Weka
Weka
INTRODUCTION TO WEKA
Introduction
       Weka (Waikato Environment for Knowledge Analysis) is a popular suite of
machine learning software written in Java, developed at the University Of Waikato, New
Zealand. WEKA acronym coined by Geoff Holmes.It was first implemented in its modern
form in 1997.WEKA is free software available under the GNU General Public License.
       Weka is a comprehensive set of advanced data mining and analysis tools. The
strength of Weka lies in the area of classification where it covers many of the most
current machine learning (ML) approaches.
     At its simplest, it provides a quick and easy way to explore and analyze data.
Weka is also suitable for dealing with large data where the resources of many
computers and or multi-processor computers can be used in parallel.
       The software is written in the Java™ language and contains a GUI for interacting
with data files and producing visual results like tables and curves. It also has a general
API, so that we can embed WEKA, like any other library, in our own applications to such
things as automated server-side data-mining tasks.
Data Handling
        Weka currently supports 3 external file formats namely CSV, Binary and
C45.Weka also allows for data to be pulled directly from database servers as well as
web servers. Its native data format is known as the ARFF format. It is basically a CSV
(comma separated value) format with some extra headers to specify what type each
attribute is (numerical, binary, nominal).
        All header commands start with ‘@’ and all comment lines start with ‘%’.
Comment and blank lines are ignored. The commands start with ‘@attribute’ followed by
the name of the attribute and then the type of attribute. There are two main types of
attributes, numeric and nominal. Numeric attributes are defined as either ‘real’, ‘integer’
or just ‘numeric’. Nominal attributes are defined by placing in brackets all the possible
values an attribute can take.
       The ARFF format does not specify which attribute is the class attribute. This is
done intentionally as in some cases there is no class attribute as in clustering
applications. In some cases there are many class variables, as is the case with
association rules where one would like to test how well each attribute can be predicted
based on the other attributes.
                                                                                         1
                                                    Application Development Lab
        Weka permits the input data set to be in numerous file formats like CSV (comma
separated values: *.csv), Binary Serialized Instances (*.bsi) etc. However, the most
preferred and the most convenient input file format is the attribute relation file format
(arff).
 ●     numeric
 ●     <nominal-specification>
 ●     string
 ●     date [<date-format>]
Numeric attributes
     Numeric attributes can be real or integer numbers.
Nominal attributes
         Nominal values are defined by providing an <nominal-specification> listing the
possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, …}. Values
that contain spaces must be quoted.
                                                                                           2
                                                    Application Development Lab
String attributes
        String attributes allow us to create attributes containing arbitrary textual values.
This is very useful in text-mining applications, as we can create datasets with string
attributes, then write Weka Filters to manipulate strings (like StringToWordVectorFilter).
String attributes are declared as follows:
   @ATTRIBUTE name string
Date attributes
      Date attribute declarations take the form:
        @attribute <name> date [<date-format>]
where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: "yyyy-MM-dd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations
of the date/time.
INSTALLATION OF WEKA
Installation Of Weka
   Weka is an open source collection of data mining tasks which you can utilize in a
number of different ways. It comes with a Graphical User Interface (GUI), but can also
be called from your own Java code. You can even write your own batch files for tasks
that you need to execute more than once, maybe with slightly different parameters each
time.
Steps to be followed in weka
Step 1: Starting up WEKA
       Go to C:\Program Files\Weka-3.6.2 and click on “Weka.jar or Runweka.bat”.
                                                                                          3
                                                   Application Development Lab
The Weka Explorer can be seen in the following window. It consists of a number of tabs
namely:
Section Tabs
       At the very top of the window, just below the title bar, is a row of tabs. When the
Explorer is first started only the first tab is active; the others are grayed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting
to explore the data.
The tabs are as follows:
                                                                                         4
                                                  Application Development Lab
       Once the tabs are active, clicking on them flicks between different screens, on
which the respective actions can be performed. The bottom area of the window
(including the status box, the log button, and the Weka bird) stays visible regardless of
which section you are in.
Status Box
       The status box appears at the very bottom of the window. It displays messages
that keep you informed about what’s going on. The menu gives two options:
       ● Memory information: Display in the log box the amount of memory available
         to WEKA.
       ● Run garbage collector: Force the Java garbage collector to search for
         memory that is no longer needed and free it up, allowing more memory for
         new tasks. Note that the garbage collector is constantly running as a
         background task anyway.
Section Tabs
Log Button
       To the right of the status box is the WEKA status icon. When no processes are
running, the bird sits down and takes a nap.
Step 3: Choosing a data set
       The first three buttons at the top of the preprocess section enable you to load
data into WEKA:
          OPENING A DATA SET FROM THE EXPLORER(Open File option)
Open file.... Brings up a dialog box allowing you to browse for the data file on the local
filesystem.Initially (in the Preprocess tab) click "open" and navigate to the directory
containing the data file (.csv or .arff). We now have a number of choices, but before we
can work with any data, we'll have to load it into Weka. For now, we'll use one of the
datasets that are included, but later on we'll have to get any file we'll use into the right
format. Open a file from the data subcategory, for example the Weather data to find the
following screen (the default should be the processing tab). The data set is obtained as
follows:
C:\Program Files\Weka3-6\data\Weather.arff
                                                                                          6
                                                   Application Development Lab
Open URL.... Asks for a Uniform Resource Locator address for where the data is
stored.
Open DB.... Reads data from a database. (Note that to make this work, we might have
to edit the file in weka/experiment/DatabaseUtils.props.)
        Weka supports many filters, and for convenience they are organized according to
whether they take class information into account (Supervised/Unsupervised) and
whether they act on instances or attributes. A filter can thus be one of four types, i.e.,
supervised instance filter, unsupervised instance filter, supervised attribute filter and
lastly, unsupervised attribute filter.
                                                                                        7
                                                     Application Development Lab
        Supervised filters take class information into account, while unsupervised filters
do not. Good examples of this are the two filters supervised discretization and
unsupervised discretization. Both these filters are designed to convert numerical
attributes into nominal ones; however the unsupervised filter does not take class
information into account when grouping instances together. There is always a risk that
distinctions between the different instances in relation to the class can be wiped out
when using such a filter.
       In our sample data file, we can discretize the attributes. In the "Filter" panel, click
on the "Choose" button. This will show a popup window with a list available filters. Scroll
down the list and select the "weka.filters.supervised.attribute.discretize" filter.
                                                                                            8
                                                    Application Development Lab
         Click the "Apply" button to apply this filter to the data. This will discretize the
attributes and create a new working relation (whose name now includes the details of
the filter that was applied).
                                                                                          9
                                                   Application Development Lab
       The supervised filter does not have the same problem because it takes the class
information into account and tries to maintain the class distinctions in the grouped
instances. The usefulness of class information and thus supervised filters will of course
depend on the context and either type of filter is useful in many cases.
                                                                                      10
                                                     Application Development Lab
        Weka also has another type of filter called instance filters. Instance filters work
on a whole instance of the data not just a specific attribute, while attribute filter works on
attributes in general and not on just a specific instance or instances.
                                                                                           11
                                        Application Development Lab
                                                                       12
                               Application Development Lab
                                                             13
                                                Application Development Lab
Data Set
%student details
@relation student
@attribute sno integer
@attribute
sname{Annie,Bpavani,Geetha,Krishna,Padmaja,Tulasi,Sravanthi,Swathi,Swav}
@attribute ERP numeric
@attribute SQAT numeric
@attribute SRE numeric
@attribute MC numeric
@attribute WT numeric
@attribute ADSLAB numeric
@attribute MCLAB numeric
@attribute ADS numeric
@attribute results{pass,fail}
@data
 501 Annie         56    60   63     73    54     92   95    67    pass
 502 Bpavani       65    76   74     78    67     89   98    72    pass
 503 Geetha        63    75   67     78    70     90   96    73    pass
 504 Krishna       58    72   70     78    59     88   89    64    pass
 505 Padmaja       63    76   73     75    66     82   93    63    pass
 506 Tulasi        57    77   67     74    59     87   92    63    pass
 507 Sravanthi     74    78   74     78    69     96   97    73    pass
 508 Swathi         53   73   62     67    66     81   90    65    pass
 509 Swav          22    76   55     45    56     65   73    66    fail
                                                                           14
                                Application Development Lab
                                                         15
                                        Application Development Lab
                                                                       16
                                                  Application Development Lab
Applying Association On Data Set
INTRODUCTION TO ASSOCIATION
      Rules that satisfy both a minimum support threshold and a minimum confidence
threshold are said to be strong. It can be used for finding frequent patterns,
associations, correlations, or causal structures among sets of items or objects in
transaction databases, relational databases, and other information repositories..
Setting Up
       This panel contains schemes for learning association rules, and the learners are
chosen and configured in the same way as the clusterers, filters, and classifiers in the
other panels.
Learning Associations
       Once appropriate parameters for the association rule learner have been set, click
the Start button. When complete, right-clicking on an entry in the result list allows the
results to be viewed or saved.
@relation asscdb
                                                                                      17
                                                Application Development Lab
● Apriori
                                                                              18
                                                  Application Development Lab
   ● Filtered Associator
a) Apriori Algorithm
        Apriori is the best-known algorithm to mine association rules. The name of the
algorithm is based on the fact that the algorithm uses prior knowledge of frequent item
set properties.It uses a breadth-first search strategy for counting the support of item
sets and uses a candidate generation function which exploits the downward closure
property of support.
Output:
Apriori
=======
                                                                                      19
                                                    Application Development Lab
4. income=low 2 ==> class:buys_computer=no 2 conf:(1)
5. age=senior class:buys_computer=yes 2 ==> student=no 2 conf:(1)
6. income=high student=no 2 ==> class:buys_computer=yes 2 conf:(1)
7. income=medium class:buys_computer=yes 2 ==> student=no 2 conf:(1)
8. income=medium student=no 2 ==> class:buys_computer=yes 2 conf:(1)
9. student=no class:buys_computer=no 2 ==> income=low 2 conf:(1)
10. income=low class:buys_computer=no 2 ==> student=no 2 conf:(1)
b) Filtered Associator:
        Class for running an arbitrary associator on data that has been passed through
an arbitrary filter. Like the associator, the structure of the filter is based exclusively on
the training data and test instances will be processed by the filter without changing their
structure.
Output
=== Run information ===
Filtered Header
@relation asscdb-weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.MultiFilter-Fweka.filters.unsupervised.attribute.ReplaceMissingValues
@data
                                                                                          20
                                            Application Development Lab
Associator Model
Apriori
=======
INTRODUCTION TO CLASSIFICATION
                                                                       21
                                                    Application Development Lab
Definition
        It is the task of generalizing known structure to apply to new data. For example,
an email program might attempt to classify an email as legitimate or spam. Common
algorithms include decision tree learning, nearest neighbor, naive Bayesian
classification and neural networks.
        Confusion matrices are very useful for evaluating classifiers, as they provide an
efficient snapshot of its performance—displaying the distribution of correct and incorrect
instances. It is a matrix used to summarize the results of a supervised classification.
Entries along the main diagonal are correct classifications. Entries other than those on
the main diagonal are classification errors.
The Confusion Matrix View includes the following sections:
    ● Confusion matrix for training data
    ● Confusion matrix for validation data
    ● Confusion matrix as computed for the current prune level
                                                                                          22
                                                    Application Development Lab
Example:
a b <-- classified as
7 2 | a = yes
3 2 | b = no
        Weka was trying to classify instances into two possible classes: yes or no. For
the sake of simplicity, Weka substitutes ‘yes’ for a, and ‘no’ for b. The columns
represent the instances that were classified as that class. So, the first column shows
that in total 10 instances were classified a by Weka, and 4 were classified as b. The
rows represent the actual instances that belong to that class. So, what this now tells us
is the number of times a given class is correctly/ incorrectly classified.
       From the matrix, we can observe that 7 of the instances that should have been
classed as a, were in fact correctly identified. Similarly, 2 b’s were classified correctly.
However, we can also see that 2 a’s were incorrectly classified as b, where as 3 b’s
were classed as a. This fine-grained perspective can provide interesting insights. It also
allows us to assess the suitability of a particular classifier.
       Cross validation is a method for estimating the true error of a model. When a
model is built from training data, the error on the training data is a rather optimistic
estimate of the error rates the model will achieve on unseen data. The aim of building a
model is usually to apply the model to new, unseen data--we expect the model to
generalize to data other than the training data on which it was built. Thus, we would like
to have some method for better approximating the error that might occur in general.
Program
Weather.arff:
@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
       Sunny               85    85      FALSE         no
       sunny               80    90      TRUE          no
       overcast            83    86      FALSE         yes
       rainy               70    96      FALSE         yes
       rainy               68    80      FALSE         yes
       rainy               65    70      TRUE          no
       overcast            64    65      TRUE          yes
       sunny               72    95      FALSE         no
       sunny               69    70      FALSE         yes
       rainy               75    80      FALSE         yes
       sunny               75    70      TRUE          yes
       overcast            72    90      TRUE          yes
       overcast            81    75      FALSE         yes
       rainy               71    91      TRUE          no
                                                                                       24
                                                   Application Development Lab
Output
                                                                                       25
                                                 Application Development Lab
Scheme:       weka.classifiers.bayes.BayesNet -D -Q
weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E
weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Test mode: evaluate on training data
                                                                          26
                                                   Application Development Lab
TP Rate       FP Rate       Precision     Recall        F-Measure            Class
0.889         0.4           0.8           0.889         0.842                yes
0.6           0.111         0.75          0.6           0.667                no
                                                                                        27
                                                  Application Development Lab
        An advantage of the naive Bayes classifier is that it requires a small amount of
training data to estimate the parameters (means and variances of the variables)
necessary for classification.
Output
Scheme:       weka.classifiers.bayes.NaiveBayes
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Test mode: evaluate on training data
                                                                                     28
                                                   Application Development Lab
             StandardDev = 7.384
             WeightSum = 5
             Precision = 1.9090909090909092
humidity:    Normal Distribution.
             Mean = 86.1111
             StandardDev = 9.2424
             WeightSum = 5
             Precision = 3.4444444444444446
windy:       Discrete Estimator.
             Counts = 4 3 (Total = 7)
a b <-- classified as
9 0 | a = yes
1 4 | b = no
                                                                            29
                                  Application Development Lab
                                                           30
                                                  Application Development Lab
        Regression predicts a numeric value. It also can determine the input fields that
are most relevant to predict the target field values. The predicted value might not be
identical to any value contained in the data that is used to build the model.
       Numeric prediction is the task of predicting continuous (or ordered) values for
given input. Regression and numeric prediction are synonymous. Regression analysis
can be used to model the relationship between one or more independent or predictor
variables and a dependent or response variable (which is continuous-valued). The
predictor variables are the attributes of interest describing the tuple. The response
variable is what we want to predict.
i) Linear Regression
       Linear regression was the first type of regression analysis and to be used
extensively in practical applications. This is because models which depend linearly on
their unknown parameters are easier to fit than models which are non-linearly related to
their parameters and because the statistical properties of the resulting estimators are
easier to determine.
Output
=== Run information ===
                                                                                     31
                                               Application Development Lab
=== Classifier model (full training set) ===
ERP =
    8.2262 * sname=Swathi,Annie,Tulasi,Krishna,Geetha,Padmaja,Bpavani,Sravanthi +
    1.171 * sname=Annie,Tulasi,Krishna,Geetha,Padmaja,Bpavani,Sravanthi +
    2.9988 * sname=Geetha,Padmaja,Bpavani,Sravanthi +
    1.2665 * sname=Padmaja,Bpavani,Sravanthi +
    7.2117 * sname=Sravanthi +
    0.1683 * SRE +
    0.2147 * MC +
    0.2205 * WT +
    0.2361 * ADSLAB +
    0.1473 * MCLAB +
   -0.158 * ADS +
    8.2262 * results=pass +
  -24.9382
Correlation coefficient           1
Mean absolute error               0
Root mean squared error           0
Relative absolute error           0%
Root relative squared error        0%
Total Number of Instances         9
                                                                               32
                                                    Application Development Lab
Output
Scheme:       weka.classifiers.functions.SimpleLinearRegression
Relation: student
Instances: 17
Attributes: 8
         sno
         mc
         sqat
                                                                                         33
                                                   Application Development Lab
       sre
       wt
       erp
       adsa
       avg
Test mode: evaluate on training data
        A decision tree is a flow chart like tree structure, where each internal node (non-
leaf node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The top most node in a
tree is the root node.
       The decision trees generated by J48 can be used for classification. J48 builds
decision trees from a set of labeled training data. It uses the fact that each attribute of
the data can be used to make a decision by splitting the data into smaller subsets.
      J48 examines the normalized information gain (difference in entropy) that results
from choosing an attribute for splitting the data. To make the decision, the attribute with
                                                                                        34
                                                   Application Development Lab
the highest normalized information gain is used. Then the algorithm recurs on the
smaller subsets. The splitting procedure stops if all instances in a subset belong to the
same class. Then a leaf node is created in the decision tree telling to choose that class.
Output
outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
                                                                                       35
                                                 Application Development Lab
Kappa statistic                     1
Mean absolute error                 0
Root mean squared error              0
Relative absolute error             0%
Root relative squared error         0%
Total Number of Instances           14
a b <-- classified as
9 0 | a = yes
0 5 | b = no
                                                                           36
                                                   Application Development Lab
Mode 2: Using CROSS VALIDATION with different folds
Applying10 folds
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves: 5
                                                                            37
                                                    Application Development Lab
Kappa statistic                         0.186
Mean absolute error                    0.2857
Root mean squared error                 0.4818
Relative absolute error                60     %
Root relative squared error            97.6586 %
Total Number of Instances              14
a b <-- classified as
7 2 | a = yes
3 2 | b = no
                                                                                   38
                                                      Application Development Lab
Class NO                        0.444                  0.4                   0.222
Confusion Matrix                a b<--classified as    a b<--classified as   a b<--classified as
                                7 2 | a = yes          6 3 | a = yes         6 3 | a = yes
                                3 2 | b = no           3 2 | b = no          4 1 | b = no
outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)
Number of Leaves : 5
                                                                                        39
                                                 Application Development Lab
=== Summary ===
a b <-- classified as
0 1 | a = yes
2 0 | b = no
                                                                          40
                                                 Application Development Lab
Class NO                  0                       0                  0
Recall
Class YES                 0                       0.8                1
Class   NO                0                       0                  0
F-Measure
Class YES                 0                       0.727             0.824
Class NO                  0                       0                 0
Confusion Matrix          a b<-- classified as    a b<-- classified a b<--
                          0 1 | a = yes           as                classified as
                          2 0 | b = no            4 1 | a = yes      7 0 | a = yes
                                                  2 0 | b = no       3 0 | b = no
                                                                                     41
                                                  Application Development Lab
Output
Scheme:      weka.classifiers.trees.ADTree -B 10 -E -3
Relation:   weather
                                                                                     42
                                                   Application Development Lab
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
: -0.255
| (1)outlook = overcast: -1.049
| (1)outlook != overcast: 0.37
| | (2)humidity < 82.5: -0.43
| | | (4)temperature < 66.5: 0.78
| | | (4)temperature >= 66.5: -0.915
| | (2)humidity >= 82.5: 0.486
| | | (3)temperature < 70.5: -1.001
| | | (3)temperature >= 70.5: 1.485
| (5)temperature < 66.5: 0.183
| (5)temperature >= 66.5: -0.398
| | (6)outlook = sunny: 0.065
| | (6)outlook != sunny: -0.381
Legend: -ve = yes, +ve = no
                                                                            43
                                                Application Development Lab
a b <-- classified as
4 5 | a = yes
5 0 | b = no
                                                                                     44
                                                         Application Development Lab
INTRODUCTION TO CLUSTERING
Definition
       It is the process of grouping the data into classes or clusters. The process of
grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other cluster.
Cluster Modes
        An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete.
Ignoring Attributes
        Sometimes, some of the attributes in the data should be ignored when clustering.
The Ignore attributes button brings up a small window that allows us to select which
attributes are to be ignored.
                                                                                                  45
                                                    Application Development Lab
Learning Clusters
       The Cluster section has Start/Stop buttons, a result text area and a result list.
Right-clicking on an entry in the result list brings up a similar menu, except that it shows
only two visualization options: Visualize cluster assignments and Visualize tree.
       We have different types of clustering methods from which we can select the
required method to perform clustering.
                                                                                         46
                                                       Application Development Lab
a) SimpleKMeans
        The k-means algorithm assigns each point to the cluster whose center (also
called centroid) is nearest. The center is the average of all the points in the cluster —
that is, its coordinates are the arithmetic mean for each dimension separately over all
the points in the cluster.
        In statistics and machine learning, k-means clustering is a method of cluster
analysis which aims to partition n observations into k clusters in which each observation belongs
to the cluster with the nearest mean. It is one of the simplest unsupervised learning algorithms
that solve the well known clustering problem.
Output
Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
                                                                                              47
                                                     Application Development Lab
       play
Test mode: evaluate on training data
kMeans
======
Number of iterations: 3
Cluster centroids:
Cluster 0
              Mean/Mode: sunny             75.8889     84.1111   FALSE     yes
              Std Devs:  N/A               6.4893      8.767     N/A       N/A
Cluster 1
             Mean/Mode: overcast           69.4        77.2      TRUE      yes
             Std Devs:  N/A                4.7223      12.3167   N/A       N/A
Clustered Instances
                           0    9 ( 64%)
                           1    5 ( 36%)
Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Test mode: split 75% train, remainder test
kMeans
======
                                                                                 48
                                               Application Development Lab
Number of iterations: 3
Cluster centroids:
Cluster 0
               Mean/Mode:        sunny       75.8889      84.1111   FALSE
               yes
               Std Devs:         N/A         6.4893       8.767     N/A
               N/A
Cluster 1
               Mean/Mode:        overcast    69.4         77.2      TRUE
               yes
               Std Devs:         N/A         4.7223       12.3167   N/A
               N/A
kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 9.11539322593136
Cluster centroids:
Cluster 0
               Mean/Mode:        overcast    73.5         80.3333   FALSE
               yes
               Std Devs:         N/A         7.5033       10.4051   N/A
               N/A
Cluster 1
               Mean/Mode:        sunny       72.75        80.25     TRUE
               no
               Std Devs:         N/A         6.3443       11.8427   N/A
               N/A
Clustered Instances
                          0   2 ( 50%)
                          1   2 ( 50%)
                                                                            49
                                                   Application Development Lab
                          75%                50%               25%
Number of iterations      2                  2                 2
Within cluster sum of     9.1153932259313    6.382060989909828 0.693530370728173
squared errors            6
Clustered Instances       0    2 ( 50%)      0       3 ( 43%)    0   9 ( 82%)
                          1    2 ( 50%)      1       4 ( 57%)    1   2 ( 18%)
Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
         temperature
         humidity
         windy
         play
Ignored:
         outlook
kMeans
======
Number of iterations: 3
Cluster centroids:
Cluster 0
                     Mean/Mode:         75.375         83.375    FALSE      yes
                     Std Devs:          6.7387         9.0702    N/A        N/A
Cluster 1
                     Mean/Mode:          71.1667       79.3333   TRUE       yes
                     Std Devs:           6.047         12.1929   N/A        N/A
                                                                                  50
                                                      Application Development Lab
Clustered Instances
                           0        8 ( 57%)
                           1        6 ( 43%)
Classes to Clusters:
Scheme:       weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         play
Ignored:
         windy
kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 9.517365461619956
Cluster centroids:
Cluster 0
                                                                               51
                                                      Application Development Lab
                Mean/Mode:           sunny          74.6     86.2         no
                Std Devs:            N/A            7.893    9.7314       N/A
Cluster 1
                Mean/Mode:           overcast       73       79.1111      yes
                Std Devs:            N/A            6.1644   10.2157      N/A
Clustered Instances
                           0        5 ( 36%)
                           1        9 ( 64%)
Classes to Clusters:
                                    (Nom)outlook             (Nom)windy
Number of iterations                3                        5
Within cluster sum of squared       7.508138450200291        9.517365461619956
errors
Clustered Instances                 0     8 ( 57%)           0     5 ( 36%)
                                    1     6 ( 43%)           1     9 ( 64%)
Class attribute                     Outlook                  Windy
Classes to Clusters                  0 1<-- assigned to       0 1<-- assigned to
                                    cluster                  cluster
                                     3 2 | sunny              3 3 | TRUE
                                     2 2 | overcast           2 6 | FALSE
                                     3 2 | rainy
                                    Cluster 0 <-- sunny      Cluster 0 <-- TRUE
                                    Cluster 1 <-- overcast   Cluster 1 <-- FALSE
b) Farthest First
                                                                                   52
                                                   Application Development Lab
       Farthest first is a Variant off K means that places each cluster centre in turn at
the point farthest from the existing cluster centers. This point must lie within the data
area. This greatly speed up the clustering in most cases since less reassignment and
adjustment is needed
Scheme:       weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Test mode: evaluate on training data
FarthestFirst
==============
Cluster centroids:
Cluster 0
                           overcast      72.0           90.0        TRUE          yes
Cluster 1
                           sunny         85.0           85.0        FALSE         no
Clustered Instances
                          0    10 ( 71%)
                          1    4 ( 29%)
Scheme:       weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
         outlook
                                                                                        53
                                                        Application Development Lab
         temperature
         humidity
         windy
         play
FarthestFirst
==============
Cluster centroids:
Cluster 0
                             overcast        72.0           90.0           TRUE        yes
Cluster 1
                             sunny           85.0           85.0           FALSE        no
FarthestFirst
==============
Cluster centroids:
Cluster 0
                             sunny           75.0           70.0           TRUE        yes
Cluster 1                    rainy           70.0           96.0           FALSE       yes
Clustered Instances
                             0    3 ( 75%)
                             1    1 ( 25%)
                                                                                              54
                                                    Application Development Lab
Mode 3: Using CLASSES TO CLUSTER EVALUATION
Scheme:       weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
         temperature
         humidity
         windy
         play
Ignored:
         outlook
Test mode: Classes to clusters evaluation on training data
FarthestFirst
==============
Cluster centroids:
Cluster 0
                           72.0              90.0     TRUE      yes
Cluster 1
                           85.0              85.0     FALSE     no
Clustered Instances
                           0      9 ( 64%)
                           1      5 ( 36%)
Classes to Clusters:
                                                                             55
                                                      Application Development Lab
Incorrectly clustered instances :    6.0       42.8571 %
Scheme:       weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         play
Ignored:
         windy
Test mode: Classes to clusters evaluation on training data
FarthestFirst
==============
Cluster centroids:
Cluster 0
                            overcast           72.0        90.0   yes
Cluster 1
                            rainy              65.0        70.0   no
Clustered Instances
                           0        8 ( 57%)
                           1        6 ( 43%)
                           2
Class attribute: windy
Classes to Clusters:
                                                                               56
                                                 Application Development Lab
    Figure: Visualize Cluster Assignments After Selecting The Windy Class Attribute
Applying Select Attributes On Data Set
        Select Attributes in WEKA can be used to decide which are the most relevant
attributes that effect the dependent variable . Searching and Evaluating Attribute
selection involves searching through all possible combinations of attributes in the data
to find which subset of attributes works best for prediction.
      To do this, two objects must be set up: an attribute evaluator and a search
method. The evaluator determines what method is used to assign a worth to each
subset of attributes. The search method determines what style of search is performed.
Options
1. Use full training set: The worth of the attribute subset is determined using the full
set of training data.
2. Cross-validation: The worth of the attribute subset is determined by a process of
cross-validation. The Fold and Seed fields set the number of folds to use and the
random seed used when shuffling the data.
        When the attribute selection process is finished, the results are output into the
result area, and an entry is added to the result list. Right-clicking on the result list gives
several options.
      For using the full training set, first we need to select the attribute evaluator and a
search method. The following are the list of attribute evaluators and different search
methods:
                                                                                           58
                         Application Development Lab
                                                  59
                                                Application Development Lab
Output
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search:    weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
                                                                              60
                                                   Application Development Lab
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Evaluation mode: evaluate on all training data
Search Method:
      Attribute ranking.
Ranked attributes:
0.1564 1 outlook
0.0488 4 windy
0    3 humidity
0    2 temperature
                                                                            61
                                                 Application Development Lab
Output
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search:      weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Evaluation mode: 10-fold cross-validation
                                                                                   62
                                                    Application Development Lab
Output
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search:      weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Evaluation mode: 10-fold cross-validation
Output
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search:    weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
                                                                               63
                                                     Application Development Lab
Attributes: 5
        outlook
         temperature
         humidity
         windy
         play
Evaluation mode: 5-fold cross-validation
Output
Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search:      weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
         outlook
         temperature
         humidity
         windy
         play
Evaluation mode: 5-fold cross-validation
VISUALIZATION
                                                                               64
                                                    Application Development Lab
        WEKA’s visualization section allows us to visualize 2D plots of the current
relation.
        When we select the Visualize panel, it shows a scatter plot matrix for all the
attributes, color coded according to the currently selected class. It is possible to change
the size of each individual 2D plot and the point size, and to randomly jitter the data (to
uncover obscured points). It also possible to change the attribute used to color the plots,
to select only a subset of attributes for inclusion in the scatter plot matrix, and to sub
sample the data. Note that changes will only come into effect once the Update button
has been pressed.
        When we click on a cell in the scatter plot matrix, this will bring up a separate
window with a visualization of the scatter plot we selected. Data points are plotted in the
main area of the window. At the top are two drop-down list buttons for selecting the
axes to plot. The one on the left shows which attribute is used for the x-axis; the one on
the right shows which is used for the y-axis. Beneath the x-axis selector is a drop-down
list for choosing the color scheme. This allows you to color the points based on the
attribute selected. Below the plot area, a legend describes what values the colors
correspond to. If the values are discrete, we can modify the color used for each one by
clicking on them and making an appropriate selection in the window that pops up.
        To the right of the plot area is a series of horizontal strips. Each strip represents
an attribute, and the dots within it show the distribution of values of the attribute. These
values are randomly scattered vertically to help us see concentrations of points. We can
choose what axes are used in the main graph by clicking on these strips. Left-clicking
an attribute strip changes the x-axis to that attribute, whereas right-clicking changes the
y-axis. The ‘X’ and ‘Y’ written beside the strips shows what the current axes are (‘B’ is
used for ‘both X and Y’). Above the attribute strips is a slider labelled Jitter, which a
random displacement is given to all points in the plot.
Selecting Instances
       There may be situations where it is helpful to select a subset of the data using
the visualization tool. (A special case of this is the UserClassifier in the Classify panel,
which lets us build our own classifier by interactively selecting instances.)
       Below the y-axis selector button is a drop-down list button for choosing a
selection method. A group of data points can be selected in four ways:
1. Select Instance: Clicking on an individual data point brings up a window listing its
attributes. If more than one point appears at the same location, more than one set of
attributes is shown.
                                                                                          65
                                                     Application Development Lab
2. Rectangle: We can create a rectangle, by dragging, that selects the points inside it.
3. Polygon: We can build a free-form polygon that selects the points inside it. Left-click
to add vertices to the polygon, right-click to complete it. The polygon will always be
closed off by connecting the first point to the last.
4. Polyline: We can build a polyline that distinguishes the points on one side from those
on the other. Left-click to add vertices to the polyline, right-click to finish. The resulting
shape is open (as opposed to a polygon, which is always closed).
        Once an area of the plot has been selected using Rectangle, Polygon or
Polyline, it turns grey. At this point, clicking the Submit button removes all instances
from the plot except those within the grey selection area. Clicking on the Clear button
erases the selected area without affecting the graph. Once any points have been
removed from the graph, the Submit button changes to a Reset button. This button
undoes all previous removals and returns us to the original graph with all points
included. Finally, clicking the Save button allows us to save the currently visible
instances to a new ARFF file.
                                                                                           66
                         Application Development Lab
                                                  67
                      Application Development Lab
                                               68
                               Application Development Lab
                                                        69
                                               Application Development Lab
Figure: Selecting The Open URL Option And Then Entering The URL
                                                                                 70
                                                Application Development Lab
a) Preprocessing
                                                                                71
                                                  Application Development Lab
Figure: Visualize All Screen After Selecting The Attribute Selection Filter
b) Classification
                                                                                     72
                                                 Application Development Lab
Output
Scheme:      weka.classifiers.rules.ZeroR
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
         Manufacturer
         Highway_MPG
         Air_Bags_standard
         Drive_train_type
         Horsepower
         Wheelbase
         Rear_seat_room
         Luggage_capacity
         class
Test mode: evaluate on training data
Correlation coefficient           0
Mean absolute error                7.1368
Root mean squared error               9.6074
Relative absolute error           100    %
Root relative squared error         100    %
Total Number of Instances            93
                                                                                      73
                                                   Application Development Lab
c) Clustering
Output
=== Run information ===
Scheme:      weka.clusterers.FarthestFirst -N 2 -S 1
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
         Manufacturer
         Highway_MPG
         Air_Bags_standard
         Drive_train_type
         Horsepower
         Wheelbase
         Rear_seat_room
         Luggage_capacity
         class
Test mode: evaluate on training data
FarthestFirst
==============
Cluster centroids:
Cluster 0
                         Pontiac 27.0 0 1 200.0 108.0 28.5 16.0 18.5
Cluster 1
                         Dodge 24.0 1 2 300.0 97.0 20.0 11.0 25.8
Clustered Instances
0    77 ( 83%)
1    16 ( 17%)
                                                                                      74
                                                 Application Development Lab
d) Association
Output
=== Run information ===
                                                                                      75
                                                  Application Development Lab
e) Select Attributes
Output
=== Run information ===
Evaluator: weka.attributeSelection.ClassifierSubsetEval -B
weka.classifiers.rules.ZeroR -T -H "Click to set hold out or test instances" --
Search:      weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -
S1
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
         Manufacturer
         Highway_MPG
         Air_Bags_standard
         Drive_train_type
         Horsepower
         Wheelbase
         Rear_seat_room
         Luggage_capacity
         class
Evaluation mode: evaluate on all training data
Search Method:
                        Genetic search.
                        Start set: no attributes
                        Population size: 20
                        Number of generations: 20
                        Probability of crossover: 0.6
                                                                                    76
                                              Application Development Lab
                     Probability of mutation: 0.033
                     Report frequency: 20
                     Random number seed: 1
Initial population
merit                scaled   subset
 7.1368               0       3
 7.1368               0       48
 7.1368               0       236
 7.1368               0       5
 7.1368               0       123678
 7.1368               0       235678
 7.1368               0       6
 7.1368               0       34
 7.1368               0       7
 7.1368               0       23578
 7.1368               0       18
 7.1368               0       247
 7.1368               0       568
 7.1368               0       3
 7.1368               0       12
 7.1368               0       12458
 7.1368               0       567
 7.1368               0       2456
 7.1368               0       27
 7.1368               0       1247
Generation: 1
merit                scaled   subset
7.1368                0       3
7.1368                0       3
7.1368                0       34
7.1368                0       3
7.1368                0       37
7.1368                0       3
7.1368                0       36
7.1368                0       3
7.1368                0       13
7.1368                0       3
7.1368                0       23
7.1368                0       3
7.1368                0       23
7.1368                0       3
7.1368                0       23
7.1368                0       3
7.1368                0       23
                                                                       77
                                                   Application Development Lab
7.1368                    0       3
7.1368                    0       35
7.1368                    0       3
Selected attributes: 3 : 1
              Air_Bags_standar
f) Visualization
                                                                            78
                                                   Application Development Lab
To perform the experimenter analysis for different data sets using different algorithms or
techniques, the following steps need to be followed:
Step 1: Select the experimenter option from the WEKA start up screen. From the
experimenter window select the set up option. Then choose the new tab, select the file
format and specify the path for the results to be stored.
      In the above figure, we can also select the experiment type, number of iterations,
and number of folds in case of cross validation, data sets and the algorithms.
                                                                                       79
                                                      Application Development Lab
Step 2: Select the run tab from the experimenter window. Then select the start option
from this window. It displays the log information regarding the test being conducted and
displays the errors, if any.
Step 3: Select the analyze tab in the experimenter window. It allows us to configure the
test options. It displays the result in the test output window. It also displays the result list
of the techniques applied on the data sets.
                                                                                              80
                                                    Application Development Lab
       This analysis states that the J48 algorithm works good for iris.arff data rather
than for the weather.arff data.
                                                                                          81
                                                    Application Development Lab
b) Performing Operations Using The Experimenter On 2 Different Data
Sets With 2 Different Algorithms
For this experiment, we have selected one user defined data set namely, student.csv
and a predefined data set namely, iris.csv. We have selected an algorithm from rules
classifier and an algorithm from trees classifier.
Output
Step 1: In this step, we are selecting the CSV file format. Select the setup tab in order to
select the files to be compared and also to select the algorithms to be applied.
                                                                                         82
                                                     Application Development Lab
Step 2: Select the run tab in order to view the errors, if any.
Step 3: Select the analyse tab to perform test on the selected data sets with the
selected algorithms.
                                                                                    83
                                 Application Development Lab
                                                          84
                                                Application Development Lab
Step 2: Choose the Data sources tab in the knowledge flow window and then select the
ARFFLOADER. Also select the configure option to select the dataset and browse to the
location of data set.
Figure: Selecting The ARFF Loader From The Data Sources Tab
Step 3: Click on evaluation tab and choose the Class Assigner component from the tool
bar.
Now connect the Arff Loader to the Class Assigner by selecting data set.
                                                                                  85
                                   Application Development Lab
                                                               86
                                                  Application Development Lab
Step 4: Click over the Class Assigner and choose configure from the menu and select
the class in weather.arff data.
Select the Cross Validation Fold Maker from the evaluation toolbar and place it on the
layout and then connect validation fold maker by clicking over class assigner and
selecting data set.
Figure: Selecting The Cross Validation Fold Maker In The Evaluation Tab
                                                                                         87
                                                  Application Development Lab
Step 5: Click on the classifier tab at top of the window and select J48 component and
place it on the layout.
Connect Cross Validation Fold Maker to J48 component by first choosing Training set
and then Test set.
                                                                                        88
                                               Application Development Lab
Step 6: Select the evaluation tab and place a Classifier Performance Evaluation
Component(CPE) ,connect J48 to Classifier Performance Evaluation Component by
selecting batch classifier entry from the pop menu from J48.
Figure: Selecting The Classifier Performance Evaluator From The Evaluation Tab
                                                                                    89
                                                 Application Development Lab
Step 7: Go to visualization tool bar and place Text Viewer Component on the layout .
Connect the Classifier Performance Evaluator to the Text Viewer by selecting the text
from the pop-up menu at Classifier Performance Evaluator.
                                                                                        90
                                                 Application Development Lab
Step 8: Start flow execution by selecting the Start Loading option from the pop menu
from Arff Loader.
Figure: Selecting The Start Loading Option From the Pop-up Menu Of ARFF Loader
                                                                                       91
                                              Application Development Lab
Step 9: Choose the Show Results from the pop menu of the Text Viewer Component.
Figure: Selecting The Show results Option From the Pop-up Menu Of Text Viewer
                                                                                  92
                                                   Application Development Lab
Step 10: View the results of the applied J48 classifier from the text viewer window.
                                                                                       93
                                                 Application Development Lab
Step 1: Select the data sources and choose the ARFF Loader. Then select the
configure option from the pop-up menu of the ARFF loader in order to select the data
set.
Figure: Selecting The ARFF Loader Option From The Data Sources Tab
Step 2: Select the Filters tab and choose the unsupervised filter “AddCluster” and
connect the ARFF Loader to the Add cluster by selecting the data set option from the
pop-up menu at the ARFF Loader.
                                                                                       94
                                           Application Development Lab
Figure: Selecting The Add Cluster Unsupervised Filter From the Filters Tab
                                                                             95
                                                   Application Development Lab
Step 3: Select the visualization tab and choose the “data visualizer” to display the graph
and also choose the “attribute summarizer” to display the plot of each and every
attribute.
Figure: Selecting The Data Visualizer And Attribute Summarizer From The Visualization
                                         Tab
Step 4: Select the start loading option from the pop-up menu of ARFF loader. Choose
the show plot option from the pop-up menu of “Data visualizer” and show summaries
option from the pop-up menu of the “Attribute Summarizer”.
                                                                                        96
                                            Application Development Lab
Figure: Selecting The Show Plot Option from The Pop-up Menu Of Data Visualizer
                                                                                 97
                                               Application Development Lab
Figure: Selecting The Show Summaries Option From The Pop-up Menu Of Attribute
                                Summarizer
                                                                                98
                                             Application Development Lab
99