0% found this document useful (0 votes)
4 views99 pages

Weka

Uploaded by

T SANKARA RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views99 pages

Weka

Uploaded by

T SANKARA RAO
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 99

Application Development Lab

INTRODUCTION TO WEKA
Introduction
Weka (Waikato Environment for Knowledge Analysis) is a popular suite of
machine learning software written in Java, developed at the University Of Waikato, New
Zealand. WEKA acronym coined by Geoff Holmes.It was first implemented in its modern
form in 1997.WEKA is free software available under the GNU General Public License.
Weka is a comprehensive set of advanced data mining and analysis tools. The
strength of Weka lies in the area of classification where it covers many of the most
current machine learning (ML) approaches.

At its simplest, it provides a quick and easy way to explore and analyze data.
Weka is also suitable for dealing with large data where the resources of many
computers and or multi-processor computers can be used in parallel.

The software is written in the Java™ language and contains a GUI for interacting
with data files and producing visual results like tables and curves. It also has a general
API, so that we can embed WEKA, like any other library, in our own applications to such
things as automated server-side data-mining tasks.

Data Handling
Weka currently supports 3 external file formats namely CSV, Binary and
C45.Weka also allows for data to be pulled directly from database servers as well as
web servers. Its native data format is known as the ARFF format. It is basically a CSV
(comma separated value) format with some extra headers to specify what type each
attribute is (numerical, binary, nominal).

All header commands start with ‘@’ and all comment lines start with ‘%’.
Comment and blank lines are ignored. The commands start with ‘@attribute’ followed by
the name of the attribute and then the type of attribute. There are two main types of
attributes, numeric and nominal. Numeric attributes are defined as either ‘real’, ‘integer’
or just ‘numeric’. Nominal attributes are defined by placing in brackets all the possible
values an attribute can take.

The ARFF format does not specify which attribute is the class attribute. This is
done intentionally as in some cases there is no class attribute as in clustering
applications. In some cases there are many class variables, as is the case with
association rules where one would like to test how well each attribute can be predicted
based on the other attributes.

Weka Data Formats

1
Application Development Lab
Weka permits the input data set to be in numerous file formats like CSV (comma
separated values: *.csv), Binary Serialized Instances (*.bsi) etc. However, the most
preferred and the most convenient input file format is the attribute relation file format
(arff).

Attribute-Relation File Format (ARFF)


WEKA’s file format “ARFF” was created by Andrew Donkin. ARFF was rumored
to stand for Andrew’s Ridiculous File Format. An ARFF (Attribute-Relation File Format)
file is an ASCII text file that describes a list of instances sharing a set of attributes.
ARFF files were developed by the Machine Learning Project at the Department of
Computer Science of The University of Waikato for use with the Weka machine learning
software.
ARFF files have two distinct sections. The first section is the Header
information, which is followed the Data information. The Header of the ARFF file
contains the name of the relation, a list of the attributes (the columns in the data), and
their types. Comment lines begin with a %. The @RELATION, @ATTRIBUTE and
@DATA declarations are case insensitive.
Comma-Separated Values (CSV)
A comma-separated values (CSV) file is a simple text format for a database
table. Each record in the table is one line of the text file. Each field value of a record is
separated from the next with a comma. Implementations of CSV can often handle field
values with embedded line breaks or separator characters by using quotation marks or
escape sequences. CSV is a simple file format that is widely supported, so it is often
used to move tabular data between different computer programs that support the
format. For example, a CSV file might be used to transfer information from a database
program to a spreadsheet.
Weka Data Types
The <data type> can be any of the four types currently supported by Weka:

● numeric
● <nominal-specification>
● string
● date [<date-format>]

where <nominal-specification> and <date-format> are defined below. The keywords


numeric, string and date are case insensitive.

Numeric attributes
Numeric attributes can be real or integer numbers.

Nominal attributes
Nominal values are defined by providing an <nominal-specification> listing the
possible values: {<nominal-name1>, <nominal-name2>, <nominal-name3>, …}. Values
that contain spaces must be quoted.

2
Application Development Lab
String attributes
String attributes allow us to create attributes containing arbitrary textual values.
This is very useful in text-mining applications, as we can create datasets with string
attributes, then write Weka Filters to manipulate strings (like StringToWordVectorFilter).
String attributes are declared as follows:
@ATTRIBUTE name string

Date attributes
Date attribute declarations take the form:
@attribute <name> date [<date-format>]

where <name> is the name for the attribute and <date-format> is an optional string
specifying how date values should be parsed and printed (this is the same format used
by SimpleDateFormat). The default format string accepts the ISO-8601 combined date
and time format: "yyyy-MM-dd'T'HH:mm:ss".
Dates must be specified in the data section as the corresponding string representations
of the date/time.

INSTALLATION OF WEKA
Installation Of Weka

● Download software from http://www.cs.waikato.ac.nz/ml/weka/


● If we are interested in modifying/extending weka there is a developer version
that includes the source code
● Set the weka environment variable for java
● Download some ML data from http://mlearn.ics.uci.edu/MLRepository.html
● It's Java-based, so if you don't have a JRE installed on your computer,
download the WEKA version that contains the JRE, as well.

Weka is an open source collection of data mining tasks which you can utilize in a
number of different ways. It comes with a Graphical User Interface (GUI), but can also
be called from your own Java code. You can even write your own batch files for tasks
that you need to execute more than once, maybe with slightly different parameters each
time.
Steps to be followed in weka
Step 1: Starting up WEKA
Go to C:\Program Files\Weka-3.6.2 and click on “Weka.jar or Runweka.bat”.

3
Application Development Lab

Figure: Weka Startup Screen


Step 2: Choosing the Applications from Start up screen
We will have a choice between the Explorer , Experimenter , Knowledge flow and
Simple CLI.
1. Simple CLI: Provides a simple command-line interface that allows direct
execution of WEKA commands for operating systems that do not provide their
own command line interface.
2. Explorer: An environment for exploring data with WEKA.
3. Experimenter: An environment for performing experiments and conducting
statistical tests between learning schemes.
4. Knowledge Flow: This environment supports essentially the same functions as
the Explorer but with a drag-and-drop interface. One advantage is that it supports
incremental learning.

The Weka Explorer can be seen in the following window. It consists of a number of tabs
namely:

Section Tabs

At the very top of the window, just below the title bar, is a row of tabs. When the
Explorer is first started only the first tab is active; the others are grayed out. This is
because it is necessary to open (and potentially pre-process) a data set before starting
to explore the data.
The tabs are as follows:

● Preprocess: Choose and modify the data being acted on.


● Classify: Train and test learning schemes that classify or perform regression.
● Cluster: Learn clusters for the data.
● Associate: Learn association rules for the data.
● Select attributes: Select the most relevant attributes in the data.
● Visualize: View an interactive 2D plot of the data.

4
Application Development Lab
Once the tabs are active, clicking on them flicks between different screens, on
which the respective actions can be performed. The bottom area of the window
(including the status box, the log button, and the Weka bird) stays visible regardless of
which section you are in.

Status Box

The status box appears at the very bottom of the window. It displays messages
that keep you informed about what’s going on. The menu gives two options:

● Memory information: Display in the log box the amount of memory available
to WEKA.
● Run garbage collector: Force the Java garbage collector to search for
memory that is no longer needed and free it up, allowing more memory for
new tasks. Note that the garbage collector is constantly running as a
background task anyway.

Section Tabs

Status Box Log Button Weka status Icon


5
Application Development Lab

Figure: Weka Explorer

Log Button

Clicking on this button brings up a separate window containing a scrollable text


field. Each line of text is stamped with the time it was entered into the log. As we
perform actions in WEKA; the log keeps a record of what has happened.

WEKA Status Icon

To the right of the status box is the WEKA status icon. When no processes are
running, the bird sits down and takes a nap.
Step 3: Choosing a data set
The first three buttons at the top of the preprocess section enable you to load
data into WEKA:
OPENING A DATA SET FROM THE EXPLORER(Open File option)

Open file.... Brings up a dialog box allowing you to browse for the data file on the local
filesystem.Initially (in the Preprocess tab) click "open" and navigate to the directory
containing the data file (.csv or .arff). We now have a number of choices, but before we
can work with any data, we'll have to load it into Weka. For now, we'll use one of the
datasets that are included, but later on we'll have to get any file we'll use into the right
format. Open a file from the data subcategory, for example the Weather data to find the
following screen (the default should be the processing tab). The data set is obtained as
follows:
C:\Program Files\Weka3-6\data\Weather.arff

6
Application Development Lab

Figure: Opening A Dataset Using Open File Option

Open URL.... Asks for a Uniform Resource Locator address for where the data is
stored.

Open DB.... Reads data from a database. (Note that to make this work, we might have
to edit the file in weka/experiment/DatabaseUtils.props.)

PREPROCESSING ON PREDEFINED DATA SET (weather.arff)

Applying Filters On The Data


A key strength of the Weka system lies in what are called filters. Filters are
algorithms that allow one to modify a dataset. Filters can be used to convert numerical
attributes to nominal or nominal into binary, they can be used to standardize numerical
values, or to remove instances with incorrect or missing values, remove misclassified
instances and a lot more. Moreover, filters can also be applied on top of each other.
Weka’s filters give one powerful set of tools to clean and prepare data for analysis.

Weka supports many filters, and for convenience they are organized according to
whether they take class information into account (Supervised/Unsupervised) and
whether they act on instances or attributes. A filter can thus be one of four types, i.e.,
supervised instance filter, unsupervised instance filter, supervised attribute filter and
lastly, unsupervised attribute filter.

7
Application Development Lab
Supervised filters take class information into account, while unsupervised filters
do not. Good examples of this are the two filters supervised discretization and
unsupervised discretization. Both these filters are designed to convert numerical
attributes into nominal ones; however the unsupervised filter does not take class
information into account when grouping instances together. There is always a risk that
distinctions between the different instances in relation to the class can be wiped out
when using such a filter.

In our sample data file, we can discretize the attributes. In the "Filter" panel, click
on the "Choose" button. This will show a popup window with a list available filters. Scroll
down the list and select the "weka.filters.supervised.attribute.discretize" filter.

Figure: Selecting The Supervised Attribute Filter

8
Application Development Lab

Figure: Generic Object Editor Of Discretize Attribute Filter

Figure: Choosing Supervised Attribute Filter: Discretize

Click the "Apply" button to apply this filter to the data. This will discretize the
attributes and create a new working relation (whose name now includes the details of
the filter that was applied).

9
Application Development Lab

Figure: Applying Supervised Attribute Filter: Discretize

The supervised filter does not have the same problem because it takes the class
information into account and tries to maintain the class distinctions in the grouped
instances. The usefulness of class information and thus supervised filters will of course
depend on the context and either type of filter is useful in many cases.

Figure: Choosing Unsupervised Attribute Filter: Discretize

10
Application Development Lab

Figure: Applying Unsupervised Attribute Filter: Discretize

Weka also has another type of filter called instance filters. Instance filters work
on a whole instance of the data not just a specific attribute, while attribute filter works on
attributes in general and not on just a specific instance or instances.

Instance filters are also of the supervised or unsupervised types. To further


illustrate the difference between supervised and unsupervised instance filters, we can
compare the supervised filter StratifiedRemoveFolds and its related unsupervised filter
RemoveFolds. Both filters allow one to select a specific cross-validation fold of the data.
The supervised filter takes class information into account and makes sure that the
selected fold is stratified appropriately. The unsupervised filter performs an ordinary
cross-validation.

Figure: Choosing Supervised Instance Filter: Stratifiedremovefolds

11
Application Development Lab

Figure: Applying Supervised Instance Filter: “StratifiedRemovefolds”

Figure: Choosing Unsupervised Instance Filter: Removefolds

12
Application Development Lab

Figure: Applying Unsupervised Instance Filter: Removefolds

13
Application Development Lab

PREPROCESSING ON USER DEFINED DATA SET (student.arff)

Data Set

%student details

@relation student
@attribute sno integer
@attribute
sname{Annie,Bpavani,Geetha,Krishna,Padmaja,Tulasi,Sravanthi,Swathi,Swav}
@attribute ERP numeric
@attribute SQAT numeric
@attribute SRE numeric
@attribute MC numeric
@attribute WT numeric
@attribute ADSLAB numeric
@attribute MCLAB numeric
@attribute ADS numeric
@attribute results{pass,fail}

@data
501 Annie 56 60 63 73 54 92 95 67 pass
502 Bpavani 65 76 74 78 67 89 98 72 pass
503 Geetha 63 75 67 78 70 90 96 73 pass
504 Krishna 58 72 70 78 59 88 89 64 pass
505 Padmaja 63 76 73 75 66 82 93 63 pass
506 Tulasi 57 77 67 74 59 87 92 63 pass
507 Sravanthi 74 78 74 78 69 96 97 73 pass
508 Swathi 53 73 62 67 66 81 90 65 pass
509 Swav 22 76 55 45 56 65 73 66 fail

14
Application Development Lab

Figure: Output Of User Defined Data Set: Student.Arff

15
Application Development Lab

Figure: Visualize All Screen Of All Class Attributes In Student.Arff

16
Application Development Lab
Applying Association On Data Set

INTRODUCTION TO ASSOCIATION

Association Rule Mining

An association rule is an implication of the form A =>B, where it holds in the


transaction set D with support s, where s is the percentage of transactions in D that
contain A U B. The rule A =>B has confidence c in the transaction set D if c is the
percentage of transactions in D containing A which also contain B.

support (A => B) = probability{A U B}


confidence (A => B) = probability{B/ A}

Rules that satisfy both a minimum support threshold and a minimum confidence
threshold are said to be strong. It can be used for finding frequent patterns,
associations, correlations, or causal structures among sets of items or objects in
transaction databases, relational databases, and other information repositories..

Association Mining In Weka

Setting Up

This panel contains schemes for learning association rules, and the learners are
chosen and configured in the same way as the clusterers, filters, and classifiers in the
other panels.

Learning Associations

Once appropriate parameters for the association rule learner have been set, click
the Start button. When complete, right-clicking on an entry in the result list allows the
results to be viewed or saved.

Figure: Different Associators In Weka

Creation of Database For Association

@relation asscdb
17
Application Development Lab

@attribute name {annie,pavani,geetha,krishna,tulasi,padhu,sravanthi,swathi}


@attribute age {youth,middle_aged,senior}
@attribute income {high,medium,low}
@attribute student {no,yes}
@attribute class:buys_computer {no,yes}
@data

annie youth high no yes


pavani youth medium yes no
geetha middle_aged medium no yes
krishna senior low no no
tulasi senior high no yes
padhu middle_aged low no no
sravanthi middle_aged high yes yes
swathi senior medium no yes

Figure: Figure: Viewer Of Asscdb.Arff

Applying Different Associators On Asscdb.Arff

● Apriori

18
Application Development Lab
● Filtered Associator

a) Apriori Algorithm

Apriori is the best-known algorithm to mine association rules. The name of the
algorithm is based on the fact that the algorithm uses prior knowledge of frequent item
set properties.It uses a breadth-first search strategy for counting the support of item
sets and uses a candidate generation function which exploits the downward closure
property of support.
Output:

=== Run information ===

Scheme: weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -


c -1
Relation: asscdb
Instances: 8
Attributes: 5
name
age
income
student
class:buys_computer
=== Associator model (full training set) ===

Apriori
=======

Minimum support: 0.3 (2 instances)


Minimum metric <confidence>: 0.9
Number of cycles performed: 14

Generated sets of large itemsets:

Size of set of large itemsets L(1): 10

Size of set of large itemsets L(2): 12

Size of set of large itemsets L(3): 4

Best rules found:

1. age=senior 3 ==> student=no 3 conf:(1)


2. income=high 3 ==> class:buys_computer=yes 3 conf:(1)
3. income=low 2 ==> student=no 2 conf:(1)

19
Application Development Lab
4. income=low 2 ==> class:buys_computer=no 2 conf:(1)
5. age=senior class:buys_computer=yes 2 ==> student=no 2 conf:(1)
6. income=high student=no 2 ==> class:buys_computer=yes 2 conf:(1)
7. income=medium class:buys_computer=yes 2 ==> student=no 2 conf:(1)
8. income=medium student=no 2 ==> class:buys_computer=yes 2 conf:(1)
9. student=no class:buys_computer=no 2 ==> income=low 2 conf:(1)
10. income=low class:buys_computer=no 2 ==> student=no 2 conf:(1)

b) Filtered Associator:
Class for running an arbitrary associator on data that has been passed through
an arbitrary filter. Like the associator, the structure of the filter is based exclusively on
the training data and test instances will be processed by the filter without changing their
structure.

Output
=== Run information ===

Scheme: weka.associations.FilteredAssociator -F "weka.filters.MultiFilter -


F \"weka.filters.unsupervised.attribute.ReplaceMissingValues \"" -c -1 -W
weka.associations.Apriori -- -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: asscdb
Instances: 8
Attributes: 5
name
age
income
student
class:buys_computer
=== Associator model (full training set) ===

FilteredAssociator using weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M


0.1 -S -1.0 -c -1 on data filtered through weka.filters.MultiFilter -F
"weka.filters.unsupervised.attribute.ReplaceMissingValues "

Filtered Header
@relation asscdb-weka.filters.unsupervised.attribute.ReplaceMissingValues-
weka.filters.MultiFilter-Fweka.filters.unsupervised.attribute.ReplaceMissingValues

@attribute name {annie,pavani,geetha,krishna,tulasi,padhu,sravanthi,swathi}


@attribute age {youth,middle_aged,senior}
@attribute income {high,medium,low}
@attribute student {no,yes}
@attribute class:buys_computer {no,yes}

@data

20
Application Development Lab

Associator Model

Apriori
=======

Minimum support: 0.3 (2 instances)


Minimum metric <confidence>: 0.9
Number of cycles performed: 14

Generated sets of large itemsets:

Size of set of large itemsets L(1): 10

Size of set of large itemsets L(2): 12

Size of set of large itemsets L(3): 4

Best rules found:

1. age=senior 3 ==> student=no 3 conf:(1)


2. income=high 3 ==> class:buys_computer=yes 3 conf:(1)
3. income=low 2 ==> student=no 2 conf:(1)
4. income=low 2 ==> class:buys_computer=no 2 conf:(1)
5. age=senior class:buys_computer=yes 2 ==> student=no 2 conf:(1)
6. income=high student=no 2 ==> class:buys_computer=yes 2 conf:(1)
7. income=medium class:buys_computer=yes 2 ==> student=no 2 conf:(1)
8. income=medium student=no 2 ==> class:buys_computer=yes 2 conf:(1)
9. student=no class:buys_computer=no 2 ==> income=low 2 conf:(1)
10. income=low class:buys_computer=no 2 ==> student=no 2 conf:(1)

Applying Classification on Data set

INTRODUCTION TO CLASSIFICATION

21
Application Development Lab

Definition

It is the task of generalizing known structure to apply to new data. For example,
an email program might attempt to classify an email as legitimate or spam. Common
algorithms include decision tree learning, nearest neighbor, naive Bayesian
classification and neural networks.

Classification is a data mining (machine learning) technique used to predict


group membership for data instances. For example, you may wish to use classification
to predict whether the weather on a particular day will be “sunny”, “rainy” or “cloudy”.
Popular classification techniques include decision trees and neural networks.

In this session we perform classification using different classifiers as follows:


1. Bayes classifier
2. Functions classifier
3. Trees classifier

Figure: Different Classifiers In Weka

Confusion matrices are very useful for evaluating classifiers, as they provide an
efficient snapshot of its performance—displaying the distribution of correct and incorrect
instances. It is a matrix used to summarize the results of a supervised classification.
Entries along the main diagonal are correct classifications. Entries other than those on
the main diagonal are classification errors.
The Confusion Matrix View includes the following sections:
● Confusion matrix for training data
● Confusion matrix for validation data
● Confusion matrix as computed for the current prune level

22
Application Development Lab

Example:

=== Confusion Matrix ===

a b <-- classified as
7 2 | a = yes
3 2 | b = no

Weka was trying to classify instances into two possible classes: yes or no. For
the sake of simplicity, Weka substitutes ‘yes’ for a, and ‘no’ for b. The columns
represent the instances that were classified as that class. So, the first column shows
that in total 10 instances were classified a by Weka, and 4 were classified as b. The
rows represent the actual instances that belong to that class. So, what this now tells us
is the number of times a given class is correctly/ incorrectly classified.

From the matrix, we can observe that 7 of the instances that should have been
classed as a, were in fact correctly identified. Similarly, 2 b’s were classified correctly.
However, we can also see that 2 a’s were incorrectly classified as b, where as 3 b’s
were classed as a. This fine-grained perspective can provide interesting insights. It also
allows us to assess the suitability of a particular classifier.

Cross validation is a method for estimating the true error of a model. When a
model is built from training data, the error on the training data is a rather optimistic
estimate of the error rates the model will achieve on unseen data. The aim of building a
model is usually to apply the model to new, unseen data--we expect the model to
generalize to data other than the training data on which it was built. Thus, we would like
to have some method for better approximating the error that might occur in general.

Test sample cross-validation is often a preferred method when there is plenty of


data available. A model is built from a training set and its predictive accuracy is
measured by applying the model a test set. A good rule of thumb is that a dataset is
partitioned into a training set (66%) and a test set (33%).
To measure error rates you might build multiple models with the one algorithm,
using variations of the same training data for each model. The average performance is
then the measure of how well this algorithm works in building models from the data.
Cross-validation is a general computer-intensive approach used in estimating the
accuracy of statistical models. The idea of cross-validation is to split the data into N
subsets, to put one subset aside, to estimate parameters of the model from the
remaining N-1 subsets, and to use the retained subset to estimate the error of the
model. Such a process is repeated N times - with each of the N subsets being used as
the validation set . Then the values of the errors obtained in such N steps are combined
to provide the final estimate of the model error.

Cross-validation has the following applications:


● Validating the robustness of a particular mining model.
23
Application Development Lab
● Evaluating multiple models from a single statement.
● Building multiple models and then identifying the best model based on statistics.

Program

Weather.arff:

%weather data set

@relation weather
@attribute outlook {sunny, overcast, rainy}
@attribute temperature real
@attribute humidity real
@attribute windy {TRUE, FALSE}
@attribute play {yes, no}
@data
Sunny 85 85 FALSE no
sunny 80 90 TRUE no
overcast 83 86 FALSE yes
rainy 70 96 FALSE yes
rainy 68 80 FALSE yes
rainy 65 70 TRUE no
overcast 64 65 TRUE yes
sunny 72 95 FALSE no
sunny 69 70 FALSE yes
rainy 75 80 FALSE yes
sunny 75 70 TRUE yes
overcast 72 90 TRUE yes
overcast 81 75 FALSE yes
rainy 71 91 TRUE no

24
Application Development Lab

Figure: Viewer Of Weather.Arff

a) APPLYING BAYES METHODS ON WEATHER.ARFF

● Bayes net classification


● Naïve Bayes Classification

Bayesian classification is based on Bayes’ theorem. Bayesian classifiers also


exhibit high accuracy and speed when applied to large databases.

i) Bayes Net Classification


Bayes Net or Bayesian networks are graphical representation for probabilistic
relationships among a set of random variables. Bayesian classifiers are statistical
classifiers. They can predict class membership probabilities, such as the probability that
a given tuple belongs to a particular class.

After applying Bayes Net classifier on weather.arff, the following output is


generated.

Output

=== Run information ===

25
Application Development Lab
Scheme: weka.classifiers.bayes.BayesNet -D -Q
weka.classifiers.bayes.net.search.local.K2 -- -P 1 -S BAYES -E
weka.classifiers.bayes.net.estimate.SimpleEstimator -- -A 0.5
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Bayes Network Classifier


not using ADTree
#attributes=5 #classindex=4
Network structure (nodes followed by parents)
outlook(3): play
temperature(1): play
humidity(1): play
windy(2): play
play(2):
LogScore Bayes: -39.49991749238267
LogScore BDeu: -46.69773452542813
LogScore MDL: -46.72727827843424
LogScore ENTROPY: -37.49057762478084
LogScore AIC: -44.49057762478084

Time taken to build model: 0.05 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 11 78.5714 %


Incorrectly Classified Instances 3 21.4286 %
Kappa statistic 0.5116
Mean absolute error 0.3435
Root mean squared error 0.3907
Relative absolute error 73.9949 %
Root relative squared error 81.4869 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

26
Application Development Lab
TP Rate FP Rate Precision Recall F-Measure Class
0.889 0.4 0.8 0.889 0.842 yes
0.6 0.111 0.75 0.6 0.667 no

=== Confusion Matrix ===


a b <-- classified as
8 1 | a = yes
2 3 | b = no

Figure: Visualize Graph Using Training Set

ii) Naive Bayes Classification


A Naïve Bayes classifier is a simple probabilistic classifier based on
applying Bayes' theorem (from Bayesian statistics) with strong (naive)
independence assumptions. A more descriptive term for the underlying probability
model would be "independent feature model".
It assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. Depending on the precise
nature of the probability model, naive Bayes classifiers can be trained very efficiently in
a supervised learningsetting. In many practical applications, parameter estimation for
naive Bayes models uses the method of maximum likelihood;

27
Application Development Lab
An advantage of the naive Bayes classifier is that it requires a small amount of
training data to estimate the parameters (means and variances of the variables)
necessary for classification.

Output

=== Run information ===

Scheme: weka.classifiers.bayes.NaiveBayes
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Naive Bayes Classifier

Class yes: Prior probability = 0.63

outlook: Discrete Estimator.


Counts = 3 5 4 (Total = 12)
temperature: Normal Distribution.
Mean = 72.9697
StandardDev = 5.2304
WeightSum = 9
Precision = 1.9090909090909092
humidity: Normal Distribution.
Mean = 78.8395
StandardDev = 9.8023
WeightSum = 9
Precision = 3.4444444444444446
windy: Discrete Estimator.
Counts = 4 7 (Total = 11)

Class no: Prior probability = 0.38

outlook: Discrete Estimator.


Counts = 4 1 3 (Total = 8)
temperature: Normal Distribution.
Mean = 74.8364

28
Application Development Lab
StandardDev = 7.384
WeightSum = 5
Precision = 1.9090909090909092
humidity: Normal Distribution.
Mean = 86.1111
StandardDev = 9.2424

WeightSum = 5
Precision = 3.4444444444444446
windy: Discrete Estimator.
Counts = 4 3 (Total = 7)

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 13 92.8571 %


Incorrectly Classified Instances 1 7.1429 %
Kappa statistic 0.8372
Mean absolute error 0.2798
Root mean squared error 0.3315
Relative absolute error 60.2576 %
Root relative squared error 69.1352 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class


1 0.2 0.9 1 0.947 yes
0.8 0 1 0.8 0.889 no

=== Confusion Matrix ===

a b <-- classified as
9 0 | a = yes
1 4 | b = no

29
Application Development Lab

Figure: Visualize Classify Errors Using The Training Set

30
Application Development Lab

Figure: Visualize Margin Curve Using Training Set

b) APPLYING THE FUNCTION CLASSIFIERS ON


STUDENTDB.ARFF

Regression predicts a numeric value. It also can determine the input fields that
are most relevant to predict the target field values. The predicted value might not be
identical to any value contained in the data that is used to build the model.

Numeric prediction is the task of predicting continuous (or ordered) values for
given input. Regression and numeric prediction are synonymous. Regression analysis
can be used to model the relationship between one or more independent or predictor
variables and a dependent or response variable (which is continuous-valued). The
predictor variables are the attributes of interest describing the tuple. The response
variable is what we want to predict.

i) Linear Regression

Linear regression was the first type of regression analysis and to be used
extensively in practical applications. This is because models which depend linearly on
their unknown parameters are easier to fit than models which are non-linearly related to
their parameters and because the statistical properties of the resulting estimators are
easier to determine.

Output
=== Run information ===

Scheme: weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8


Relation: student
Instances: 9
Attributes: 11
sno
sname
ERP
SQAT
SRE
MC
WT
ADSLAB
MCLAB
ADS
results
Test mode: evaluate on training data

31
Application Development Lab
=== Classifier model (full training set) ===

Linear Regression Model

ERP =

8.2262 * sname=Swathi,Annie,Tulasi,Krishna,Geetha,Padmaja,Bpavani,Sravanthi +
1.171 * sname=Annie,Tulasi,Krishna,Geetha,Padmaja,Bpavani,Sravanthi +
2.9988 * sname=Geetha,Padmaja,Bpavani,Sravanthi +
1.2665 * sname=Padmaja,Bpavani,Sravanthi +
7.2117 * sname=Sravanthi +
0.1683 * SRE +
0.2147 * MC +
0.2205 * WT +
0.2361 * ADSLAB +
0.1473 * MCLAB +
-0.158 * ADS +
8.2262 * results=pass +
-24.9382

Time taken to build model: 0.02 seconds

=== Evaluation on training set ===


=== Summary ===

Correlation coefficient 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0%
Root relative squared error 0%
Total Number of Instances 9

32
Application Development Lab

Figure: Visualize Classify Errors Using Training Set

ii) Simple Linear regression

Simple Linear Regression (SLR) is a special regression model where

● There is only one (numerical) independent variable.


● The model is linear both in the independent variable and, more importantly,
in the parameters.

Output

=== Run information ===

Scheme: weka.classifiers.functions.SimpleLinearRegression
Relation: student
Instances: 17
Attributes: 8
sno
mc
sqat

33
Application Development Lab
sre
wt
erp
adsa
avg
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Linear regression on erp

0.6 * erp + 19.93

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correlation coefficient 0.9118


Mean absolute error 5.5765
Root mean squared error 6.66
Relative absolute error 47.2205 %
Root relative squared error 41.064%
Total Number of Instances 17

c) APPLYING DECISION TREES CLASSIFIER ON


WEATHER.ARFF

A decision tree is a flow chart like tree structure, where each internal node (non-
leaf node) denotes a test on an attribute, each branch represents an outcome of the
test, and each leaf node (or terminal node) holds a class label. The top most node in a
tree is the root node.

● J48 Pruned decision tree


● Alternating decision tree

i) J48 Pruned decision tree

The decision trees generated by J48 can be used for classification. J48 builds
decision trees from a set of labeled training data. It uses the fact that each attribute of
the data can be used to make a decision by splitting the data into smaller subsets.

J48 examines the normalized information gain (difference in entropy) that results
from choosing an attribute for splitting the data. To make the decision, the attribute with

34
Application Development Lab
the highest normalized information gain is used. Then the algorithm recurs on the
smaller subsets. The splitting procedure stops if all instances in a subset belong to the
same class. Then a leaf node is created in the decision tree telling to choose that class.

Output

Mode 1: Using USE TRAINING SET mode

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2


Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Classifier model (full training set) ===

J48 pruned tree


------------------

outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correctly Classified Instances 14 100%


Incorrectly Classified Instances 0 0%

35
Application Development Lab
Kappa statistic 1
Mean absolute error 0
Root mean squared error 0
Relative absolute error 0%
Root relative squared error 0%
Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


1 0 1 1 1 1 yes
1 0 1 1 1 1 no
Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as
9 0 | a = yes
0 5 | b = no

Figure: Visualize Tree After Applying J48 On Weather.Arff

36
Application Development Lab
Mode 2: Using CROSS VALIDATION with different folds

Applying10 folds

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2

Relation: weather

Instances: 14

Attributes: 5
outlook
temperature
humidity
windy
play

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

J48 pruned tree


------------------

outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)

Number of Leaves: 5

Size of the tree: 8

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 9 64.2857 %


Incorrectly Classified Instances 5 35.7143 %

37
Application Development Lab
Kappa statistic 0.186
Mean absolute error 0.2857
Root mean squared error 0.4818
Relative absolute error 60 %
Root relative squared error 97.6586 %
Total Number of Instances 14

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class


0.778 0.6 0.7 0.778 0.737 yes
0.4 0.222 0.5 0.4 0.444 no

=== Confusion Matrix ===

a b <-- classified as
7 2 | a = yes
3 2 | b = no

Summary report after applying different folds (Cross Validation) on weather.arff

10 folds 6 folds 4 folds


Correctly Classified 64.2857 % 57.1429% 50%
Instances
Incorrectly Classified 35.7143 % 42.8571% 50%
Instances
Kappa statistic 0.186 0.0667 -0.1395
Mean absolute error 0.2857 0.3929 0.4643
Root mean squared error 0.4818 0.5782 0.6116
Relative absolute error 60 % 83.0017% 98.4466%
Root relative squared error 97.6586 % 118.3986% 125.5591%
Total Number of Instances 14 14 14
TP Rate
Class YES 0.778 0.667 0.667
Class NO 0.4 0.4 0.2
Precision
Class YES 0.7 0.667 0.6
Class NO 0.5 0.4 0.25
Recall
Class YES 0.778 0.667 0.667
Class NO 0.4 0.4 0.2
F-Measure
Class YES 0.737 0.667 0.632

38
Application Development Lab
Class NO 0.444 0.4 0.222
Confusion Matrix a b<--classified as a b<--classified as a b<--classified as
7 2 | a = yes 6 3 | a = yes 6 3 | a = yes
3 2 | b = no 3 2 | b = no 4 1 | b = no

Mode 3: Using PERCENTAGE SPLIT with different percentages

Applying Trained Data: 75% Test Data: 25%

=== Run information ===

Scheme: weka.classifiers.trees.J48 -C 0.25 -M 2


Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: split 75.0% train, remainder test

=== Classifier model (full training set) ===

J48 pruned tree


------------------

outlook = sunny
| humidity <= 75: yes (2.0)
| humidity > 75: no (3.0)
outlook = overcast: yes (4.0)
outlook = rainy
| windy = TRUE: no (2.0)
| windy = FALSE: yes (3.0)

Number of Leaves : 5

Size of the tree : 8

Time taken to build model: 0.02 seconds

=== Evaluation on test split ===

39
Application Development Lab
=== Summary ===

Correctly Classified Instances 0 0 %


Incorrectly Classified Instances 3 100 %
Kappa statistic -0.8
Mean absolute error 0.8667
Root mean squared error 0.8869
Relative absolute error 153.6364 %
Root relative squared error 149.6888 %
Total Number of Instances 3

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class


0 1 0 0 0 0 yes
0 1 0 0 0 0 no
Weighted Avg. 0 1 0 0 0 0

=== Confusion Matrix ===

a b <-- classified as
0 1 | a = yes
2 0 | b = no

Summary report after applying different % on weather.arff:

75% 50% 28%


Correctly Classified 0% 57.1429% 70%
Instances
Incorrectly Classified 100% 42.8571% 30%
Instances
Kappa statistic -0.8 -0.2353 0
Mean absolute error 0.8667 0.4286 0.5
Root mean squared error 0.8869 0.6547 0.5
Relative absolute error 153.6364 90% 100%
Root relative squared error 149.6888 136.7198% 100%
Total Number of Instances 14 7 10
TP Rate
Class YES 0 0.8 1
Class NO 0 0 0
FP Rate
Class YES 1 1 1
Class NO 1 0.2 0
Precision
Class YES 0 0.667 0.7

40
Application Development Lab
Class NO 0 0 0
Recall
Class YES 0 0.8 1
Class NO 0 0 0
F-Measure
Class YES 0 0.727 0.824
Class NO 0 0 0
Confusion Matrix a b<-- classified as a b<-- classified a b<--
0 1 | a = yes as classified as
2 0 | b = no 4 1 | a = yes 7 0 | a = yes
2 0 | b = no 3 0 | b = no

Figure: Classifier Evaluation Options Window

41
Application Development Lab

Figure: Visualization Classify Errors On Percentage Split Of 28%

ii) Alternating decision tree

An Alternating Decision Tree (ADTree) is a machine learning method for


classification. The ADTree data structure and algorithm are a generalization of decision
tree.

An alternating decision tree consists of decision nodes and prediction


nodes. Decision nodes specify a predicate condition. Prediction nodes contain a single
number. ADTrees always have prediction nodes as both root and leaves. An instance is
classified by an ADTree by following all paths for which all decision nodes are true and
summing any prediction nodes that are traversed.

Output

=== Run information ===

Scheme: weka.classifiers.trees.ADTree -B 10 -E -3
Relation: weather

42
Application Development Lab
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play

Test mode: 10-fold cross-validation

=== Classifier model (full training set) ===

Alternating decision tree:

: -0.255
| (1)outlook = overcast: -1.049
| (1)outlook != overcast: 0.37
| | (2)humidity < 82.5: -0.43
| | | (4)temperature < 66.5: 0.78
| | | (4)temperature >= 66.5: -0.915
| | (2)humidity >= 82.5: 0.486
| | | (3)temperature < 70.5: -1.001
| | | (3)temperature >= 70.5: 1.485
| (5)temperature < 66.5: 0.183
| (5)temperature >= 66.5: -0.398
| | (6)outlook = sunny: 0.065
| | (6)outlook != sunny: -0.381
Legend: -ve = yes, +ve = no

Tree size (total number of nodes): 19


Leaves (number of predictor nodes): 13

Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 4 28.5714 %


Incorrectly Classified Instances 10 71.4286 %
Kappa statistic -0.5556
Mean absolute error 0.5965
Root mean squared error 0.6574
Relative absolute error 125.2706 %
Root relative squared error 133.2388 %
Total Number of Instances 14

43
Application Development Lab

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure Class


0.444 1 0.444 0.444 0.444 yes
0 0.556 0 0 0 no

=== Confusion Matrix ===

a b <-- classified as
4 5 | a = yes
5 0 | b = no

Figure: Visualize Tree After Applying Alternating Decision Tree On Weather.Arff

44
Application Development Lab

Applying Clustering On Data set

INTRODUCTION TO CLUSTERING
Definition

It is the process of grouping the data into classes or clusters. The process of
grouping a set of physical or abstract objects into classes of similar objects is called
clustering. A cluster is a collection of data objects that are similar to one another within
the same cluster and are dissimilar to the objects in other cluster.

Cluster analysis or clustering is the assignment of a set of observations into subsets


(called clusters) so that observations in the same cluster are similar in some sense. Clustering is a
method of unsupervised learning, and a common technique for statistical data analysis used in
many fields, including machine learning, data mining, pattern recognition, image analysis and
bioinformatics.

Cluster Analysis provides a simple profile of individuals. Given a number of analysis


units, each of which is described by a set of characteristics and attributes. Cluster Analysis also
suggests how groups of units are determined such that units within groups are similar in some
respect and unlike those from other groups.

Cluster Modes

There are 4 types of modes as listed below:

1. Use training set


2. Supplied test set
3. Percentage split
4. Classes to clusters evaluation

An additional option in the Cluster mode box, the Store clusters for visualization
tick box, determines whether or not it will be possible to visualize the clusters once
training is complete.

Ignoring Attributes

Sometimes, some of the attributes in the data should be ignored when clustering.
The Ignore attributes button brings up a small window that allows us to select which
attributes are to be ignored.

45
Application Development Lab
Learning Clusters

The Cluster section has Start/Stop buttons, a result text area and a result list.
Right-clicking on an entry in the result list brings up a similar menu, except that it shows
only two visualization options: Visualize cluster assignments and Visualize tree.

Figure: Different Cluster Modes In The Explorer Window

We have different types of clustering methods from which we can select the
required method to perform clustering.

46
Application Development Lab

Figure: Different Clustering Methods In Weka

Applying Different Clustering Methods On Weather.Arff


● SimpleKMeans
● FarthestFirst

a) SimpleKMeans
The k-means algorithm assigns each point to the cluster whose center (also
called centroid) is nearest. The center is the average of all the points in the cluster —
that is, its coordinates are the arithmetic mean for each dimension separately over all
the points in the cluster.
In statistics and machine learning, k-means clustering is a method of cluster
analysis which aims to partition n observations into k clusters in which each observation belongs
to the cluster with the nearest mean. It is one of the simplest unsupervised learning algorithms
that solve the well known clustering problem.

Output

Mode 1: Using USE TRAINING SET mode

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy

47
Application Development Lab
play
Test mode: evaluate on training data

=== Model and evaluation on training set ===

kMeans
======
Number of iterations: 3

Within cluster sum of squared errors: 16.23745631138724

Cluster centroids:
Cluster 0
Mean/Mode: sunny 75.8889 84.1111 FALSE yes
Std Devs: N/A 6.4893 8.767 N/A N/A

Cluster 1
Mean/Mode: overcast 69.4 77.2 TRUE yes
Std Devs: N/A 4.7223 12.3167 N/A N/A

Clustered Instances
0 9 ( 64%)
1 5 ( 36%)

Mode 2: Using PERCENTAGE SPLIT with different percentages

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: split 75% train, remainder test

=== Clustering model (full training set) ===

kMeans
======

48
Application Development Lab
Number of iterations: 3

Within cluster sum of squared errors: 16.23745631138724

Cluster centroids:

Cluster 0
Mean/Mode: sunny 75.8889 84.1111 FALSE
yes
Std Devs: N/A 6.4893 8.767 N/A
N/A

Cluster 1
Mean/Mode: overcast 69.4 77.2 TRUE
yes
Std Devs: N/A 4.7223 12.3167 N/A
N/A

=== Model and evaluation on test split ===

kMeans
======
Number of iterations: 2
Within cluster sum of squared errors: 9.11539322593136

Cluster centroids:

Cluster 0
Mean/Mode: overcast 73.5 80.3333 FALSE
yes
Std Devs: N/A 7.5033 10.4051 N/A
N/A
Cluster 1
Mean/Mode: sunny 72.75 80.25 TRUE
no
Std Devs: N/A 6.3443 11.8427 N/A
N/A

Clustered Instances
0 2 ( 50%)
1 2 ( 50%)

Summary report after applying different % on weather.arff

49
Application Development Lab
75% 50% 25%
Number of iterations 2 2 2
Within cluster sum of 9.1153932259313 6.382060989909828 0.693530370728173
squared errors 6
Clustered Instances 0 2 ( 50%) 0 3 ( 43%) 0 9 ( 82%)
1 2 ( 50%) 1 4 ( 57%) 1 2 ( 18%)

Mode 3: Using CLASSES TO CLUSTERS EVALUATION mode

a) SELECTING OUTLOOK CLASS ATTRIBUTE

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
temperature
humidity
windy
play
Ignored:
outlook

Test mode: Classes to clusters evaluation on training data

=== Model and evaluation on training set ===

kMeans
======

Number of iterations: 3

Within cluster sum of squared errors: 7.508138450200291

Cluster centroids:

Cluster 0
Mean/Mode: 75.375 83.375 FALSE yes
Std Devs: 6.7387 9.0702 N/A N/A
Cluster 1
Mean/Mode: 71.1667 79.3333 TRUE yes
Std Devs: 6.047 12.1929 N/A N/A

50
Application Development Lab
Clustered Instances
0 8 ( 57%)
1 6 ( 43%)

Class attribute: outlook

Classes to Clusters:

0 1 <-- assigned to cluster


3 2 | sunny
2 2 | overcast
3 2 | rainy

Cluster 0 <-- sunny


Cluster 1 <-- overcast

Incorrectly clustered instances : 9.0 64.2857 %

b) SELECTING WINDY CLASS ATTRIBUTE

=== Run information ===

Scheme: weka.clusterers.SimpleKMeans -N 2 -S 10
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
play
Ignored:
windy

Test mode: Classes to clusters evaluation on training data

=== Model and evaluation on training set ===

kMeans
======
Number of iterations: 5
Within cluster sum of squared errors: 9.517365461619956

Cluster centroids:

Cluster 0

51
Application Development Lab
Mean/Mode: sunny 74.6 86.2 no
Std Devs: N/A 7.893 9.7314 N/A
Cluster 1
Mean/Mode: overcast 73 79.1111 yes
Std Devs: N/A 6.1644 10.2157 N/A

Clustered Instances

0 5 ( 36%)
1 9 ( 64%)

Class attribute: windy

Classes to Clusters:

0 1 <-- assigned to cluster


3 3 | TRUE
2 6 | FALSE

Cluster 0 <-- TRUE


Cluster 1 <-- FALSE

Incorrectly clustered instances : 5.0 35.7143 %

Summary report after selecting different class attributes

(Nom)outlook (Nom)windy
Number of iterations 3 5
Within cluster sum of squared 7.508138450200291 9.517365461619956
errors
Clustered Instances 0 8 ( 57%) 0 5 ( 36%)
1 6 ( 43%) 1 9 ( 64%)
Class attribute Outlook Windy
Classes to Clusters 0 1<-- assigned to 0 1<-- assigned to
cluster cluster
3 2 | sunny 3 3 | TRUE
2 2 | overcast 2 6 | FALSE
3 2 | rainy
Cluster 0 <-- sunny Cluster 0 <-- TRUE
Cluster 1 <-- overcast Cluster 1 <-- FALSE

Incorrectly clustered instances 64.2857 % 35.7143 %

b) Farthest First

52
Application Development Lab
Farthest first is a Variant off K means that places each cluster centre in turn at
the point farthest from the existing cluster centers. This point must lie within the data
area. This greatly speed up the clustering in most cases since less reassignment and
adjustment is needed

Mode 1: Using USE TRAINING SET mode

=== Run information ===

Scheme: weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Model and evaluation on training set ===

FarthestFirst
==============
Cluster centroids:

Cluster 0
overcast 72.0 90.0 TRUE yes
Cluster 1
sunny 85.0 85.0 FALSE no

Clustered Instances
0 10 ( 71%)
1 4 ( 29%)

Mode2: Using PERCENTAGE SPLIT with different percentages

Applying trained Data: 75%, Test Data:25%

=== Run information ===

Scheme: weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
outlook

53
Application Development Lab
temperature
humidity
windy
play

Test mode: split 75% train, remainder test

=== Clustering model (full training set) ===

FarthestFirst
==============
Cluster centroids:

Cluster 0
overcast 72.0 90.0 TRUE yes
Cluster 1
sunny 85.0 85.0 FALSE no

=== Model and evaluation on test split ===

FarthestFirst
==============
Cluster centroids:

Cluster 0
sunny 75.0 70.0 TRUE yes
Cluster 1 rainy 70.0 96.0 FALSE yes

Clustered Instances

0 3 ( 75%)
1 1 ( 25%)

Summary report after applying different %s on weather.arff

75% 50% 25%


Cluster 0 sunny 75.0 70.0 Rainy 70.0 96.0 Rainy 65.0 70.0
TRUE yes FALSE yes TRUE no
Cluster
Centroids
Cluster 1 rainy 70.0 96.0 Sunny 80.0 90.0 overcast 64.0 65.0
FALSE yes TRUE no TRUE yes

Clustered Instances 0 3 ( 75%) 0 5 ( 71%) 0 6 ( 55%)


1 1 ( 25%) 1 2 ( 29%) 1 5 ( 45%)

54
Application Development Lab
Mode 3: Using CLASSES TO CLUSTER EVALUATION

a) SELECTING OUTLOOK CLASS ATTRIBUTE

=== Run information ===

Scheme: weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
temperature
humidity
windy
play
Ignored:
outlook
Test mode: Classes to clusters evaluation on training data

=== Model and evaluation on training set ===

FarthestFirst
==============

Cluster centroids:
Cluster 0
72.0 90.0 TRUE yes
Cluster 1
85.0 85.0 FALSE no

Clustered Instances

0 9 ( 64%)
1 5 ( 36%)

Class attribute: outlook

Classes to Clusters:

0 1 <-- assigned to cluster


2 3 | sunny
2 2 | overcast
5 0 | rainy

Cluster 0 <-- rainy


Cluster 1 <-- sunny

55
Application Development Lab
Incorrectly clustered instances : 6.0 42.8571 %

b) SELECTING WINDY CLASS ATTRIBUTE

=== Run information ===

Scheme: weka.clusterers.FarthestFirst -N 2 -S 1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
play
Ignored:
windy
Test mode: Classes to clusters evaluation on training data

=== Model and evaluation on training set ===

FarthestFirst
==============
Cluster centroids:

Cluster 0
overcast 72.0 90.0 yes
Cluster 1
rainy 65.0 70.0 no

Clustered Instances

0 8 ( 57%)
1 6 ( 43%)
2
Class attribute: windy

Classes to Clusters:

0 1 <-- assigned to cluster


3 3 | TRUE
5 3 | FALSE

Cluster 0 <-- FALSE


Cluster 1 <-- TRUE

Incorrectly clustered instances : 6.0 42.8571 %

56
Application Development Lab

Summary report after selecting different class attributes

(Nom) outlook (Nom) windy


Clustered Instances 0 9 ( 64%) 0 8 ( 57%)
1 5 ( 36%) 1 6 ( 43%)
Class attribute outlook Windy
Classes to Clusters 0 1<-- assigned to 0 1 <-- assigned to
cluster cluster
2 3 | sunny 3 3 | TRUE
2 2 | overcast 5 3 | FALSE
5 0 | rainy

Cluster 0 <-- rainy Cluster 0 <-- FALSE


Cluster 1 <-- sunny Cluster 1 <-- TRUE
Incorrectly clustered 42.8571 % 42.8571 %
instances

Figure: Visualize Cluster Assignments After Selecting The Windy Class Attribute
Applying Select Attributes On Data Set

INTRODUCTION TO SELECT ATTRIBUTES


57
Application Development Lab

Select Attributes in WEKA can be used to decide which are the most relevant
attributes that effect the dependent variable . Searching and Evaluating Attribute
selection involves searching through all possible combinations of attributes in the data
to find which subset of attributes works best for prediction.

To do this, two objects must be set up: an attribute evaluator and a search
method. The evaluator determines what method is used to assign a worth to each
subset of attributes. The search method determines what style of search is performed.

Options

The Attribute Selection Mode box has two options:

1. Use full training set: The worth of the attribute subset is determined using the full
set of training data.
2. Cross-validation: The worth of the attribute subset is determined by a process of
cross-validation. The Fold and Seed fields set the number of folds to use and the
random seed used when shuffling the data.

When the attribute selection process is finished, the results are output into the
result area, and an entry is added to the result list. Right-clicking on the result list gives
several options.

Figure: Attribute Selection Modes

Mode 1: Using USE FULL TRAINING SET mode

For using the full training set, first we need to select the attribute evaluator and a
search method. The following are the list of attribute evaluators and different search
methods:

58
Application Development Lab

Figure: Different Attribute Evaluators

Figure: Different Search Methods

Figure: Choosing Attribute Evaluator

59
Application Development Lab

Figure: Generic Object Editor Of Gainratioattributeeval

Figure: Choosing Search Method

Figure: Generic Object Editor Of Ranker Attribute Of Search Method

Output

=== Run information ===

Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1

60
Application Development Lab
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data ===

Search Method:
Attribute ranking.

Attribute Evaluator (supervised, Class (nominal): 5 play):


Gain Ratio feature evaluator

Ranked attributes:
0.1564 1 outlook
0.0488 4 windy
0 3 humidity
0 2 temperature

Selected attributes: 1,4,3,2 : 4

61
Application Development Lab

Figure: Visualize Reduced Data After Applying Gainratioattribute Evaluator

Mode 2: Using CROSS VALIDATION with 10 folds and 1 seed

Output

=== Run information ===

Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Evaluation mode: 10-fold cross-validation
62
Application Development Lab

=== Attribute selection 10 fold cross-validation (stratified), seed: 1 ===

Average merit average rank attribute


0.165 +- 0.036 1.1 +- 0.3 1 outlook
0.06 +- 0.051 1.9 +- 0.3 4 windy
0 +- 0 3 +- 0 3 humidity
0 +- 0 4 +- 0 2 temperature

Using CROSS VALIDATION with 10 folds and 8 seeds:

Output

=== Run information ===

Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Evaluation mode: 10-fold cross-validation

=== Attribute selection 10 fold cross-validation (stratified), seed: 8 ===

Average merit average rank attribute


0.16 +- 0.024 1 +- 0 1 outlook
0.057 +- 0.046 2 +- 0 4 windy
0 +- 0 3 +- 0 3 humidity
0 +- 0 4 +- 0 2 temperature

Using CROSS VALIDATION with 5 folds and 1 seed

Output

=== Run information ===

Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14

63
Application Development Lab
Attributes: 5

outlook
temperature
humidity
windy
play
Evaluation mode: 5-fold cross-validation

=== Attribute selection 5 fold cross-validation (stratified), seed: 1 ===

average merit average rank attribute


0.198 +- 0.108 1.2 +- 0.4 1 outlook
0.063 +- 0.052 1.8 +- 0.4 4 windy
0 +- 0 3 +- 0 3 humidity
0 +- 0 4 +- 0 2 temperature

Using CROSS VALIDATION with 5 folds and 8 seeds

Output

=== Run information ===

Evaluator: weka.attributeSelection.GainRatioAttributeEval
Search: weka.attributeSelection.Ranker -T -1.7976931348623157E308 -N -1
Relation: weather
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Evaluation mode: 5-fold cross-validation

=== Attribute selection 5 fold cross-validation (stratified), seed: 8 ===

Average merit average rank attribute


0.167 +- 0.024 1 +- 0 1 outlook
0.061 +- 0.049 2 +- 0 4 windy
0 +- 0 3 +- 0 3 humidity
0 +- 0 4 +- 0 2 temperature

VISUALIZATION

64
Application Development Lab
WEKA’s visualization section allows us to visualize 2D plots of the current
relation.

The Scatter Plot Matrix

When we select the Visualize panel, it shows a scatter plot matrix for all the
attributes, color coded according to the currently selected class. It is possible to change
the size of each individual 2D plot and the point size, and to randomly jitter the data (to
uncover obscured points). It also possible to change the attribute used to color the plots,
to select only a subset of attributes for inclusion in the scatter plot matrix, and to sub
sample the data. Note that changes will only come into effect once the Update button
has been pressed.

Selecting An Individual 2D Scatter Plot

When we click on a cell in the scatter plot matrix, this will bring up a separate
window with a visualization of the scatter plot we selected. Data points are plotted in the
main area of the window. At the top are two drop-down list buttons for selecting the
axes to plot. The one on the left shows which attribute is used for the x-axis; the one on
the right shows which is used for the y-axis. Beneath the x-axis selector is a drop-down
list for choosing the color scheme. This allows you to color the points based on the
attribute selected. Below the plot area, a legend describes what values the colors
correspond to. If the values are discrete, we can modify the color used for each one by
clicking on them and making an appropriate selection in the window that pops up.

To the right of the plot area is a series of horizontal strips. Each strip represents
an attribute, and the dots within it show the distribution of values of the attribute. These
values are randomly scattered vertically to help us see concentrations of points. We can
choose what axes are used in the main graph by clicking on these strips. Left-clicking
an attribute strip changes the x-axis to that attribute, whereas right-clicking changes the
y-axis. The ‘X’ and ‘Y’ written beside the strips shows what the current axes are (‘B’ is
used for ‘both X and Y’). Above the attribute strips is a slider labelled Jitter, which a
random displacement is given to all points in the plot.

Selecting Instances

There may be situations where it is helpful to select a subset of the data using
the visualization tool. (A special case of this is the UserClassifier in the Classify panel,
which lets us build our own classifier by interactively selecting instances.)

Below the y-axis selector button is a drop-down list button for choosing a
selection method. A group of data points can be selected in four ways:

1. Select Instance: Clicking on an individual data point brings up a window listing its
attributes. If more than one point appears at the same location, more than one set of
attributes is shown.

65
Application Development Lab

2. Rectangle: We can create a rectangle, by dragging, that selects the points inside it.

3. Polygon: We can build a free-form polygon that selects the points inside it. Left-click
to add vertices to the polygon, right-click to complete it. The polygon will always be
closed off by connecting the first point to the last.

4. Polyline: We can build a polyline that distinguishes the points on one side from those
on the other. Left-click to add vertices to the polyline, right-click to finish. The resulting
shape is open (as opposed to a polygon, which is always closed).

Once an area of the plot has been selected using Rectangle, Polygon or
Polyline, it turns grey. At this point, clicking the Submit button removes all instances
from the plot except those within the grey selection area. Clicking on the Clear button
erases the selected area without affecting the graph. Once any points have been
removed from the graph, the Submit button changes to a Reset button. This button
undoes all previous removals and returns us to the original graph with all points
included. Finally, clicking the Save button allows us to save the currently visible
instances to a new ARFF file.

66
Application Development Lab

Figure: Visualize Graph Of Weather.Arff

67
Application Development Lab

Figure: Options For Visualization

Figure: Attribute Selection Panel

Figure: Subsample Panel

68
Application Development Lab

Figure: Visualization After Selecting The Subsamples

69
Application Development Lab

ILLUSTRATING THE OPENING OF A DATA SET USING URL


Opening the URL asks the user for a Uniform Resource Locator address from where the
data is stored.

Figure: Selecting The Open URL Option And Then Entering The URL

70
Application Development Lab

Figure: Window Displaying The Auto93.arff Data Set

a) Preprocessing

Figure: Selecting The Supervised Attribute Filter: AttributeSelection

71
Application Development Lab

Figure: Visualize All Screen After Selecting The Attribute Selection Filter

b) Classification

Figure: Selecting The ZeroR Classifier

72
Application Development Lab
Output

=== Run information ===

Scheme: weka.classifiers.rules.ZeroR
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
Manufacturer
Highway_MPG
Air_Bags_standard
Drive_train_type
Horsepower
Wheelbase
Rear_seat_room
Luggage_capacity
class
Test mode: evaluate on training data

=== Classifier model (full training set) ===

ZeroR predicts class value: 19.509677419354837

Time taken to build model: 0 seconds

=== Evaluation on training set ===


=== Summary ===

Correlation coefficient 0
Mean absolute error 7.1368
Root mean squared error 9.6074
Relative absolute error 100 %
Root relative squared error 100 %
Total Number of Instances 93

73
Application Development Lab

c) Clustering

Figure: Selecting The Farthest First Clusterer

Output
=== Run information ===

Scheme: weka.clusterers.FarthestFirst -N 2 -S 1
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
Manufacturer
Highway_MPG
Air_Bags_standard
Drive_train_type
Horsepower
Wheelbase
Rear_seat_room
Luggage_capacity
class
Test mode: evaluate on training data

=== Model and evaluation on training set ===

FarthestFirst
==============

Cluster centroids:

Cluster 0
Pontiac 27.0 0 1 200.0 108.0 28.5 16.0 18.5
Cluster 1
Dodge 24.0 1 2 300.0 97.0 20.0 11.0 25.8

Clustered Instances

0 77 ( 83%)
1 16 ( 17%)

74
Application Development Lab

d) Association

Figure: Selecting The Predictive Apriori Algorithm

Output
=== Run information ===

Scheme: weka.associations.PredictiveApriori -N 100


Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
Manufacturer
Highway_MPG
Air_Bags_standard
Drive_train_type
Horsepower
Wheelbase
Rear_seat_room
Luggage_capacity
Class

75
Application Development Lab

e) Select Attributes

Figure: Selecting The Attribute Evaluator: Classifier Subset Evaluator

Figure: Selecting The Search Method: Genetic Search

Output
=== Run information ===

Evaluator: weka.attributeSelection.ClassifierSubsetEval -B
weka.classifiers.rules.ZeroR -T -H "Click to set hold out or test instances" --
Search: weka.attributeSelection.GeneticSearch -Z 20 -G 20 -C 0.6 -M 0.033 -R 20 -
S1
Relation: auto93.names-weka.filters.supervised.attribute.AttributeSelection-
Eweka.attributeSelection.CfsSubsetEval-Sweka.attributeSelection.BestFirst -D 1 -N 5
Instances: 93
Attributes: 9
Manufacturer
Highway_MPG
Air_Bags_standard
Drive_train_type
Horsepower
Wheelbase
Rear_seat_room
Luggage_capacity
class
Evaluation mode: evaluate on all training data

=== Attribute Selection on all input data ===

Search Method:
Genetic search.
Start set: no attributes
Population size: 20
Number of generations: 20
Probability of crossover: 0.6

76
Application Development Lab
Probability of mutation: 0.033
Report frequency: 20
Random number seed: 1

Initial population
merit scaled subset
7.1368 0 3
7.1368 0 48
7.1368 0 236
7.1368 0 5
7.1368 0 123678
7.1368 0 235678
7.1368 0 6
7.1368 0 34
7.1368 0 7
7.1368 0 23578
7.1368 0 18
7.1368 0 247
7.1368 0 568
7.1368 0 3
7.1368 0 12
7.1368 0 12458
7.1368 0 567
7.1368 0 2456
7.1368 0 27
7.1368 0 1247

Generation: 1
merit scaled subset
7.1368 0 3
7.1368 0 3
7.1368 0 34
7.1368 0 3
7.1368 0 37
7.1368 0 3
7.1368 0 36
7.1368 0 3
7.1368 0 13
7.1368 0 3
7.1368 0 23
7.1368 0 3
7.1368 0 23
7.1368 0 3
7.1368 0 23
7.1368 0 3
7.1368 0 23

77
Application Development Lab
7.1368 0 3
7.1368 0 35
7.1368 0 3

Attribute Subset Evaluator (supervised, Class (numeric): 9 class):


Classifier Subset Evaluator
Learning scheme: weka.classifiers.rules.ZeroR
Scheme options:
Hold out/test set: Training data
Accuracy estimation: MAE

Selected attributes: 3 : 1
Air_Bags_standar
f) Visualization

Figure: Visualization Window Of auto93.arff data set

78
Application Development Lab

PERFORMING OPERATIONS WITH EXPERIMENTER


• Experimenter makes it easy to compare the performance of different learning
schemes
• It can be used for classification and regression problems
• Results can be written into file or database
• Evaluation options: cross-validation, learning curve, hold-out are used
• It can also iterate over different parameter settings

a) Comparing iris.arff and weather.arff for J48 classifier

To perform the experimenter analysis for different data sets using different algorithms or
techniques, the following steps need to be followed:

Step 1: Select the experimenter option from the WEKA start up screen. From the
experimenter window select the set up option. Then choose the new tab, select the file
format and specify the path for the results to be stored.

Figure: Selecting The Set up Option In The Experimenter Window

In the above figure, we can also select the experiment type, number of iterations,
and number of folds in case of cross validation, data sets and the algorithms.

79
Application Development Lab
Step 2: Select the run tab from the experimenter window. Then select the start option
from this window. It displays the log information regarding the test being conducted and
displays the errors, if any.

Figure: Selecting The Run Tab In The Experimenter Window

Step 3: Select the analyze tab in the experimenter window. It allows us to configure the
test options. It displays the result in the test output window. It also displays the result list
of the techniques applied on the data sets.

80
Application Development Lab

Figure: Selecting The Analyze Tab In The Experimenter Window

This analysis states that the J48 algorithm works good for iris.arff data rather
than for the weather.arff data.

81
Application Development Lab
b) Performing Operations Using The Experimenter On 2 Different Data
Sets With 2 Different Algorithms

For this experiment, we have selected one user defined data set namely, student.csv
and a predefined data set namely, iris.csv. We have selected an algorithm from rules
classifier and an algorithm from trees classifier.

Output

Step 1: In this step, we are selecting the CSV file format. Select the setup tab in order to
select the files to be compared and also to select the algorithms to be applied.

Figure: Selecting The Setup Tab In The Experimenter Window

82
Application Development Lab

Step 2: Select the run tab in order to view the errors, if any.

Figure: Selecting Run tab In The Experimenter Window

Step 3: Select the analyse tab to perform test on the selected data sets with the
selected algorithms.

83
Application Development Lab

Figure: Performing The Experiment Using The Analyse Tab

84
Application Development Lab

OPERATIONS ON KNOWLEDGE FLOW USING CLASSIFIERS


Step 1: Select the “Knowledge Flow” from the WEKA start up screen.

Step 2: Choose the Data sources tab in the knowledge flow window and then select the
ARFFLOADER. Also select the configure option to select the dataset and browse to the
location of data set.

Figure: Selecting The ARFF Loader From The Data Sources Tab

Step 3: Click on evaluation tab and choose the Class Assigner component from the tool
bar.
Now connect the Arff Loader to the Class Assigner by selecting data set.

85
Application Development Lab

Figure: Selecting The Class Assigner From The Evaluation Tab

86
Application Development Lab

Step 4: Click over the Class Assigner and choose configure from the menu and select
the class in weather.arff data.

Select the Cross Validation Fold Maker from the evaluation toolbar and place it on the
layout and then connect validation fold maker by clicking over class assigner and
selecting data set.

Figure: Selecting The Cross Validation Fold Maker In The Evaluation Tab

87
Application Development Lab
Step 5: Click on the classifier tab at top of the window and select J48 component and
place it on the layout.

Connect Cross Validation Fold Maker to J48 component by first choosing Training set
and then Test set.

Figure: Selecting The J48 Algorithm From The Classifiers Tab

88
Application Development Lab
Step 6: Select the evaluation tab and place a Classifier Performance Evaluation
Component(CPE) ,connect J48 to Classifier Performance Evaluation Component by
selecting batch classifier entry from the pop menu from J48.

Figure: Selecting The Classifier Performance Evaluator From The Evaluation Tab

89
Application Development Lab

Step 7: Go to visualization tool bar and place Text Viewer Component on the layout .
Connect the Classifier Performance Evaluator to the Text Viewer by selecting the text
from the pop-up menu at Classifier Performance Evaluator.

Figure: Selecting The Text Viewer From The Evaluation Tab

90
Application Development Lab

Step 8: Start flow execution by selecting the Start Loading option from the pop menu
from Arff Loader.

Figure: Selecting The Start Loading Option From the Pop-up Menu Of ARFF Loader

91
Application Development Lab

Step 9: Choose the Show Results from the pop menu of the Text Viewer Component.

Figure: Selecting The Show results Option From the Pop-up Menu Of Text Viewer

92
Application Development Lab

Step 10: View the results of the applied J48 classifier from the text viewer window.

Figure: Viewing The Results In The Text Viewer

93
Application Development Lab

OPERATIONS ON KNOWLEDGE FLOW USING FILTERS

Step 1: Select the data sources and choose the ARFF Loader. Then select the
configure option from the pop-up menu of the ARFF loader in order to select the data
set.

Figure: Selecting The ARFF Loader Option From The Data Sources Tab

Step 2: Select the Filters tab and choose the unsupervised filter “AddCluster” and
connect the ARFF Loader to the Add cluster by selecting the data set option from the
pop-up menu at the ARFF Loader.

94
Application Development Lab

Figure: Selecting The Add Cluster Unsupervised Filter From the Filters Tab

95
Application Development Lab
Step 3: Select the visualization tab and choose the “data visualizer” to display the graph
and also choose the “attribute summarizer” to display the plot of each and every
attribute.

Figure: Selecting The Data Visualizer And Attribute Summarizer From The Visualization
Tab

Step 4: Select the start loading option from the pop-up menu of ARFF loader. Choose
the show plot option from the pop-up menu of “Data visualizer” and show summaries
option from the pop-up menu of the “Attribute Summarizer”.

96
Application Development Lab

Figure: Selecting The Show Plot Option from The Pop-up Menu Of Data Visualizer

97
Application Development Lab

Figure: Selecting The Show Summaries Option From The Pop-up Menu Of Attribute
Summarizer

Figure: Visualization After Selecting The Show Plot Of Data Visualizer

98
Application Development Lab

Figure: Visualization After Selecting The Show Summaries Of Attribute Summarizer

99

You might also like