WEKA Manual For Version 3-6-5
WEKA Manual For Version 3-6-5
uU
 p(u[pa(u)).
Below,   a  Bayesian  network  is  shown  for  the  variables  in  the  iris  data  set.
Note  that  the  links  between the  nodes class,  petallength  and petalwidth  do  not
form  a  directed  cycle,  so  the  graph  is  a  proper  DAG.
This picture  just shows the  network structure of the  Bayes net,  but for each
of the nodes a probability distribution for the node given its parents are specied
as  well.   For  example,  in  the  Bayes net  above  there  is  a  conditional  distribution
115
116   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
for  petallength  given  the  value  of  class.   Since  class  has  no  parents,  there  is  an
unconditional  distribution  for  sepalwidth.
Basic  assumptions
The  classication  task  consist  of   classifying  a  variable  y  =  x
0
  called  the  class
variable   given  a  set   of   variables   x  =  x
1
 . . . x
n
,   called  attribute   variables.   A
classier  h  :   x    y  is  a  function  that  maps  an  instance  of   x  to  a  value  of   y.
The classier is learned from a dataset D  consisting of samples over (x, y).   The
learning  task  consists  of  nding  an  appropriate  Bayesian  network  given  a  data
set  D  over  U.
All Bayes network algorithms implemented in Weka assume the following for
the  data  set:
   all   variables   are   discrete   nite   variables.   If   you  have   a   data  set   with
continuous  variables,  you  can  use  the  following  lter  to  discretize  them:
weka.filters.unsupervised.attribute.Discretize
   no  instances  have  missing  values.   If  there  are  missing  values  in  the  data
set,  values  are  lled  in  using  the  following  lter:
weka.filters.unsupervised.attribute.ReplaceMissingValues
The  rst   step  performed  by  buildClassifier  is   checking  if   the  data  set
fullls   those  assumptions.   If   those  assumptions   are  not   met,   the  data  set   is
automatically  ltered  and  a  warning  is  written  to  STDERR.
1
Inference  algorithm
To use a Bayesian network as a classier, one simply calculates argmax
y
P(y[x)
using  the  distribution  P(U)  represented  by  the  Bayesian  network.   Now  note
that
P(y[x)   =   P(U)/P(x)
   P(U)
=
uU
p(u[pa(u))   (8.1)
And since all variables in x are known, we do not need complicated inference
algorithms,  but  just  calculate  (8.1)  for  all  class  values.
Learning  algorithms
The  dual nature of a Bayesian network makes learning a Bayesian network as a
two  stage  process  a  natural  division:   rst  learn  a  network  structure,  then  learn
the  probability  tables.
There are various approaches to structure learning and in Weka, the following
areas  are  distinguished:
1
If   there  are  missing  values  in  the  test  data,   but  not  in  the  training  data,   the  values  are
lled  in  in  the  test  data  with  a  ReplaceMissingValues  lter  based  on  the  training  data.
8.1.   INTRODUCTION   117
   local   score  metrics:   Learning  a  network  structure  B
S
  can  be  considered
an  optimization  problem  where  a  quality  measure  of  a  network  structure
given the training data Q(B
S
[D) needs to be maximized.   The quality mea-
sure  can  be  based  on  a  Bayesian  approach,  minimum  description  length,
information and  other criteria.   Those  metrics have the  practical property
that  the  score  of   the  whole  network  can  be  decomposed  as  the  sum  (or
product) of the score of the individual nodes.   This allows for local scoring
and  thus  local  search  methods.
   conditional  independence  tests:   These methods mainly stem from the goal
of uncovering causal structure.   The assumption is  that there is  a network
structure  that   exactly  represents  the  independencies   in  the  distribution
that  generated  the  data.   Then  it  follows  that  if  a  (conditional)  indepen-
dency  can be  identied  in  the  data  between  two variables that  there is  no
arrow between those two variables.   Once locations of edges are identied,
the direction of the edges is assigned such that conditional independencies
in  the  data  are  properly  represented.
   global   score  metrics:   A  natural  way  to  measure  how  well  a  Bayesian  net-
work  performs   on  a  given  data  set   is   to  predict   its   future  performance
by  estimating  expected  utilities,   such  as   classication  accuracy.   Cross-
validation  provides  an  out  of  sample  evaluation  method  to  facilitate  this
by repeatedly splitting the data in training and validation sets.   A Bayesian
network  structure  can  be  evaluated  by  estimating  the  networks  param-
eters  from  the  training  set  and  the  resulting  Bayesian  networks  perfor-
mance  determined  against  the  validation  set.   The  average  performance
of  the  Bayesian network  over the  validation  sets  provides a  metric  for  the
quality  of  the  network.
Cross-validation diers  from  local  scoring  metrics  in  that  the  quality  of  a
network  structure  often  cannot  be  decomposed  in  the  scores  of  the  indi-
vidual   nodes.   So,   the  whole  network  needs  to  be  considered  in  order  to
determine  the  score.
   xed  structure:   Finally,   there  are  a  few  methods  so  that  a  structure  can
be  xed,  for  example,  by  reading  it  from  an  XML  BIF  le
2
.
For each of these areas, dierent search algorithms are implemented in Weka,
such  as  hill  climbing,  simulated  annealing  and  tabu  search.
Once  a  good  network  structure  is  identied,   the  conditional  probability  ta-
bles  for  each  of  the  variables  can  be  estimated.
You can select a Bayes net classier by clicking the classier Choose button
in the Weka explorer, experimenter or knowledge ow and nd  BayesNet under
the  weka.classifiers.bayes package  (see  below).
2
See  http://www-2.cs.cmu.edu/
xjpa(xi)
 r
j
.
120   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Note  pa(x
i
)  =   implies  q
i
  =  1.   We  use  N
ij
  (1   i    n,   1   j    q
i
)  to  denote
the  number  of   records  in  D  for  which  pa(x
i
)  takes  its  jth  value.We  use  N
ijk
(1    i     n,   1    j     q
i
,   1    k    r
i
)   to  denote  the  number  of   records  in  D
for  which  pa(x
i
)  takes  its  jth  value  and  for  which  x
i
  takes  its  kth  value.   So,
N
ij
  =
ri
k=1
 N
ijk
.   We  use  N  to  denote  the  number  of  records  in  D.
Let  the  entropy  metric  H(B
S
, D)  of   a  network  structure  and  database  be
dened  as
H(B
S
, D) = N
n
i=1
qi
j=1
ri
k=1
N
ijk
N
  log
 N
ijk
N
ij
(8.2)
and  the  number  of  parameters  K  as
K  =
n
i=1
(r
i
  1)  q
i
  (8.3)
AIC  metric The AIC metric Q
AIC
(B
S
, D) of a Bayesian network structure
B
S
  for  a  database  D  is
Q
AIC
(B
S
, D) = H(B
S
, D) + K   (8.4)
A  term  P(B
S
)  can  be  added  [15]   representing  prior  information  over  network
structures,  but  will  be  ignored  for  simplicity  in  the  Weka  implementation.
MDL  metric  The  minimum  description  length  metric  Q
MDL
(B
S
, D)  of  a
Bayesian network  structure  B
S
  for  a  database  D  is  is  dened  as
Q
MDL
(B
S
, D) = H(B
S
, D) +
  K
2
  log N   (8.5)
Bayesian  metric The Bayesian metric of a Bayesian network structure B
D
for  a  database  D  is
Q
Bayes
(B
S
, D) = P(B
S
)
n
i=0
qi
j=1
(N
ij
)
(N
ij
  + N
ij
)
ri
k=1
(N
ijk
 + N
ijk
)
(N
ijk
)
where  P(B
S
)  is  the  prior on  the  network structure  (taken  to  be  constant hence
ignored  in  the  Weka  implementation)  and  (.)  the  gamma-function.   N
ij
  and
N
ijk
  represent choices of priors on counts restricted by N
ij
  =
ri
k=1
 N
ijk
.   With
N
ijk
  = 1  (and  thus  N
ij
  = r
i
),  we  obtain  the  K2  metric  [19]
Q
K2
(B
S
, D) = P(B
S
)
n
i=0
qi
j=1
(r
i
  1)!
(r
i
  1 + N
ij
)!
ri
k=1
N
ijk
!
With  N
ijk
  = 1/r
i
  q
i
  (and  thus  N
ij
  = 1/q
i
),  we  obtain  the  BDe  metric  [22].
8.2.2   Search  algorithms
The  following  search  algorithms  are  implemented  for  local  score  metrics;
   K2  [19]:   hill  climbing  add  arcs  with  a  xed  ordering  of  variables.
Specic  option:   randomOrder  if   true  a  random  ordering  of  the  nodes  is
made at the beginning of the  search.   If false (default) the ordering in the
data set is used.   The only exception in both cases is that in case the initial
network is  a  naive Bayes network (initAsNaiveBayes set true)  the  class
variable  is  made  rst  in  the  ordering.
8.2.   LOCAL  SCORE  BASED  STRUCTURE  LEARNING   121
   Hill   Climbing   [16]:   hill   climbing  adding  and  deleting  arcs  with  no  xed
ordering  of  variables.
useArcReversal if  true, also arc reversals are consider when determining
the  next  step  to  make.
   Repeated Hill  Climber  starts with a randomly generated network and then
applies  hill   climber  to  reach  a  local  optimum.   The  best  network  found  is
returned.
useArcReversal option  as  for  Hill  Climber.
   LAGD  Hill   Climbing  does  hill   climbing  with  look  ahead  on  a  limited  set
of   best   scoring  steps,   implemented  by  Manuel   Neubach.   The   number
of   look  ahead  steps   and  number   of   steps   considered  for   look  ahead  are
congurable.
   TAN  [17,   21]:   Tree  Augmented  Naive  Bayes  where  the  tree  is   formed
by  calculating  the  maximum  weight  spanning  tree  using  Chow  and  Liu
algorithm [18].
No  specic  options.
   Simulated  annealing  [15]:   using  adding  and  deleting  arrows.
The  algorithm  randomly  generates  a  candidate  network  B
S
  close  to  the
current network B
S
.   It accepts the network if it is better than the current,
i.e.,   Q(B
S
, D)   >  Q(B
S
, D).   Otherwise,   it   accepts   the   candidate   with
probability
e
ti(Q(B
S
,D)Q(BS,D))
where  t
i
  is  the  temperature  at  iteration  i.   The  temperature  starts  at  t
0
and  is  slowly  decreases  with  each  iteration.
Specic  options:
TStart start  temperature  t
0
.
delta  is  the  factor    used  to  update  the  temperature,  so  t
i+1
  = t
i
  .
runs  number  of  iterations  used  to  traverse the  search  space.
seed  is  the  initialization  value  for  the  random  number  generator.
   Tabu  search  [15]:   using  adding  and  deleting  arrows.
Tabu  search  performs  hill  climbing  until  it  hits  a  local  optimum.   Then  it
122   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
steps  to  the  least  worse  candidate  in  the  neighborhood.   However,  it  does
not consider points  in  the  neighborhood it just visited  in  the  last tl  steps.
These  steps  are  stored  in  a  so  called  tabu-list.
Specic  options:
runs  is  the  number  of  iterations  used  to  traverse the  search  space.
tabuList is  the  length  tl  of  the  tabu  list.
   Genetic  search:   applies  a  simple  implementation  of a  genetic  search algo-
rithm to network structure learning.   A Bayes net structure is represented
by a array of n n (n = number of nodes) bits where bit i  n+j  represents
whether  there  is  an  arrow from  node  j   i.
Specic  options:
populationSize is  the  size  of the  population  selected  in  each  generation.
descendantPopulationSize is  the  number  of ospring generated in  each
8.3.   CONDITIONAL INDEPENDENCE TEST BASEDSTRUCTURE LEARNING123
generation.
runs  is  the  number  of  generation  to  generate.
seed  is  the  initialization  value  for  the  random  number  generator.
useMutation ag to indicate whether mutation should be used.   Mutation
is  applied  by  randomly  adding  or  deleting  a  single  arc.
useCrossOver ag  to  indicate  whether  cross-over should  be  used.   Cross-
over  is  applied  by  randomly  picking  an  index  k  in  the  bit  representation
and  selecting  the  rst  k  bits  from  one  and  the  remainder  from  another
network  structure  in  the  population.   At   least  one  of   useMutation  and
useCrossOver should  be  set  to  true.
useTournamentSelection when  false,   the  best  performing  networks  are
selected  from  the  descendant   population  to  form  the  population  of   the
next  generation.   When  true,   tournament  selection  is  used.   Tournament
selection  randomly  chooses  two  individuals  from  the  descendant  popula-
tion  and  selects  the  one  that  performs  best.
8.3   Conditional   independence  test  based  struc-
ture  learning
Conditional independence tests in Weka are slightly dierent from the standard
tests  described  in  the  literature.   To  test  whether  variables  x  and  y  are  condi-
tionally independent given a set of variables Z, a network structure with arrows
zZ
z   y  is  compared  with  one  with  arrows x  y  
zZ
z   y.   A  test  is
performed  by  using  any  of  the  score  metrics  described  in  Section  2.1.
At  the  moment,  only  the  ICS  [25]and  CI  algorithm are  implemented.
The  ICS  algorithm  makes   two  steps,   rst   nd  a  skeleton  (the  undirected
graph with  edges iff  there  is  an arrow in  network structure) and  second direct
124   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
all  the  edges  in  the  skeleton  to  get  a  DAG.
Starting  with  a  complete  undirected  graph,  we  try  to  nd  conditional  inde-
pendencies  x, y[Z)  in  the  data.   For  each  pair  of   nodes  x,   y,   we  consider  sets
Z  starting  with  cardinality  0,   then  1  up  to  a  user  dened  maximum.   Further-
more,   the  set   Z  is   a  subset   of   nodes   that   are  neighbors  of   both  x  and  y.   If
an  independency  is  identied,   the  edge  between  x  and  y  is  removed  from  the
skeleton.
The rst step in directing arrows is to check for every conguration xz 
y  where  x  and  y  not  connected  in  the  skeleton  whether  z  is  in  the  set  Z  of
variables  that  justied  removing  the  link  between  x  and  y  (cached  in  the  rst
step).   If  z  is  not  in  Z,  we  can  assign  direction  x  z  y.
Finally, a set of graphical rules is applied [25] to direct the remaining arrows.
Rule   1:   i->j--k   &   i-/-k   =>   j->k
Rule   2:   i->j->k   &   i--k   =>   i->k
Rule   3   m
/|\
i   |   k   =>   m->j
i->j<-k   \|/
j
Rule   4   m
/   \
i---k   =>   i->m   &   k->m
i->j   \   /
j
Rule   5:   if   no   edges   are   directed   then   take   a   random   one   (first   we   can   find)
The  ICS  algorithm comes  with  the  following  options.
Since  the  ICS  algorithm  is  focused  on  recovering  causal   structure,   instead
of   nding  the  optimal   classier,   the  Markov  blanket   correction  can  be  made
afterwards.
Specic  options:
The maxCardinality option determines the largest subset of Z  to be considered
in  conditional  independence  tests  x, y[Z).
The  scoreType option  is  used  to  select  the  scoring  metric.
8.4.   GLOBAL  SCORE  METRIC  BASED  STRUCTURE  LEARNING   125
8.4   Global score metric based structure learning
Common  options  for  cross-validation based  algorithms  are:
initAsNaiveBayes, markovBlanketClassifier and maxNrOfParents (see Sec-
tion  8.2  for  description).
Further, for each of the cross-validation based algorithms the  CVType can be
chosen  out  of  the  following:
   Leave  one  out cross-validation  (loo-cv) selects m = N  training sets simply
by  taking the  data set D  and  removing the  ith  record for training set D
t
i
.
The  validation  set  consist  of   just  the  ith  single  record.   Loo-cv  does  not
always  produce  accurate  performance  estimates.
   K-fold  cross-validation  (k-fold  cv)  splits  the  data  D  in  m  approximately
equal   parts  D
1
, . . . , D
m
.   Training  set  D
t
i
  is  obtained  by  removing  part
D
i
  from  D.   Typical   values  for  m  are  5,   10  and  20.   With  m  =  N,   k-fold
cross-validation becomes  loo-cv.
   Cumulative cross-validation  (cumulative cv) starts with an empty data set
and adds instances item by item from D.   After each time an item is added
the  next  item  to  be  added  is  classied  using  the  then  current  state  of  the
Bayes  network.
Finally,   the   useProb  ag  indicates   whether   the   accuracy  of   the   classier
should  be  estimated  using  the  zero-one  loss  (if   set  to  false)  or  using  the  esti-
mated  probability  of  the  class.
126   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
The following search algorithms are implemented:   K2, HillClimbing, Repeat-
edHillClimber,   TAN,   Tabu  Search,   Simulated  Annealing  and  Genetic  Search.
See  Section  8.2  for  a  description  of  the  specic  options  for  those  algorithms.
8.5   Fixed  structure  learning
The  structure  learning  step  can  be  skipped  by  selecting  a  xed  network  struc-
ture.   There  are  two  methods  of  getting  a  xed  structure:   just  make  it  a  naive
Bayes  network,  or  reading  it  from  a  le  in  XML  BIF  format.
8.6   Distribution  learning
Once  the  network  structure  is  learned,   you  can  choose  how  to  learn  the  prob-
ability  tables  selecting  a  class  in  the  weka.classifiers.bayes.net.estimate
8.6.   DISTRIBUTION  LEARNING   127
package.
The   SimpleEstimator  class   produces   direct   estimates   of   the   conditional
probabilities,  that  is,
P(x
i
  = k[pa(x
i
) = j) =
N
ijk
 + N
ijk
N
ij
 + N
ij
where  N
ijk
  is  the  alpha  parameter  that  can  be  set  and  is  0.5  by  default.   With
alpha = 0,  we  get  maximum  likelihood  estimates.
With  the  BMAEstimator,   we  get   estimates   for   the  conditional   probability
tables  based  on  Bayes  model   averaging  of  all   network  structures  that  are  sub-
structures  of   the  network  structure  learned  [15].   This  is  achieved  by  estimat-
ing  the  conditional   probability  table  of   a  node  x
i
  given  its  parents  pa(x
i
)  as
a  weighted  average  of   all   conditional   probability  tables  of   x
i
  given  subsets  of
pa(x
i
).   The  weight  of  a  distribution  P(x
i
[S)  with  S    pa(x
i
)  used  is  propor-
tional   to  the  contribution  of  network  structure  
yS
y    x
i
  to  either  the  BDe
metric  or  K2  metric  depending  on  the  setting  of  the  useK2Prior option  (false
and  true  respectively).
128   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
8.7   Running  from  the  command  line
These  are  the  command  line  options  of  BayesNet.
General   options:
-t   <name   of   training   file>
Sets   training   file.
-T   <name   of   test   file>
Sets   test   file.   If   missing,   a   cross-validation   will   be   performed   on   the
training   data.
-c   <class   index>
Sets   index   of   class   attribute   (default:   last).
-x   <number   of   folds>
Sets   number   of   folds   for   cross-validation   (default:   10).
-no-cv
Do   not   perform   any   cross   validation.
-split-percentage   <percentage>
Sets   the   percentage   for   the   train/test   set   split,   e.g.,   66.
-preserve-order
Preserves   the   order   in   the   percentage   split.
-s   <random   number   seed>
Sets   random   number   seed   for   cross-validation   or   percentage   split
(default:   1).
-m   <name   of   file   with   cost   matrix>
Sets   file   with   cost   matrix.
-l   <name   of   input   file>
Sets   model   input   file.   In   case   the   filename   ends   with   .xml,
the   options   are   loaded   from   the   XML   file.
-d   <name   of   output   file>
Sets   model   output   file.   In   case   the   filename   ends   with   .xml,
only   the   options   are   saved   to   the   XML   file,   not   the   model.
-v
Outputs   no   statistics   for   training   data.
-o
Outputs   statistics   only,   not   the   classifier.
-i
Outputs   detailed   information-retrieval   statistics   for   each   class.
-k
8.7.   RUNNING  FROM  THE  COMMAND  LINE   129
Outputs   information-theoretic   statistics.
-p   <attribute   range>
Only   outputs   predictions   for   test   instances   (or   the   train
instances   if   no   test   instances   provided),   along   with   attributes
(0   for   none).
-distribution
Outputs   the   distribution   instead   of   only   the   prediction
in   conjunction   with   the   -p   option   (only   nominal   classes).
-r
Only   outputs   cumulative   margin   distribution.
-g
Only   outputs   the   graph   representation   of   the   classifier.
-xml   filename   |   xml-string
Retrieves   the   options   from   the   XML-data   instead   of   the   command   line.
Options   specific   to   weka.classifiers.bayes.BayesNet:
-D
Do   not   use   ADTree   data   structure
-B   <BIF   file>
BIF   file   to   compare   with
-Q   weka.classifiers.bayes.net.search.SearchAlgorithm
Search   algorithm
-E   weka.classifiers.bayes.net.estimate.SimpleEstimator
Estimator   algorithm
The search algorithm option -Q and estimator option -E options are manda-
tory.
Note  that  it  is  important  that  the  -E  options  should  be  used  after  the  -Q
option.   Extra  options  can  be  passed  to  the  search algorithm  and  the  estimator
after  the  class  name  specied  following  --.
For  example:
java   weka.classifiers.bayes.BayesNet   -t   iris.arff   -D   \
-Q   weka.classifiers.bayes.net.search.local.K2   --   -P   2   -S   ENTROPY   \
-E   weka.classifiers.bayes.net.estimate.SimpleEstimator   --   -A   1.0
Overview  of  options  for  search  algorithms
   weka.classifiers.bayes.net.search.local.GeneticSearch
-L   <integer>
Population  size
-A   <integer>
Descendant  population  size
-U   <integer>
Number   of   runs
-M
Use   mutation.
130   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
(default  true)
-C
Use   cross-over.
(default  true)
-O
Use   tournament  selection  (true)   or   maximum  subpopulatin  (false).
(default  false)
-R   <seed>
Random   number   seed
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,  BDeu,   MDL,   ENTROPY  and   AIC)
   weka.classifiers.bayes.net.search.local.HillClimber
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,  BDeu,   MDL,   ENTROPY  and   AIC)
   weka.classifiers.bayes.net.search.local.K2
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Random   order.
(default  false)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
8.7.   RUNNING  FROM  THE  COMMAND  LINE   131
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.local.LAGDHillClimber
-L   <nr   of   look   ahead   steps>
Look   Ahead   Depth
-G   <nr   of   good   operations>
Nr   of   Good   Operations
-P   <nr   of   parents>
Maximum  number   of   parents
-R
Use   arc   reversal   operation.
(default  false)
-N
Initial  structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.local.RepeatedHillClimber
-U   <integer>
Number   of   runs
-A   <seed>
Random   number   seed
-P   <nr   of   parents>
Maximum  number   of   parents
-R
Use   arc   reversal   operation.
(default  false)
-N
Initial  structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.local.SimulatedAnnealing
132   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
-A   <float>
Start   temperature
-U   <integer>
Number   of   runs
-D   <float>
Delta   temperature
-R   <seed>
Random   number   seed
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,  BDeu,   MDL,   ENTROPY  and   AIC)
   weka.classifiers.bayes.net.search.local.TabuSearch
-L   <integer>
Tabu   list   length
-U   <integer>
Number   of   runs
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,  BDeu,   MDL,   ENTROPY  and   AIC)
   weka.classifiers.bayes.net.search.local.TAN
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
8.7.   RUNNING  FROM  THE  COMMAND  LINE   133
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.ci.CISearchAlgorithm
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.ci.ICSSearchAlgorithm
-cardinality  <num>
When   determining  whether  an   edge   exists   a   search   is   performed
for   a   set   Z   that   separates  the   nodes.   MaxCardinality  determines
the   maximum   size   of   the   set   Z.   This   greatly  influences  the
length   of   the   search.   (default  2)
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [BAYES|MDL|ENTROPY|AIC|CROSS_CLASSIC|CROSS_BAYES]
Score   type   (BAYES,   BDeu,   MDL,   ENTROPY   and   AIC)
   weka.classifiers.bayes.net.search.global.GeneticSearch
-L   <integer>
Population  size
-A   <integer>
Descendant  population  size
-U   <integer>
Number   of   runs
-M
Use   mutation.
(default  true)
-C
Use   cross-over.
(default  true)
-O
Use   tournament  selection  (true)   or   maximum   subpopulatin  (false).
(default  false)
-R   <seed>
134   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Random   number   seed
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.global.HillClimber
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.global.K2
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Random   order.
(default  false)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
8.7.   RUNNING  FROM  THE  COMMAND  LINE   135
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.global.RepeatedHillClimber
-U   <integer>
Number   of   runs
-A   <seed>
Random   number   seed
-P   <nr   of   parents>
Maximum  number   of   parents
-R
Use   arc   reversal   operation.
(default  false)
-N
Initial  structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.global.SimulatedAnnealing
-A   <float>
Start   temperature
-U   <integer>
Number   of   runs
-D   <float>
Delta   temperature
-R   <seed>
Random   number   seed
-mbc
Applies  a   Markov   Blanket  correction  to   the   network   structure,
after   a   network   structure  is   learned.  This   ensures   that   all
nodes   in   the   network   are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
136   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
   weka.classifiers.bayes.net.search.global.TabuSearch
-L   <integer>
Tabu   list   length
-U   <integer>
Number   of   runs
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-P   <nr   of   parents>
Maximum   number   of   parents
-R
Use   arc   reversal  operation.
(default  false)
-N
Initial   structure  is   empty   (instead  of   Naive   Bayes)
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.global.TAN
-mbc
Applies   a   Markov   Blanket   correction  to   the   network   structure,
after   a   network  structure  is   learned.  This   ensures   that   all
nodes   in   the   network  are   part   of   the   Markov   blanket   of   the
classifier  node.
-S   [LOO-CV|k-Fold-CV|Cumulative-CV]
Score   type   (LOO-CV,k-Fold-CV,Cumulative-CV)
-Q
Use   probabilistic  or   0/1   scoring.
(default  probabilistic  scoring)
   weka.classifiers.bayes.net.search.fixed.FromFile
-B   <BIF   File>
Name   of   file   containing  network   structure  in   BIF   format
   weka.classifiers.bayes.net.search.fixed.NaiveBayes
8.7.   RUNNING  FROM  THE  COMMAND  LINE   137
No   options.
Overview  of  options  for  estimators
   weka.classifiers.bayes.net.estimate.BayesNetEstimator
-A   <alpha>
Initial  count   (alpha)
   weka.classifiers.bayes.net.estimate.BMAEstimator
-k2
Whether  to   use   K2   prior.
-A   <alpha>
Initial  count   (alpha)
   weka.classifiers.bayes.net.estimate.MultiNomialBMAEstimator
-k2
Whether  to   use   K2   prior.
-A   <alpha>
Initial  count   (alpha)
   weka.classifiers.bayes.net.estimate.SimpleEstimator
-A   <alpha>
Initial  count   (alpha)
Generating  random  networks  and  articial  data  sets
You  can  generate  random Bayes  nets  and  data  sets  using
weka.classifiers.bayes.net.BayesNetGenerator
The  options  are:
-B
Generate  network   (instead  of   instances)
-N   <integer>
Nr   of   nodes
-A   <integer>
Nr   of   arcs
-M   <integer>
Nr   of   instances
-C   <integer>
Cardinality  of   the   variables
-S   <integer>
Seed   for   random   number   generator
-F   <file>
The   BIF   file   to   obtain   the   structure  from.
138   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
The  network structure is  generated by rst generating a tree so that we  can
ensure  that  we  have  a  connected  graph.   If  any  more  arrows  are  specied  they
are  randomly  added.
8.8   Inspecting  Bayesian  networks
You  can  inspect  some  of  the  properties  of  Bayesian  networks  that  you  learned
in  the  Explorer in  text  format  and  also  in  graphical  format.
Bayesian  networks  in  text
Below,   you  nd  output  typical   for  a  10  fold  cross-validation  run  in  the  Weka
Explorer with  comments  where  the  output  is  specic  for  Bayesian  nets.
===   Run   information  ===
Scheme:   weka.classifiers.bayes.BayesNet  -D   -B   iris.xml   -Q   weka.classifiers.bayes.net.
Options  for  BayesNet include  the  class  names  for  the  structure  learner  and  for
the  distribution  estimator.
Relation:   iris-weka.filters.unsupervised.attribute.Discretize-B2-M-1.0-Rfirst-last
Instances:   150
Attributes:   5
sepallength
sepalwidth
petallength
petalwidth
class
Test   mode:   10-fold   cross-validation
===   Classifier  model   (full   training  set)   ===
Bayes   Network   Classifier
not   using   ADTree
Indication whether the ADTree algorithm [24] for calculating counts in the data
set  was  used.
#attributes=5  #classindex=4
This  line  lists  the  number  of  attribute  and  the  number  of  the  class  variable  for
which  the  classier  was  trained.
Network   structure  (nodes   followed  by   parents)
sepallength(2):  class
sepalwidth(2):  class
petallength(2):  class   sepallength
petalwidth(2):  class   petallength
class(3):
8.8.   INSPECTING  BAYESIAN  NETWORKS   139
This  list  species  the  network  structure.   Each  of  the  variables  is  followed  by  a
list  of   parents,   so  the  petallength  variable  has  parents  sepallength  and  class,
while   class  has   no  parents.   The   number   in  braces   is   the   cardinality  of   the
variable.   It  shows  that  in  the  iris  dataset  there  are  three  class  variables.   All
other  variables  are  made  binary  by  running  it  through  a  discretization  lter.
LogScore   Bayes:   -374.9942769685747
LogScore   BDeu:   -351.85811477631626
LogScore   MDL:   -416.86897021246466
LogScore   ENTROPY:  -366.76261727150217
LogScore   AIC:   -386.76261727150217
These lines  list the  logarithmic score of the  network structure for various meth-
ods  of  scoring.
If   a  BIF  le  was  specied,   the  following  two  lines  will   be  produced  (if   no
such  le  was  specied,  no  information  is  printed).
Missing:   0   Extra:   2   Reversed:  0
Divergence:  -0.0719759699700729
In this case the network that was learned was compared with a le iris.xml
which contained the naive Bayes network structure.   The number after Missing
is  the  number  of   arcs  that  was  in  the  network  in  le  that  is  not  recovered  by
the  structure  learner.   Note  that  a  reversed  arc  is  not  counted  as  missing.   The
number  after Extra is the  number of arcs in  the learned network that are not
in  the  network  on  le.   The  number  of  reversed  arcs  is  listed  as  well.
Finally,  the  divergence between  the  network distribution  on le  and  the  one
learned  is  reported.   This  number  is  calculated  by  enumerating  all   possible  in-
stantiations of all variables, so it may take some time to calculate the divergence
for  large  networks.
The  remainder  of  the  output  is  standard  output  for  all  classiers.
Time   taken   to   build   model:   0.01   seconds
===   Stratified  cross-validation  ===
===   Summary  ===
Correctly  Classified  Instances   116   77.3333   %
Incorrectly  Classified  Instances   34   22.6667   %
etc...
Bayesian  networks  in  GUI
To  show  the  graphical  structure,  right  click  the  appropriate  BayesNet in  result
list  of  the  Explorer.   A  menu  pops  up,  in  which  you  select  Visualize  graph.
140   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
The  Bayes network is  automatically layed out  and  drawn thanks  to  a  graph
drawing  algorithm implemented  by  Ashraf  Kibriya.
When you hover the mouse over a node, the node lights up and all its children
are  highlighted  as  well,  so  that  it  is  easy  to  identify  the  relation  between  nodes
in  crowded  graphs.
Saving  Bayes   nets  You  can  save  the  Bayes  network  to  le  in  the  graph
visualizer.   You  have  the  choice  to  save  as  XML  BIF  format  or  as  dot  format.
Select the  oppy button  and a  le  save dialog pops  up  that allows you to select
the  le  name  and  le  format.
Zoom  The  graph  visualizer  has  two  buttons  to  zoom  in  and  out.   Also,  the
exact  zoom  desired  can  be  entered  in  the  zoom  percentage  entry.   Hit  enter  to
redraw  at  the  desired  zoom  level.
8.9.   BAYES  NETWORK  GUI   141
Graph  drawing   options   Hit   the   extra  controls   button  to  show  extra
options  that  control  the  graph  layout settings.
The  Layout   Type  determines  the  algorithm  applied  to  place  the  nodes.
The  Layout   Method determines  in  which  direction  nodes  are  considered.
The  Edge   Concentration toggle  allows  edges  to  be  partially  merged.
The   Custom   Node   Size  can  be  used  to  override  the   automatically  deter-
mined  node  size.
When  you  click  a  node  in  the  Bayesian  net,   a  window  with  the  probability
table of the node clicked pops up.   The left side shows the parent attributes and
lists  the  values  of  the  parents,  the  right  side  shows  the  probability  of  the  node
clicked  conditioned  on  the  values  of  the  parents  listed  on  the  left.
So,   the  graph  visualizer  allows  you  to  inspect  both  network  structure  and
probability  tables.
8.9   Bayes  Network  GUI
The  Bayesian  network  editor   is   a  stand  alone  application  with  the  following
features
  Edit  Bayesian  network  completely  by  hand,  with  unlimited  undo/redo  stack,
cut/copy/paste  and  layout support.
  Learn  Bayesian  network  from  data  using  learning  algorithms in  Weka.
  Edit structure by hand  and learn conditional probability tables (CPTs) using
142   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
learning  algorithms  in  Weka.
  Generate  dataset  from  Bayesian  network.
  Inference  (using  junction  tree  method)  of   evidence  through  the  network,   in-
teractively  changing  values  of  nodes.
  Viewing  cliques  in  junction  tree.
  Accelerator  key  support  for  most  common  operations.
The  Bayes  network  GUI  is  started  as
java  weka.classiers.bayes.net.GUI bif  le
The  following  window  pops  up  when  an  XML  BIF  le  is  specied  (if   none  is
specied  an  empty  graph  is  shown).
Moving  a  node
Click  a   node   with  the   left   mouse   button  and  drag  the   node   to   the   desired
position.
8.9.   BAYES  NETWORK  GUI   143
Selecting  groups  of  nodes
Drag  the  left  mouse  button  in  the  graph  panel.   A  rectangle  is  shown  and  all
nodes  intersecting  with  the  rectangle  are  selected  when  the  mouse  is  released.
Selected nodes are made visible with four little black squares at the corners (see
screenshot  above).
The selection can be extended by keeping the shift key pressed while selecting
another  set  of  nodes.
The  selection  can  be  toggled  by  keeping  the  ctrl   key  pressed.   All   nodes  in
the  selection  selected  in  the  rectangle  are  de-selected,  while  the  ones  not  in  the
selection  but  intersecting  with  the  rectangle  are  added  to  the  selection.
Groups  of  nodes  can  be  moved  by  keeping  the  left  mouse  pressed  on  one  of
the  selected  nodes  and  dragging the  group  to  the  desired  position.
File  menu
The  New,  Save,  Save  As,  and  Exit  menu  provide  functionality  as  expected.
The  le  format  used  is  XML  BIF  [20].
There  are  two  le  formats  supported  for  opening
  .xml   for   XML  BIF  les.   The  Bayesian  network  is   reconstructed  from  the
information  in  the  le.   Node  width  information  is  not  stored  so  the  nodes  are
shown  with  the  default  width.   This  can  be  changed  by  laying  out  the  graph
(menu  Tools/Layout).
 .ar Weka data les.   When an ar le is selected, a new empty Bayesian net-
work is created with  nodes for each of the attributes in the ar le.   Continuous
variables are discretized using the weka.filters.supervised.attribute.Discretize
lter  (see  note  at  end  of  this  section  for  more  details).   The  network  structure
can  be  specied  and  the  CPTs  learned  using  the  Tools/Learn CPT  menu.
The  Print  menu  works  (sometimes)  as  expected.
The  Export   menu  allows  for   writing  the  graph  panel   to  image  (currently
supported are bmp,  jpg, png and eps formats).   This can also be activated using
the  Alt-Shift-Left  Click  action  in  the  graph  panel.
144   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Edit  menu
Unlimited undo/redo support.   Most edit operations on the Bayesian network
are  undoable.   A  notable  exception  is  learning  of  network  and  CPTs.
Cut/copy/paste support.   When a set of nodes is selected these can be placed
on  a  clipboard  (internal,   so  no  interaction  with  other  applications  yet)  and  a
paste action will add the nodes.   Nodes are renamed by adding Copy of before
the  name  and  adding  numbers if  necessary to  ensure  uniqueness of name.   Only
the  arrows to  parents  are  copied,  not  these  of  the  children.
The  Add  Node  menu  brings  up  a  dialog  (see  below)  that  allows  to  specify
the name of the new node and the cardinality of the new node.   Node values are
assigned  the  names  Value1,  Value2  etc.   These  values  can  be  renamed  (right
click  the  node  in  the  graph panel  and  select  Rename  Value).   Another  option  is
to  copy/paste a  node  with  values  that  are  already  properly  named  and  rename
the  node.
The  Add  Arc  menu  brings  up  a  dialog  to  choose  a  child  node  rst;
8.9.   BAYES  NETWORK  GUI   145
Then  a  dialog  is  shown  to  select  a  parent.   Descendants  of   the  child  node,
parents  of   the  child  node  and  the  node  itself   are  not  listed  since  these  cannot
be  selected  as  child  node  since  they  would  introduce  cycles  or  already  have  an
arc  in  the  network.
The  Delete  Arc  menu  brings  up  a  dialog  with  a  list  of  all   arcs  that  can  be
deleted.
The list of eight items at the bottom are active only when a group of at least
two  nodes  are  selected.
  Align  Left/Right/Top/Bottom moves  the  nodes  in  the  selection  such  that  all
nodes  align  to  the  utmost  left,   right,   top  or  bottom  node  in  the  selection  re-
spectively.
  Center  Horizontal/Vertical moves nodes  in  the  selection  halfway  between  left
and  right  most  (or  top  and  bottom  most  respectively).
  Space  Horizontal/Vertical   spaces  out  nodes  in  the  selection  evenly  between
left  and  right  most  (or  top  and  bottom  most  respectively).   The  order  in  which
the  nodes  are  selected  impacts  the  place  the  node  is  moved  to.
Tools  menu
The Generate Network menu allows generation of a complete random Bayesian
network.   It  brings  up  a  dialog  to  specify  the  number  of  nodes,  number  of  arcs,
cardinality  and  a  random  seed  to  generate  a  network.
146   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
The Generate Data menu allows for generating a data set from the Bayesian
network  in  the  editor.   A  dialog  is  shown  to  specify  the  number  of  instances  to
be  generated,   a  random  seed  and  the  le  to  save  the  data  set   into.   The  le
format  is  ar.   When  no  le  is  selected  (eld  left  blank)  no  le  is  written  and
only  the  internal  data  set  is  set.
The  Set   Data  menu  sets   the  current  data  set.   From  this   data  set   a  new
Bayesian  network  can  be  learned,  or  the  CPTs  of  a  network  can  be  estimated.
A  le  choose  menu  pops  up  to  select  the  ar  le  containing  the  data.
The  Learn  Network  and  Learn  CPT  menus  are  only  active  when  a  data  set
is  specied  either  through
  Tools/Set  Data  menu,  or
  Tools/Generate  Data  menu,  or
  File/Open  menu  when  an  ar  le  is  selected.
The Learn Network action learns the whole Bayesian network from the data
set.   The  learning  algorithms  can  be  selected  from  the  set  available  in  Weka  by
selecting the Options button  in the  dialog below.   Learning a network clears the
undo  stack.
The Learn CPT menu does not change the structure of the Bayesian network,
only  the  probability  tables.   Learning  the  CPTs  clears  the  undo  stack.
The  Layout  menu  runs  a  graph  layout  algorithm  on  the  network  and  tries
to  make  the  graph  a  bit  more  readable.   When  the  menu  item  is  selected,   the
node size can be specied or left to calculate by the algorithm based on the  size
of  the  labels  by  deselecting  the  custom  node  size  check  box.
8.9.   BAYES  NETWORK  GUI   147
The  Show  Margins  menu  item  makes  marginal  distributions  visible.   These
are  calculated using  the  junction  tree  algorithm [23].   Marginal probabilities for
nodes   are  shown  in  green  next   to  the  node.   The  value  of   a  node  can  be  set
(right click node,  set evidence, select a value) and the color is changed to red to
indicate evidence is set for the node.   Rounding errors may occur in the marginal
probabilities.
The  Show  Cliques  menu  item  makes  the  cliques  visible  that  are  used  by  the
junction  tree  algorithm.   Cliques  are  visualized  using  colored  undirected  edges.
Both  margins  and  cliques  can  be  shown  at  the  same  time,   but  that  makes  for
rather  crowded  graphs.
148   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
View  menu
The view menu allows for zooming in and out of the graph panel.   Also, it allows
for  hiding  or  showing  the  status  and  toolbars.
Help  menu
The  help  menu  points  to  this  document.
8.9.   BAYES  NETWORK  GUI   149
Toolbar
The   toolbar   allows   a  shortcut   to  many  functions.   Just   hover   the   mouse
over  the  toolbar  buttons  and  a  tooltiptext  pops  up  that  tells  which  function  is
activated.   The  toolbar  can  be  shown  or  hidden  with  the  View/View  Toolbar
menu.
Statusbar
At  the  bottom  of  the  screen  the  statusbar  shows messages.   This  can  be  helpful
when  an  undo/redo  action  is  performed  that  does  not  have  any  visible  eects,
such  as  edit  actions on  a  CPT.  The  statusbar  can  be  shown  or hidden  with  the
View/View  Statusbar  menu.
Click  right  mouse  button
Clicking  the  right  mouse  button  in  the  graph  panel   outside  a  node  brings  up
the  following  popup  menu.   It  allows  to  add  a  node  at  the  location  that  was
clicked,  or add  select a  parent to  add  to all  nodes  in  the  selection.   If no  node  is
selected,  or  no  node  can  be  added  as  parent,  this  function  is  disabled.
Clicking  the  right  mouse  button  on  a  node  brings  up  a  popup  menu.
The  popup  menu  shows list of values that  can be  set as evidence  to  selected
node.   This is only visible when margins are shown (menu Tools/Show margins).
By selecting Clear, the value of the node is removed and the margins calculated
based  on  CPTs  again.
150   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
A node can be renamed by right click and select Rename in the popup menu.
The  following  dialog  appears  that  allows  entering  a  new  node  name.
The   CPT  of   a   node   can  be   edited  manually   by   selecting   a   node,   right
click/Edit CPT.  A  dialog  is  shown  with  a  table  representing  the  CPT.  When  a
value  is  edited,   the  values  of  the  remainder  of  the  table  are  update  in  order  to
ensure that the  probabilities add up to 1.   It attempts to adjust the  last column
rst,  then  goes  backward  from  there.
The whole table can be lled with randomly generated distributions by selecting
the  Randomize  button.
The  popup  menu  shows  list  of  parents  that  can  be  added  to  selected  node.
CPT for the node is updated by making copies for each value of the new parent.
8.9.   BAYES  NETWORK  GUI   151
The  popup  menu  shows   list   of   parents   that   can  be  deleted  from  selected
node.   CPT  of  the  node  keeps  only  the  one  conditioned  on  the  rst  value  of the
parent  node.
The  popup  menu  shows  list  of   children  that   can  be  deleted  from  selected
node.   CPT  of  the  child  node  keeps  only  the  one  conditioned  on  the  rst  value
of  the  parent  node.
Selecting the Add Value from the popup menu brings up this dialog, in which
the  name  of  the  new  value  for  the  node  can  be  specied.   The  distribution  for
the  node assign zero probability to the value.   Child  node CPTs are updated  by
copying  distributions  conditioned  on  the  new  value.
The popup menu shows list of values that can be renamed for selected node.
152   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Selecting a value brings up  the following dialog in which a new name can be
specied.
The popup menu shows list of values that can be deleted from selected node.
This  is  only  active  when  there  are  more  then  two  values  for  the  node  (single
valued  nodes  do  not  make  much  sense).   By  selecting  the  value  the  CPT  of  the
node  is  updated  in  order  to  ensure  that  the  CPT  adds  up  to  unity.   The  CPTs
of children are updated  by dropping the  distributions conditioned  on the  value.
A  note  on  CPT  learning
Continuous variables are discretized by the Bayes network class.   The discretiza-
tion  algorithm  chooses   its   values   based  on  the   information  in  the   data   set.
8.10.   BAYESIAN  NETS  IN  THE  EXPERIMENTER   153
However,   these  values  are  not   stored  anywhere.   So,   reading  an  ar  le  with
continuous variables using the File/Open menu allows one to specify a network,
then  learn  the  CPTs  from  it   since  the  discretization  bounds   are  still   known.
However,   opening  an  ar  le,   specifying  a  structure,   then  closing  the  applica-
tion,   reopening  and  trying  to  learn  the  network  from  another   le  containing
continuous  variables  may  not  give  the  desired  result  since  a  the  discretization
algorithm  is  re-applied  and  new  boundaries  may  have  been  found.   Unexpected
behavior  may  be  the  result.
Learning from  a  dataset that  contains more  attributes  than  there  are  nodes
in  the  network  is  ok.   The  extra  attributes  are  just  ignored.
Learning from a dataset with dierently ordered attributes is ok.   Attributes
are  matched  to  nodes  based  on  name.   However,   attribute  values  are  matched
with  node  values  based  on  the  order  of  the  values.
The  attributes  in  the  dataset should  have the  same  number  of values  as  the
corresponding  nodes  in  the  network  (see  above  for  continuous  variables).
8.10   Bayesian  nets  in  the  experimenter
Bayesian networks generate extra measures that can  be  examined  in  the  exper-
imenter.   The experimenter can then be used to calculate mean and variance for
those  measures.
The  following  metrics  are  generated:
   measureExtraArcs:   extra  arcs  compared  to  reference  network.   The  net-
work  must   be   provided  as   BIFFile   to   the   BayesNet   class.   If   no  such
network  is  provided,  this  value  is  zero.
   measureMissingArcs:   missing  arcs  compared  to  reference  network  or  zero
if  not  provided.
   measureReversedArcs:   reversed  arcs   compared  to  reference   network  or
zero  if  not  provided.
   measureDivergence:   divergence  of  network  learned  compared  to  reference
network  or  zero  if  not  provided.
   measureBayesScore:   log  of  the  K2  score  of  the  network  structure.
   measureBDeuScore:   log  of  the  BDeu  score  of  the  network  structure.
   measureMDLScore:   log  of  the  MDL  score.
   measureAICScore:   log  of  the  AIC  score.
   measureEntropyScore:log of  the  entropy.
8.11   Adding  your  own  Bayesian  network  learn-
ers
You  can  add  your  own  structure  learners  and  estimators.
154   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Adding  a  new  structure  learner
Here  is  the  quick  guide  for  adding  a  structure  learner:
1.   Create a class that derives from weka.classifiers.bayes.net.search.SearchAlgorithm.
If   your  searcher  is  score  based,   conditional   independence  based  or  cross-
validation based, you probably want to derive fromScoreSearchAlgorithm,
CISearchAlgorithmor CVSearchAlgorithminstead of deriving from SearchAlgorithm
directly.   Lets  say  it  is  called
weka.classifiers.bayes.net.search.local.MySearcher derived from
ScoreSearchAlgorithm.
2.   Implement  the  method
public   void   buildStructure(BayesNet  bayesNet,  Instances  instances).
Essentially,   you  are  responsible  for  setting  the  parent  sets  in  bayesNet.
You can access the parentsets using bayesNet.getParentSet(iAttribute)
where  iAttribute is  the  number  of  the  node/variable.
To  add  a  parent  iParent to  node  iAttribute, use
bayesNet.getParentSet(iAttribute).AddParent(iParent,  instances)
where  instances need to be passed for the parent set to derive properties
of  the  attribute.
Alternatively, implement public   void   search(BayesNet  bayesNet,  Instances
instances).   The  implementation  of   buildStructure  in  the  base  class.
This  method  is  called  by  the  SearchAlgorithm will   call   search  after  ini-
tializing  parent  sets  and  if   the  initAsNaiveBase  ag  is  set,   it  will   start
with  a  naive  Bayes  network  structure.   After  calling  search  in  your  cus-
tom  class,  it  will  add  arrows  if  the  markovBlanketClassifier ag  is  set
to  ensure  all  attributes  are  in  the  Markov blanket  of  the  class  node.
3.   If   the   structure   learner   has   options   that   are   not   default   options,   you
want  to  implement  public   Enumeration  listOptions(),  public   void
setOptions(String[]  options),  public   String[]  getOptions() and
the  get  and  set  methods  for  the  properties  you  want  to  be  able  to  set.
NB  1.   do  not  use  the  -E  option  since  that  is  reserved  for  the  BayesNet
class  to  distinguish  the  extra  options  for  the  SearchAlgorithm class  and
the  Estimator class.   If the -E option is used, it will not be passed to your
SearchAlgorithm (and  probably causes problems in the  BayesNet class).
NB  2.   make   sure   to  process   options   of   the   parent   class   if   any  in  the
get/setOpions  methods.
Adding  a  new  estimator
This  is  the  quick  guide  for  adding  a  new  estimator:
1.   Create  a  class  that  derives  from
weka.classifiers.bayes.net.estimate.BayesNetEstimator.  Lets say
it  is  called
weka.classifiers.bayes.net.estimate.MyEstimator.
2.   Implement  the  methods
public   void   initCPTs(BayesNet  bayesNet)
8.12.   FAQ   155
public   void   estimateCPTs(BayesNet  bayesNet)
public   void   updateClassifier(BayesNet  bayesNet,  Instance  instance),
and
public   double[]  distributionForInstance(BayesNet  bayesNet,  Instance
instance).
3.   If   the   structure   learner   has   options   that   are   not   default   options,   you
want  to  implement  public   Enumeration  listOptions(),  public   void
setOptions(String[]  options),  public   String[]   getOptions() and
the  get  and  set  methods  for  the  properties  you  want  to  be  able  to  set.
NB  do  not  use  the  -E  option  since  that  is  reserved for  the  BayesNet class
to  distinguish  the  extra  options   for   the   SearchAlgorithm  class   and  the
Estimator   class.   If   the   -E  option  is   used  and  no  extra  arguments   are
passed  to  the  SearchAlgorithm,   the  extra  options  to  your  Estimator  will
be  passed  to  the  SearchAlgorithm  instead.   In  short,   do  not  use  the  -E
option.
8.12   FAQ
How  do  I  use  a  data  set  with  continuous  variables  with  the
BayesNet  classes?
Use the class weka.filters.unsupervised.attribute.Discretizeto discretize
them.   From  the  command  line,  you  can  use
java   weka.filters.unsupervised.attribute.Discretize  -B   3   -i   infile.arff
-o   outfile.arff
where  the  -B  option  determines  the  cardinality  of  the  discretized  variables.
How  do   I   use   a   data   set   with   missing   values   with   the
BayesNet  classes?
You would have to delete the entries with missing values or ll in dummy values.
How  do  I  create  a  random  Bayes  net  structure?
Running  from  the  command  line
java   weka.classifiers.bayes.net.BayesNetGenerator  -B   -N   10   -A   9   -C
2
will   print  a  Bayes  net  with  10  nodes,   9  arcs  and  binary  variables  in  XML  BIF
format  to  standard  output.
How do  I  create  an  articial  data  set  using  a  random  Bayes
nets?
Running
java   weka.classifiers.bayes.net.BayesNetGenerator  -N   15   -A   20   -C   3
-M   300
will generate a data set in ar format with 300 instance from a random network
with  15  ternary  variables  and  20  arrows.
156   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
How  do  I  create  an  articial   data  set  using  a  Bayes  nets  I
have  on  le?
Running
java   weka.classifiers.bayes.net.BayesNetGenerator  -F   alarm.xml  -M   1000
will  generate  a  data  set  with  1000 instances  from  the  network  stored  in  the  le
alarm.xml.
How  do  I  save  a  Bayes  net  in  BIF  format?
   GUI:  In  the  Explorer
  learn  the  network  structure,
  right  click  the  relevant  run  in  the  result  list,
  choose  Visualize  graph  in  the  pop  up  menu,
  click  the  oppy  button  in  the  Graph  Visualizer  window.
  a le save as dialog pops up that allows you to select the le name
to  save  to.
   Java:   Create a BayesNet and call BayesNet.toXMLBIF03() which returns
the  Bayes  network  in  BIF  format  as  a  String.
   Command  line:   use  the  -g  option  and  redirect  the  output   on  stdout
into  a  le.
How  do  I   compare   a  network  I   learned  with  one   in  BIF
format?
Specify  the  -B <bif-le> option to BayesNet.   Calling  toString() will  produce
a  summary  of  extra,  missing  and  reversed arrows.   Also  the  divergence between
the  network  learned  and  the  one  on  le  is  reported.
How  do  I  use  the  network  I  learned  for  general  inference?
There is no general purpose inference in Weka, but you can export the network as
XML BIF le (see above) and import it in other packages, for example JavaBayes
available  under  GPL  from  http://www.cs.cmu.edu/
javabayes.
8.13   Future  development
If   you  would  like  to  add  to  the  current  Bayes  network  facilities  in  Weka,   you
might  consider  one  of  the  following  possibilities.
   Implement  more  search  algorithms,  in  particular,
  general   purpose  search  algorithms  (such  as  an  improved  implemen-
tation  of  genetic  search).
  structure  search  based  on  equivalent  model  classes.
  implement  those  algorithms  both  for  local   and  global   metric  based
search  algorithms.
8.13.   FUTURE  DEVELOPMENT   157
  implement  more  conditional  independence  based  search  algorithms.
   Implement score metrics that can handle sparse instances in order to allow
for  processing  large  datasets.
   Implement  traditional   conditional   independence  tests  for  conditional   in-
dependence  based  structure  learning  algorithms.
   Currently,   all   search  algorithms   assume   that   all   variables   are   discrete.
Search algorithms that can handle continuous variables would be interest-
ing.
   A  limitation  of  the  current  classes  is  that  they  assume  that  there  are  no
missing  values.   This   limitation  can  be   undone   by  implementing  score
metrics  that  can  handle  missing  values.   The  classes  used  for  estimating
the  conditional  probabilities  need  to  be  updated  as  well.
   Only leave-one-out, k-fold and cumulative cross-validation are implemented.
These implementations can be made more ecient and other cross-validation
methods  can  be  implemented,   such  as  Monte  Carlo  cross-validation  and
bootstrap  cross  validation.
   Implement  methods  that  can  handle  incremental   extensions  of   the  data
set  for  updating  network  structures.
And  for  the  more  ambitious  people,  there  are  the  following  challenges.
   A  GUI  for  manipulating  Bayesian  network  to  allow  user  intervention  for
adding  and  deleting  arcs  and  updating  the  probability  tables.
   General   purpose   inference   algorithms   built   into  the   GUI   to  allow  user
dened  queries.
   Allow learning of other graphical models, such as chain graphs, undirected
graphs  and  variants  of  causal  graphs.
   Allow  learning  of  networks  with  latent  variables.
   Allow learning of dynamic Bayesian networks so that time  series data can
be  handled.
158   CHAPTER  8.   BAYESIAN  NETWORK  CLASSIFIERS
Part  III
Data
159
Chapter  9
ARFF
An  ARFF  (=  Attribute-Relation  File  Format )   le  is   an  ASCII   text   le  that
describes  a  list  of  instances  sharing  a  set  of  attributes.
9.1   Overview
ARFF les have two distinct sections.   The rst section is the Header informa-
tion,  which  is  followed  the  Data  information.
The  Header  of   the  ARFF  le  contains  the  name  of   the  relation,   a  list  of
the  attributes  (the  columns  in  the  data),   and  their  types.   An  example  header
on  the  standard  IRIS  dataset  looks  like  this:
%   1.   Title:   Iris   Plants   Database
%
%   2.   Sources:
%   (a)   Creator:  R.A.   Fisher
%   (b)   Donor:   Michael   Marshall  (MARSHALL%PLU@io.arc.nasa.gov)
%   (c)   Date:   July,   1988
%
@RELATION  iris
@ATTRIBUTE  sepallength   NUMERIC
@ATTRIBUTE  sepalwidth   NUMERIC
@ATTRIBUTE  petallength   NUMERIC
@ATTRIBUTE  petalwidth   NUMERIC
@ATTRIBUTE  class   {Iris-setosa,Iris-versicolor,Iris-virginica}
The  Data  of  the  ARFF  le  looks  like  the  following:
@DATA
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
161
162   CHAPTER  9.   ARFF
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa
4.9,3.1,1.5,0.1,Iris-setosa
Lines  that  begin  with  a  %  are  comments.   The  @RELATION,  @ATTRIBUTE and
@DATA declarations  are  case  insensitive.
9.2   Examples
Several well-known machine  learning datasets  are  distributed  with  Weka  in  the
$WEKAHOME/data directory  as  ARFF  les.
9.2.1   The  ARFF  Header  Section
The  ARFF  Header  section  of  the  le  contains  the  relation  declaration  and  at-
tribute  declarations.
The  @relation  Declaration
The  relation  name  is  dened  as  the  rst  line  in  the  ARFF  le.   The  format  is:
@relation  <relation-name>
where  <relation-name>  is  a  string.   The  string  must  be  quoted  if  the  name
includes  spaces.
The  @attribute  Declarations
Attribute   declarations   take   the   form  of   an  ordered  sequence   of   @attribute
statements.   Each  attribute  in  the  data  set  has  its  own  @attribute  statement
which uniquely denes the name of that attribute and its data type.   The order
the  attributes   are  declared  indicates   the  column  position  in  the  data  section
of   the  le.   For  example,   if   an  attribute  is  the  third  one  declared  then  Weka
expects that all that attributes values will be found in the third comma delimited
column.
The  format  for  the  @attribute statement  is:
@attribute  <attribute-name>  <datatype>
where  the  <attribute-name>  must   start   with  an  alphabetic  character.   If
spaces  are  to  be  included  in  the  name  then  the  entire  name  must  be  quoted.
The  <datatype>  can  be  any  of  the  four  types  supported  by  Weka:
   numeric
   integer  is  treated  as  numeric
   real   is  treated  as  numeric
   <nominal-specication>
   string
9.2.   EXAMPLES   163
   date  [<date-format>]
   relational   for  multi-instance  data  (for  future  use)
where  <nominal-specication>  and  <date-format>  are  dened  below.   The
keywords  numeric,  real,  integer,  string  and  date  are  case  insensitive.
Numeric  attributes
Numeric  attributes  can  be  real  or  integer  numbers.
Nominal  attributes
Nominal values are dened by providing an <nominal-specication> listing the
possible values:   <nominal-name1>,  <nominal-name2>,  <nominal-name3>,
...
For  example,  the  class  value  of  the  Iris  dataset  can  be  dened  as  follows:
@ATTRIBUTE  class   {Iris-setosa,Iris-versicolor,Iris-virginica}
Values  that  contain  spaces  must  be  quoted.
String  attributes
String  attributes  allow  us  to  create  attributes  containing  arbitrary  textual  val-
ues.   This  is  very  useful   in  text-mining  applications,   as  we  can  create  datasets
with string attributes, then write Weka Filters to manipulate strings (like String-
ToWordVectorFilter).   String  attributes  are  declared  as  follows:
@ATTRIBUTE  LCC   string
Date  attributes
Date  attribute  declarations  take  the  form:
@attribute  <name>   date   [<date-format>]
where  <name>  is  the  name  for  the  attribute  and  <date-format>  is  an  op-
tional string specifying how date values should be parsed and printed (this is the
same  format  used  by  SimpleDateFormat).   The  default  format  string  accepts
the  ISO-8601  combined  date  and  time  format:   yyyy-MM-ddTHH:mm:ss.
Dates  must be  specied  in  the  data  section  as  the  corresponding string  rep-
resentations  of  the  date/time  (see  example  below).
Relational  attributes
Relational  attribute  declarations  take  the  form:
@attribute  <name>   relational
<further  attribute  definitions>
@end   <name>
For  the  multi-instance  dataset MUSK1  the  denition  would  look like  this  (...
denotes  an  omission):
164   CHAPTER  9.   ARFF
@attribute  molecule_name  {MUSK-jf78,...,NON-MUSK-199}
@attribute  bag   relational
@attribute  f1   numeric
...
@attribute  f166   numeric
@end   bag
@attribute  class   {0,1}
...
9.2.2   The  ARFF  Data  Section
The  ARFF  Data  section  of   the  le  contains  the  data  declaration  line  and  the
actual  instance  lines.
The  @data  Declaration
The  @data declaration is  a single  line  denoting  the  start of the  data segment in
the  le.   The  format  is:
@data
The  instance  data
Each instance is represented on a single line, with  carriage returns denoting the
end  of  the  instance.   A  percent  sign  (%)  introduces  a  comment,  which  continues
to  the  end  of  the  line.
Attribute   values   for   each  instance  are  delimited  by  commas.   They  must
appear in  the  order that  they  were declared  in  the  header section  (i.e.   the  data
corresponding  to  the  nth  @attribute declaration  is  always  the  nth  eld  of  the
attribute).
Missing  values  are  represented  by  a  single  question  mark,  as  in:
@data
4.4,?,1.5,?,Iris-setosa
Values of string and nominal attributes are case sensitive, and any that contain
space or the comment-delimiter character % must be quoted.   (The code suggests
that  double-quotes  are  acceptable  and  that  a  backslash  will   escape  individual
characters.)  An  example  follows:
@relation  LCCvsLCSH
@attribute  LCC   string
@attribute  LCSH   string
@data
AG5,   Encyclopedias  and   dictionaries.;Twentieth  century.
AS262,   Science  --   Soviet   Union   --   History.
AE5,   Encyclopedias  and   dictionaries.
AS281,   Astronomy,  Assyro-Babylonian.;Moon  --   Phases.
AS281,   Astronomy,  Assyro-Babylonian.;Moon  --   Tables.
9.3.   SPARSE  ARFF  FILES   165
Dates must be specied in the data section using the string representation spec-
ied  in  the  attribute  declaration.   For  example:
@RELATION  Timestamps
@ATTRIBUTE  timestamp  DATE   "yyyy-MM-dd  HH:mm:ss"
@DATA
"2001-04-03  12:12:12"
"2001-05-03  12:59:55"
Relational   data  must  be  enclosed  within  double  quotes  .   For  example  an  in-
stance  of  the  MUSK1  dataset  (...  denotes  an  omission):
MUSK-188,"42,...,30",1
9.3   Sparse  ARFF  les
Sparse ARFF les are very similar to ARFF les, but data with value 0 are not
be  explicitly  represented.
Sparse  ARFF  les  have  the  same  header  (i.e  @relation  and  @attribute
tags)   but   the  data  section  is   dierent.   Instead  of   representing  each  value  in
order,  like  this:
@data
0,   X,   0,   Y,   "class   A"
0,   0,   W,   0,   "class   B"
the  non-zero  attributes  are  explicitly  identied  by  attribute  number  and  their
value  stated,  like  this:
@data
{1   X,   3   Y,   4   "class   A"}
{2   W,   4   "class   B"}
Each  instance  is  surrounded  by  curly  braces,  and  the  format  for  each  entry  is:
<index>  <space>  <value>  where  index  is  the  attribute  index  (starting  from
0).
Note that the omitted values in a sparse instance are 0, they are not missing
values!   If  a  value  is  unknown,   you  must  explicitly  represent  it  with  a  question
mark  (?).
Warning:   There  is  a  known  problem  saving  SparseInstance objects  from
datasets  that  have  string  attributes.   In  Weka,   string  and  nominal   data  values
are  stored  as  numbers;   these  numbers  act  as  indexes  into  an  array  of  possible
attribute  values   (this   is   very  ecient).   However,   the  rst   string  value  is   as-
signed  index  0:   this  means  that,  internally,  this  value  is  stored  as  a  0.   When  a
SparseInstance is  written,   string  instances  with  internal  value  0  are  not  out-
put,  so their string value is lost (and when the ar le is read again, the default
value  0  is  the  index  of   a  dierent  string  value,   so  the  attribute  value  appears
to  change).   To  get  around  this  problem,  add  a  dummy  string  value  at  index  0
that  is  never  used  whenever  you  declare  string  attributes  that  are  likely  to  be
used  in  SparseInstance objects  and  saved  as  Sparse  ARFF  les.
166   CHAPTER  9.   ARFF
9.4   Instance  weights  in  ARFF  les
A  weight  can  be  associated  with  an  instance  in  a  standard  ARFF  le  by  ap-
pending  it  to  the  end  of   the  line  for  that  instance  and  enclosing  the  value  in
curly  braces.   E.g:
@data
0,   X,   0,   Y,   "class   A",   {5}
For  a  sparse  instance,  this  example  would  look  like:
@data
{1   X,   3   Y,   4   "class   A"},   {5}
Note  that  any  instance  without  a  weight  value  specied  is  assumed  to  have  a
weight  of  1  for  backwards compatibility.
Chapter  10
XRFF
The  XRFF  (Xml   attribute  Relation  File  Format )  is  a  representing  the  data  in
a  format  that  can  store  comments,  attribute  and  instance  weights.
10.1   File  extensions
The  following  le  extensions  are  recognized  as  XRFF  les:
   .xrff
the  default  extension  of  XRFF  les
   .xrff.gz
the  extension  for  gzip  compressed  XRFF  les  (see  Compression  section
for  more  details)
10.2   Comparison
10.2.1   ARFF
In  the  following  a  snippet  of  the  UCI  dataset  iris  in  ARFF  format:
@relation  iris
@attribute  sepallength  numeric
@attribute  sepalwidth  numeric
@attribute  petallength  numeric
@attribute  petalwidth  numeric
@attribute  class   {Iris-setosa,Iris-versicolor,Iris-virginica}
@data
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3,1.4,0.2,Iris-setosa
...
167
168   CHAPTER  10.   XRFF
10.2.2   XRFF
And  the  same  dataset  represented  as  XRFF  le:
<?xml   version="1.0"  encoding="utf-8"?>
<!DOCTYPE  dataset
[
<!ELEMENT  dataset  (header,body)>
<!ATTLIST  dataset  name   CDATA   #REQUIRED>
<!ATTLIST  dataset  version   CDATA   "3.5.4">
<!ELEMENT  header   (notes?,attributes)>
<!ELEMENT  body   (instances)>
<!ELEMENT  notes   ANY>
<!ELEMENT  attributes  (attribute+)>
<!ELEMENT  attribute  (labels?,metadata?,attributes?)>
<!ATTLIST  attribute  name   CDATA   #REQUIRED>
<!ATTLIST  attribute  type   (numeric|date|nominal|string|relational)  #REQUIRED>
<!ATTLIST  attribute  format   CDATA   #IMPLIED>
<!ATTLIST  attribute  class   (yes|no)  "no">
<!ELEMENT  labels   (label*)>
<!ELEMENT  label   ANY>
<!ELEMENT  metadata  (property*)>
<!ELEMENT  property  ANY>
<!ATTLIST  property  name   CDATA   #REQUIRED>
<!ELEMENT  instances  (instance*)>
<!ELEMENT  instance  (value*)>
<!ATTLIST  instance  type   (normal|sparse)  "normal">
<!ATTLIST  instance  weight   CDATA   #IMPLIED>
<!ELEMENT  value   (#PCDATA|instances)*>
<!ATTLIST  value   index   CDATA   #IMPLIED>
<!ATTLIST  value   missing   (yes|no)  "no">
]
>
<dataset  name="iris"  version="3.5.3">
<header>
<attributes>
<attribute  name="sepallength"  type="numeric"/>
<attribute  name="sepalwidth"  type="numeric"/>
<attribute  name="petallength"  type="numeric"/>
<attribute  name="petalwidth"  type="numeric"/>
<attribute  class="yes"  name="class"  type="nominal">
<labels>
<label>Iris-setosa</label>
<label>Iris-versicolor</label>
<label>Iris-virginica</label>
</labels>
10.3.   SPARSE  FORMAT   169
</attribute>
</attributes>
</header>
<body>
<instances>
<instance>
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
<instance>
<value>4.9</value>
<value>3</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
...
</instances>
</body>
</dataset>
10.3   Sparse  format
The XRFF format also supports a sparse data representation.   Even though the
iris  dataset  does  not  contain  sparse  data,  the  above  example  will   be  used  here
to  illustrate  the  sparse  format:
...
<instances>
<instance  type="sparse">
<value   index="1">5.1</value>
<value   index="2">3.5</value>
<value   index="3">1.4</value>
<value   index="4">0.2</value>
<value   index="5">Iris-setosa</value>
</instance>
<instance  type="sparse">
<value   index="1">4.9</value>
<value   index="2">3</value>
<value   index="3">1.4</value>
<value   index="4">0.2</value>
<value   index="5">Iris-setosa</value>
</instance>
...
</instances>
...
170   CHAPTER  10.   XRFF
In contrast to the normal   data format, each sparse instance tag contains a type
attribute  with  the  value  sparse:
<instance  type="sparse">
And  each  value  tag  needs   to  specify  the  index   attribute,   which  contains   the
1-based  index  of  this  value.
<value   index="1">5.1</value>
10.4   Compression
Since the XML representation takes up considerably more space than the rather
compact ARFF format, one can also compress the data via gzip.   Weka automat-
ically recognizes a  le  being  gzip  compressed, if the  les  extension  is  .xrff.gz
instead  of  .xrff.
The Weka Explorer, Experimenter and command-line allow one to load/save
compressed  and  uncompressed  XRFF  les  (this  applies  also  to  ARFF  les).
10.5   Useful   features
In  addition  to  all  the  features  of  the  ARFF  format,  the  XRFF  format  contains
the  following  additional  features:
   class  attribute  specication
   attribute  weights
10.5.1   Class  attribute  specication
Via  the  class="yes" attribute  in  the  attribute  specication  in  the  header,  one
can  dene  which  attribute  should  act   as   class   attribute.   A  feature  that   can
be  used  on  the  command  line  as  well   as  in  the  Experimenter,   which  now  can
also load other  data  formats, and  removing the  limitation  of the  class attribute
always  having  to  be  the  last  one.
Snippet  from  the  iris  dataset:
<attribute  class="yes"  name="class"  type="nominal">
10.5.2   Attribute  weights
Attribute  weights  are  stored  in  an  attributes  meta-data  tag  (in  the  header   sec-
tion).   Here  is  an  example  of  the  petalwidth  attribute  with  a  weight  of  0.9:
<attribute  name="petalwidth"  type="numeric">
<metadata>
<property  name="weight">0.9</property>
</metadata>
</attribute>
10.5.   USEFUL  FEATURES   171
10.5.3   Instance  weights
Instance  weights  are  dened  via  the  weight   attribute  in  each  instance  tag.   By
default,  the  weight is  1.   Here  is  an  example:
<instance  weight="0.75">
<value>5.1</value>
<value>3.5</value>
<value>1.4</value>
<value>0.2</value>
<value>Iris-setosa</value>
</instance>
172   CHAPTER  10.   XRFF
Chapter  11
Converters
11.1   Introduction
Weka oers conversion utilities for several formats, in order to allow import from
dierent  sorts  of  datasources.   These  utilities,   called  converters,  are  all   located
in  the  following  package:
weka.core.converters
For  a  certain  kind  of  converter you  will  nd  two  classes
   one  for  loading  (classname  ends  with  Loader)  and
   one  for  saving  (classname  ends  with  Saver).
Weka  contains  converters for  the  following  data  sources:
   ARFF  les  (ArLoader,  ArSaver)
   C4.5  les  (C45Loader,  C45Saver)
   CSV  les  (CSVLoader,  CSVSaver)
   les  containing  serialized  instances  (SerializedInstancesLoader,  Serial-
izedInstancesSaver)
   JDBC  databases  (DatabaseLoader, DatabaseSaver)
   libsvm  les  (LibSVMLoader,  LibSVMSaver)
   XRFF  les  (XRFFLoader,  XRFFSaver)
   text  directories  for  text  mining  (TextDirectoryLoader)
173
174   CHAPTER  11.   CONVERTERS
11.2   Usage
11.2.1   File  converters
File  converters can  be  used  as  follows:
   Loader
They  take  one  argument,  which  is  the  le  that  should  be  converted,  and
print  the  result  to  stdout.   You  can  also  redirect  the  output  into  a  le:
java   <classname>  <input-file>  >   <output-file>
Heres   an   example   for   loading   the   CSV  le   iris.csv   and   saving   it   as
iris.ar :
java   weka.core.converters.CSVLoader  iris.csv  >   iris.arff
   Saver
For  a  Saver  you  specify  the  ARFF  input  le  via  -i   and  the  output  le  in
the  specic  format  with  -o:
java   <classname>  -i   <input>   -o   <output>
Heres  an  example  for  saving  an  ARFF  le  to  CSV:
java   weka.core.converters.CSVSaver  -i   iris.arff  -o   iris.csv
A  few  notes:
   Using  the  ArSaver   from  commandline  doesnt  make  much  sense,   since
this   Saver  takes   an  ARFF  le  as   input   and  output.   The  ArSaver   is
normally  used  from  Java  for  saving  an  object  of   weka.core.Instances
to  a  le.
   The  C45Loader   either  takes  the  .names-le  or  the  .data-le  as  input,   it
automatically  looks  for  the  other  one.
   For  the  C45Saver   one  species  as  output  le  a  lename  without  any  ex-
tension,   since  two  output   les   will   be  generated;   .names   and  .data   are
automatically  appended.
11.2.2   Database  converters
The  database  converters  are  a  bit   more  complex,   since  they  also  rely  on  ad-
ditional   conguration  les,   besides  the  parameters  on  the  commandline.   The
setup  for  the  database  connection  is  stored  in  the  following  props  le:
DatabaseUtils.props
The  default  le  can  be  found  here:
weka/experiment/DatabaseUtils.props
11.2.   USAGE   175
   Loader
You  have  to  specify  at  least  a  SQL  query  with  the  -Q  option  (there  are
additional  options  for  incremental  loading)
java   weka.core.converters.DatabaseLoader  -Q   "select   *   from   employee"
   Saver
The Saver takes an ARFF le as input like any other Saver, but then  also
the  table  where  to  save  the  data  to  via  -T:
java   weka.core.converters.DatabaseSaver  -i   iris.arff  -T   iris
176   CHAPTER  11.   CONVERTERS
Chapter  12
Stemmers
12.1   Introduction
Weka now supports stemming algorithms.   The stemming algorithms are located
in  the  following  package:
weka.core.stemmers
Currently,  the  Lovins  Stemmer  (+  iterated  version)  and  support  for  the  Snow-
ball  stemmers  are  included.
12.2   Snowball   stemmers
Weka contains a wrapper class for the Snowball (homepage:   http://snowball.tartarus.org/)
stemmers  (containing  the  Porter  stemmer  and  several   other  stemmers  for  dif-
ferent  languages).   The  relevant  class  is  weka.core.stemmers.Snowball.
The  Snowball   classes  are  not  included,   they  only  have  to  be  present  in  the
classpath.   The reason for this is, that the Weka team doesnt have to watch out
for  new  versions  of  the  stemmers  and  update  them.
There  are  two  ways  of  getting  hold  of  the  Snowball  stemmers:
1.   You  can  add  the  following  pre-compiled  jar  archive to  your classpath  and
youre  set  (based  on  source  code  from  2005-10-19, compiled  2005-10-22).
http://www.cs.waikato.ac.nz/
ml/weka/stemmers/snowball.jar
2.   You  can  compile   the   stemmers   yourself   with  the   newest   sources.   Just
download  the  following  ZIP  le,   unpack  it  and  follow  the  instructions  in
the  README  le  (the  zip  contains an  ANT  (http://ant.apache.org/)
build  script  for  generating  the  jar  archive).
http://www.cs.waikato.ac.nz/
ml/weka/stemmers/snowball.zip
Note:   the  patch  target  is  specic  to  the  source  code  from  2005-10-19.
177
178   CHAPTER  12.   STEMMERS
12.3   Using  stemmers
The  stemmers  can  either  used
   from  commandline
   within the StringToWordVector (package weka.filters.unsupervised.attribute)
12.3.1   Commandline
All  stemmers  support  the  following  options:
   -h
for  displaying  a  brief  help
   -i  <input-le>
The  le  to  process
   -o  <output-le>
The  le  to  output  the  processed  data  to  (default  stdout )
   -l
Uses  lowercase  strings,  i.e.,   the  input  is  automatically  converted  to  lower
case
12.3.2   StringToWordVector
Just  use  the  GenericObjectEditor  to  choose  the  right  stemmer  and  the  desired
options  (if  the  stemmer  oers  additional  options).
12.4   Adding  new  stemmers
You  can  easily  add  new  stemmers,  if  you  follow  these  guidelines  (for  use  in  the
GenericObjectEditor):
   they  should  be  located  in  the  weka.core.stemmers package  (if  not,  then
the GenericObjectEditor.props/GenericPropertiesCreator.propsle
need  to  be  updated)  and
   they  must  implement  the  interface  weka.core.stemmers.Stemmer.
Chapter  13
Databases
13.1   Conguration  les
Thanks   to   JDBC  it   is   easy  to   connect   to   Databases   that   provide   a   JDBC
driver.   Responsible   for   the   setup  is   the   following  properties   le,   located  in
the  weka.experiment  package:
DatabaseUtils.props
You  can  get   this   properties  le  from  the  weka.jar  or  weka-src.jar  jar-archive,
both part of a normal Weka release.   If you open up one of those les, youll nd
the  properties  le  in  the  sub-folder  weka/experiment.
Weka  comes  with  example  les  for  a  wide  range  of  databases:
   DatabaseUtils.props.hsql -  HSQLDB  (>=3.4.1)
   DatabaseUtils.props.msaccess -  MS  Access  (>3.4.14,  >3.5.8,  >3.6.0)
see  the  Windows  databases  chapter  for  more  information.
   DatabaseUtils.props.mssqlserver  -   MS  SQL  Server   2000  (>=3.4.9,
>=3.5.4)
   DatabaseUtils.props.mssqlserver2005- MS SQL Server 2005 (>=3.4.11,
>=3.5.6)
   DatabaseUtils.props.mysql -  MySQL  (>=3.4.9,  >=3.5.4)
   DatabaseUtils.props.odbc - ODBC access via Suns ODBC/JDBC bridge,
e.g.,  for  MS  Sql  Server  (>=3.4.9,  >=3.5.4)
see  the  Windows  databases  chapter  for  more  information.
   DatabaseUtils.props.oracle -  Oracle  10g  (>=3.4.9, >=3.5.4)
   DatabaseUtils.props.postgresql- PostgreSQL 7.4 (>=3.4.9, >=3.5.4)
   DatabaseUtils.props.sqlite3 -  sqlite  3.x  (>3.4.12,  >3.5.7)
179
180   CHAPTER  13.   DATABASES
The easiest way is just to place the extracted properties le into your HOME
directory.   For  more  information  on  how  property  les  are  processed,  check  out
the  following  URL:
http://weka.wikispaces.com/Properties+File
Note:   Weka  only  looks  for  the  DatabaseUtils.props le.   If  you  take  one  of
the  example  les  listed  above,  you  need  to  rename  it  rst.
13.2   Setup
Under normal circumstances you only have to edit the following two properties:
   jdbcDriver
   jdbcURL
Driver
jdbcDriver is  the  classname  of  the  JDBC  driver,  necessary to  connect  to  your
database,  e.g.:
   HSQLDB
org.hsqldb.jdbcDriver
   MS  SQL  Server  2000 (Desktop  Edition)
com.microsoft.jdbc.sqlserver.SQLServerDriver
   MS  SQL  Server  2005
com.microsoft.sqlserver.jdbc.SQLServerDriver
   MySQL
org.gjt.mm.mysql.Driver (or  com.mysql.jdbc.Driver)
   ODBC  -  part  of  Suns  JDKs/JREs, no  external  driver  necessary
sun.jdbc.odbc.JdbcOdbcDriver
   Oracle
oracle.jdbc.driver.OracleDriver
   PostgreSQL
org.postgresql.Driver
   sqlite  3.x
org.sqlite.JDBC
URL
jdbcURL species the JDBC URL pointing to your database (can be still changed
in the Experimenter/Explorer), e.g.   for the database MyDatabase on the server
server.my.domain:
13.3.   MISSING  DATATYPES   181
   HSQLDB
jdbc:hsqldb:hsql://server.my.domain/MyDatabase
   MS  SQL  Server  2000  (Desktop  Edition)
jdbc:microsoft:sqlserver://v:1433
(Note:   if you add  ;databasename=db-name you can connect to a dierent
database  than  the  default  one,  e.g.,  MyDatabase)
   MS  SQL  Server  2005
jdbc:sqlserver://server.my.domain:1433
   MySQL
jdbc:mysql://server.my.domain:3306/MyDatabase
   ODBC
jdbc:odbc:DSN  name (replace DSN  name  with the DSN that you want to
use)
   Oracle  (thin  driver)
jdbc:oracle:thin:@server.my.domain:1526:orcl
(Note:   @machineName:port:SID)
for  the  Express  Edition  you  can  use
jdbc:oracle:thin:@server.my.domain:1521:XE
   PostgreSQL
jdbc:postgresql://server.my.domain:5432/MyDatabase
You  can  also  specify  user  and  password directly  in  the  URL:
jdbc:postgresql://server.my.domain:5432/MyDatabase?user=<...>&password=<...>
where  you  have  to  replace  the  <...>  with  the  correct  values
   sqlite  3.x
jdbc:sqlite:/path/to/database.db
(you  can  access  only  local  les)
13.3   Missing  Datatypes
Sometimes   (e.g.   with  MySQL)  it   can  happen  that   a  column  type  cannot  be
interpreted.   In  that  case  it  is  necessary  to  map  the  name  of   the  column  type
to  the  Java  type  it  should  be  interpreted  as.   E.g.   the  MySQL  type  TEXT  is
returned  as  BLOB  from  the  JDBC  driver  and  has  to  be  mapped  to  String  (0
represents String - the mappings can be found in the comments of the properties
le):
182   CHAPTER  13.   DATABASES
Java  type   Java  method   Identier   Weka  attribute  type
String   getString()   0   nominal
boolean   getBoolean()   1   nominal
double   getDouble()   2   numeric
byte   getByte()   3   numeric
short   getByte()   4   numeric
int   getInteger()   5   numeric
long   getLong()   6   numeric
oat   getFloat()   7   numeric
date   getDate()   8   date
text   getString()   9   string
time   getTime()   10   date
In  the  props  le  one  lists  now  the  type  names  that  the  database  returns  and
what  Java  type  it  represents  (via  the  identier),  e.g.:
CHAR=0
VARCHAR=0
CHAR  and  VARCHAR  are  both  String  types,   hence  they  are  interpreted  as  String
(identier  0)
Note:   in  case  database  types  have  blanks,   one  needs  to  replace  those  blanks
with  an  underscore,  e.g.,  DOUBLE  PRECISION must  be  listed  like  this:
DOUBLE_PRECISION=2
13.4   Stored  Procedures
Lets  say  youre  tired  of   typing  the  same  query  over  and  over  again.   A  good
way  to  shorten  that,  is  to  create  a  stored  procedure.
PostgreSQL  7.4.x
The  following  example  creates  a  procedure  called  emplyoee  name  that  returns
the  names  of all  the  employees in  table  employee.   Even though it doesnt make
much  sense  to  create  a  stored  procedure  for  this  query,   nonetheless,   it  shows
how  to  create  and  call  stored  procedures  in  PostgreSQL.
   Create
CREATE   OR   REPLACE   FUNCTION  public.employee_name()
RETURNS  SETOF   text   AS   select   name   from   employee
LANGUAGE  sql   VOLATILE;
   SQL  statement  to  call  procedure
SELECT   *   FROM   employee_name()
   Retrieve  data  via  InstanceQuery
java   weka.experiment.InstanceQuery
-Q   "SELECT   *   FROM   employee_name()"
-U   <user>   -P   <password>
13.5.   TROUBLESHOOTING   183
13.5   Troubleshooting
   In  case  youre  experiencing  problems  connecting  to  your  database,   check
out  the  WEKA  Mailing  List  (see  Weka  homepage  for  more  information).
It  is  possible  that  somebody  else  encountered  the  same  problem  as  you
and  youll  nd  a  post  containing  the  solution  to  your  problem.
   Specic  MS  SQL  Server  2000  Troubleshooting
  Error Establishing  Socket  with  JDBC  Driver
Add TCP/IP to the list of protocols as stated in the following article:
http://support.microsoft.com/default.aspx?scid=kb;en-us;313178
  Login failed for user sa.   Reason:   Not associated with a trusted SQL
Server  connection.
For   changing  the   authentication  to  mixed  mode   see   the   following
article:
http://support.microsoft.com/kb/319930/en-us
   MS  SQL  Server   2005:   TCP/IP  is   not   enabled  for   SQL  Server,   or   the
server  or  port  number  specied  is  incorrect.Verify  that  SQL  Server  is  lis-
tening  with  TCP/IP  on  the  specied  server  and  port.   This  might  be  re-
ported  with  an  exception  similar  to:   The  login  has  failed.   The  TCP/IP
connection  to  the  host  has  failed.  This  indicates  one  of  the  following:
  SQL  Server  is  installed  but  TCP/IP has  not  been  installed  as  a  net-
work protocol for SQL Server by using the SQL Server Network Util-
ity  for  SQL  Server  2000,   or  the  SQL  Server  Conguration  Manager
for  SQL  Server  2005
  TCP/IP is  installed  as  a  SQL  Server  protocol,  but  it  is  not  listening
on the port specied in the JDBC connection URL. The default port
is  1433.
  The port that is used by the server has not been opened in the rewall
   The  Added  driver:   ...   output  on  the  commandline  does  not  mean  that
the  actual   class  was  found,   but  only  that  Weka  will   attempt   to  load  the
class  later  on  in  order  to  establish  a  database  connection.
   The  error message  No  suitable  driver  can  be  caused  by  the  following:
  The  JDBC  driver  you  are  attempting  to  load  is  not  in  the  CLASS-
PATH  (Note:   using  -jar  in  the  java  commandline  overwrites  the
CLASSPATH  environment  variable!).   Open  the  SimpleCLI,  run  the
command java   weka.core.SystemInfoand check whether the prop-
erty  java.class.path  lists  your  database  jar.   If   not  correct  your
CLASSPATH  or  the  Java  call  you  start  Weka  with.
  The  JDBC  driver  class  is  misspelled  in  the  jdbcDriver property  or
you have multiple entries of  jdbcDriver (properties les need unique
keys!)
  The  jdbcURL  property  has  a  spelling  error  and  tries  to  use  a  non-
existing  protocol  or  you  listed  it  multiple  times,  which  doesnt  work
either  (remember,  properties  les  need  unique  keys!)
184   CHAPTER  13.   DATABASES
Chapter  14
Windows  databases
A  common  query  we  get  from  our  users  is  how  to  open  a  Windows  database  in
the Weka Explorer.   This page is intended as a guide to help you achieve this.   It
is a complicated process and we cannot guarantee that it will work for you.   The
process  described  makes  use  of   the  JDBC-ODBC  bridge  that  is  part  of   Suns
JRE/JDK  1.3  (and  higher).
The  following  instructions   are  for   Windows   2000.   Under   other   Windows
versions  there  may  be  slight  dierences.
Step  1:   Create  a  User  DSN
1.   Go  to  the  Control  Panel
2.   Choose  Adminstrative  Tools
3.   Choose  Data  Sources  (ODBC)
4.   At  the  User  DSN  tab,  choose  Add...
5.   Choose  database
   Microsoft Access
(a)   Note:   Make sure your database is not open in another application
before  following  the  steps  below.
(b)   Choose  the  Microsoft  Access  driver  and  click  Finish
(c)   Give   the   source   a   name   by  typing   it   into   the   Data   Source
Name  eld
(d)   In  the  Database  section,  choose  Select...
(e)   Browse  to  nd  your  database  le,  select  it  and  click  OK
(f)   Click  OK  to  nalize  your  DSN
   Microsoft SQL  Server  2000 (Desktop  Engine)
(a)   Choose  the  SQL  Server  driver  and  click  Finish
(b)   Give  the  source  a  name  by  typing  it  into  the  Name  eld
(c)   Add  a  description  for  this  source  in  the  Description  eld
(d)   Select the server youre connecting to from the Server combobox
185
186   CHAPTER  14.   WINDOWS  DATABASES
(e)   For   the  verication  of   the  authenticity  of   the  login  ID  choose
With  SQL  Server...
(f)   Check Connect to SQL Server to obtain default settings...
and  supply  the  user  ID  and  password  with  which  you  installed
the  Desktop  Engine
(g)   Just  click  on  Next  until   it  changes  into  Finish  and  click  this,
too
(h)   For  testing purposes,  click on Test  Data  Source...   - the  result
should  be  TESTS  COMPLETED  SUCCESSFULLY!
(i)   Click  on  OK
   MySQL
(a)   Choose  the  MySQL  ODBC  driver  and  click  Finish
(b)   Give   the   source   a   name   by  typing   it   into   the   Data   Source
Name  eld
(c)   Add  a  description  for  this  source  in  the  Description  eld
(d)   Specify  the  server youre  connecting  to  in  Server
(e)   Fill in the user to use for connecting to the database in the User
eld,  the  same  for  the  password
(f)   Choose the database for this DSN from the Database combobox
(g)   Click  on  OK
6.   Your  DSN  should  now  be  listed  in  the  User  Data  Sources  list
Step  2:   Set  up  the  DatabaseUtils.props  le
You will need to congure a le called  DatabaseUtils.props.   This le already
exists  under  the  path  weka/experiment/ in  the  weka.jar  le  (which  is  just  a
ZIP le) that is part of the Weka download.   In this directory you will also nd a
sample le for ODBC connectivity, called DatabaseUtils.props.odbc, and one
specically  for  MS  Access,   called  DatabaseUtils.props.msaccess, also  using
ODBC. You should use one of the sample les as basis for your setup, since they
already  contain  default  values  specic  to  ODBC  access.
This  le  needs  to  be  recognized  when  the  Explorer  starts.   You  can  achieve
this  by  making  sure  it  is  in  the  working directory or  the  home  directory (if  you
are unsure  what  the  terms  working  directory  and  home  directory  mean,  see  the
Notes  section).   The easiest is probably the  second alternative, as the  setup will
apply  to  all  the  Weka  instances  on  your  machine.
Just  make  sure  that  the  le  contains  the  following  lines  at  least:
jdbcDriver=sun.jdbc.odbc.JdbcOdbcDriver
jdbcURL=jdbc:odbc:dbname
where  dbname  is  the  name  you  gave  the  user  DSN.   (This  can  also  be  changed
once  the  Explorer  is  running.)
187
Step  3:   Open  the  database
1.   Start  up  the  Weka  Explorer.
2.   Choose  Open  DB...
3.   The  URL  should  read  jdbc:odbc:dbname  where  dbname   is  the  name
you  gave  the  user  DSN.
4.   Click  Connect
5.   Enter  a  Query,   e.g.,   select   *   from   tablename  where  tablename   is
the  name  of   the  database  table  you  want  to  read.   Or  you  could  put   a
more  complicated  SQL  query  here  instead.
6.   Click  Execute
7.   When  youre  satised  with  the  returned  data,  click  OK  to  load  the  data
into  the  Preprocess  panel.
Notes
   Working  directory
The  directory  a  process  is  started  from.   When  you  start  Weka  from  the
Windows   Start   Menu,   then  this   directory  would  be  Wekas   installation
directory  (the  java  process  is  started  from  that  directory).
   Home  directory
The directory that contains all the users data.   The exact location depends
on  the  operating  system  and  the  version  of   the  operating  system.   It  is
stored  in  the  following  environment  variable:
  Unix/Linux
$HOME
  Windows
%USERPROFILE%
  Cygwin
$USERPROFILE
You should be able output the value in a command prompt/terminal with
the  echo  command.   E.g.,  for  Windows  this  would  be:
echo   %USERPROFILE%
188   CHAPTER  14.   WINDOWS  DATABASES
Part  IV
Appendix
189
Chapter  15
Research
15.1   Citing  Weka
If   you  want  to  refer  to  Weka  in  a  publication,   please  cite  following  SIGKDD
Explorations
1
paper.   The  full  citation  is:
Mark  Hall,  Eibe  Frank,  Georey  Holmes,  Bernhard  Pfahringer,  Pe-
ter   Reutemann,   Ian  H.   Witten  (2009);   The   WEKA  Data  Mining
Software:   An  Update;  SIGKDD  Explorations,  Volume  11,  Issue  1.
15.2   Paper  references
Due  to  the  introduction  of  the  weka.core.TechnicalInformationHandler in-
terface it is now easy to extract all the paper references via weka.core.ClassDiscovery
and  weka.core.TechnicalInformation.
The  script   listed  at   the  end,   extracts   all   the  paper   references  from  Weka
based on a given jar le and dumps it to stdout.   One can either generate simple
plain  text  output  (option  -p)  or  BibTeX  compliant  one  (option  -b).
Typical  use  (after  an  ant   exejar) for  BibTeX:
get_wekatechinfo.sh  -d   ../   -w   ../dist/weka.jar  -b   >   ../tech.txt
(command is issued from the same directory the Weka  build.xml is located in)
1
http://www.kdd.org/explorations/issues/11-1-2009-07/p2V11n1.pdf
191
192   CHAPTER  15.   RESEARCH
Bash  shell  script  get  wekatechinfo.sh
#!/bin/bash
#
#   This   script   prints   the   information   stored   in   TechnicalInformationHandlers
#   to   stdout.
#
#   FracPete,   $Revision:   4582   $
#   the   usage   of   this   script
function   usage()
{
echo
echo   "${0##*/}   -d   <dir>   [-w   <jar>]   [-p|-b]   [-h]"
echo
echo   "Prints   the   information   stored   in   TechnicalInformationHandlers  to   stdout."
echo
echo   "   -h   this   help"
echo   "   -d   <dir>"
echo   "   the   directory   to   look   for   packages,   must   be   the   one   just   above"
echo   "   the   weka   package,   default:   $DIR"
echo   "   -w   <jar>"
echo   "   the   weka   jar   to   use,   if   not   in   CLASSPATH"
echo   "   -p   prints   the   information   in   plaintext   format"
echo   "   -b   prints   the   information   in   BibTeX   format"
echo
}
#   generates   a   filename   out   of   the   classname   TMP   and   returns   it   in   TMP
#   uses   the   directory   in   DIR
function   class_to_filename()
{
TMP=$DIR"/"echo  $TMP   |   sed   s/"\."/"\/"/g".java"
}
#   variables
DIR="."
PLAINTEXT="no"
BIBTEX="no"
WEKA=""
TECHINFOHANDLER="weka.core.TechnicalInformationHandler"
TECHINFO="weka.core.TechnicalInformation"
CLASSDISCOVERY="weka.core.ClassDiscovery"
#   interprete   parameters
while   getopts   ":hpbw:d:"   flag
do
case   $flag   in
p)   PLAINTEXT="yes"
;;
b)   BIBTEX="yes"
;;
d)   DIR=$OPTARG
;;
w)   WEKA=$OPTARG
;;
h)   usage
exit   0
;;
*)   usage
exit   1
;;
esac
done
#   either   plaintext   or   bibtex
if   [   "$PLAINTEXT"   =   "$BIBTEX"   ]
then
echo
echo   "ERROR:   either   -p   or   -b   has   to   be   given!"
echo
usage
exit   2
fi
15.2.   PAPER  REFERENCES   193
#   do   we   have   everything?
if   [   "$DIR"   =   ""   ]   ||   [   !   -d   "$DIR"   ]
then
echo
echo   "ERROR:   no   directory   or   non-existing   one   provided!"
echo
usage
exit   3
fi
#   generate   Java   call
if   [   "$WEKA"   =   ""   ]
then
JAVA="java"
else
JAVA="java   -classpath   $WEKA"
fi
if   [   "$PLAINTEXT"   =   "yes"   ]
then
CMD="$JAVA   $TECHINFO   -plaintext"
elif   [   "$BIBTEX"   =   "yes"   ]
then
CMD="$JAVA   $TECHINFO   -bibtex"
fi
#   find   packages
TMP=find   $DIR   -mindepth   1   -type   d   |   grep   -v   CVS   |   sed   s/".*weka"/"weka"/g  |   sed   s/"\/"/./g
PACKAGES=echo   $TMP   |   sed   s/"   "/,/g
#   get   technicalinformationhandlers
TECHINFOHANDLERS=$JAVA  weka.core.ClassDiscovery  $TECHINFOHANDLER   $PACKAGES   |   grep   "\.   weka"   |   sed   s/".*weka"/weka/g
#   output   information
echo
for   i   in   $TECHINFOHANDLERS
do
TMP=$i;class_to_filename
#   exclude   internal   classes
if   [   !   -f   $TMP   ]
then
continue
fi
$CMD   -W   $i
echo
done
194   CHAPTER  15.   RESEARCH
Chapter  16
Using  the  API
Using the graphical tools, like the Explorer, or just the command-line is in most
cases   sucient   for   the  normal   user.   But   WEKAs   clearly  dened  API   (ap-
plication  programming interface) makes  it  very  easy  to  embed  it  in  another
projects.   This chapter covers the basics of how to achieve the following common
tasks  from  source  code:
   Setting  options
   Creating  datasets  in  memory
   Loading  and  saving  data
   Filtering
   Classifying
   Clustering
   Selecting  attributes
   Visualization
   Serialization
Even  though  most  of  the  code  examples  are  for  the  Linux  platform,   using  for-
ward slashes in the paths and le names, they do work on the MS Windows plat-
form  as  well.   To  make  the  examples  work  under  MS  Windows,  one  only  needs
to  adapt  the  paths,   changing  the  forward  slashes  to  backslashes  and  adding  a
drive  letter  where  necessary.
Note
WEKA is released under the  GNU General Public  License version 2
1
(GPLv2),
i.e.,  that  derived  code  or  code  that  uses  WEKA  needs  to  be  released under  the
GPLv2  as  well.   If   one  is   just   using  WEKA  for   a  personal   project  that   does
not  get  released  publicly  then  one  is  not  aected.   But  as  soon  as  one  makes
the  project  publicly  available  (e.g.,   for  download),  then  one  needs  to  make  the
source  code  available  under  the  GLPv2  as  well,  alongside  the  binaries.
1
http://www.gnu.org/licenses/gpl-2.0.html
195
196   CHAPTER  16.   USING  THE  API
16.1   Option  handling
Conguring  an  object,   e.g.,   a  classier,   can  either   be  done  using  the  appro-
priate  get/set-methods  for  the  property  that   one  wishes  to  change,   like  the
Explorer  does.   Or,   if  the  class  implements  the  weka.core.OptionHandler in-
terface,   one   can  just   use   the   objects   ability  to  parse  command-line   options
via   the   setOptions(String[])  method   (the   counterpart   of   this   method   is
getOptions(),  which  returns  a  String[]  array).   The  dierence  between  the
two  approaches   is,   that   the   setOptions(String[])  method  cannot   be   used
to  set  the  options  incrementally.   Default  values  are  used  for  all   options  that
havent  been  explicitly  specied  in  the  options  array.
The  most   basic  approach  is   to  assemble  the   String  array  by  hand.   The
following  example  creates  an  array  with  a  single  option  (-R)  that   takes  an
argument (1)  and  initializes  the  Remove lter  with  this  option:
import   weka.filters.unsupervised.attribute.Remove;
...
String[]  options   =   new   String[2];
options[0]  =   "-R";
options[1]  =   "1";
Remove   rm   =   new   Remove();
rm.setOptions(options);
Since  the  setOptions(String[]) method  expects  a  fully  parsed  and  correctly
split  up  array (which  is  done  by  the  console/command prompt),  some  common
pitfalls  with  this  approach are:
   Combination  of   option  and  argument    Using  -R   1  as   an  element   of
the  String  array will  fail,  prompting  WEKA  to  output  an  error  message
stating  that  the  option  R  1  is  unknown.
   Trailing blanks  Using -R    will  fail as well,  since no trailing blanks are
removed  and  therefore  option  R    will  not  be  recognized.
The  easiest   way  to  avoid  these  problems,   is   to  provide  a  String  array  that
has  been  generated  automatically  from  a  single  command-line  string  using  the
splitOptions(String) method  of  the  weka.core.Utils class.   Here  is  an  ex-
ample:
import   weka.core.Utils;
...
String[]  options   =   Utils.splitOptions("-R  1");
As  this  method  ignores  whitespaces,  using     -R   1  or  -R   1     will  return  the
same  result  as  -R   1.
Complicated command-lines with lots of nested options, e.g., options for the
support-vector machine classier SMO  (package weka.classifiers.functions)
including  a kernel setup,  are a  bit tricky, since Java requires one to  escape dou-
ble  quotes   and  backslashes   inside  a  String.   The  Wiki[2]   article  Use  Weka
in  your  Java  code  references  the  Java  class  OptionsToCode,  which  turns  any
command-line  into  appropriate  Java  source  code.   This   example  class   is   also
available  from  the  Weka  Examples  collection[3]:   weka.core.OptionsToCode.
16.1.   OPTION  HANDLING   197
Instead  of   using  the   Remove  lters   setOptions(String[])  method,   the
following  code  snippet  uses  the  actual  set-method  for  this  property:
import   weka.filters.unsupervised.attribute.Remove;
...
Remove   rm   =   new   Remove();
rm.setAttributeIndices("1");
In  order   to  nd  out,   which  option  belongs   to  which  property,   i.e.,   get/set-
method, it is best to have a look at the setOptions(String[])and getOptions()
methods.   In case these methods use the member variables directly, one just has
to  look  for  the  methods  making  this  particular  member  variable  accessible  to
the  outside.
Using   the   set-methods,   one   will   most   likely   come   across   ones   that   re-
quire   a  weka.core.SelectedTag  as   parameter.   An  example  for   this,   is   the
setEvaluation  method  of  the  meta-classier  GridSearch  (located  in  package
weka.classifiers.meta).   The  SelectedTag class  is  used  in  the  GUI  for  dis-
playing  drop-down  lists,   enabling  the  user   to  chose  from  a  predened  list   of
values.   GridSearch allows the  user  to  chose the  statistical  measure to  base the
evaluation  on  (accuracy,  correlation  coecient,  etc.).
A SelectedTag gets constructed using the array of all possible weka.core.Tag
elements  that  can  be  chosen  and  the  integer  or  string  ID  of   the  Tag.   For  in-
stance,   GridSearchs  setOptions(String[]) method  uses  the  supplied  string
ID  to  set  the  evaluation  type  (e.g.,   ACC  for  accuracy),   or,   if   the  evaluation
option  is  missing,   the  default   integer  ID  EVALUATION  ACC.   In  both  cases,   the
array  TAGS  EVALUATION is  used,  which  denes  all  possible  options:
import   weka.core.SelectedTag;
...
String   tmpStr   =   Utils.getOption(E,  options);
if   (tmpStr.length()  !=   0)
setEvaluation(new  SelectedTag(tmpStr,  TAGS_EVALUATION));
else
setEvaluation(new  SelectedTag(EVALUATION_CC,  TAGS_EVALUATION));
198   CHAPTER  16.   USING  THE  API
16.2   Loading  data
Before any lter, classier or clusterer can be applied, data needs to be present.
WEKA enables one to load data from les (in various le formats) and also from
databases.   In  the  latter  case,   it  is  assumed  in  that  the  database  connection  is
set up and working.   See chapter 13 for more details on how to congure WEKA
correctly  and  also  more  information  on  JDBC  (Java  Database  Connectivity)
URLs.
Example classes, making use of the functionality covered in this section, can
be  found  in  the  wekaexamples.core.converters package  of  the  Weka  Exam-
ples  collection[3].
The  following  classes  are  used  to  store  data  in  memory:
   weka.core.Instances    holds  a  complete  dataset.   This  data  structure
is  row-based;  single  rows  can  be  accessed  via  the  instance(int) method
using a 0-based index.   Information about the columns can be accessed via
the attribute(int) method.   This method returns weka.core.Attribute
objects  (see  below).
   weka.core.Instance   encapsulates  a  single  row.   It  is  basically  a  wrap-
per   around  an  array  of   double  primitives.   Since  this   class   contains   no
information  about  the  type  of   the  columns,   it   always  needs  access  to  a
weka.core.Instances  object   (see  methods   dataset  and  setDataset).
The  class  weka.core.SparseInstance is  used  in  case  of  sparse  data.
   weka.core.Attribute  holds the type information about a single column
in  the  dataset.   It  stores  the  type  of   the  attribute,   as  well   as  the  labels
for   nominal   attributes,   the   possible   values   for   string   attributes   or   the
datasets   for   relational   attributes   (these   are  just   weka.core.Instances
objects  again).
16.2.1   Loading  data  from  les
When  loading data from les,  one can either let WEKA choose the  appropriate
loader (the  available loaders can  be  found  in  the  weka.core.converters pack-
age) based on the les extension or one can use the correct loader directly.   The
latter  case  is  necessary  if  the  les  do  not  have  the  correct extension.
The DataSource class (inner class of the weka.core.converters.ConverterUtils
class) can be used to read data from les that have the appropriate le extension.
Here  are  some  examples:
import   weka.core.converters.ConverterUtils.DataSource;
import   weka.core.Instances;
...
Instances  data1   =   DataSource.read("/some/where/dataset.arff");
Instances  data2   =   DataSource.read("/some/where/dataset.csv");
Instances  data3   =   DataSource.read("/some/where/dataset.xrff");
In  case  the  le  does  have  a  dierent  le  extension  than  is  normally  associated
with  the  loader,  one  has  to  use  a  loader  directly.   The  following  example  loads
a  CSV  (comma-separated values)  le:
import   weka.core.converters.CSVLoader;
16.2.   LOADING  DATA   199
import   weka.core.Instances;
import   java.io.File;
...
CSVLoader  loader   =   new   CSVLoader();
loader.setSource(new  File("/some/where/some.data"));
Instances  data   =   loader.getDataSet();
NB:   Not  all   le  formats  allow  to  store  information  about  the  class  attribute
(e.g.,  ARFF  stores  no  information  about  class  attribute,  but  XRFF  does).   If  a
class  attribute  is  required  further  down  the  road,   e.g.,   when  using  a  classier,
it  can  be  set  with  the  setClassIndex(int) method:
//   uses   the   first   attribute  as   class   attribute
if   (data.classIndex()  ==   -1)
data.setClassIndex(0);
...
//   uses   the   last   attribute  as   class   attribute
if   (data.classIndex()  ==   -1)
data.setClassIndex(data.numAttributes()  -   1);
16.2.2   Loading  data  from  databases
For  loading  data  from  databases,  one  of  the  following  two  classes  can  be  used:
   weka.experiment.InstanceQuery
   weka.core.converters.DatabaseLoader
The  dierences  between  them  are,  that  the  InstanceQuery class  allows  one  to
retrieve  sparse  data  and  the  DatabaseLoader  can  retrieve  the  data  incremen-
tally.
Here  is  an  example  of  using  the  InstanceQuery class:
import   weka.core.Instances;
import   weka.experiment.InstanceQuery;
...
InstanceQuery  query   =   new   InstanceQuery();
query.setDatabaseURL("jdbc_url");
query.setUsername("the_user");
query.setPassword("the_password");
query.setQuery("select  *   from   whatsoever");
//   if   your   data   is   sparse,  then   you   can   say   so,   too:
//   query.setSparseData(true);
Instances  data   =   query.retrieveInstances();
And  an  example  using  the  DatabaseLoader class  in  batch  retrieval:
import   weka.core.Instances;
import   weka.core.converters.DatabaseLoader;
...
DatabaseLoader  loader   =   new   DatabaseLoader();
loader.setSource("jdbc_url",  "the_user",  "the_password");
loader.setQuery("select  *   from   whatsoever");
Instances  data   =   loader.getDataSet();
200   CHAPTER  16.   USING  THE  API
The  DatabaseLoader is  used  in  incremental  mode  as  follows:
import   weka.core.Instance;
import   weka.core.Instances;
import   weka.core.converters.DatabaseLoader;
...
DatabaseLoader  loader   =   new   DatabaseLoader();
loader.setSource("jdbc_url",  "the_user",  "the_password");
loader.setQuery("select  *   from   whatsoever");
Instances  structure  =   loader.getStructure();
Instances  data   =   new   Instances(structure);
Instance  inst;
while   ((inst   =   loader.getNextInstance(structure))  !=   null)
data.add(inst);
Notes:
   Not  all  database  systems  allow  incremental  retrieval.
   Not  all  queries  have  a  unique  key  to  retrieve  rows  incrementally.   In  that
case,   one  can  supply  the  necessary  columns  with  the  setKeys(String)
method  (comma-separated list  of  columns).
   If   the  data  cannot  be  retrieved  in  an  incremental   fashion,   it  is  rst  fully
loaded into memory and then provided row-by-row(pseudo-incremental).
16.2.   LOADING  DATA   201
202   CHAPTER  16.   USING  THE  API
16.3   Creating  datasets  in  memory
Loading  datasets   from  disk  or   database   are   not   the   only  ways   of   obtaining
data in  WEKA:  datasets can  be  created in  memory or on-the-y.   Generating  a
dataset  memory  structure  (i.e.,  a  weka.core.Instances object)  is  a  two-stage
process:
1.   Dening  the  format  of  the  data  by  setting  up  the  attributes.
2.   Adding  the  actual  data,  row  by  row.
The class wekaexamples.core.CreateInstances of the Weka Examples  collection[3]
generates an Instances object containing all attribute types WEKA can handle
at  the  moment.
16.3.1   Dening  the  format
There  are  currently  ve  dierent  types  of  attributes  available  in  WEKA:
   numeric    continuous  variables
   date    date  variables
   nominal    predened  labels
   string    textual  data
   relational      contains   other   relations,   e.g.,   the   bags   in  case   of   multi-
instance  data
For all of the dierent attribute types, WEKA uses the same class, weka.core.Attribute,
but with dierent constructors.   In the following, these dierent constructors are
explained.
   numeric    The  easiest  attribute  type  to  create,   as  it  requires  only  the
name  of  the  attribute:
Attribute  numeric   =   new   Attribute("name_of_attr");
   date    Date  attributes  are  handled  internally  as  numeric  attributes,   but
in  order  to  parse  and  present  the  date  value  correctly,   the  format  of  the
date  needs  to  be  specied.   The  date  and  time  patterns   are  explained  in
detail   in  the  Javadoc  of  the  java.text.SimpleDateFormat class.   In  the
following, an example of how to create a date attribute using a date format
of  4-digit  year,  2-digit  month  and  a  2-digit  day,  separated  by  hyphens:
Attribute  date   =   new   Attribute("name_of_attr",  "yyyy-MM-dd");
   nominal    Since  nominal  attributes  contain  predened  labels,  one  needs
to  supply  these,  stored  in  form  of  a  weka.core.FastVector object:
FastVector  labels   =   new   FastVector();
labels.addElement("label_a");
labels.addElement("label_b");
labels.addElement("label_c");
labels.addElement("label_d");
Attribute  nominal   =   new   Attribute("name_of_attr",  labels);
   string    In  contrast   to  nominal   attributes,   this   type   does   not   store  a
predened  list of labels.   Normally used  to store textual data,  i.e.,  content
of   documents   for   text   categorization.   The  same  constructor   as   for   the
nominal   attribute   is   used,   but   a   null  value   is   provided  instead  of   an
instance  of   FastVector:
Attribute  string   =   new   Attribute("name_of_attr",  (FastVector)  null);
16.3.   CREATING  DATASETS  IN  MEMORY   203
   relational    This   attribute  just   takes   another   weka.core.Instances
object for dening the relational structure in the constructor.   The follow-
ing  code  snippet  generates  a  relational  attribute  that  contains  a  relation
with  two  attributes,  a  numeric  and  a  nominal  attribute:
FastVector  atts   =   new   FastVector();
atts.addElement(new  Attribute("rel.num"));
FastVector  values   =   new   FastVector();
values.addElement("val_A");
values.addElement("val_B");
values.addElement("val_C");
atts.addElement(new  Attribute("rel.nom",  values));
Instances  rel_struct  =   new   Instances("rel",  atts,   0);
Attribute  relational  =   new   Attribute("name_of_attr",  rel_struct);
A  weka.core.Instances  object   is   then  created  by  supplying  a  FastVector
object   containing  all   the  attribute  objects.   The  following  example  creates   a
dataset   with  two  numeric  attributes   and  a  nominal   class   attribute  with  two
labels  no  and  yes:
Attribute  num1   =   new   Attribute("num1");
Attribute  num2   =   new   Attribute("num2");
FastVector  labels   =   new   FastVector();
labels.addElement("no");
labels.addElement("yes");
Attribute  cls   =   new   Attribute("class",  labels);
FastVector  attributes  =   new   FastVector();
attributes.addElement(num1);
attributes.addElement(num2);
attributes.addElement(cls);
Instances  dataset   =   new   Instances("Test-dataset",  attributes,  0);
The nal argument in the  Instances constructor above tells WEKA how much
memory  to  reserve  for  upcoming  weka.core.Instance  objects.   If   one  knows
how  many  rows  will   be  added  to  the  dataset,   then  it  should  be  specied  as  it
saves  costly  operations  for  expanding  the  internal  storage.   It  doesnt  matter,  if
one  aims  to  high  with  the  amount  of  rows  to  be  added,  it  is  always  possible  to
trim  the  dataset  again,  using  the  compactify() method.
16.3.2   Adding  data
After the structure of the dataset has been dened, one can add the actual data
to it, row by row.   There are basically two constructors of the weka.core.Instance
class  that  one  can  use  for  this  purpose:
   Instance(double  weight,   double[]  attValues)  generates an Instance
object  with  the  specied  weight  and  the  given  double  values.   WEKAs
internal format is using doubles for all attribute types.   For nominal, string
and  relational  attributes  this  is  just  an  index  of  the  stored  values.
   Instance(int  numAttributes)  generates a new  Instance object with
weight  1.0  and  all  missing  values.
The second constructor may be easier to use, but setting values via the Instance
class   methods  is  a  bit  costly,   especially  if   one  is  adding  a  lot  of  rows.   There-
fore,  the  following  code  examples  cover  the  rst  constructor.   For  simplicity,  an
Instances object data based  on the  code  snippets  for  the  dierent attribute
introduced  used  above  is  used,  as  it  contains  all  possible  attribute  types.
204   CHAPTER  16.   USING  THE  API
For   each  instance,   the   rst   step  is   to  create  a  new  double   array  to  hold
the  attribute  values.   It  is  important  not  to  reuse  this  array,  but  always  create
a  new  one,   since  WEKA  only  references  it   and  does   not   create  a  copy  of   it,
when instantiating the Instance object.   Reusing means changing the previously
generated  Instance object:
double[]  values   =   new   double[data.numAttributes()];
After  that,  the  double  array  is  lled  with  the  actual  values:
   numeric    just  sets  the  numeric  value:
values[0]  =   1.23;
   date    turns  the  date  string  into  a  double  value:
values[1]  =   data.attribute(1).parseDate("2001-11-09");
   nominal    determines  the  index  of  the  label:
values[2]  =   data.attribute(2).indexOf("label_b");
   string    determines  the  index  of   the  string,   using  the  addStringValue
method  (internally,  a  hashtable  holds  all  the  string  values):
values[3]  =   data.attribute(3).addStringValue("This  is   a   string");
   relational    rst,  a  new  Instances object  based  on  the  attributes  rela-
tional denition has to be created, before the index of it can be determined,
using  the  addRelation method:
Instances  dataRel   =   new   Instances(data.attribute(4).relation(),0);
valuesRel  =   new   double[dataRel.numAttributes()];
valuesRel[0]  =   2.34;
valuesRel[1]  =   dataRel.attribute(1).indexOf("val_C");
dataRel.add(new  Instance(1.0,  valuesRel));
values[4]  =   data.attribute(4).addRelation(dataRel);
Finally,   an  Instance  object  is  generated  with  the  initialized  double  array  and
added  to  the  dataset:
Instance  inst   =   new   Instance(1.0,  values);
data.add(inst);
16.4.   RANDOMIZING  DATA   205
16.4   Randomizing  data
Since learning algorithms can be prone to the order the data arrives in, random-
izing  (also  called  shuing)  the  data  is  a  common  approach  to  alleviate  this
problem.   Especially  repeated  randomizations,   e.g.,   as  during  cross-validation,
help  to  generate  more  realistic  statistics.
WEKA  oers  two  possibilities  for  randomizing  a  dataset:
   Using  the  randomize(Random) method  of  the  weka.core.Instances ob-
ject  containing  the  data  itself.   This  method  requires  an  instance  of   the
java.util.Random class.   How  to  correctly  instantiate  such  an  object  is
explained  below.
   Using the Randomize lter (package weka.filters.unsupervised.instance).
For  more  information  on  how  to  use  lters,  see  section  16.5.
A  very important aspect of Machine Learning experiments is, that experiments
have  to  be   repeatable.   Subsequent   runs   of   the  same  experiment   setup  have
to  yield  the  exact  same  results.   It  may  seem  weird,   but  randomization  is  still
possible  in  this  scenario.   Random  number  generates  never  return  a  completely
random  sequence  of   numbers  anyway,   only  a  pseudo-random  one.   In  order  to
achieve repeatable pseudo-random sequences, seeded  generators are used.   Using
the  same  seed  value  will  always result  in  the  same  sequence  then.
The default constructor of the java.util.Random random number generator
class   should  never  be  used,   as   such  created  objects   will   generate  most   likely
dierent sequences.   The constructor Random(long), using a specied seed value,
is  the  recommended  one  to  use.
In  order  to  get   a  more  dataset-dependent   randomization  of   the  data,   the
getRandomNumberGenerator(int) method  of  the  weka.core.Instances class
can  be used.   This method  returns a  java.util.Random object that was seeded
with the sum of the supplied seed and the hashcode of the string representation
of   a  randomly  chosen  weka.core.Instance  of   the  Instances  object  (using  a
random  number  generator  seeded  with  the  seed  supplied  to  this  method).
206   CHAPTER  16.   USING  THE  API
16.5   Filtering
In  WEKA,   lters  are  used  to  preprocess  the  data.   They  can  be  found  below
package weka.filters.   Each lter falls into one of the following two categories:
   supervised    The  lter  requires  a  class  attribute  to  be  set.
   unsupervised    A  class  attribute  is  not  required  to  be  present.
And  into  one  of  the  two  sub-categories:
   attribute-based    Columns  are  processed,  e.g.,  added  or  removed.
   instance-based    Rows  are  processed,  e.g.,  added  or  deleted.
These   categories   should  make   it   clear,   what   the   dierence   between  the   two
Discretize  lters  in  WEKA  is.   The  supervised  one  takes  the  class  attribute
and  its   distribution  over  the  dataset  into  account,   in  order  to  determine  the
optimal number and size of bins,  whereas the unsupervised  one relies on a user-
specied  number  of  bins.
Apart from this classication, lters are either stream- or batch-based.   Stream
lters can process the  data straight away and make it immediately available for
collection again.   Batch  lters, on the other hand,  need a batch of data to setup
their  internal   data  structures.   The  Add  lter  (this   lter  can  be  found  in  the
weka.filters.unsupervised.attribute  package)  is  an  example  of   a  stream
lter.   Adding  a  new  attribute  with  only  missing  values  does  not  require  any
sophisticated setup.   However, the  ReplaceMissingValues lter  (same  package
as  the  Add  lter)  needs  a  batch  of   data  in  order  to  determine  the  means  and
modes for each of the attributes.   Otherwise, the lter will not be able to replace
the missing values with meaningful values.   But as soon as a batch lter has been
initialized  with  the  rst batch of data, it can also process data on  a row-by-row
basis,  just  like  a  stream  lter.
Instance-based   lters   are  a  bit   special   in  the   way  they  handle   data.   As
mentioned  earlier,   all   lters  can  process  data  on  a  row-by-row  basis  after  the
rst batch of data has been passed through.   Of course, if a lter adds or removes
rows  from  a  batch  of   data,   this   no  longer  works  when  working  in  single-row
processing  mode.   This  makes  sense,   if   one  thinks  of   a  scenario  involving  the
FilteredClassifier meta-classier:   after  the  training  phase  (=  rst  batch  of
data),  the  classier will  get evaluated  against a test set,  one  instance at a  time.
If the lter now removes the only instance or adds instances, it can no longer be
evaluated  correctly,   as  the  evaluation  expects  to  get  only  a  single  result  back.
This  is  the  reason  why  instance-based  lters  only  pass  through  any  subsequent
batch  of  data  without  processing it.   The  Resample lters,  for  instance,  act  like
this.
One   can  nd  example   classes   for   ltering  in  the   wekaexamples.filters
package  of  the  Weka  Examples  collection[3].
16.5.   FILTERING   207
The following example uses the  Remove lter (the  lter is located in package
weka.filters.unsupervised.attribute) to  remove the  rst attribute  from a
dataset.   For  setting  the  options,  the  setOptions(String[]) method  is  used.
import   weka.core.Instances;
import   weka.filters.Filter;
import   weka.filters.unsupervised.attribute.Remove;
...
String[]  options   =   new   String[2];
options[0]  =   "-R";   //   "range"
options[1]  =   "1";   //   first   attribute
Remove   remove   =   new   Remove();   //   new   instance  of   filter
remove.setOptions(options);   //   set   options
remove.setInputFormat(data);   //   inform   filter   about   dataset
//   **AFTER**  setting  options
Instances  newData   =   Filter.useFilter(data,  remove);   //   apply   filter
A common trap to fall into is setting options after the setInputFormat(Instances)
has  been  called.   Since  this  method  is  (normally)  used  to  determine  the  output
format  of  the  data,  all  the  options  have  to  be  set  before  calling  it.   Otherwise,
all  options  set  afterwards  will  be  ignored.
16.5.1   Batch  ltering
Batch ltering is necessary if two or more datasets need to be processed accord-
ing  to  the  same  lter  initialization.   If   batch  ltering  is  not  used,   for  instance
when  generating  a  training  and  a  test  set  using  the  StringToWordVector  l-
ter   (package  weka.filters.unsupervised.attribute),   then  these  two  lter
runs  are  completely  independent  and  will   create  two  most  likely  incompatible
datasets.   Running the  StringToWordVector on two dierent datasets, this  will
result in  two dierent word dictionaries and  therefore dierent  attributes  being
generated.
The  following code  example  shows how to  standardize,  i.e.,  transforming all
numeric attributes to have zero mean and unit variance, a training and a test set
with the Standardize lter (package weka.filters.unsupervised.attribute):
Instances  train   =   ...   //   from   somewhere
Instances  test   =   ...   //   from   somewhere
Standardize  filter   =   new   Standardize();
//   initializing  the   filter   once   with   training  set
filter.setInputFormat(train);
//   configures  the   Filter   based   on   train   instances  and   returns
//   filtered  instances
Instances  newTrain  =   Filter.useFilter(train,  filter);
//   create   new   test   set
Instances  newTest   =   Filter.useFilter(test,  filter);
208   CHAPTER  16.   USING  THE  API
16.5.2   Filtering  on-the-y
Even though using the API gives one full control over the data and makes it eas-
ier  to  juggle  several datasets  at  the  same  time,  ltering  data  on-the-y  makes
life even easier.   This handy feature is available through meta schemes in WEKA,
like FilteredClassifier(package weka.classifiers.meta), FilteredClusterer
(package weka.clusterers), FilteredAssociator(package weka.associations)
and FilteredAttributeEval/FilteredSubsetEval(in weka.attributeSelection).
Instead of ltering the data beforehand,  one just sets up a meta-scheme and lets
the  meta-scheme  do  the  ltering  for  one.
The  following  example  uses  the  FilteredClassifier  in  conjunction  with
the   Remove  lter   to  remove  the   rst   attribute   (which  happens   to  be   an  ID
attribute)  from  the  dataset  and  J48  (J48  is  WEKAs  implementation  of  C4.5;
package weka.classifiers.trees) as base-classier.   First the classier is built
with  a  training set and then  evaluated with  a separate test set.   The  actual and
predicted  class   values   are   printed  in  the   console.   For   more   information  on
classication,  see  chapter  16.6.
import   weka.classifiers.meta.FilteredClassifier;
import   weka.classifiers.trees.J48;
import   weka.core.Instances;
import   weka.filters.unsupervised.attribute.Remove;
...
Instances  train   =   ...   //   from   somewhere
Instances  test   =   ...   //   from   somewhere
//   filter
Remove   rm   =   new   Remove();
rm.setAttributeIndices("1");   //   remove   1st   attribute
//   classifier
J48   j48   =   new   J48();
j48.setUnpruned(true);   //   using   an   unpruned   J48
//   meta-classifier
FilteredClassifier  fc   =   new   FilteredClassifier();
fc.setFilter(rm);
fc.setClassifier(j48);
//   train   and   output   model
fc.buildClassifier(train);
System.out.println(fc);
for   (int   i   =   0;   i   <   test.numInstances();  i++)   {
double   pred   =   fc.classifyInstance(test.instance(i));
double   actual   =   test.instance(i).classValue();
System.out.print("ID:  "
+   test.instance(i).value(0));
System.out.print(",  actual:  "
+   test.classAttribute().value((int)  actual));
System.out.println(",  predicted:  "
+   test.classAttribute().value((int)  pred));
}
16.6.   CLASSIFICATION   209
16.6   Classication
Classication and regression algorithms in WEKA are called classiers and are
located below the weka.classifiers package.   This section covers the following
topics:
   Building  a  classier     batch  and  incremental  learning.
   Evaluating  a  classier    various  evaluation  techniques  and  how  to  obtain
the  generated  statistics.
   Classifying  instances    obtaining  classications  for  unknown  data.
The Weka Examples  collection[3] contains example classes covering classication
in  the  wekaexamples.classifiers package.
16.6.1   Building  a  classier
By  design,  all  classiers  in  WEKA  are  batch-trainable,   i.e.,  they  get  trained  on
the  whole  dataset  at  once.   This  is  ne,   if   the  training  data  ts  into  memory.
But   there  are  also  algorithms  available  that   can  update  their   internal   model
on-the-go.   These  classiers  are  called  incremental.   The  following  two  sections
cover  the  batch  and  the  incremental  classiers.
Batch  classiers
A  batch  classier  is  really  simple  to  build:
   set  options     either  using  the  setOptions(String[]) method  or  the  ac-
tual  set-methods.
   train   it      calling   the   buildClassifier(Instances)  method   with  the
training  set.   By  denition,   the   buildClassifier(Instances)  method
resets  the  internal   model   completely,   in  order  to  ensure  that  subsequent
calls  of   this  method  with  the  same  data  result  in  the  same  model   (re-
peatable  experiments).
The  following  code  snippet  builds  an  unpruned  J48  on  a  dataset:
import   weka.core.Instances;
import   weka.classifiers.trees.J48;
...
Instances  data   =   ...   //   from   somewhere
String[]  options   =   new   String[1];
options[0]  =   "-U";   //   unpruned  tree
J48   tree   =   new   J48();   //   new   instance   of   tree
tree.setOptions(options);   //   set   the   options
tree.buildClassifier(data);   //   build   classifier
Incremental   classiers
All incremental classiers in WEKA implement the interface UpdateableClassifier
(located  in  package  weka.classifiers).   Bringing  up  the  Javadoc for  this  par-
ticular interface tells one what classiers implement this interface.   These classi-
ers can be used to process large amounts of data with a small memory-footprint,
as  the  training  data  does  not  have  to  t  in  memory.   ARFF  les,   for  instance,
can  be  read  incrementally  (see  chapter  16.2).
210   CHAPTER  16.   USING  THE  API
Training  an  incremental  classier  happens  in  two  stages:
1.   initialize  the model by calling the buildClassifier(Instances) method.
One can either use a  weka.core.Instances object with no actual data or
one  with  an  initial  set  of  data.
2.   update the model row-by-row, by calling the updateClassifier(Instance)
method.
The following example shows how to load an ARFF le incrementally using the
ArffLoader class and  train  the  NaiveBayesUpdateable classier  with  one  row
at  a  time:
import   weka.core.converters.ArffLoader;
import   weka.classifiers.bayes.NaiveBayesUpdateable;
import   java.io.File;
...
//   load   data
ArffLoader  loader   =   new   ArffLoader();
loader.setFile(new  File("/some/where/data.arff"));
Instances  structure  =   loader.getStructure();
structure.setClassIndex(structure.numAttributes()  -   1);
//   train   NaiveBayes
NaiveBayesUpdateable  nb   =   new   NaiveBayesUpdateable();
nb.buildClassifier(structure);
Instance  current;
while   ((current  =   loader.getNextInstance(structure))  !=   null)
nb.updateClassifier(current);
16.6.   CLASSIFICATION   211
16.6.2   Evaluating  a  classier
Building  a  classier   is   only  one  part   of   the  equation,   evaluating  how  well   it
performs  is  another  important  part.   WEKA  supports  two  types  of  evaluation:
   Cross-validation     If   one  only  has   a  single  dataset   and  wants   to  get   a
reasonable  realistic  evaluation.   Setting  the  number  of   folds  equal   to  the
number  of  rows in  the  dataset will  give  one  leave-one-out cross-validation
(LOOCV).
   Dedicated  test  set     The  test  set  is  solely  used  to  evaluate  the  built  clas-
sier.   It  is  important  to  have  a  test  set  that  incorporates  the  same  (or
similar)   concepts   as   the  training  set,   otherwise  one  will   always  end  up
with  poor  performance.
The evaluation step, including collection of statistics, is performed by the Evaluation
class  (package  weka.classifiers).
Cross-validation
The  crossValidateModel method  of  the  Evaluation  class  is  used  to  perform
cross-validation with an untrained classier and a single dataset.   Supplying an
untrained classier ensures that no information leaks into the actual evaluation.
Even  though  it  is  an  implementation  requirement,   that  the  buildClassifier
method resets the classier, it cannot be guaranteed that this is indeed the case
(leaky  implementation).   Using  an  untrained  classier  avoids  unwanted  side-
eects,  as for each train/test set pair, a copy of the  originally supplied classier
is  used.
Before  cross-validation  is   performed,   the   data  gets   randomized  using  the
supplied  random  number  generator  (java.util.Random).   It   is  recommended
that  this  number  generator  is  seeded  with  a  specied  seed  value.   Otherwise,
subsequent runs  of  cross-validation on  the  same  dataset will  not yield  the  same
results,   due  to  dierent  randomization  of   the  data  (see  section  16.4  for  more
information  on  randomization).
The code snippet below performs 10-fold cross-validation with a J48 decision
tree  algorithm  on  a  dataset  newData,   with  random  number  generator  that   is
seeded  with  1.   The  summary  of  the  collected  statistics  is  output  to  stdout.
212   CHAPTER  16.   USING  THE  API
import   weka.classifiers.Evaluation;
import   weka.classifiers.trees.J48;
import   weka.core.Instances;
import   java.util.Random;
...
Instances  newData   =   ...   //   from   somewhere
Evaluation  eval   =   new   Evaluation(newData);
J48   tree   =   new   J48();
eval.crossValidateModel(tree,  newData,  10,   new   Random(1));
System.out.println(eval.toSummaryString("\nResults\n\n",  false));
The  Evaluation  object  in  this  example  is  initialized  with  the  dataset  used  in
the evaluation process.   This is done in order to inform the evaluation about the
type  of  data  that  is  being  evaluated,   ensuring  that  all  internal   data  structures
are  setup  correctly.
Train/test  set
Using   a   dedicated   test   set   to   evaluate   a   classier   is   just   as   easy   as   cross-
validation.   But  instead  of  providing  an  untrained  classier,  a  trained  classier
has to be  provided now.   Once again, the  weka.classifiers.Evaluation class
is  used  to  perform  the  evaluation,  this  time  using  the  evaluateModel method.
The  code  snippet  below  trains  a  J48  with  default  options  on  a  training  set
and  evaluates  it  on  a  test  set  before  outputting  the  summary  of   the  collected
statistics:
import   weka.core.Instances;
import   weka.classifiers.Evaluation;
import   weka.classifiers.trees.J48;
...
Instances  train   =   ...   //   from   somewhere
Instances  test   =   ...   //   from   somewhere
//   train   classifier
Classifier  cls   =   new   J48();
cls.buildClassifier(train);
//   evaluate   classifier  and   print   some   statistics
Evaluation  eval   =   new   Evaluation(train);
eval.evaluateModel(cls,  test);
System.out.println(eval.toSummaryString("\nResults\n\n",  false));
16.6.   CLASSIFICATION   213
Statistics
In  the  previous   sections,   the   toSummaryString  of   the   Evaluation  class   was
already  used  in  the  code  examples.   But  there  are  other  summary  methods  for
nominal  class  attributes  available  as  well:
   toMatrixString   outputs  the  confusion  matrix.
   toClassDetailsString outputs TP/FP rates, precision, recall, F-measure,
AUC  (per  class).
   toCumulativeMarginDistributionString outputs the cumulative mar-
gins  distribution.
If   one  does  not   want  to  use  these  summary  methods,   it   is  possible  to  access
the  individual  statistical measures directly.   Below, a  few common  measures are
listed:
   nominal  class  attribute
  correct()    The  number  of   correctly  classied  instances.   The  in-
correctly  classied  ones  are  available  through  incorrect().
  pctCorrect()   The  percentage of correctly classied  instances (ac-
curacy).   pctIncorrect() returns  the  number  of  misclassied  ones.
  areaUnderROC(int)    The  AUC  for  the  specied  class  label   index
(0-based  index).
   numeric  class  attribute
  correlationCoefficient()   The  correlation  coecient.
   general
  meanAbsoluteError()   The  mean  absolute  error.
  rootMeanSquaredError()   The  root  mean  squared  error.
  numInstances()   The  number  of  instances  with  a  class  value.
  unclassified() -  The  number  of  unclassied  instances.
  pctUnclassified() -  The  percentage  of  unclassied  instances.
For  a  complete  overview,   see  the  Javadoc  page  of   the  Evaluation  class.   By
looking  up  the  source code  of  the  summary  methods  mentioned  above,  one  can
easily  determine  what  methods  are  used  for  which  particular  output.
214   CHAPTER  16.   USING  THE  API
16.6.3   Classifying  instances
After a classier setup has been evaluated and proven to be useful, a built classi-
er can be used to make predictions and label previously unlabeled data.   Section
16.5.2 already provided a glimpse of how to use a classiers classifyInstance
method.   This  section  here  elaborates  a  bit  more  on  this.
The following example uses a trained classier tree to label all the instances
in  an unlabeled  dataset that gets loaded from disk.   After  all the instances have
been  labeled,  the  newly  labeled  dataset  gets  written  back  to  disk  to  a  new  le.
//   load   unlabeled  data   and   set   class   attribute
Instances  unlabeled  =   DataSource.read("/some/where/unlabeled.arff");
unlabeled.setClassIndex(unlabeled.numAttributes()  -   1);
//   create   copy
Instances  labeled   =   new   Instances(unlabeled);
//   label   instances
for   (int   i   =   0;   i   <   unlabeled.numInstances();  i++)   {
double   clsLabel  =   tree.classifyInstance(unlabeled.instance(i));
labeled.instance(i).setClassValue(clsLabel);
}
//   save   newly   labeled   data
DataSink.write("/some/where/labeled.arff",  labeled);
The  above  example  works  for   classication  and  regression  problems   alike,   as
long  as  the  classier  can  handle  numeric  classes,  of  course.   Why  is  that?   The
classifyInstance(Instance) method  returns  for  numeric  classes  the  regres-
sion value and for nominal classes the 0-based index in the list of available class
labels.
If   one  is  interested  in  the  class  distribution  instead,   then  one  can  use  the
distributionForInstance(Instance) method  (this  array  sums  up  to  1).   Of
course,   using  this   method  makes   only  sense  for   classication  problems.   The
code   snippet   below  outputs   the   class   distribution,   the   actual   and  predicted
label  side-by-side  in  the  console:
//   load   data
Instances  train   =   DataSource.read(args[0]);
train.setClassIndex(train.numAttributes()  -   1);
Instances  test   =   DataSource.read(args[1]);
test.setClassIndex(test.numAttributes()  -   1);
//   train   classifier
J48   cls   =   new   J48();
cls.buildClassifier(train);
//   output   predictions
System.out.println("#  -   actual   -   predicted  -   distribution");
for   (int   i   =   0;   i   <   test.numInstances();  i++)   {
double   pred   =   cls.classifyInstance(test.instance(i));
double[]  dist   =   cls.distributionForInstance(test.instance(i));
System.out.print((i+1)  +   "   -   ");
System.out.print(test.instance(i).toString(test.classIndex())  +   "   -   ");
System.out.print(test.classAttribute().value((int)  pred)   +   "   -   ");
System.out.println(Utils.arrayToString(dist));
}
16.6.   CLASSIFICATION   215
216   CHAPTER  16.   USING  THE  API
16.7   Clustering
Clustering  is  an  unsupervised  Machine  Learning  technique  of   nding  patterns
in  the  data,  i.e.,  these  algorithms  work  without  class  attributes.   Classiers,  on
the  other  hand,  are  supervised  and  need  a  class  attribute.   This  section,  similar
to  the  one  about  classiers,  covers  the  following  topics:
   Building  a  clusterer    batch  and  incremental  learning.
   Evaluating  a  clusterer    how  to  evaluate  a  built  clusterer.
   Clustering  instances     determining  what  clusters  unknown  instances  be-
long  to.
Fully  functional  example  classes  are  located  in  the  wekaexamples.clusterers
package  of  the  Weka  Examples  collection[3].
16.7.1   Building  a  clusterer
Clusterers,   just  like  classiers,   are  by  design  batch-trainable  as  well.   They  all
can be built on data that is completely stored in memory.   But a small subset of
the cluster algorithms can also update the internal representation incrementally.
The  following  two  sections  cover  both  types  of  clusterers.
Batch  clusterers
Building  a  batch  clusterer,  just  like  a  classier,  happens  in  two  stages:
   set   options     either   calling  the   setOptions(String[])  method  or   the
appropriate  set-methods  of  the  properties.
   build the model  with training data  calling the buildClusterer(Instances)
method.   By  denition,   subsequent   calls   of   this   method  must   result   in
the  same  model   (repeatable  experiments).   In  other  words,  calling  this
method  must  completely  reset  the  model.
Below  is  an  example  of  building  the  EM  clusterer  with  a  maximum  of  100  itera-
tions.   The  options  are  set  using  the  setOptions(String[]) method:
import   weka.clusterers.EM;
import   weka.core.Instances;
...
Instances  data   =   ...   //   from   somewhere
String[]  options   =   new   String[2];
options[0]  =   "-I";   //   max.   iterations
options[1]  =   "100";
EM   clusterer  =   new   EM();   //   new   instance  of   clusterer
clusterer.setOptions(options);   //   set   the   options
clusterer.buildClusterer(data);   //   build   the   clusterer
Incremental   clusterers
Incremental clusterers in WEKA implement the interface UpdateableClusterer
(package   weka.clusterers).   Training   an   incremental   clusterer   happens   in
three  stages,  similar  to  incremental  classiers:
1.   initialize  the  model  by calling the  buildClusterer(Instances) method.
Once again, one can either use an empty  weka.core.Instances object or
one  with  an  initial  set  of  data.
2.   update the model row-by-rowby calling the the updateClusterer(Instance)
method.
3.   nish  the  training  by  calling  updateFinished() method.   In  case  cluster
algorithms  need  to  perform  computational   expensive  post-processing  or
clean  up  operations.
16.7.   CLUSTERING   217
An ArffLoader is used in the following example to build the Cobweb clusterer
incrementally:
import   weka.clusterers.Cobweb;
import   weka.core.Instance;
import   weka.core.Instances;
import   weka.core.converters.ArffLoader;
...
//   load   data
ArffLoader  loader   =   new   ArffLoader();
loader.setFile(new  File("/some/where/data.arff"));
Instances  structure  =   loader.getStructure();
//   train   Cobweb
Cobweb   cw   =   new   Cobweb();
cw.buildClusterer(structure);
Instance  current;
while   ((current  =   loader.getNextInstance(structure))  !=   null)
cw.updateClusterer(current);
cw.updateFinished();
218   CHAPTER  16.   USING  THE  API
16.7.2   Evaluating  a  clusterer
Evaluation  of   clusterers   is   not   as   comprehensive   as   the   evaluation  of   classi-
ers.   Since   clustering   is   unsupervised,   it   is   also   a   lot   harder   determining
how  good   a   model   is.   The   class   used  for   evaluating   cluster   algorithms,   is
ClusterEvaluation (package  weka.clusterers).
In  order  to  generate  the  same  output  as  the  Explorer  or  the  command-line,
one  can  use  the  evaluateClusterer method,  as  shown  below:
import   weka.clusterers.EM;
import   weka.clusterers.ClusterEvaluation;
...
String[]  options   =   new   String[2];
options[0]  =   "-t";
options[1]  =   "/some/where/somefile.arff";
System.out.println(ClusterEvaluation.evaluateClusterer(new  EM(),   options));
Or,   if   the  dataset  is  already  present  in  memory,   one  can  use  the  following  ap-
proach:
import   weka.clusterers.ClusterEvaluation;
import   weka.clusterers.EM;
import   weka.core.Instances;
...
Instances  data   =   ...   //   from   somewhere
EM   cl   =   new   EM();
cl.buildClusterer(data);
ClusterEvaluation  eval   =   new   ClusterEvaluation();
eval.setClusterer(cl);
eval.evaluateClusterer(new  Instances(data));
System.out.println(eval.clusterResultsToString());
Density  based  clusterers,   i.e.,   algorithms  that  implement  the  interface  named
DensityBasedClusterer  (package   weka.clusterers)  can  be   cross-validated
and the log-likelyhood obtained.   Using the MakeDensityBasedClusterer meta-
clusterer,   any  non-density  based  clusterer  can  be  turned  into  such.   Here  is  an
example   of   cross-validating  a  density  based  clusterer   and  obtaining  the   log-
likelyhood:
import   weka.clusterers.ClusterEvaluation;
import   weka.clusterers.DensityBasedClusterer;
import   weka.core.Instances;
import   java.util.Random;
...
Instances  data   =   ...   //   from   somewhere
DensityBasedClusterer  clusterer  =   new   ...   //   the   clusterer  to   evaluate
double   logLikelyhood  =
ClusterEvaluation.crossValidateModel(   //   cross-validate
clusterer,  data,   10,   //   with   10   folds
new   Random(1));   //   and   random   number   generator
//   with   seed   1
16.7.   CLUSTERING   219
Classes  to  clusters
Datasets  for  supervised  algorithms,   like  classiers,   can  be  used  to  evaluate  a
clusterer as well.   This evaluation is called classes-to-clusters, as the clusters are
mapped  back  onto  the  classes.
This  type  of  evaluation  is  performed  as  follows:
1.   create  a  copy  of the  dataset containing the class attribute and remove the
class  attribute,   using  the  Remove  lter   (this   lter   is   located  in  package
weka.filters.unsupervised.attribute).
2.   build  the  clusterer  with  this  new  data.
3.   evaluate  the  clusterer  now  with  the  original  data.
And  here  are  the  steps   translated  into  code,   using  EM  as   the  clusterer   being
evaluated:
1.   create  a  copy  of  data  without  class  attribute
Instances  data   =   ...   //   from   somewhere
Remove   filter   =   new   Remove();
filter.setAttributeIndices(""  +   (data.classIndex()  +   1));
filter.setInputFormat(data);
Instances  dataClusterer  =   Filter.useFilter(data,  filter);
2.   build  the  clusterer
EM   clusterer  =   new   EM();
//   set   further   options   for   EM,   if   necessary...
clusterer.buildClusterer(dataClusterer);
3.   evaluate  the  clusterer
ClusterEvaluation  eval   =   new   ClusterEvaluation();
eval.setClusterer(clusterer);
eval.evaluateClusterer(data);
//   print   results
System.out.println(eval.clusterResultsToString());
220   CHAPTER  16.   USING  THE  API
16.7.3   Clustering  instances
Clustering  of   instances  is  very  similar  to  classifying  unknown  instances  when
using  classiers.   The  following  methods  are  involved:
   clusterInstance(Instance)  determines the cluster the Instance would
belong  to.
   distributionForInstance(Instance)  predicts the cluster membership
for  this  Instance.   The  sum  of  this  array adds  up  to  1.
The  code  fragment  outlined  below  trains  an  EM  clusterer  on  one  dataset  and
outputs  for  a  second  dataset  the  predicted  clusters  and  cluster  memberships  of
the  individual  instances:
import   weka.clusterers.EM;
import   weka.core.Instances;
...
Instances  dataset1   =   ...   //   from   somewhere
Instances  dataset2   =   ...   //   from   somewhere
//   build   clusterer
EM   clusterer  =   new   EM();
clusterer.buildClusterer(dataset1);
//   output   predictions
System.out.println("#  -   cluster  -   distribution");
for   (int   i   =   0;   i   <   dataset2.numInstances();  i++)   {
int   cluster   =   clusterer.clusterInstance(dataset2.instance(i));
double[]  dist   =   clusterer.distributionForInstance(dataset2.instance(i));
System.out.print((i+1));
System.out.print("  -   ");
System.out.print(cluster);
System.out.print("  -   ");
System.out.print(Utils.arrayToString(dist));
System.out.println();
}
16.8.   SELECTING  ATTRIBUTES   221
16.8   Selecting  attributes
Preparing  ones  data  properly  is  a  very  important  step  for  getting  the  best  re-
sults.   Reducing the number of attributes can not only help speeding up runtime
with  algorithms  (some  algorithms  runtime  are  quadratic  in  regards  to  number
of   attributes),   but   also  help  avoid  burying  the  algorithm  in  a  mass   of   at-
tributes,  when  only  a  few  are  essential  for  building  a  good  model.
There  are  three  dierent  types  of  evaluators  in  WEKA  at  the  moment:
   single attribute evaluators   perform evaluations on single attributes.   These
classes  implement  the  weka.attributeSelection.AttributeEvaluator
interface.   The Ranker search algorithm is usually used in conjunction with
these  algorithms.
   attribute  subset   evaluators     work  on  subsets  of   all   the  attributes  in  the
dataset.   The  weka.attributeSelection.SubsetEvaluator  interface  is
implemented  by  these  evaluators.
   attribute   set   evaluators      evaluate   sets   of   attributes.   Not   to   be   con-
fused  with  the   subset   evaluators,   as   these   classes   are  derived  from  the
weka.attributeSelection.AttributeSetEvaluator superclass.
Most  of  the  attribute  selection  schemes  currently  implemented  are  supervised,
i.e.,   they  require   a  dataset   with  a  class   attribute.   Unsupervised  evaluation
algorithms  are  derived  from  one  of  the  following  superclasses:
   weka.attributeSelection.UnsupervisedAttributeEvaluator
e.g.,  LatentSemanticAnalysis, PrincipalComponents
   weka.attributeSelection.UnsupervisedSubsetEvaluator
none  at  the  moment
Attribute  selection  oers   ltering  on-the-y,   like  classiers  and  clusterers,   as
well:
   weka.attributeSelection.FilteredAttributeEval   lter  for  evalua-
tors  that  evaluate  attributes  individually.
   weka.attributeSelection.FilteredSubsetEval    for   ltering  evalua-
tors  that  evaluate  subsets  of  attributes.
So much about the dierences among the  various attribute selection algorithms
and  back  to  how  to  actually  perform  attribute  selection.   WEKA  oers  three
dierent  approaches:
   Using a meta-classier   for performing attribute selection on-the-y (sim-
ilar  to  FilteredClassiers  ltering  on-the-y).
   Using  a  lter  -  for  preprocessing the  data.
   Low-level   API   usage   -   instead  of   using  the   meta-schemes   (classier   or
lter),  one  can  use  the  attribute  selection  API  directly  as  well.
The following sections cover each of the topics, accompanied with a code exam-
ple.   For  clarity,  the  same  evaluator  and  search  algorithm  is  used  in  all  of  these
examples.
Feel free to check out the example classes of the Weka Examples  collection[3],
located  in  the  wekaexamples.attributeSelection package.
222   CHAPTER  16.   USING  THE  API
16.8.1   Using  the  meta-classier
The meta-classier AttributeSelectedClassifier (this  classier is located in
package weka.classifiers.meta), is similar to the  FilteredClassifier.   But
instead  of   taking  a  base-classier   and  a  lter   as   parameters   to  perform  the
ltering,  the  AttributeSelectedClassifier uses  a  search  algorithm  (derived
from  weka.attributeSelection.ASEvaluation),   an  evaluator   (superclass   is
weka.attributeSelection.ASSearch) to  perform  the  attribute  selection  and
a  base-classier to  train  on  the  reduced  data.
This  example  here  uses  J48  as  base-classier,   CfsSubsetEval  as  evaluator
and  a  backwards  operating  GreedyStepwise as  search  method:
import   weka.attributeSelection.CfsSubsetEval;
import   weka.attributeSelection.GreedyStepwise;
import   weka.classifiers.Evaluation;
import   weka.classifiers.meta.AttributeSelectedClassifier;
import   weka.classifiers.trees.J48;
import   weka.core.Instances;
...
Instances  data   =   ...   //   from   somewhere
//   setup   meta-classifier
AttributeSelectedClassifier  classifier  =   new   AttributeSelectedClassifier();
CfsSubsetEval  eval   =   new   CfsSubsetEval();
GreedyStepwise  search   =   new   GreedyStepwise();
search.setSearchBackwards(true);
J48   base   =   new   J48();
classifier.setClassifier(base);
classifier.setEvaluator(eval);
classifier.setSearch(search);
//   cross-validate  classifier
Evaluation  evaluation  =   new   Evaluation(data);
evaluation.crossValidateModel(classifier,  data,   10,   new   Random(1));
System.out.println(evaluation.toSummaryString());
16.8.   SELECTING  ATTRIBUTES   223
16.8.2   Using  the  lter
In  case  the  data  only  needs  to  be  reduced  in  dimensionality,   but  not  used  for
training a classier, then the lter approach is the right one.   The AttributeSelection
lter  (package  weka.filters.supervised.attribute) takes  an  evaluator  and
a  search  algorithm as  parameter.
The  code  snippet  below  uses  once  again  CfsSubsetEval as  evaluator  and  a
backwards  operating  GreedyStepwise as  search  algorithm.   It  just  outputs  the
reduced  data  to  stdout after  the  ltering  step:
import   weka.attributeSelection.CfsSubsetEval;
import   weka.attributeSelection.GreedyStepwise;
import   weka.core.Instances;
import   weka.filters.Filter;
import   weka.filters.supervised.attribute.AttributeSelection;
...
Instances  data   =   ...   //   from   somewhere
//   setup   filter
AttributeSelection  filter   =   new   AttributeSelection();
CfsSubsetEval  eval   =   new   CfsSubsetEval();
GreedyStepwise  search   =   new   GreedyStepwise();
search.setSearchBackwards(true);
filter.setEvaluator(eval);
filter.setSearch(search);
filter.setInputFormat(data);
//   filter   data
Instances  newData   =   Filter.useFilter(data,  filter);
System.out.println(newData);
224   CHAPTER  16.   USING  THE  API
16.8.3   Using  the  API  directly
Using  the  meta-classier  or  the  lter  approach  makes  attribute  selection  fairly
easy.   But  it  might  not  satify  everybodys  needs.   For  instance,   if  one  wants  to
obtain  the  ordering  of   the  attributes  (using  Ranker)  or  retrieve  the  indices  of
the  selected  attributes  instead  of  the  reduced  data.
Just  like  the  other  examples,   the  one  shown  here  uses  the  CfsSubsetEval
evaluator and the  GreedyStepwise search algorithm (in backwards mode).   But
instead  of  outputting  the  reduced  data,  only  the  selected  indices  are  printed  in
the  console:
import   weka.attributeSelection.AttributeSelection;
import   weka.attributeSelection.CfsSubsetEval;
import   weka.attributeSelection.GreedyStepwise;
import   weka.core.Instances;
...
Instances  data   =   ...   //   from   somewhere
//   setup   attribute  selection
AttributeSelection  attsel   =   new   AttributeSelection();
CfsSubsetEval  eval   =   new   CfsSubsetEval();
GreedyStepwise  search   =   new   GreedyStepwise();
search.setSearchBackwards(true);
attsel.setEvaluator(eval);
attsel.setSearch(search);
//   perform   attribute  selection
attsel.SelectAttributes(data);
int[]   indices   =   attsel.selectedAttributes();
System.out.println(
"selected  attribute  indices   (starting  with   0):\n"
+   Utils.arrayToString(indices));
16.9.   SAVING  DATA   225
16.9   Saving  data
Saving  weka.core.Instances objects is as easy as reading the data in  the  rst
place,   though  the  process  of  storing  the  data  again  is  far  less  common  than  of
reading the data into memory.   The following two sections cover how to save the
data  in  les  and  in  databases.
Just  like  with  loading  the  data  in  chapter  16.2,   examples  classes  for  saving
data can be found in the wekaexamples.core.converters package of the Weka
Examples  collection[3];
16.9.1   Saving  data  to  les
Once again, one can either let WEKA choose the appropriate converter for sav-
ing the data or use an explicit converter (all savers are located in the weka.core.converters
package).   The latter approach is necessary, if the le name under which the data
will  be  stored  does  not  have  an  extension  that  WEKA  recognizes.
Use the DataSink class (inner class of weka.core.converters.ConverterUtils),
if  the  extensions  are  not  a  problem.   Here  are  a  few  examples:
import   weka.core.Instances;
import   weka.core.converters.ConverterUtils.DataSink;
...
//   data   structure  to   save
Instances  data   =   ...
//   save   as   ARFF
DataSink.write("/some/where/data.arff",  data);
//   save   as   CSV
DataSink.write("/some/where/data.csv",  data);
And  here  is  an  example  of  using  the  CSVSaver converter explicitly:
import   weka.core.Instances;
import   weka.core.converters.CSVSaver;
import   java.io.File;
...
//   data   structure  to   save
Instances  data   =   ...
//   save   as   CSV
CSVSaver  saver   =   new   CSVSaver();
saver.setInstances(data);
saver.setFile(new  File("/some/where/data.csv"));
saver.writeBatch();
16.9.2   Saving  data  to  databases
Apart   from  the   KnowledgeFlow,   saving  to  databases   is   not   very  obvious   in
WEKA,   unless  one  knows  about  the  DatabaseSaver  converter.   Just  like  the
DatabaseLoader, the saver counterpart can store the data either in batch mode
or  incrementally  as  well.
226   CHAPTER  16.   USING  THE  API
The  rst  example  shows  how  to  save  the  data  in  batch  mode,   which  is  the
easier  way  of  doing  it:
import   weka.core.Instances;
import   weka.core.converters.DatabaseSaver;
...
//   data   structure  to   save
Instances  data   =   ...
//   store   data   in   database
DatabaseSaver  saver   =   new   DatabaseSaver();
saver.setDestination("jdbc_url",  "the_user",  "the_password");
//   we   explicitly  specify  the   table   name   here:
saver.setTableName("whatsoever2");
saver.setRelationForTableName(false);
//   or   we   could   just   update   the   name   of   the   dataset:
//   saver.setRelationForTableName(true);
//   data.setRelationName("whatsoever2");
saver.setInstances(data);
saver.writeBatch();
Saving  the  data  incrementally,   requires  a  bit  more  work,   as  one  has  to  specify
that  writing  the  data  is  done  incrementally  (using  the  setRetrieval method),
as  well  as  notifying  the  saver when  all  the  data  has  been  saved:
import   weka.core.Instances;
import   weka.core.converters.DatabaseSaver;
...
//   data   structure  to   save
Instances  data   =   ...
//   store   data   in   database
DatabaseSaver  saver   =   new   DatabaseSaver();
saver.setDestination("jdbc_url",  "the_user",  "the_password");
//   we   explicitly  specify  the   table   name   here:
saver.setTableName("whatsoever2");
saver.setRelationForTableName(false);
//   or   we   could   just   update   the   name   of   the   dataset:
//   saver.setRelationForTableName(true);
//   data.setRelationName("whatsoever2");
saver.setRetrieval(DatabaseSaver.INCREMENTAL);
saver.setStructure(data);
count   =   0;
for   (int   i   =   0;   i   <   data.numInstances();  i++)   {
saver.writeIncremental(data.instance(i));
}
//   notify   saver   that   were   finished
saver.writeIncremental(null);
16.10.   VISUALIZATION   227
16.10   Visualization
The  concepts   covered  in  this   chapter   are  also  available  through  the  example
classes  of  the  Weka  Examples  collection[3].   See  the  following  packages:
   wekaexamples.gui.graphvisualizer
   wekaexamples.gui.treevisualizer
   wekaexamples.gui.visualize
16.10.1   ROC  curves
WEKA  can  generate  Receiver  operating  characteristic  (ROC)  curves,   based
on  the  collected  predictions   during  an  evaluation  of   a  classier.   In  order   to
display  a  ROC  curve,  one  needs  to  perform  the  following  steps:
1.   Generate  the  plotable  data  based  on  the  Evaluations  collected  predic-
tions, using the ThresholdCurve class (package weka.classifiers.evaluation).
2.   Put the plotable data into a plot container, an instance of the PlotData2D
class  (package  weka.gui.visualize).
3.   Add the plot container to a visualization panel for displaying the data, an
instance of the ThresholdVisualizePanelclass (package weka.gui.visualize).
4.   Add  the  visualization  panel  to  a  JFrame (package  javax.swing) and  dis-
play  it.
And  now,  the  four  steps  translated  into  actual  code:
1.   Generate  the  plotable  data
Evaluation  eval   =   ...   //   from   somewhere
ThresholdCurve  tc   =   new   ThresholdCurve();
int   classIndex  =   0;   //   ROC   for   the   1st   class   label
Instances  curve   =   tc.getCurve(eval.predictions(),  classIndex);
2.   Put  the  plotable  into  a  plot  container
PlotData2D  plotdata   =   new   PlotData2D(curve);
plotdata.setPlotName(curve.relationName());
plotdata.addInstanceNumberAttribute();
3.   Add  the  plot  container  to  a  visualization  panel
ThresholdVisualizePanel  tvp   =   new   ThresholdVisualizePanel();
tvp.setROCString("(Area  under   ROC   =   "   +
Utils.doubleToString(ThresholdCurve.getROCArea(curve),4)+")");
tvp.setName(curve.relationName());
tvp.addPlot(plotdata);
4.   Add  the  visualization  panel  to  a  JFrame
final   JFrame   jf   =   new   JFrame("WEKA  ROC:   "   +   tvp.getName());
jf.setSize(500,400);
jf.getContentPane().setLayout(new  BorderLayout());
jf.getContentPane().add(tvp,  BorderLayout.CENTER);
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setVisible(true);
228   CHAPTER  16.   USING  THE  API
16.10.2   Graphs
Classes  implementing  the  weka.core.Drawable  interface  can  generate  graphs
of their internal models which can be displayed.   There are two dierent types of
graphs available at the moment, which are explained in the subsequent sections:
   Tree    decision  trees.
   BayesNet    bayesian  net  graph  structures.
16.10.2.1   Tree
It   is   quite   easy   to   display  the   internal   tree   structure   of   classiers   like   J48
or   M5P  (package   weka.classifiers.trees).   The   following  example   builds
a   J48   classier   on  a   dataset   and  displays   the   generated  tree   visually  using
the   TreeVisualizer  class   (package   weka.gui.treevisualizer).   This   visu-
alization  class   can  be   used  to   view  trees   (or   digraphs)   in  GraphVizs   DOT
language[26].
import   weka.classifiers.trees.J48;
import   weka.core.Instances;
import   weka.gui.treevisualizer.PlaceNode2;
import   weka.gui.treevisualizer.TreeVisualizer;
import   java.awt.BorderLayout;
import   javax.swing.JFrame;
...
Instances  data   =   ...   //   from   somewhere
//   train   classifier
J48   cls   =   new   J48();
cls.buildClassifier(data);
//   display   tree
TreeVisualizer  tv   =   new   TreeVisualizer(
null,   cls.graph(),  new   PlaceNode2());
JFrame   jf   =   new   JFrame("Weka  Classifier  Tree   Visualizer:  J48");
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setSize(800,  600);
jf.getContentPane().setLayout(new  BorderLayout());
jf.getContentPane().add(tv,  BorderLayout.CENTER);
jf.setVisible(true);
//   adjust   tree
tv.fitToScreen();
16.10.   VISUALIZATION   229
16.10.2.2   BayesNet
The  graphs   that   the  BayesNet   classier   (package   weka.classifiers.bayes)
generates can be displayed using the GraphVisualizer class (located in package
weka.gui.graphvisualizer).   The  GraphVisualizer can  display  graphs  that
are  either   in  GraphVizs   DOT  language[26]   or   in  XML  BIF[20]   format.   For
displaying DOT format, one needs to use the method  readDOT, and for the BIF
format  the  method  readBIF.
The  following  code  snippet  trains  a  BayesNet  classier  on  some  data  and
then  displays  the  graph  generated  from  this  data  in  a  frame:
import   weka.classifiers.bayes.BayesNet;
import   weka.core.Instances;
import   weka.gui.graphvisualizer.GraphVisualizer;
import   java.awt.BorderLayout;
import   javax.swing.JFrame;
...
Instances  data   =   ...   //   from   somewhere
//   train   classifier
BayesNet  cls   =   new   BayesNet();
cls.buildClassifier(data);
//   display   graph
GraphVisualizer  gv   =   new   GraphVisualizer();
gv.readBIF(cls.graph());
JFrame   jf   =   new   JFrame("BayesNet  graph");
jf.setDefaultCloseOperation(JFrame.DISPOSE_ON_CLOSE);
jf.setSize(800,  600);
jf.getContentPane().setLayout(new  BorderLayout());
jf.getContentPane().add(gv,  BorderLayout.CENTER);
jf.setVisible(true);
//   layout   graph
gv.layoutGraph();
230   CHAPTER  16.   USING  THE  API
16.11   Serialization
Serialization
2
is  the  process  of   saving  an  object  in  a  persistent  form,   e.g.,   on
the  harddisk  as   a  bytestream.   Deserialization   is   the  process  in  the  opposite
direction,  creating  an  object  from  a  persistently  saved  data  structure.   In  Java,
an  object  can  be  serialized  if  it  imports  the  java.io.Serializable interface.
Members of an object that are not supposed to be serialized, need to be declared
with  the  keyword  transient.
In the following are some Java code snippets for serializing and deserializing a
J48 classier.   Of course, serialization is not limited to classiers.   Most schemes
in  WEKA,  like  clusterers  and  lters,  are  also  serializable.
Serializing  a  classier
The  weka.core.SerializationHelper  class  makes  it  easy  to  serialize  an  ob-
ject.   For  saving,  one  can  use  one  of  the  write  methods:
import   weka.classifiers.Classifier;
import   weka.classifiers.trees.J48;
import   weka.core.converters.ConverterUtils.DataSource;
import   weka.core.SerializationHelper;
...
//   load   data
Instances  inst   =   DataSource.read("/some/where/data.arff");
inst.setClassIndex(inst.numAttributes()  -   1);
//   train   J48
Classifier  cls   =   new   J48();
cls.buildClassifier(inst);
//   serialize  model
SerializationHelper.write("/some/where/j48.model",  cls);
Deserializing  a  classier
Deserializing  an  object  can  be  achieved  by  using  one  of  the  read  methods:
import   weka.classifiers.Classifier;
import   weka.core.SerializationHelper;
...
//   deserialize  model
Classifier  cls   =   (Classifier)  SerializationHelper.read(
"/some/where/j48.model");
2
http://en.wikipedia.org/wiki/Serialization
16.11.   SERIALIZATION   231
Deserializing  a  classier  saved  from  the  Explorer
The Explorer does not only save the built classier in the model le, but also the
header  information  of  the  dataset  the  classier  was  built  with.   By  storing  the
dataset  information  as  well,  one  can  easily  check  whether  a  serialized  classier
can  be  applied  on  the  current  dataset.   The  readAll methods  returns  an  array
with  all  objects  that  are  contained  in  the  model  le.
import   weka.classifiers.Classifier;
import   weka.core.Instances;
import   weka.core.SerializationHelper;
...
//   the   current   data   to   use   with   classifier
Instances  current   =   ...   //   from   somewhere
//   deserialize  model
Object   o[]   =   SerializationHelper.readAll("/some/where/j48.model");
Classifier  cls   =   (Classifier)  o[0];
Instances  data   =   (Instances)  o[1];
//   is   the   data   compatible?
if   (!data.equalHeaders(current))
throw   new   Exception("Incompatible  data!");
Serializing  a  classier  for  the  Explorer
If  one  wants  to  serialize the  dataset header  information  alongside  the  classier,
just  like  the  Explorer does,  then  one  can  use  one  of  the  writeAll methods:
import   weka.classifiers.Classifier;
import   weka.classifiers.trees.J48;
import   weka.core.converters.ConverterUtils.DataSource;
import   weka.core.SerializationHelper;
...
//   load   data
Instances  inst   =   DataSource.read("/some/where/data.arff");
inst.setClassIndex(inst.numAttributes()  -   1);
//   train   J48
Classifier  cls   =   new   J48();
cls.buildClassifier(inst);
//   serialize  classifier  and   header   information
Instances  header   =   new   Instances(inst,  0);
SerializationHelper.writeAll(
"/some/where/j48.model",  new   Object[]{cls,  header});
232   CHAPTER  16.   USING  THE  API
Chapter  17
Extending  WEKA
For   most   users,   the  existing  WEKA  framework  will   be  sucient   to  perform
the   task  at   hand,   oering  a  wide   range  of   lters,   classiers,   clusterers,   etc.
Researchers, on the other hand, might want to add new algorithms and compare
them  against  existing  ones.   The  framework  with  its  existing  algorithms  is  not
set  in  stone,  but  basically one  big  plugin  framework.   With  WEKAs  automatic
discovery  of  classes  on  the  classpath,  adding  new  classiers,  lters,   etc.   to  the
existing  framework is  very  easy.
Though algorithms like clusterers, associators, data generators and attribute
selection  are  not  covered  in  this  chapter,   their  implemention  is  very  similar  to
the  one of implementing a classier.   You basically choose a superclass to derive
your new algorithm from and then implement additional interfaces, if necessary.
Just  check  out  the  other  algorithms  that  are  already  implemented.
The  section  covering the  GenericObjectEditor  (see  chapter  18.4)  shows  you
how  to  tell   WEKA  where  to  nd  your  class(es)  and  therefore  making  it/them
available  in  the  GUI  (Explorer/Experimenter) via  the  GenericObjectEditor.
233
234   CHAPTER  17.   EXTENDING  WEKA
17.1.   WRITING  A  NEW  CLASSIFIER   235
17.1   Writing  a  new  Classier
17.1.1   Choosing  the  base  class
The  ancestor  of  all  classiers  in  WEKA  is  weka.classifiers.Classifier, an
abstract  class.   Your  new  classier  must  be  derived  from  this  class  at  least  to
be  visible  through  the  GenericObjectEditor.   But  in  order  to  make  implemen-
tations   of   new  classiers   even  easier,   WEKA  comes   already  with  a  range  of
other abstract classes derived from weka.classifiers.Classifier.  In the fol-
lowing  you  will   nd  an  overview  that  will   help  you  decide  what  base  class  to
use  for your classier.   For better readability, the  weka.classifiers prex was
dropped  from  the  class  names:
   simple  classier
  Classifier   not  randomizable
  RandomizableClassifier   randomizable
   meta  classier
  single  base  classier
   SingleClassifierEnhancer   not  randomizable,  not  iterated
   RandomizableSingleClassifierEnhancer   randomizable,  not
iterated
   IteratedSingleClassifierEnhancer  not randomizable, iter-
ated
   RandomizableIteratedSingleClassifierEnhancer   random-
izable,  iterated
  multiple  base  classiers
   MultipleClassifiersCombiner   not  randomizable
   RandomizableMultipleClassifiersCombiner   randomizable
If   you  are  still   unsure   about   what   superclass   to  choose,   then  check  out   the
Javadoc  of   those  superclasses.   In  the  Javadoc  you  will   nd  all   the  classiers
that   are   derived  from  it,   which  should  give   you  a   better   idea  whether   this
particular  superclass  is  suited  for  your  needs.
17.1.2   Additional  interfaces
The abstract classes listed above basically just implement various combinations
of  the  following  two  interfaces:
   weka.core.Randomizable  to allow (seeded) randomization taking place
   weka.classifiers.IterativeClassifier   to  make  the  classier  an  it-
erated  one
But these interfaces are not the only ones that can be implemented by a classier.
Here  is  a  list  for  further  interfaces:
   weka.core.AdditionalMeasureProducer     the   classier   returns   addi-
tional  information,  e.g.,  J48  returns  the  tree  size  with  this  method.
   weka.core.WeightedInstancesHandler   denotes  that  the  classier  can
make use of weighted Instance objects (the default weight of an Instance
is  1.0).
   weka.core.TechnicalInformationHandler    for  returning  paper  refer-
ences  and  publications  this  classier  is  based  on.
   weka.classifiers.Sourcable     classiers   implementing   this   interface
can  return  Java  code  of  a  built  model,  which  can  be  used  elsewhere.
   weka.classifiers.UpdateableClassifier    for  classiers  that  can  be
trained  incrementally,  i.e.,  row  by  row  like  NaiveBayesUpdateable.
236   CHAPTER  17.   EXTENDING  WEKA
17.1.3   Packages
A  few  comments   about   the   dierent   sub-packages   in  the   weka.classifiers
package:
   bayes    contains  bayesian  classiers,  e.g.,  NaiveBayes
   evaluation  classes related to evaluation, e.g., confusion matrix, thresh-
old  curve  (=  ROC)
   functions  e.g., Support Vector Machines, regression algorithms, neural
nets
   lazy  learning is performed at prediction time, e.g., k-nearest neighbor
(k-NN)
   meta    meta-classiers  that  use  a  base  one  or  more  classiers  as  input,
e.g.,  boosting,  bagging  or  stacking
   mi    classiers  that  handle  multi-instance  data
   misc    various  classiers  that  dont  t  in  any  another  category
   rules    rule-based  classiers,  e.g.,  ZeroR
   trees    tree  classiers,  like  decision  trees  with  J48  a  very  common  one
17.1.   WRITING  A  NEW  CLASSIFIER   237
17.1.4   Implementation
In  the  following  you  will   nd  information  on  what  methods  need  to  be  imple-
mented and other coding guidelines for methods, option handling and documen-
tation  of  the  source  code.
17.1.4.1   Methods
This   section  explains   what   methods   need  to  be   implemented  in  general   and
more  specialized  ones  in  case  of  meta-classiers  (either  with  single  or  multiple
base-classiers).
General
Here  is  an  overview  of  methods  that  your  new  classier  needs  to  implement  in
order  to  integrate  nicely  into  the  WEKA  framework:
globalInfo()
returns  a  short  description  that  is  displayed  in  the  GUI,   like  the  Explorer  or
Experimenter.   How  long  this   description  will   be   is   really  up  to  you,   but   it
should  be  sucient  to  understand  the  classiers  underlying  algorithm.   If   the
classier  implements  the  weka.core.TechnicalInformationHandler interface
then  you  could  refer  to  the  publication(s)  by  extending  the  returned  string  by
getTechnicalInformation().toString().
listOptions()
returns   a  java.util.Enumeration  of   weka.core.Option  objects.   This   enu-
meration  is  used  to  display  the  help  on  the  command-line,   hence  it   needs  to
return  the  Option objects  of  the  superclass  as  well.
setOptions(String[])
parses the options that the classier would receive from a command-line invoca-
tion.   A parameter and argument are always two elements in the string array.   A
common mistake is to use a single cell in the  string array for both of them,  e.g.,
"-S   1" instead of  "-S","1".   You can use the methods getOption and getFlag
of the  weka.core.Utils class to retrieve the values of an option or to ascertain
whether  a  ag  is  present.   But  note  that  these  calls  remove  the  option  and,  if
applicable,  the  argument from the  string array (destructive).   The  last  call  in
the  setOptions methods  should  always  be  the  super.setOptions(String[])
one,   in  order  to  pass  on  any  other  arguments  still   present  in  the  array  to  the
superclass.   The  following code snippet just parses the  only option alpha that
an  imaginary  classier  denes:
import   weka.core.Utils;
...
public   void   setOptions(String[]   options)   throws   Exception   {
String   tmpStr   =   Utils.getOption("alpha",   options);
if   (tmpStr.length()   ==   0)   {
setAlpha(0.75);
}
else   {
setAlpha(Double.parseDouble(tmpStr));
}
super.setOptions(options);
}
238   CHAPTER  17.   EXTENDING  WEKA
getOptions()
returns  a  string  array  of  command-line  options  that  resemble  the  current  clas-
sier  setup.   Supplying  this  array  to  the  setOptions(String[]) method  must
result  in  the  same  conguration.   This  method  will  get  called  in  the  GUI  when
copying a classier setup to the clipboard.   Since handling of arrays is a bit cum-
bersome  in  Java  (due  to  xed  length),  using  an  instance  of   java.util.Vector
is  a  lot  easier  for  creating  the  array  that  needs  to  be  returned.   The  following
code  snippet  just  adds  the  only option  alpha that  the  classier denes  to  the
array that  is  being  returned,  including  the  options  of  the  superclass:
import   java.util.Arrays;
import   java.util.Vector;
...
public   String[]  getOptions()  {
Vector<String>  result   =   new   Vector<String>();
result.add("-alpha");
result.add(""  +   getAlpha());
result.addAll(Arrays.asList(super.getOptions()));  //   superclass
return   result.toArray(new  String[result.size()]);
}
Note, that the getOptions() method requires you to add the preceding dash for
an option, opposed to the getOption/getFlag calls in the setOptions method.
getCapabilities()
returns   meta-information  on  what   type   of   data  the   classier   can  handle,   in
regards  to  attributes  and  class  attributes.   See  section  Capabilities  on  page
242  for  more  information.
buildClassier(Instances)
builds the model from scratch with the provided dataset.   Each subsequent call of
this method must result in the same model being built.   The  buildClassifier
method  also  tests  whether  the  supplied  data  can  be  handled  at  all  by  the  clas-
sier,  utilizing  the  capabilities  returned  by  the  getCapabilities() method:
public   void   buildClassifier(Instances  data)   throws   Exception  {
//   test   data   against   capabilities
getCapabilities().testWithFail(data);
//   remove   instances  with   missing   class   value,
//   but   dont   modify   original  data
data   =   new   Instances(data);
data.deleteWithMissingClass();
//   actual   model   generation
...
}
toString()
is  used  for  outputting  the  built   model.   This  is   not  required,   but   it   is   useful
for  the  user  to  see  properties  of  the  model.   Decision  trees  normally  ouput  the
tree,  support vector machines  the  support vectors and  rule-based classiers the
generated  rules.
17.1.   WRITING  A  NEW  CLASSIFIER   239
distributionForInstance(Instance)
returns the class probabilities array of the prediction for the given weka.core.Instance
object.   If your classier handles nominal   class attributes, then you need to over-
ride  this  method.
classifyInstance(Instance)
returns the classication or regression for the given weka.core.Instanceobject.
In  case  of  a  nominal   class  attribute,  this  method  returns  the  index  of  the  class
label that got predicted.   You do not need to override this method in this case as
the  weka.classifiers.Classifier superclass already determines the class la-
bel index based on the probabilities array that the distributionForInstance(Instance)
method  returns  (it  returns  the  index  in  the  array  with  the  highest  probability;
in  case  of  ties  the  rst  one).   For  numeric  class  attributes,  you  need  to  override
this  method,  as  it  has  to  return  the  regression value  predicted  by  the  model.
main(String[])
executes   the   classier   from  command-line.   If   your   new  algorithm  is   called
FunkyClassifier, then  use  the  following  code  as  your  main  method:
/**
*   Main   method   for   executing  this   classifier.
*
*   @param   args   the   options,  use   "-h"   to   display   options
*/
public   static   void   main(String[]  args)   {
runClassifier(new  FunkyClassifier(),  args);
}
Note:   the   static   method  runClassifier  (dened  in  the   abstract   superclass
weka.classifiers.Classifier) handles  all   the  appropriate  calls  and  catches
and  processes  any  exceptions  as  well.
240   CHAPTER  17.   EXTENDING  WEKA
Meta-classiers
Meta-classiers dene a range of other methods that you might want to override.
Normally,   this  should  not  be  the  case.   But  if  your  classier  requires  the  base-
classier(s) to be of a certain type, you can override the specic set-method and
add  additional  checks.
SingleClassierEnhancer
The  following  methods   are  used  for   handling  the  single  base-classier  of   this
meta-classier.
defaultClassierString()
returns  the  class  name  of  the  classier  that  is  used  as  the  default  one  for  this
meta-classier.
setClassier(Classier)
sets the classier object.   Override this method if you require further checks, like
that  the  classiers  needs  to  be  of  a  certain  class.   This  is  necessary,  if  you  still
want to allow the user to parametrize the base-classier, but not choose another
classier  with  the  GenericObjectEditor.   Be  aware  that  this  method  does  not
create  a  copy  of  the  provided  classier.
getClassier()
returns the currently set classier object.   Note, this method returns the internal
object  and  not  a  copy.
MultipleClassiersCombiner
This meta-classier handles its multiple base-classiers with the following meth-
ods:
setClassiers(Classier[])
sets  the  array  of   classiers  to  use  as  base-classiers.   If   you  require  the  base-
classiers to implement a certain interface or be of a certain class, then override
this  method  and  add  the  necessary  checks.   Note,   this  method  does  not  create
a  copy  of  the  array,  but  just  uses  this  reference  internally.
getClassiers()
returns  the  array  of  classiers  that  is  in  use.   Careful,   this  method  returns  the
internal  array and  not  a  copy  of  it.
getClassier(int)
returns   the   classier   from  the   internal   classier   array  specied  by  the   given
index.   Once again, this  method  does  not return a  copy of the  classier, but the
actual  object  used  by  this  classier.
17.1.   WRITING  A  NEW  CLASSIFIER   241
17.1.4.2   Guidelines
WEKAs  code  base  requires  you  to  follow  a  few  rules.   The  following  sections
can  be  used  as  guidelines  in  writing  your  code.
Parameters
There  are  two  dierent  ways  of  setting/obtaining  parameters  of  an  algorithm.
Both  of   them  are  unfortunately  completely  independent,   which  makes  option
handling  so  prone  to  errors.   Here  are  the  two:
1.   command-line  options,  using  the  setOptions/getOptions methods
2.   using  the  properties  through  the  GenericObjectEditor  in  the  GUI
Each  command-line  option  must  have  a  corresponding  GUI  property  and  vice
versa.   In  case  of  GUI  properties,  the  get-  and  set-method  for  a  property  must
comply  with  Java  Beans   style  in  order  to  show  up  in  the  GUI.   You  need  to
supply  three  methods  for  each  property:
   public   void   set<PropertyName>(<Type>)    checks   whether   the  sup-
plied value is valid and only then updates the corresponding member vari-
able.   In  any  other  case  it  should  ignore  the  value  and  output  a  warning
in  the  console  or  throw  an  IllegalArgumentException.
   public   <Type>   get<PropertyName>()  performs any necessary conver-
sions  of  the  internal  value  and  returns  it.
   public   String   <propertyName>TipText()   returns  the  help  text  that
is available through the GUI. Should be the same as on the command-line.
Note:   everything  after  the  rst  period  .   gets  truncated  from  the  tool
tip  that  pops  up  in  the  GUI  when  hovering  with  the  mouse  cursor  over
the  eld  in  the  GenericObjectEditor.
With  a  property  called  alpha  of  type  double,  we  get  the  following  method
signatures:
   public   void   setAlpha(double)
   public   double   getAlpha()
   public   String   alphaTipText()
These get- and set-methods should be used in the  getOptions and setOptions
methods  as  well,  to  impose  the  same  checks  when  getting/setting  parameters.
Randomization
In  order   to  get   repeatable   experiments,   one   is   not   allowed  to  use   unseeded
random number generators like Math.random().   Instead, one has to instantiate
a java.util.Random object in the buildClassifier(Instances) method with
a  specic  seed  value.   The  seed  value  can  be  user  supplied,   of  course,  which  all
the  Randomizable...  abstract  classiers  already  implement.
242   CHAPTER  17.   EXTENDING  WEKA
Capabilities
By  default,   the  weka.classifiers.Classifier  superclass   returns   an  object
that  denotes  that  the  classier  can  handle  any  type  of  data.   This  is  useful  for
rapid  prototyping  of   new  algorithms,   but  also  very  dangerous.   If   you  do  not
specically  dene  what  type  of  data  can  be  handled  by  your  classier,  you  can
end  up  with  meaningless  models   or  errors.   This   can  happen  if   you  devise  a
new  classier  which  is  supposed  to  handle  only  numeric  attributes.   By  using
the  value(int/Attribute)  method  of   a  weka.core.Instance  to  obtain  the
numeric  value  of  an  attribute,   you  also  obtain  the  internal   format  of  nominal,
string  and  relational   attributes.   Of   course,   treating  these  attribute  types   as
numeric  ones  does  not  make  any  sense.   Hence  it  is  highly  recommended  (and
required  for  contributions)  to  override this  method  in  your  own  classier.
There  are  three  dierent  types  of  capabilities  that  you  can  dene:
1.   attribute  related    e.g.,  nominal,  numeric,  date,  missing  values,  ...
2.   class   attribute   related     e.g.,   no-class,   nominal,   numeric,   missing  class
values,  ...
3.   miscellaneous     e.g.,   only  multi-instance  data,   minimum  number   of   in-
stances  in  the  training  data
There  are  some  special  cases:
   incremental   classiers    need  to  set the  minimum  number  of  instances  in
the  training  data  to  0,  since  the  default  is  1:
setMinimumNumberInstances(0)
   multi-instance classiers   in order to signal that the special multi-instance
format  (bag-id,  bag-data,  class)  is  used,  they  need  to  enable  the  following
capability:
enable(Capability.ONLY  MULTIINSTANCE)
These  classiers  also  need  to  implement   the  interface  specic  to  multi-
instance, weka.core.MultiInstanceCapabilitiesHandler, which returns
the  capabilities  for  the  bag-data.
   cluster   algorithms     since   clusterers   are   unsupervised  algorithms,   they
cannot   process   data  with  the   class   attribute   set.   The   capability  that
denotes   that   an  algorithm  can  handle  data  without   a  class  attribute  is
Capability.NO  CLASS
And a note on enabling/disabling nominal attributes  or nominal class attributes.
These   operations   automatically  enable/disable   the   binary,   unary   and  empty
nominal   capabilities  as  well.   The  following  sections  list  a  few  examples  of  how
to  congure  the  capabilities.
17.1.   WRITING  A  NEW  CLASSIFIER   243
Simple  classier
A  classier   that   handles   only  numeric   classes   and  numeric   and  nominal   at-
tributes,  but no missing values at all, would congure the  Capabilities object
like  this:
public   Capabilities  getCapabilities()  {
Capabilities  result   =   new   Capabilities(this);
//   attributes
result.enable(Capability.NOMINAL_ATTRIBUTES);
result.enable(Capability.NUMERIC_ATTRIBUTES);
//   class
result.enable(Capability.NUMERIC_CLASS);
return   result;
}
Another  classier, that only handles binary classes and only nominal attributes
and  missing  values,   would  implement  the  getCapabilities()  method  as  fol-
lows:
public   Capabilities  getCapabilities()  {
Capabilities  result   =   new   Capabilities(this);
//   attributes
result.enable(Capability.NOMINAL_ATTRIBUTES);
result.enable(Capability.MISSING_VALUES);
//   class
result.enable(Capability.BINARY_CLASS);
result.disable(Capability.UNNARY_CLASS);
result.enable(Capability.MISSING_CLASS_VALUES);
return   result;
}
Meta-classier
Meta-classiers, by default,  just return the  capabilities of their  base classiers -
in case of descendants of the weka.classifier.MultipleClassifiersCombiner,
an  AND  over all  the  Capabilities of  the  base  classiers  is  returned.
Due  to  this  behavior,  the  capabilities  depend    normally    only  on  the  cur-
rently congured base classier(s).   To soften ltering for certain behavior, meta-
classiers  also  dene  so-called  Dependencies   on  a  per-Capability  basis.   These
dependencies   tell   the  lter   that   even  though  a  certain  capability  is   not   sup-
ported  right  now,   it  is  possible  that  it  will   be  supported  with  a  dierent  base
classier.   By  default,  all  capabilities  are  initialized  as  Dependencies.
weka.classifiers.meta.LogitBoost, e.g., is restricted to nominal classes.
For  that  reason  it  disables  the  Dependencies  for  the  class:
result.disableAllClasses();   //   disable   all   class   types
result.disableAllClassDependencies();   //   no   dependencies!
result.enable(Capability.NOMINAL_CLASS);   //   only   nominal   classes   allowed
244   CHAPTER  17.   EXTENDING  WEKA
Javadoc
In  order  to  keep  code-quality  high  and  maintenance  low,   source  code  needs  to
be  well  documented.   This  includes  the  following  Javadoc  requirements:
   class
  description  of  the  classier
  listing  of  command-line  parameters
  publication(s),  if  applicable
  @author and  @version tag
   methods  (all,  not  just  public)
  each  parameter  is  documented
  return  value,  if  applicable,  is  documented
  exception(s)  are  documented
  the  setOptions(String[]) method  also  lists  the  command-line  pa-
rameters
Most   of   the   class-related  and  the   setOptions  Javadoc   is   already  available
through  the  source  code:
   description  of  the  classier    globalInfo()
   listing  of  command-line  parameters   listOptions()
   publication(s),  if  applicable    getTechnicalInformation()
In  order  to  avoid  manual   syncing  between  Javadoc  and  source  code,   WEKA
comes  with  some  tools  for  updating  the  Javadoc  automatically.   The  following
tools take a concrete class and update its source code (the source code directory
needs  to  be  supplied  as  well,  of  course):
   weka.core.AllJavadoc   executes  all   Javadoc-producing  classes  (this  is
the  tool,  you  would  normally  use)
   weka.core.GlobalInfoJavadoc   updates  the  globalinfo  tags
   weka.core.OptionHandlerJavadoc   updates  the  option  tags
   weka.core.TechnicalInformationHandlerJavadoc   updates  the  tech-
nical   tags  (plain  text  and  BibTeX)
These tools look for specic comment tags in the source code and replace every-
thing  in  between  the  start  and  end  tag  with  the  documentation  obtained  from
the  actual  class.
   description  of  the  classier
<!--   globalinfo-start  -->
will   be   automatically  replaced
<!--   globalinfo-end  -->
   listing  of  command-line  parameters
<!--   options-start  -->
will   be   automatically  replaced
<!--   options-end  -->
   publication(s),  if  applicable
<!--   technical-bibtex-start  -->
will   be   automatically  replaced
<!--   technical-bibtex-end  -->
for  a  shortened,  plain-text  version  use  the  following:
<!--   technical-plaintext-start  -->
will   be   automatically  replaced
<!--   technical-plaintext-end  -->
17.1.   WRITING  A  NEW  CLASSIFIER   245
Here  is  a  template  of a  Javadoc class  block  for  an  imaginary classier  that  also
implements  the  weka.core.TechnicalInformationHandler interface:
/**
<!--   globalinfo-start  -->
<!--   globalinfo-end  -->
*
<!--   technical-bibtex-start  -->
<!--   technical-bibtex-end  -->
*
<!--   options-start  -->
<!--   options-end  -->
*
*   @author   John   Doe   (john   dot   doe   at   no   dot   where   dot   com)
*   @version  $Revision:  6192   $
*/
The  template  for  any  classiers  setOptions(String[]) method  is  as  follows:
/**
*   Parses   a   given   list   of   options.
*
<!--   options-start  -->
<!--   options-end  -->
*
*   @param   options   the   list   of   options   as   an   array   of   strings
*   @throws   Exception  if   an   option   is   not   supported
*/
Running  the  weka.core.AllJavadoc tool  over  this  code  will  output  code  with
the  comments  lled  out  accordingly.
Revisions
Classiers   implement   the   weka.core.RevisionHandler  interface.   This   pro-
vides  the  functionality  of   obtaining  the  Subversion  revision  from  within  Java.
Classiers  that  are  not  part  of   the  ocial   WEKA  distribution  do  not  have  to
implement  the  method  getRevision() as  the  weka.classifiers.Classifier
class  already  implements  this  method.   Contributions,  on  the  other  hand,  need
to  implement   it   as   follows,   in  order   to  obtain  the  revision  of   this   particular
source  le:
/**
*   Returns   the   revision   string.
*
*   @return   the   revision
*/
public   String   getRevision()  {
return   RevisionUtils.extract("$Revision:  6192   $");
}
Note, a commit into Subversion will replace the revision number above with the
actual  revision  number.
246   CHAPTER  17.   EXTENDING  WEKA
Testing
WEKA  provides  already  a  test  framework  to  ensure  correct basic  functionality
of  a  classier.   It  is  essential  for  the  classier  to  pass  these  tests.
Option  handling
You can check the option handling of your classier with the following tool from
command-line:
weka.core.CheckOptionHandler  -W   classname  [--   additional  parameters]
All  tests  need  to  return  yes.
GenericObjectEditor
The CheckGOE class checks whether all the properties available in the GUI have a
tooltip accompanying them and whether the globalInfo() method is declared:
weka.core.CheckGOE  -W   classname  [--   additional  parameters]
All  tests,  once  again,  need  to  return  yes.
Source  code
Classiers that implement the weka.classifiers.Sourcable interface can out-
put  Java  code  of   the  built  model.   In  order  to  check  the  generated  code,   one
should  not  only  compile  the  code,  but  also  test  it  with  the  following  test  class:
weka.classifiers.CheckSource
This class takes the original WEKA classier, the generated code and the dataset
used  for  generating  the  model   (and  an  optional   class  index)  as  parameters.   It
builds  the  WEKA  classier  on  the  dataset  and  compares  the  output,   the  one
from the WEKA classier and the one from the generated source code, whether
they  are  the  same.
Here   is   an  example  call   for   weka.filters.trees.J48  and  the   generated
class   weka.filters.WEKAWrapper  (it   wraps   the   actual   generated  code   in   a
pseudo-classier):
java   weka.classifiers.CheckSource  \
-W   weka.classifiers.trees.J48  \
-S   weka.classifiers.WEKAWrapper  \
-t   data.arff
It  needs  to  return  Tests  OK!.
Unit  tests
In  order  to  make  sure  that  your  classier  applies  to  the  WEKA  criteria,   you
should add your classier to the junit unit test framework, i.e., by creating a Test
class.   The superclass for classier unit tests is weka.classifiers.AbstractClassifierTest.
17.2.   WRITING  A  NEW  FILTER   247
17.2   Writing  a  new  Filter
The  work  horses  of  preprocessing  in  WEKA  are  lters.   They  perform  many
tasks,   from  resampling  data,   to  deleting  and  standardizing  attributes.   In  the
following  are  two  dierent   approaches   covered  that   explain  in  detail   how  to
implement  a  new  lter:
   default    this  is  how  lters  had  to  be  implemented  in  the  past.
   simple    since  there  are  mainly  two  types  of  lters,  batch  or  stream, ad-
ditional  abstract  classes  were  introduced  to  speed  up  the  implementation
process.
17.2.1   Default  approach
The  default   approach  is  the  most  exible,   but  also  the  most  complicated  one
for  writing  a  new  lter.   This  approach  has  to  be  used,   if   the  lter  cannot  be
written  using  the  simple  approach described  further  below.
17.2.1.1   Implementation
The  following methods  are of importance  for the  implementation  of a lter  and
explained  in  detail  further  down.   It is  also  a good  idea  studying  the  Javadoc of
these  methods  as  declared  in  the  weka.filters.Filter class:
   getCapabilities()
   setInputFormat(Instances)
   getInputFormat()
   setOutputFormat(Instances)
   getOutputFormat()
   input(Instance)
   bufferInput(Instance)
   push(Instance)
   output()
   batchFinished()
   flushInput()
   getRevision()
But  only  the  following  ones  normally  need  to  be  modied:
   getCapabilities()
   setInputFormat(Instances)
   input(Instance)
   batchFinished()
   getRevision()
For more information on Capabilities see section 17.2.3.   Please note, that the
weka.filters.Filtersuperclass does not implement the weka.core.OptionHandler
interface.   See  section  Option  handling  on  page  249.
248   CHAPTER  17.   EXTENDING  WEKA
setInputFormat(Instances)
With  this  call,  the  user  tells  the  lter  what  structure,  i.e.,  attributes,  the  input
data  has.   This  method  also  tests,   whether  the  lter  can  actually  process  this
data, according to the capabilities specied in the getCapabilities() method.
If   the  output  format  of   the  lter,   i.e.,   the  new  Instances  header,   can  be
determined  based  alone  on  this   information,   then  the  method  should  set   the
output  format  via  setOutputFormat(Instances)  and  return  true,   otherwise
it  has  to  return  false.
getInputFormat()
This   method  returns   an  Instances  object   containing   all   currently  buered
Instance objects  from  the  input  queue.
setOutputFormat(Instances)
setOutputFormat(Instances) denes  the  new  Instances  header  for  the  out-
put  data.   For  lters  that work on  a  row-basis, there  should  not be  any  changes
between  the  input  and  output  format.   But  lters  that  work  on  attributes,  e.g.,
removing,   adding,   modifying,   will   aect   this   format.   This   method  must   be
called  with  the  appropriate Instances object as  parameter, since  all  Instance
objects  being  processed  will   rely  on  the  output  format  (they  use  it  as  dataset
that  they  belong  to).
getOutputFormat()
This method returns the currently set Instances object that denes the output
format.   In  case  setOutputFormat(Instances)  has   not   been  called  yet,   this
method  will  return  null.
input(Instance)
returns  true  if  the  given  Instance can  be  processed  straight  away  and  can  be
collected  immediately  via  the  output()  method  (after  adding  it  to  the  output
queue  via  push(Instance),  of   course).   This  is  also  the  case  if   the  rst  batch
of data has been processed and the  Instance  belongs to the second batch.   Via
isFirstBatchDone() one  can  query  whether  this  Instance  is  still   part  of  the
rst  batch  or  of  the  second.
If   the  Instance  cannot  be  processed  immediately,   e.g.,   the  lter  needs  to
collect   all   the   data  rst   before  doing  some  calculations,   then  it   needs   to  be
buered  with  bufferInput(Instance)  until   batchFinished()  is   called.   In
this  case,  the  method  needs  to  return  false.
buerInput(Instance)
In case an  Instance cannot be processed immediately,  one can use this method
to  buer  them  in  the  input  queue.   All  buered  Instance objects  are  available
via  the  getInputFormat() method.
push(Instance)
adds  the  given  Instance to  the  output  queue.
output()
Returns  the  next  Instance  object  from  the  output  queue  and  removes  it  from
there.   In  case  there  is  no  Instance available  this  method  returns  null.
17.2.   WRITING  A  NEW  FILTER   249
batchFinished()
signals  the  end  of  a  dataset  being  pushed  through  the  lter.   In  case  of  a  lter
that could not process the data of the rst batch immediately, this is the place to
determine what the output format will be (and set if via setOutputFormat(Instances))
and nally process the input data.   The currently available data can be retrieved
with  the  getInputFormat() method.   After  processing  the  data,   one  needs  to
call  flushInput() to  remove  all  the  pending  input  data.
ushInput()
flushInput()  removes   all   buered  Instance  objects   from  the   input   queue.
This  method must be called after all the  Instance objects have been  processed
in  the  batchFinished() method.
Option  handling
If   the   lter   should  be   able   to  handle   command-line   options,   then  the   inter-
face  weka.core.OptionHandler needs  to  be  implemented.   In  addition  to  that,
the  following  code  should  be  added  at  the  end  of   the  setOptions(String[])
method:
if   (getInputFormat()  !=   null)   {
setInputFormat(getInputFormat());
}
This  will  inform  the  lter  about  changes  in  the  options  and  therefore  reset  it.
250   CHAPTER  17.   EXTENDING  WEKA
17.2.1.2   Examples
The  following  examples,   covering  batch  and  stream  lters,   illustrate  the  lter
framework  and  how  to  use  it.
Unseeded  random  number  generators  like  Math.random() should  never  be
used  since  they  will  produce  dierent  results  in  each  run  and  repeatable  exper-
iments  are  essential  in  machine  learning.
BatchFilter
This   simple   batch  lter   adds   a  new  attribute   called  blah   at   the   end  of   the
dataset.   The  rows  of   this  attribute  contain  only  the  rows  index  in  the  data.
Since the batch-lter does not have to see all the data before creating the output
format,   the  setInputFormat(Instances)  sets  the  output  format  and  returns
true  (indicating  that   the   output   format   can  be   queried  immediately).   The
batchFinished() method  performs  the  processing  of  all  the  data.
import   weka.core.*;
import   weka.core.Capabilities.*;
public   class   BatchFilter   extends   Filter   {
public   String   globalInfo()   {
return   "A   batch   filter   that   adds   an   additional   attribute   blah   at   the   end   "
+   "containing   the   index   of   the   processed   instance.   The   output   format   "
+   "can   be   collected   immediately.";
}
public   Capabilities   getCapabilities()   {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   //   filter   doesnt   need   class   to   be   set
return   result;
}
public   boolean   setInputFormat(Instances  instanceInfo)   throws   Exception   {
super.setInputFormat(instanceInfo);
Instances   outFormat   =   new   Instances(instanceInfo,  0);
outFormat.insertAttributeAt(new  Attribute("blah"),
outFormat.numAttributes());
setOutputFormat(outFormat);
return   true;   //   output   format   is   immediately   available
}
public   boolean   batchFinished()   throws   Exception   {
if   (getInputFormat()   =   null)
throw   new   NullPointerException("No  input   instance   format   defined");
Instances   inst   =   getInputFormat();
Instances   outFormat   =   getOutputFormat();
for   (int   i   =   0;   i   <   inst.numInstances();  i++)   {
double[]   newValues   =   new   double[outFormat.numAttributes()];
double[]   oldValues   =   inst.instance(i).toDoubleArray();
System.arraycopy(oldValues,  0,   newValues,   0,   oldValues.length);
newValues[newValues.length  -   1]   =   i;
push(new   Instance(1.0,   newValues));
}
flushInput();
m_NewBatch   =   true;
m_FirstBatchDone   =   true;
return   (numPendingOutput()   !=   0);
}
public   static   void   main(String[]   args)   {
runFilter(new   BatchFilter(),   args);
}
}
17.2.   WRITING  A  NEW  FILTER   251
BatchFilter2
In  contrast  to  the  rst  batch  lter,   this  one  here  cannot  determine  the  output
format  immediately  (the  number  of   instances  in  the  rst  batch  is  part  of   the
attribute  name  now).   This  is  done  in  the  batchFinished() method.
import   weka.core.*;
import   weka.core.Capabilities.*;
public   class   BatchFilter2   extends   Filter   {
public   String   globalInfo()   {
return   "A   batch   filter   that   adds   an   additional   attribute   blah   at   the   end   "
+   "containing   the   index   of   the   processed   instance.   The   output   format   "
+   "cannot   be   collected   immediately.";
}
public   Capabilities   getCapabilities()  {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   //   filter   doesnt   need   class   to   be   set
return   result;
}
public   boolean   batchFinished()   throws   Exception   {
if   (getInputFormat()  =   null)
throw   new   NullPointerException("No  input   instance   format   defined");
//   output   format   still   needs   to   be   set   (depends   on   first   batch   of   data)
if   (!isFirstBatchDone())  {
Instances   outFormat   =   new   Instances(getInputFormat(),  0);
outFormat.insertAttributeAt(new  Attribute(
"blah-"   +   getInputFormat().numInstances()),  outFormat.numAttributes());
setOutputFormat(outFormat);
}
Instances   inst   =   getInputFormat();
Instances   outFormat   =   getOutputFormat();
for   (int   i   =   0;   i   <   inst.numInstances();   i++)   {
double[]   newValues   =   new   double[outFormat.numAttributes()];
double[]   oldValues   =   inst.instance(i).toDoubleArray();
System.arraycopy(oldValues,  0,   newValues,   0,   oldValues.length);
newValues[newValues.length  -   1]   =   i;
push(new   Instance(1.0,   newValues));
}
flushInput();
m_NewBatch   =   true;
m_FirstBatchDone   =   true;
return   (numPendingOutput()   !=   0);
}
public   static   void   main(String[]   args)   {
runFilter(new   BatchFilter2(),   args);
}
}
252   CHAPTER  17.   EXTENDING  WEKA
BatchFilter3
As soon as this batch lters rst batch is done, it can process Instance objects
immediately  in  the  input(Instance)  method.   It  adds  a  new  attribute  which
contains  just  a  random  number,  but  the  random  number  generator  being  used
is  seeded  with  the  number  of  instances  from  the  rst  batch.
import   weka.core.*;
import   weka.core.Capabilities.*;
import   java.util.Random;
public   class   BatchFilter3   extends   Filter   {
protected   int   m_Seed;
protected   Random   m_Random;
public   String   globalInfo()   {
return   "A   batch   filter   that   adds   an   attribute   blah   at   the   end   "
+   "containing   a   random   number.   The   output   format   cannot   be   collected   "
+   "immediately.";
}
public   Capabilities   getCapabilities()   {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   //   filter   doesnt   need   class   to   be   set
return   result;
}
public   boolean   input(Instance   instance)   throws   Exception   {
if   (getInputFormat()   =   null)
throw   new   NullPointerException("No  input   instance   format   defined");
if   (isNewBatch())   {
resetQueue();
m_NewBatch   =   false;
}
if   (isFirstBatchDone())
convertInstance(instance);
else
bufferInput(instance);
return   isFirstBatchDone();
}
public   boolean   batchFinished()   throws   Exception   {
if   (getInputFormat()   =   null)
throw   new   NullPointerException("No  input   instance   format   defined");
//   output   format   still   needs   to   be   set   (random   number   generator   is   seeded
//   with   number   of   instances   of   first   batch)
if   (!isFirstBatchDone())  {
m_Seed   =   getInputFormat().numInstances();
Instances   outFormat   =   new   Instances(getInputFormat(),  0);
outFormat.insertAttributeAt(new  Attribute(
"blah-"   +   getInputFormat().numInstances()),  outFormat.numAttributes());
setOutputFormat(outFormat);
}
Instances   inst   =   getInputFormat();
for   (int   i   =   0;   i   <   inst.numInstances();  i++)   {
convertInstance(inst.instance(i));
}
flushInput();
m_NewBatch   =   true;
m_FirstBatchDone   =   true;
m_Random   =   null;
return   (numPendingOutput()   !=   0);
}
protected   void   convertInstance(Instance  instance)   {
if   (m_Random   =   null)
m_Random   =   new   Random(m_Seed);
double[]   newValues   =   new   double[instance.numAttributes()  +   1];
double[]   oldValues   =   instance.toDoubleArray();
newValues[newValues.length  -   1]   =   m_Random.nextInt();
System.arraycopy(oldValues,  0,   newValues,   0,   oldValues.length);
push(new   Instance(1.0,   newValues));
}
public   static   void   main(String[]   args)   {
runFilter(new   BatchFilter3(),   args);
}
}
17.2.   WRITING  A  NEW  FILTER   253
StreamFilter
This  stream  lter  adds  a  random  number  (the  seed  value  is  hard-coded)  at  the
end of each Instance of the input data.   Since this does not rely on having access
to  the  full   data  of  the  rst  batch,   the  output  format  is  accessible  immediately
after  using  setInputFormat(Instances).   All  the  Instance objects  are  imme-
diately  processed  in  input(Instance)  via  the   convertInstance(Instance)
method,  which  pushes  them  immediately  to  the  output  queue.
import   weka.core.*;
import   weka.core.Capabilities.*;
import   java.util.Random;
public   class   StreamFilter   extends   Filter   {
protected   Random   m_Random;
public   String   globalInfo()   {
return   "A   stream   filter   that   adds   an   attribute   blah   at   the   end   "
+   "containing   a   random   number.   The   output   format   can   be   collected   "
+   "immediately.";
}
public   Capabilities   getCapabilities()  {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   //   filter   doesnt   need   class   to   be   set
return   result;
}
public   boolean   setInputFormat(Instances  instanceInfo)   throws   Exception   {
super.setInputFormat(instanceInfo);
Instances   outFormat   =   new   Instances(instanceInfo,  0);
outFormat.insertAttributeAt(new  Attribute("blah"),
outFormat.numAttributes());
setOutputFormat(outFormat);
m_Random   =   new   Random(1);
return   true;   //   output   format   is   immediately   available
}
public   boolean   input(Instance   instance)   throws   Exception   {
if   (getInputFormat()  =   null)
throw   new   NullPointerException("No  input   instance   format   defined");
if   (isNewBatch())   {
resetQueue();
m_NewBatch   =   false;
}
convertInstance(instance);
return   true;   //   can   be   immediately   collected   via   output()
}
protected   void   convertInstance(Instance  instance)   {
double[]   newValues   =   new   double[instance.numAttributes()  +   1];
double[]   oldValues   =   instance.toDoubleArray();
newValues[newValues.length  -   1]   =   m_Random.nextInt();
System.arraycopy(oldValues,  0,   newValues,   0,   oldValues.length);
push(new   Instance(1.0,   newValues));
}
public   static   void   main(String[]   args)   {
runFilter(new   StreamFilter(),   args);
}
}
254   CHAPTER  17.   EXTENDING  WEKA
17.2.2   Simple  approach
The  base  lters  and  interfaces  are  all  located  in  the  following  package:
weka.filters
One  can  basically  divide  lters  roughly  into  two  dierent  kinds  of  lters:
   batch  lters    they  need  to  see  the  whole  dataset  before  they  can  start
processing  it,  which  they  do  in  one  go
   stream  lters  they can start producing output right away and the data
just  passes  through  while  being  modied
You  can  subclass  one  of  the  following  abstract lters,  depending  on  the  kind  of
classier  you  want  to  implement:
   weka.filters.SimpleBatchFilter
   weka.filters.SimpleStreamFilter
These  lters  simplify  the  rather  general  and  complex  framework  introduced  by
the  abstract  superclass  weka.filters.Filter.   One  only  needs  to  implement
a  couple  of  abstract  methods  that  will  process  the  actual  data  and  override,  if
necessary,  a  few  existing  methods  for  option  handling.
17.2.2.1   SimpleBatchFilter
Only  the  following  abstract methods  need  to  be  implemented:
   globalInfo()    returns  a  short  description  of   what  the  lter  does;   will
be  displayed  in  the  GUI
   determineOutputFormat(Instances)   generates the  new  format,  based
on  the  input  data
   process(Instances)   processes  the  whole  dataset  in  one  go
   getRevision()  returns the Subversion revision information, see section
Revisions  on  page  258
If more options are necessary, then the following methods need to be overridden:
   listOptions()    returns  an  enumeration  of   the  available  options;   these
are  printed  if  one  calls  the  lter  with  the  -h  option
   setOptions(String[])  parses the  given option array, that were passed
from  command-line
   getOptions()  returns an array of options, resembling the current setup
of  the  lter
See  section  Methods  on  page  237  and  section  Parameters  on  page  241  for
more  information.
17.2.   WRITING  A  NEW  FILTER   255
In  the   following  an  example   implementation  that   adds   an  additional   at-
tribute  at  the  end,  containing  the  index  of  the  processed  instance:
import   weka.core.*;
import   weka.core.Capabilities.*;
import   weka.filters.*;
public   class   SimpleBatch   extends   SimpleBatchFilter   {
public   String   globalInfo()   {
return   "A   simple   batch   filter   that   adds   an   additional   attribute   blah   at   the   end   "
+   "containing   the   index   of   the   processed   instance.";
}
public   Capabilities   getCapabilities()   {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   ////   filter   doesnt   need   class   to   be   set//
return   result;
}
protected   Instances   determineOutputFormat(Instances   inputFormat)   {
Instances   result   =   new   Instances(inputFormat,   0);
result.insertAttributeAt(new   Attribute("blah"),   result.numAttributes());
return   result;
}
protected   Instances   process(Instances   inst)   {
Instances   result   =   new   Instances(determineOutputFormat(inst),   0);
for   (int   i   =   0;   i   <   inst.numInstances();   i++)   {
double[]   values   =   new   double[result.numAttributes()];
for   (int   n   =   0;   n   <   inst.numAttributes();   n++)
values[n]   =   inst.instance(i).value(n);
values[values.length   -   1]   =   i;
result.add(new   Instance(1,   values));
}
return   result;
}
public   static   void   main(String[]   args)   {
runFilter(new   SimpleBatch(),   args);
}
}
256   CHAPTER  17.   EXTENDING  WEKA
17.2.2.2   SimpleStreamFilter
Only the following abstract methods need to be implemented for a stream lter:
   globalInfo()    returns  a  short  description  of   what  the  lter  does;   will
be  displayed  in  the  GUI
   determineOutputFormat(Instances)   generates the  new  format,  based
on  the  input  data
   process(Instance)    processes  a  single  instance  and  turns  it  from  the
old  format  into  the  new  one
   getRevision()  returns the Subversion revision information, see section
Revisions  on  page  258
If more options are necessary, then the following methods need to be overridden:
   listOptions()    returns  an  enumeration  of   the  available  options;   these
are  printed  if  one  calls  the  lter  with  the  -h  option
   setOptions(String[])  parses the  given option array, that were passed
from  command-line
   getOptions()  returns an array of options, resembling the current setup
of  the  lter
See  also  section  17.1.4.1,  covering Methods  for  classiers.
17.2.   WRITING  A  NEW  FILTER   257
In  the  following  an  example  implementation  of  a  stream  lter  that  adds  an
extra  attribute  at  the  end,  which  is  lled  with  random  numbers.   The  reset()
method  is only  used in  this  example, since  the  random number generator needs
to  be  re-initialized  in  order  to  obtain  repeatable  results.
import   weka.core.*;
import   weka.core.Capabilities.*;
import   weka.filters.*;
import   java.util.Random;
public   class   SimpleStream   extends   SimpleStreamFilter   {
protected   Random   m_Random;
public   String   globalInfo()   {
return   "A   simple   stream   filter   that   adds   an   attribute   blah   at   the   end   "
+   "containing   a   random   number.";
}
public   Capabilities   getCapabilities()   {
Capabilities   result   =   super.getCapabilities();
result.enableAllAttributes();
result.enableAllClasses();
result.enable(Capability.NO_CLASS);   ////   filter   doesnt   need   class   to   be   set//
return   result;
}
protected   void   reset()   {
super.reset();
m_Random   =   new   Random(1);
}
protected   Instances   determineOutputFormat(Instances   inputFormat)   {
Instances   result   =   new   Instances(inputFormat,   0);
result.insertAttributeAt(new   Attribute("blah"),   result.numAttributes());
return   result;
}
protected   Instance   process(Instance   inst)   {
double[]   values   =   new   double[inst.numAttributes()   +   1];
for   (int   n   =   0;   n   <   inst.numAttributes();   n++)
values[n]   =   inst.value(n);
values[values.length   -   1]   =   m_Random.nextInt();
Instance   result   =   new   Instance(1,   values);
return   result;
}
public   static   void   main(String[]   args)   {
runFilter(new   SimpleStream(),   args);
}
}
A  real-world implementation  of  a  stream  lter  is  the  MultiFilter class  (pack-
age  weka.filters),   which  passes  the  data  through  all   the  lters  it   contains.
Depending  on  whether  all   the  used  lters  are  streamable  or  not,   it  acts  either
as  stream  lter  or  as  batch  lter.
258   CHAPTER  17.   EXTENDING  WEKA
17.2.2.3   Internals
Some  useful  methods  of  the  lter  classes:
   isNewBatch()   returns  true  if  an  instance  of  the  lter  was  just  instan-
tiated  or  a  new  batch  was  started  via  the  batchFinished() method.
   isFirstBatchDone()  returns true as soon as the rst batch was nished
via  the   batchFinished()  method.   Useful   for   supervised  lters,   which
should not be altered after being trained with  the rst batch of instances.
17.2.3   Capabilities
Filters implement the weka.core.CapabilitiesHandler interface like the clas-
siers.   This method returns what kind of data the lter is able to process.   Needs
to be  adapted for each individual lter, since the default implementation allows
the processing of all kinds of attributes and classes.   Otherwise correct function-
ing  of  the  lter  cannot  be  guaranteed.   See  section  Capabilities  on  page  242
for  more  information.
17.2.4   Packages
A  few  comments  about  the  dierent  lter  sub-packages:
   supervised  contains supervised lters, i.e.,  lters that take class distri-
butions into account.   Must implement the weka.filters.SupervisedFilter
interface.
  attribute    lters  that  work  column-wise.
  instance    lters  that  work  row-wise.
   unsupervised    contains   unsupervised  lters,   i.e.,   they  work  without
taking any class distributions into account.   The lter must implement the
weka.filters.UnsupervisedFilter interface.
  attribute    lters  that  work  column-wise.
  instance    lters  that  work  row-wise.
Javadoc
The Javadoc generation works the same as with classiers.   See section Javadoc
on  page  244  for  more  information.
17.2.5   Revisions
Filters,   like  classiers,   implement  the  weka.core.RevisionHandler  interface.
This provides the functionality of obtaining the Subversion revision from within
Java.   Filters  that  are  not  part  of   the  ocial   WEKA  distribution  do  not  have
to  implement  the  method  getRevision()  as  the  weka.filters.Filter  class
already  implements   this   method.   Contributions,   on  the  other  hand,   need  to
implement  it,   in  order  to  obtain  the  revision  of  this  particular  source  le.   See
section  Revisions  on  page  245.
17.2.   WRITING  A  NEW  FILTER   259
17.2.6   Testing
WEKA  provides  already  a  test  framework  to  ensure  correct  basic  functionality
of  a  lter.   It  is  essential  for  the  lter  to  pass  these  tests.
17.2.6.1   Option  handling
You  can  check  the  option  handling  of   your  lter  with  the  following  tool   from
command-line:
weka.core.CheckOptionHandler  -W   classname  [--   additional  parameters]
All  tests  need  to  return  yes.
17.2.6.2   GenericObjectEditor
The CheckGOE class checks whether all the properties available in the GUI have a
tooltip accompanying them and whether the globalInfo() method is declared:
weka.core.CheckGOE  -W   classname  [--   additional  parameters]
All  tests,  once  again,  need  to  return  yes.
17.2.6.3   Source  code
Filters that implement the weka.filters.Sourcable interface can output Java
code  of  their  internal  representation.   In  order to  check  the  generated code,  one
should  not  only  compile  the  code,  but  also  test  it  with  the  following  test  class:
weka.filters.CheckSource
This  class  takes  the  original  WEKA  lter,   the  generated  code  and  the  dataset
used for generating the source code (and an optional class index) as parameters.
It builds the WEKA lter on the dataset and compares the output, the one from
the  WEKA lter and the  one from the  generated source code,  whether they are
the  same.
Here is an example call for weka.filters.unsupervised.attribute.ReplaceMissingValues
and  the  generated  class  weka.filters.WEKAWrapper (it  wraps  the  actual  gen-
erated  code  in  a  pseudo-lter):
java   weka.filters.CheckSource  \
-W   weka.filters.unsupervised.attribute.ReplaceMissingValues  \
-S   weka.filters.WEKAWrapper  \
-t   data.arff
It  needs  to  return  Tests  OK!.
17.2.6.4   Unit  tests
In order to make sure that your lter applies to the WEKA criteria, you should
add  your  lter  to  the  junit  unit  test  framework,   i.e.,   by  creating  a  Test  class.
The  superclass  for  lter  unit  tests  is  weka.filters.AbstractFilterTest.
260   CHAPTER  17.   EXTENDING  WEKA
17.3   Writing  other  algorithms
The  previous  sections  covered  how  to  implement  classiers  and  lters.   In  the
following  you  will   nd  some  information  on  how  to  implement   clusterers,   as-
sociators  and  attribute  selection  algorithms.   The  various  algorithms  are  only
covered briey, since other important components (capabilities, option handling,
revisions) have  already  been  discussed  in  the  other  chapters.
17.3.1   Clusterers
Superclasses  and  interfaces
All  clusterers  implement  the  interface  weka.clusterers.Clusterer, but  most
algorithms  will   be  most  likely  derived  (directly  or  further  up  in  the  class  hier-
archy) from  the  abstract  superclass  weka.clusterers.AbstractClusterer.
weka.clusterers.SingleClustererEnhancer  is   used  for   meta-clusterers,
like the FilteredClustererthat lters the data on-the-y for the base-clusterer.
Here  are  some  common  interfaces  that  can  be  implemented:
   weka.clusterers.DensityBasedClusterer  for clusterers that can esti-
mate  the  density for a given instance.   AbstractDensityBasedClusterer
already  implements  this  interface.
   weka.clusterers.UpdateableClusterer    clusterers  that  can  generate
their  model  incrementally  implement  this  interface,  like  CobWeb.
   NumberOfClustersRequestable    is  for  clusterers  that  allow  to  specify
the  number  of  clusters  to  generate,  like  SimpleKMeans.
   weka.core.Randomizable   for  clusterers  that  support  randomization  in
one way or another.   RandomizableClusterer, RandomizableDensityBasedClusterer
and  RandomizableSingleClustererEnhancer  all   implement   this   inter-
face  already.
Methods
In  the  following  a  short  description  of  methods  that  are  common  to  all  cluster
algorithms,  see  also  the  Javadoc  for  the  Clusterer interface.
buildClusterer(Instances)
Like  the   buildClassifier(Instances)  method,   this   method  completely  re-
builds  the  model.   Subsequent  calls  of  this  method  with  the  same  dataset  must
result in exactly the same model being built.   This method also tests the training
data  against the  capabilities  of  this  this  clusterer:
public   void   buildClusterer(Instances  data)   throws   Exception  {
//   test   data   against   capabilities
getCapabilities().testWithFail(data);
//   actual   model   generation
...
}
clusterInstance(Instance)
returns  the  index  of  the  cluster  the  provided  Instance belongs  to.
17.3.   WRITING  OTHER  ALGORITHMS   261
distributionForInstance(Instance)
returns  the  cluster  membership  for  this  Instance object.   The  membership  is  a
double  array containing  the  probabilities  for  each  cluster.
numberOfClusters()
returns   the  number   of   clusters   that   the  model   contains,   after   the  model   has
been  generated  with  the  buildClusterer(Instances) method.
getCapabilities()
see  section  Capabilities  on  page  242  for  more  information.
toString()
should  output  some  information  on  the  generated  model.   Even  though  this  is
not  required,   it  is  rather  useful   for  the  user  to  get  some  feedback  on  the  built
model.
main(String[])
executes   the   clusterer   from  command-line.   If   your   new  algorithm  is   called
FunkyClusterer, then  use  the  following  code  as  your  main  method:
/**
*   Main   method   for   executing  this   clusterer.
*
*   @param   args   the   options,  use   "-h"   to   display   options
*/
public   static   void   main(String[]  args)   {
AbstractClusterer.runClusterer(new  FunkyClusterer(),  args);
}
Testing
For   some  basic  tests   from  the   command-line,   you  can  use  the   following  test
class:
weka.clusterers.CheckClusterer  -W   classname  [further   options]
For junit tests, you can subclass the weka.clusterers.AbstractClustererTest
class  and  add  additional  tests.
262   CHAPTER  17.   EXTENDING  WEKA
17.3.2   Attribute  selection
Attribute  selection  consists  basically  of  two  dierent  types  of  classes:
   evaluator     determines   the   merit   of   single   attributes   or   subsets   of   at-
tributes
   search  algorithm    the  search  heuristic
Each  of  the  them  will  be  discussed  separately  in  the  following  sections.
Evaluator
The   evaluator   algorithm  is   responsible   for   determining   merit   of   the   current
attribute  selection.
Superclasses  and  interfaces
The ancestor for all evaluators is the weka.attributeSelection.ASEvaluation
class.
Here  are  some  interfaces  that  are  commonly  implemented  by  evaluators:
   AttributeEvaluator   evaluates  only  single  attributes
   SubsetEvaluator   evaluates  subsets  of  attributes
   AttributeTransformer   evaluators  that  transform the  input  data
Methods
In  the  following  a  brief  description  of  the  main  methods  of  an  evaluator.
buildEvaluator(Instances)
Generates  the  attribute  evaluator.   Subsequent   calls   of   this   method  with  the
same  data  (and  the  same  search  algorithm)  must  result  in  the  same  attributes
being  selected.   This  method  also  checks  the  data  against the  capabilities:
public   void   buildEvaluator  (Instances  data)   throws   Exception  {
//   can   evaluator  handle   data?
getCapabilities().testWithFail(data);
//   actual   initialization  of   evaluator
...
}
postProcess(int[])
can  be   used  for   optional   post-processing  of   the   selected  attributes,   e.g.,   for
ranking  purposes.
17.3.   WRITING  OTHER  ALGORITHMS   263
main(String[])
executes   the   evaluator   from  command-line.   If   your   new  algorithm  is   called
FunkyEvaluator, then  use  the  following  code  as  your  main  method:
/**
*   Main   method   for   executing  this   evaluator.
*
*   @param   args   the   options,  use   "-h"   to   display   options
*/
public   static   void   main(String[]  args)   {
ASEvaluation.runEvaluator(new  FunkyEvaluator(),  args);
}
Search
The  search  algorithm  denes  the  heuristic  of  searching,  e.g,   exhaustive  search,
greedy  or  genetic.
Superclasses  and  interfaces
The ancestor for all search algorithms is the weka.attributeSelection.ASSearch
class.
Interfaces  that  can  be  implemented,  if  applicable,  by  a  search  algorithm:
   RankedOutputSearch  for search algorithms that produce ranked lists of
attributes
   StartSetHandler   search  algorithms that  can  make  use  of  a  start set  of
attributes  implement  this  interface
Methods
Search  algorithms  are  rather  basic  classes  in  regards  to  methods  that  need  to
be  implemented.   Only  the  following  method  needs  to  be  implemented:
search(ASEvaluation,Instances)
uses  the  provided  evaluator  to  guide  the  search.
Testing
For   some  basic  tests   from  the   command-line,   you  can  use  the   following  test
class:
weka.attributeSelection.CheckAttributeSelection
-eval   classname  -search   classname  [further  options]
For junit tests, you can subclass the weka.attributeSelection.AbstractEvaluatorTest
or  weka.attributeSelection.AbstractSearchTest class  and  add  additional
tests.
264   CHAPTER  17.   EXTENDING  WEKA
17.3.3   Associators
Superclasses  and  interfaces
The  interface  weka.associations.Associator is  common  to  all  associator al-
gorithms.   But  most  algorithms  will   be  derived  from  AbstractAssociator, an
abstract class implementing this interface.   As with classiers and clusterers, you
can also implement a meta-associator, derived from SingleAssociatorEnhancer.
An example for this is the  FilteredAssociator, which lters the training data
on-the-y  for  the  base-associator.
The  only  other  interface  that  is  used  by  some  other  association  algorithms,
is  the  weka.clusterers.CARuleMiner one.   Associators that learn  class associ-
ation  rules  implement  this  interface,  like  Apriori.
Methods
The   associators   are   very  basic   algorithms   and  only  support   building   of   the
model.
buildAssociations(Instances)
Like  the   buildClassifier(Instances)  method,   this   method  completely  re-
builds  the  model.   Subsequent  calls  of  this  method  with  the  same  dataset  must
result in exactly the same model being built.   This method also tests the training
data  against the  capabilities:
public   void   buildAssociations(Instances  data)   throws   Exception  {
//   other   necessary  setups
...
//   test   data   against   capabilities
getCapabilities().testWithFail(data);
//   actual   model   generation
...
}
getCapabilities()
see  section  Capabilities  on  page  242  for  more  information.
toString()
should  output  some  information  on  the  generated  model.   Even  though  this  is
not  required,   it  is  rather  useful   for  the  user  to  get  some  feedback  on  the  built
model.
17.3.   WRITING  OTHER  ALGORITHMS   265
main(String[])
executes   the   associator   from  command-line.   If   your   new  algorithm  is   called
FunkyAssociator, then  use  the  following  code  as  your  main  method:
/**
*   Main   method   for   executing  this   associator.
*
*   @param   args   the   options,  use   "-h"   to   display   options
*/
public   static   void   main(String[]  args)   {
AbstractAssociator.runAssociator(new  FunkyAssociator(),  args);
}
Testing
For   some  basic  tests   from  the   command-line,   you  can  use  the   following  test
class:
weka.associations.CheckAssociator  -W   classname  [further  options]
For junit tests, you can subclass the weka.associations.AbstractAssociatorTest
class  and  add  additional  tests.
266   CHAPTER  17.   EXTENDING  WEKA
17.4   Extending  the  Explorer
The  plugin  architecture  of   the  Explorer  allows   you  to  add  new  functionality
easily without having to dig into the code of the Explorer itself.   In the following
you  will  nd  information  on  how  to  add  new  tabs,  like  the  Classify  tab,  and
new  visualization  plugins  for  the  Classify  tab.
17.4.1   Adding  tabs
The  Explorer  is  a  handy  tool   for  initial   exploration  of   your  data    for  proper
statistical   evaluation,   the   Experimenter   should  be   used  instead.   But   if   the
available  functionality  is   not   enough,   you  can  always  add  your   own  custom-
made  tabs  to  the  Explorer.
17.4.1.1   Requirements
Here is roughly what is required in order to add a new tab (the examples below
go  into  more  detail):
   your  class  must  be  derived  from  javax.swing.JPanel
   the interface weka.gui.explorer.Explorer.ExplorerPanel must be im-
plemented  by  your  class
   optional  interfaces
  weka.gui.explorer.Explorer.LogHandler    in  case  you  want   to
take  advantage  of  the  logging  in  the  Explorer
  weka.gui.explorer.Explorer.CapabilitiesFilterChangeListener
 in case your class needs to be notied of changes in the Capabilities,
e.g.,  if  new  data  is  loaded  into  the  Explorer
   adding the classname of your class to the Tabs property in the Explorer.props
le
17.4.1.2   Examples
The  following  examples  demonstrate  the  plugin  architecture.   Only  the  neces-
sary  details  are  discussed,   as  the  full   source  code  is  available  from  the  WEKA
Examples  [3]  (package  wekaexamples.gui.explorer).
SQL  worksheet
Purpose
Displaying  the  SqlViewer   as   a  tab  in  the  Explorer   instead  of   using  it   either
via  the   Open  DB...   button  or   as   standalone  application.   Uses   the   existing
components  already  available  in  WEKA  and  just  assembles  them  in  a  JPanel.
Since  this  tab  does  not  rely  on  a  dataset being  loaded  into  the  Explorer, it  will
be  used  as  a  standalone  one.
17.4.   EXTENDING  THE  EXPLORER   267
Useful   for  people  who  are  working  a  lot  with  databases  and  would  like  to
have  an  SQL  worksheet  available  all   the  time  instead  of   clicking  on  a  button
every  time  to  open  up  a  database  dialog.
Implementation
   class  is  derived  from  javax.swing.JPanel and  implements  the  interface
weka.gui.Explorer.ExplorerPanel  (the   full   source  code   also  imports
the weka.gui.Explorer.LogHandlerinterface, but that is only additional
functionality):
public   class   SqlPanel
extends   JPanel
implements  ExplorerPanel  {
   some  basic  members  that  we  need  to  have
/**   the   parent   frame   */
protected  Explorer  m_Explorer  =   null;
/**   sends   notifications  when   the   set   of   working  instances  gets   changed*/
protected  PropertyChangeSupport  m_Support  =   new   PropertyChangeSupport(this);
   methods  we  need  to  implement  due  to  the  used  interfaces
/**   Sets   the   Explorer  to   use   as   parent   frame   */
public   void   setExplorer(Explorer  parent)  {
m_Explorer  =   parent;
}
/**   returns  the   parent   Explorer  frame   */
public   Explorer   getExplorer()  {
return   m_Explorer;
}
/**   Returns  the   title   for   the   tab   in   the   Explorer  */
public   String   getTabTitle()  {
return   "SQL";   //   whats   displayed  as   tab-title,  e.g.,   Classify
}
/**   Returns  the   tooltip  for   the   tab   in   the   Explorer   */
public   String   getTabTitleToolTip()  {
return   "Retrieving  data   from   databases";   //   the   tooltip   of   the   tab
}
/**   ignored,  since   we   "generate"  data   and   not   receive   it   */
public   void   setInstances(Instances  inst)   {
}
/**   PropertyChangeListener  which   will   be   notified  of   value   changes.  */
public   void   addPropertyChangeListener(PropertyChangeListener  l)   {
m_Support.addPropertyChangeListener(l);
}
/**   Removes  a   PropertyChangeListener.  */
public   void   removePropertyChangeListener(PropertyChangeListener  l)   {
m_Support.removePropertyChangeListener(l);
}
268   CHAPTER  17.   EXTENDING  WEKA
   additional  GUI  elements
/**   the   actual   SQL   worksheet  */
protected  SqlViewer  m_Viewer;
/**   the   panel   for   the   buttons   */
protected  JPanel   m_PanelButtons;
/**   the   Load   button   -   makes   the   data   available  in   the   Explorer   */
protected  JButton   m_ButtonLoad  =   new   JButton("Load  data");
/**   displays  the   current  query   */
protected  JLabel   m_LabelQuery  =   new   JLabel("");
   loading the data into the Explorer by clicking on the Load  button will re
a  propertyChange  event:
m_ButtonLoad.addActionListener(new  ActionListener()  {
public   void   actionPerformed(ActionEvent  evt){
m_Support.firePropertyChange("",  null,   null);
}
});
   the   propertyChange   event   will   perform  the   actual   loading  of   the   data,
hence  we  add  an  anonymous  property  change  listener  to  our  panel:
addPropertyChangeListener(new  PropertyChangeListener()  {
public   void   propertyChange(PropertyChangeEvent  e)   {
try   {
//   load   data
InstanceQuery  query   =   new   InstanceQuery();
query.setDatabaseURL(m_Viewer.getURL());
query.setUsername(m_Viewer.getUser());
query.setPassword(m_Viewer.getPassword());
Instances  data   =   query.retrieveInstances(m_Viewer.getQuery());
//   set   data   in   preprocess  panel   (also   notifies   of   capabilties  changes)
getExplorer().getPreprocessPanel().setInstances(data);
}
catch   (Exception  ex)   {
ex.printStackTrace();
}
}
});
   In  order   to  add  our   SqlPanel  to  the   list   of   tabs   displayed  in  the   Ex-
plorer,   we  need  to  modify  the  Explorer.props  le  (just  extract  it  from
the  weka.jar  and  place  it  in  your  home  directory).   The  Tabs  property
must  look  like  this:
Tabs=weka.gui.explorer.SqlPanel,\
weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
17.4.   EXTENDING  THE  EXPLORER   269
Screenshot
270   CHAPTER  17.   EXTENDING  WEKA
Articial   data  generation
Purpose
Instead  of  only  having  a  Generate...   button  in  the  PreprocessPanel or  using  it
from  command-line,  this  example  creates  a  new  panel  to  be  displayed  as  extra
tab  in  the  Explorer.   This  tab  will   be  available  regardless  whether  a  dataset  is
already  loaded  or  not  (=  standalone).
Implementation
   class  is  derived  from  javax.swing.JPanel and  implements  the  interface
weka.gui.Explorer.ExplorerPanel  (the   full   source  code  also  imports
the weka.gui.Explorer.LogHandler interface, but that is only additional
functionality):
public   class   GeneratorPanel
extends  JPanel
implements  ExplorerPanel  {
   some  basic  members  that  we  need  to  have  (the  same  as  for  the  SqlPanel
class):
/**   the   parent   frame   */
protected  Explorer   m_Explorer  =   null;
/**   sends   notifications  when   the   set   of   working   instances  gets   changed*/
protected  PropertyChangeSupport  m_Support  =   new   PropertyChangeSupport(this);
   methods we need to implement due to the used interfaces (almost identical
to  SqlPanel):
/**   Sets   the   Explorer  to   use   as   parent   frame   */
public   void   setExplorer(Explorer  parent)   {
m_Explorer  =   parent;
}
/**   returns   the   parent   Explorer  frame   */
public   Explorer  getExplorer()  {
return   m_Explorer;
}
/**   Returns   the   title   for   the   tab   in   the   Explorer  */
public   String   getTabTitle()  {
return   "DataGeneration";   //   whats   displayed  as   tab-title,  e.g.,   Classify
}
/**   Returns   the   tooltip   for   the   tab   in   the   Explorer  */
public   String   getTabTitleToolTip()  {
return   "Generating  artificial  datasets";   //   the   tooltip   of   the   tab
}
/**   ignored,  since   we   "generate"  data   and   not   receive  it   */
public   void   setInstances(Instances  inst)   {
}
/**   PropertyChangeListener  which   will   be   notified  of   value   changes.  */
public   void   addPropertyChangeListener(PropertyChangeListener  l)   {
m_Support.addPropertyChangeListener(l);
}
/**   Removes   a   PropertyChangeListener.  */
public   void   removePropertyChangeListener(PropertyChangeListener  l)   {
m_Support.removePropertyChangeListener(l);
}
17.4.   EXTENDING  THE  EXPLORER   271
   additional  GUI  elements:
/**   the   GOE   for   the   generators  */
protected  GenericObjectEditor  m_GeneratorEditor  =   new   GenericObjectEditor();
/**   the   text   area   for   the   output   of   the   generated  data   */
protected  JTextArea  m_Output  =   new   JTextArea();
/**   the   Generate  button   */
protected  JButton   m_ButtonGenerate  =   new   JButton("Generate");
/**   the   Use   button   */
protected  JButton   m_ButtonUse  =   new   JButton("Use");
   the  Generate   button  does  not  load  the  generated  data  directly  into  the
Explorer, but only outputs  it in the  JTextArea (the Use  button  loads the
data  -  see  further  down):
m_ButtonGenerate.addActionListener(new  ActionListener(){
public   void   actionPerformed(ActionEvent  evt){
DataGenerator  generator  =   (DataGenerator)  m_GeneratorEditor.getValue();
String   relName  =   generator.getRelationName();
String   cname   =   generator.getClass().getName().replaceAll(".*\\.",  "");
String   cmd   =   generator.getClass().getName();
if   (generator  instanceof  OptionHandler)
cmd   +=   "   "+Utils.joinOptions(((OptionHandler)generator).getOptions());
try   {
//   generate   data
StringWriter  output   =   new   StringWriter();
generator.setOutput(new  PrintWriter(output));
DataGenerator.makeData(generator,  generator.getOptions());
m_Output.setText(output.toString());
}
catch   (Exception  ex)   {
ex.printStackTrace();
JOptionPane.showMessageDialog(
getExplorer(),  "Error   generating  data:\n"  +   ex.getMessage(),
"Error",  JOptionPane.ERROR_MESSAGE);
}
generator.setRelationName(relName);
}
});
   the Use  button nally res a propertyChange  event that will load the data
into  the  Explorer:
m_ButtonUse.addActionListener(new  ActionListener(){
public   void   actionPerformed(ActionEvent  evt){
m_Support.firePropertyChange("",  null,   null);
}
});
272   CHAPTER  17.   EXTENDING  WEKA
   the   propertyChange   event   will   perform  the   actual   loading  of   the   data,
hence  we  add  an  anonymous  property  change  listener  to  our  panel:
addPropertyChangeListener(new  PropertyChangeListener()  {
public   void   propertyChange(PropertyChangeEvent  e)   {
try   {
Instances  data   =   new   Instances(new  StringReader(m_Output.getText()));
//   set   data   in   preprocess  panel   (also   notifies   of   capabilties  changes)
getExplorer().getPreprocessPanel().setInstances(data);
}
catch   (Exception  ex)   {
ex.printStackTrace();
JOptionPane.showMessageDialog(
getExplorer(),  "Error   generating  data:\n"  +   ex.getMessage(),
"Error",   JOptionPane.ERROR_MESSAGE);
}
}
});
   In  order  to  add  our  GeneratorPanel  to  the  list  of   tabs  displayed  in  the
Explorer, we need to modify the Explorer.props le (just extract it from
the  weka.jar  and  place  it  in  your  home  directory).   The  Tabs  property
must  look  like  this:
Tabs=weka.gui.explorer.GeneratorPanel:standalone,\
weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
   Note:   the  standalone  option  is  used  to  make  the  tab  available  without
requiring  the  preprocess  panel  to  load  a  dataset  rst.
Screenshot
17.4.   EXTENDING  THE  EXPLORER   273
Experimenter  light
Purpose
By  default  the  Classify  panel   only  performs  1  run  of   10-fold  cross-validation.
Since  most  classiers  are  rather  sensitive  to  the  order  of   the  data  being  pre-
sented  to  them,   those  results  can  be  too  optimistic  or  pessimistic.   Averaging
the   results   over   10  runs   with  dierently  randomized  train/test   pairs   returns
more  reliable  results.   And  this  is  where  this  plugin  comes  in:   it  can  be  used
to  obtain  statistical   sound  results  for  a  specic  classier/dataset  combination,
without  having  to  setup  a  whole  experiment  in  the  Experimenter.
Implementation
   Since  this  plugin  is  rather bulky,  we  omit the  implementation  details,  but
the  following  can  be  said:
  based  on  the  weka.gui.explorer.ClassifierPanel
  the  actual  code  doing  the  work follows  the  example  in  the  Using  the
Experiment  API   wiki  article  [2]
   In  order  to  add  our  ExperimentPanel to  the  list  of  tabs  displayed  in  the
Explorer, we need to modify the Explorer.props le (just extract it from
the  weka.jar  and  place  it  in  your  home  directory).   The  Tabs  property
must  look  like  this:
Tabs=weka.gui.explorer.ClassifierPanel,\
weka.gui.explorer.ExperimentPanel,\
weka.gui.explorer.ClustererPanel,\
weka.gui.explorer.AssociationsPanel,\
weka.gui.explorer.AttributeSelectionPanel,\
weka.gui.explorer.VisualizePanel
Screenshot
274   CHAPTER  17.   EXTENDING  WEKA
17.4.2   Adding  visualization  plugins
Introduction
As  of  WEKA  version  3.5.3  you  can  easily  add  visualization  plugins  in  the  Ex-
plorer (Classify panel).   This makes it easy to implement custom visualizations, if
the ones WEKA oers are not sucient.   The following examples can be found in
the Examples collection [3] (package wekaexamples.gui.visualize.plugins).
Requirements
   custom  visualization  class  must  implement  the  following  interface
weka.gui.visualize.plugins.VisualizePlugin
   the  class must  either  reside  in  the  following  package (visualization  classes
are  automatically  discovered during  run-time)
weka.gui.visualize.plugins
   or   you  must   list   the  package  this   class  belongs   to  in  the  properties  le
weka/gui/GenericPropertiesCreator.props (or  the  equivalent  in  your
home directory) under the key weka.gui.visualize.plugins.VisualizePlugin.
Implementation
The  visualization  interface  contains  the  following  four  methods
   getMinVersion    This  method  returns  the  minimum  version  (inclusive)
of  WEKA  that  is  necessary  to  execute  the  plugin,  e.g.,  3.5.0.
   getMaxVersion   This  method  returns  the  maximum  version  (exclusive)
of  WEKA  that  is  necessary  to  execute  the  plugin,  e.g.,  3.6.0.
   getDesignVersion  Returns the actual version of WEKA this plugin was
designed  for,  e.g.,  3.5.1
   getVisualizeMenuItem The JMenuItem that is returned via this method
will   be  added  to  the  plugins   menu  in  the  popup  in  the  Explorer.   The
ActionListener  for  clicking  the  menu  item  will   most  likely  open  a  new
frame  containing  the  visualized  data.
17.4.   EXTENDING  THE  EXPLORER   275
Examples
Table  with  predictions
The  PredictionTable.java example simply displays the actual class label and
the  one  predicted  by  the  classier.   In  addition  to  that,   it  lists  whether  it  was
an  incorrect prediction  and  the  class  probability  for  the  correct class  label.
276   CHAPTER  17.   EXTENDING  WEKA
Bar  plot  with  probabilities
The  PredictionError.java  example  uses  the  JMathTools  library  (needs  the
jmathplot.jar  [27]   in  the  CLASSPATH)  to  display  a  simple  bar  plot  of   the
predictions.   The  correct  predictions  are  displayed  in  blue,   the  incorrect  ones
in  red.   In  both  cases  the  class  probability  that  the  classier  returned  for  the
correct  class  label  is  displayed  on  the  y  axis.   The  x  axis  is  simply  the  index  of
the  prediction  starting  with  0.
Chapter  18
Technical  documentation
18.1   ANT
What   is   ANT?  This   is   how  the  ANT  homepage  (http://ant.apache.org/)
denes  its  tool:
Apache  Ant   is  a  Java-based  build  tool.   In  theory,   it   is  kind  of   like  Make,   but
without  Makes  wrinkles.
18.1.1   Basics
   the  ANT  build  le  is  based  on  XML  (http://www.w3.org/XML/)
   the  usual  name  for  the  build  le  is:
build.xml
   invocationthe  usual  build  le  needs  not  be  specied  explicitly,  if  its  in
the  current  directory;  if  not  target is  specied,  the  default  one  is  used
ant   [-f   <build-file>]  [<target>]
   displaying  all  the  available  targets  of  a  build  le
ant   [-f   <build-file>]  -projecthelp
18.1.2   Weka  and  ANT
   a  build  le  for  Weka  is  available  from  subversion
   some  targets  of  interest
  cleanRemoves the build, dist and reports directories; also any class
les  in  the  source  tree
  compileCompile weka  and  deposit  class  les  in
${path_modifier}/build/classes
  docsMake javadocs  into  ${path_modifier}/doc
  exejarCreate an  executable  jar  le  in  ${path_modifier}/dist
277
278   CHAPTER  18.   TECHNICAL  DOCUMENTATION
18.2   CLASSPATH
The  CLASSPATH  environment   variable  tells   Java  where  to  look  for   classes.
Since Java does the search in a rst-come-rst-serve kind of manner, youll have
to  take  care  where  and  what  to  put  in  your  CLASSPATH.   I,  personally,  never
use  the  environment  variable,  since  Im  working  often  on  a  project  in  dierent
versions in  parallel.   The  CLASSPATH  would  just  mess  up  things,  if  youre not
careful   (or  just  forget  to  remove  an  entry).   ANT  (http://ant.apache.org/)
oers  a  nice  way  for  building  (and  separating  source  code  and  class  les)  Java
projects.   But  still,  if  youre  only  working  on  totally  separate  projects,  it  might
be  easiest  for  you  to  use  the  environment  variable.
18.2.1   Setting  the  CLASSPATH
In  the   following  we   add  the   mysql-connector-java-5.1.7-bin.jar  to   our
CLASSPATH variable (this  works for any other jar archive) to make it possible to
access  MySQL  databases  via  JDBC.
Win32  (2k  and  XP)
We assume that the  mysql-connector-java-5.1.7-bin.jar archive is located
in  the  following  directory:
C:\Program  Files\Weka-3-7
In the Control Panel   click on System  (or right click on My  Computer  and select
Properties)  and  then  go  to  the  Avanced  tab.   There  youll  nd  a  button  called
Environment  Variables,  click  it.   Depending  on,  whether  youre  the  only  person
using this computer or its a lab computer shared by many, you can either create
a new system-wide (youre the only user) environment variable or a user depen-
dent one (recommended for multi-user machines).   Enter the following name for
the  variable.
CLASSPATH
and  add  this  value
C:\Program  Files\Weka-3-7\mysql-connector-java-5.1.7-bin.jar
If you want to add additional jars, you will have to separate them with the path
separator, the  semicolon  ;  (no  spaces!).
Unix/Linux
We make the assumption that the mysql jar is located in the following directory:
/home/johndoe/jars/
18.2.   CLASSPATH   279
Open a shell and execute the following command, depending on the shell youre
using:
   bash
export   CLASSPATH=$CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.7-bin.jar
   c  shell
setenv   CLASSPATH  $CLASSPATH:/home/johndoe/jars/mysql-connector-java-5.1.7-bin.jar
Cygwin
The process is like with Unix/Linux systems, but since the host system is Win32
and  therefore  the  Java  installation  also  a  Win32  application,  youll  have  to  use
the  semicolon  ;  as  separator for  several  jars.
18.2.2   RunWeka.bat
From version 3.5.4, Weka is launched dierently under Win32.   The simple batch
le  got replaced by  a central launcher class (=   RunWeka.class) in  combination
with  an  INI-le  (=   RunWeka.ini).   The  RunWeka.bat  only  calls  this  launcher
class  now  with  the  appropriate  parameters.   With  this  launcher  approach  it  is
possible  to  dene  dierent   launch  scenarios,   but   with  the  advantage  of   hav-
ing  placeholders,   e.g.,   for  the  max  heap  size,   which  enables  one  to  change  the
memory  for  all  setups  easily.
The  key  of  a  command  in  the  INI-le  is  prexed  with  cmd_,   all   other  keys
are  considered  placeholders:
cmd_blah=java  ...   command  blah
bloerk=   ...   placeholder   bloerk
A  placeholder  is  surrounded  in  a  command  with  #:
cmd_blah=java  #bloerk#
Note:   The  key  wekajar   is   determined  by  the   -w  parameter   with  which  the
launcher  class  is  called.
By  default,  the  following  commands  are  predened:
   default
The  default  Weka  start,  without  a  terminal  window.
   console
For   debugging  purposes.   Useful   as   Weka  gets   started  from  a  terminal
window.
   explorer
The  command  thats  executed  if  one  double-clicks on  an  ARFF  or  XRFF
le.
In  order to  change the  maximum  heap  size for all those  commands, one only
has  to  modify  the  maxheap  placeholder.
280   CHAPTER  18.   TECHNICAL  DOCUMENTATION
For  more  information  check  out  the  comments  in  the  INI-le.
18.2.3   java  -jar
When  youre  using  the  Java  interpreter  with  the  -jar  option,   be  aware  of   the
fact  that  it  overwrites  your  CLASSPATH and  not  augments  it.   Out  of  conve-
nience, people often only use the -jar option to skip the declaration of the main
class  to  start.   But  as  soon  as  you  need  more  jars,  e.g.,  for  database access,  you
need  to  use  the  -classpath  option  and  specify  the  main  class.
Heres once again how you start the Weka Main-GUI with your current CLASSPATH
variable  (and  128MB  for  the  JVM):
   Linux
java   -Xmx128m   -classpath  $CLASSPATH:weka.jar  weka.gui.Main
   Win32
java   -Xmx128m   -classpath  "%CLASSPATH%;weka.jar"  weka.gui.Main
18.3   Subversion
18.3.1   General
The  Weka  Subversion  repository  is  accessible  and  browseable  via  the  following
URL:
https://svn.scms.waikato.ac.nz/svn/weka/
A  Subversion  repository  has  usually  the  following  layout:
root
|
+-   trunk
|
+-   tags
|
+-   branches
Where  trunk   contains  the  main  trunk   of   the  development,   tags   snapshots  in
time   of   the   repository  (e.g.,   when  a  new  version  got   released)   and  branches
development branches that forked o the main trunk at some stage (e.g., legacy
versions  that  still  get  bugxed).
18.3.2   Source  code
The  latest  version  of  the  Weka  source  code  can  be  obtained  with  this  URL:
https://svn.scms.waikato.ac.nz/svn/weka/trunk/weka
If  you  want  to  obtain  the  source  code  of  the  book  version,  use  this  URL:
18.3.   SUBVERSION   281
https://svn.scms.waikato.ac.nz/svn/weka/branches/book2ndEd-branch/weka
18.3.3   JUnit
The  latest  version  of  Wekas  JUnit  tests  can  be  obtained  with  this  URL:
https://svn.scms.waikato.ac.nz/svn/weka/trunk/tests
And  if  you  want  to  obtain  the  JUnit  tests  of  the  book  version,  use  this  URL:
https://svn.scms.waikato.ac.nz/svn/weka/branches/book2ndEd-branch/tests
18.3.4   Specic  version
Whenever  a  release  of  Weka  is  generated,  the  repository  gets  tagged
   dev-X-Y-Z
the tag for a release of the developer version, e.g., dev-3.7.0  for Weka 3.7.0
https://svn.scms.waikato.ac.nz/svn/weka/tags/dev-3-7-0
   stable-X-Y-Z
the tag for a release of the book version, e.g., stable-3-4-15  for Weka 3.4.15
https://svn.scms.waikato.ac.nz/svn/weka/tags/stable-3-4-15
18.3.5   Clients
Commandline
Modern  Linux  distributions  already  come  with  Subversion  either  pre-installed
or easily installed via the package manager of the distribution.   If that shouldnt
be  case,   or  if   you  are  using  Windows,   you  have  to  download  the  appropriate
client  from  the  Subversion  homepage  (http://subversion.tigris.org/).
A  checkout  of  the  current  developer  version  of  Weka  looks  like  this:
svn   co   https://svn.scms.waikato.ac.nz/svn/weka/trunk/weka
SmartSVN
SmartSVN (http://smartsvn.com/) is a Java-based, graphical, cross-platform
client for Subversion.   Though it is not open-source/free software, the foundation
version  is  for  free.
TortoiseSVN
Under   Windows,   TortoiseCVS  was   a  CVS  client,   neatly  integrated  into  the
Windows  Explorer.   TortoiseSVN  (http://tortoisesvn.tigris.org/)  is  the
equivalent  for  Subversion.
282   CHAPTER  18.   TECHNICAL  DOCUMENTATION
18.4   GenericObjectEditor
18.4.1   Introduction
As  of  version  3.4.4  it  is  possible  for  WEKA  to  dynamically  discover  classes  at
runtime (rather than using only those specied in the GenericObjectEditor.props
(GOE)  le).   In  some  versions (3.5.8,  3.6.0)  this  facility  was  not  enabled  by  de-
fault  as  it  is  a  bit  slower  than  the  GOE  le  approach,   and,   furthermore,   does
not function in environments that do not have a CLASSPATH (e.g., application
servers).   Later  versions  (3.6.1,   3.7.0)  enabled  the  dynamic  discovery  again,   as
WEKA  can  now  distinguish  between  being  a  standalone  Java  application  or
being  run  in  a  non-CLASSPATH  environment.
If   you  wish  to  enable  or  disable  dynamic  class  discovery,   the  relevant  le
to  edit  is  GenericPropertiesCreator.props (GPC).   You  can  obtain  this  le
either   from  the   weka.jar  or   weka-src.jar  archive.   Open  one  of   these  les
with  an  archive  manager   that   can  handle  ZIP  les   (for   Windows   users,   you
can  use  7-Zip  (http://7-zip.org/)   for   this)   and  navigate  to  the   weka/gui
directory,  where  the  GPC  le  is  located.   All   that  is  required,   is  to  change  the
UseDynamic  property  in  this  le  from  false  to  true  (for  enabling  it)   or  the
other  way  round  (for  disabling  it).   After  changing  the  le,   you  just  place  it  in
your  home  directory.   In  order  to  nd  out  the  location  of  your  home  directory,
do  the  following:
   Linux/Unix
  Open  a  terminal
  run  the  following  command:
echo   $HOME
   Windows
  Open  a  command-primpt
  run  the  following  command:
echo   %USERPROFILE%
If  dynamic  class  discovery  is  too  slow,  e.g.,  due  to  an  enormous  CLASSPATH,
you  can  generate  a  new  GenericObjectEditor.props  le  and  then  turn  dy-
namic class discovery o again.   It is assumed that you already placed  the  GPC
le  in  your home  directory (see  steps  above) and  that the  weka.jar jar  archive
with  the  WEKA  classes  is  in  your  CLASSPATH  (otherwise  you  have  to  add  it
to  the  java  call  using  the  -classpath option).
18.4.   GENERICOBJECTEDITOR   283
For  generating  the  GOE  le,  execute  the  following  steps:
   generate a new GenericObjectEditor.props le using the following com-
mand:
  Linux/Unix
java   weka.gui.GenericPropertiesCreator  \
$HOME/GenericPropertiesCreator.props  \
$HOME/GenericObjectEditor.props
  Windows  (command  must  be  in  one  line)
java   weka.gui.GenericPropertiesCreator
%USERPROFILE%\GenericPropertiesCreator.props
%USERPROFILE%\GenericObjectEditor.props
   edit  the  GenericPropertiesCreator.props  le  in  your  home  directory
and  set  UseDynamic to  false.
A  limitation  of   the  GOE  prior  to  3.4.4  was,   that  additional   classiers,   lters,
etc.,   had  to  t   into  the  same  package  structure  as  the  already  existing  ones,
i.e.,   all   had  to  be  located  below  weka.   WEKA  can  now  display  multiple  class
hierarchies  in  the  GUI,  which  makes  adding  new  functionality  quite  easy  as  we
will see later in an example (it is not restricted to classiers only, but also works
with  all  the  other  entries  in  the  GPC  le).
18.4.2   File  Structure
The  structure of the  GOE is a key-value-pair, separated by an equals-sign.   The
value   is   a  comma  separated  list   of   classes   that   are   all   derived  from  the   su-
perclass/superinterface  key.   The  GPC  is   slightly  dierent,   instead  of   declar-
ing   all   the   classes/interfaces   one   need   only   to   specify   all   the   packages   de-
scendants   are  located  in  (only  non-abstract   ones   are   then  listed).   E.g.,   the
weka.classifiers.Classifier entry  in  the  GOE  le  looks  like  this:
weka.classifiers.Classifier=\
weka.classifiers.bayes.AODE,\
weka.classifiers.bayes.BayesNet,\
weka.classifiers.bayes.ComplementNaiveBayes,\
weka.classifiers.bayes.NaiveBayes,\
weka.classifiers.bayes.NaiveBayesMultinomial,\
weka.classifiers.bayes.NaiveBayesSimple,\
weka.classifiers.bayes.NaiveBayesUpdateable,\
weka.classifiers.functions.LeastMedSq,\
weka.classifiers.functions.LinearRegression,\
weka.classifiers.functions.Logistic,\
...
The  entry  producing  the  same  output  for  the  classiers  in  the  GPC  looks  like
this  (7  lines  instead  of  over  70  for  WEKA  3.4.4):
weka.classifiers.Classifier=\
weka.classifiers.bayes,\
weka.classifiers.functions,\
weka.classifiers.lazy,\
weka.classifiers.meta,\
weka.classifiers.trees,\
weka.classifiers.rules
284   CHAPTER  18.   TECHNICAL  DOCUMENTATION
18.4.3   Exclusion
It  may  not  always  be  desired  to  list  all  the  classes  that  can  be  found  along  the
CLASSPATH. Sometimes,  classes cannot be  declared  abstract but still shouldnt
be listed in the GOE. For that reason one can list classes, interfaces, superclasses
for  certain  packages  to  be  excluded  from  display.   This  exclusion  is  done  with
the  following  le:
weka/gui/GenericPropertiesCreator.excludes
The  format  of  this  properties  le  is  fairly  simple:
<key>=<prefix>:<class>[,<prefix>:<class>]
Where the <key> corresponds to a key in the GenericPropertiesCreator.props
le  and  the  <prefix> can  be  one  of  the  following:
   S    Superclass
any  class  derived  from  this  will  be  excluded
   I    Interface
any  class  implementing  this  interface  will  be  excluded
   C    Class
exactly  this  class  will  be  excluded
Here  are  a  few  examples:
#   exclude   all   ResultListeners  that   also   implement  the   ResultProducer  interface
#   (all   ResultProducers  do   that!)
weka.experiment.ResultListener=\
I:weka.experiment.ResultProducer
#   exclude   J48   and   all   SingleClassifierEnhancers
weka.classifiers.Classifier=\
C:weka.classifiers.trees.J48,\
S:weka.classifiers.SingleClassifierEnhancer
18.4.   GENERICOBJECTEDITOR   285
18.4.4   Class  Discovery
Unlike  the  Class.forName(String)  method  that   grabs  the  rst   class   it   can
nd  in  the  CLASSPATH,  and  therefore  xes  the  location  of  the  package  it  found
the  class  in,   the  dynamic  discovery  examines  the  complete  CLASSPATH  you  are
starting  the  Java  Virtual   Machine  (=  JVM)   with.   This  means  that   you  can
have  several  parallel  directories  with  the  same  WEKA  package  structure,   e.g.,
the   standard  release  of   WEKA  in  one   directory  (/distribution/weka.jar)
and  another one with  your own classes (/development/weka/...), and  display
all   of  the  classiers  in  the  GUI.   In  case  of  a  name  conict,   i.e.,   two  directories
contain  the  same  class,   the  rst  one  that  can  be  found  is  used.   In  a  nutshell,
your  java  call  of  the  GUIChooser  can  look  like  this:
java   -classpath  "/development:/distribution/weka.jar"  weka.gui.GUIChooser
Note:   Windows  users  have  to  replace  the  :  with  ;  and  the  forward  slashes
with  backslashes.
18.4.5   Multiple  Class  Hierarchies
In  case you are developing your own framework, but still  want to use  your clas-
siers within WEKA that was not possible with WEKA prior to 3.4.4.   Starting
with  the  release  3.4.4  it  is  possible  to  have  multiple  class  hierarchies  being  dis-
played  in  the  GUI.  If  you  have  developed  a  modied  version  of  NaiveBayes, let
us   call   it  DummyBayes   and  it  is   located  in  the  package  dummy.classifiers
then  you  will  have  to  add  this  package to  the  classiers list in  the  GPC  le  like
this:
weka.classifiers.Classifier=\
weka.classifiers.bayes,\
weka.classifiers.functions,\
weka.classifiers.lazy,\
weka.classifiers.meta,\
weka.classifiers.trees,\
weka.classifiers.rules,\
dummy.classifiers
286   CHAPTER  18.   TECHNICAL  DOCUMENTATION
Your  java  call  for  the  GUIChooser  might  look  like  this:
java   -classpath  "weka.jar:dummy.jar"  weka.gui.GUIChooser
Starting up the GUI you will now have another root node in the tree view of the
classiers, called  root,  and  below it the  weka and  the dummy  package hierarchy
as  you  can  see  here:
18.4.   GENERICOBJECTEDITOR   287
18.4.6   Capabilities
Version  3.5.3  of  Weka  introduced  the  notion  of  Capabilities.   Capabilities  basi-
cally  list  what  kind  of  data  a  certain  object  can  handle,   e.g.,  one  classier  can
handle  numeric  classes,   but  another  cannot.   In  case  a  class  supports  capabili-
ties the additional buttons Filter...   and Remove  lter will be available in the
GOE. The Filter...   button pops up a dialog which lists all available Capabilities:
One  can  then  choose  those  capabilities  an  object,  e.g.,  a  classier,  should  have.
If   one  is  looking  for  classication  problem,   then  the  Nominal   class  Capability
can  be  selected.   On  the  other  hand,  if  one  needs  a  regression  scheme,  then  the
Capability  Numeric  class  can  be  selected.   This  ltering  mechanism  makes  the
search for an  appropriate learning scheme easier.   After  applying that lter,  the
tree with the objects will be displayed again and lists all objects that can handle
all  the  selected  Capabilities  black,  the  ones  that  cannot  grey  and  the  ones  that
might  be  able  to  handle  them  blue  (e.g.,  meta  classiers which  depend  on  their
base  classier(s)).
288   CHAPTER  18.   TECHNICAL  DOCUMENTATION
18.5   Properties
A  properties  le  is  a  simple  text  le  with  this  structure:
<key>=<value>
Comments  start  with  the  hash  sign  #.
To  make  a  rather  long  property  line  more  readable,  one  can  use  a  backslash  to
continue  on  the  next  line.   The  Filter  property,  e.g.,  looks  like  this:
weka.filters.Filter=  \
weka.filters.supervised.attribute,  \
weka.filters.supervised.instance,  \
weka.filters.unsupervised.attribute,  \
weka.filters.unsupervised.instance
18.5.1   Precedence
The  Weka  property  les   (extension  .props)   are  searched  for   in  the   following
order:
   current  directory
   the  users  home  directory  (*nix  $HOME,  Windows  %USERPROFILE%)
   the  class  path  (normally  the  weka.jar  le)
If Weka encounters those les it only supplements the properties, never overrides
them.   In  other  words,   a  property  in  the  property  le  of  the  current  directory
has  a  higher  precedence  than  the  one  in  the  users  home  directory.
Note:   Under   Cywgin  (http://cygwin.com/),   the  home  directory  is   still   the
Windows  one,  since  the  java  installation  will  be  still  one  for  Windows.
18.5.2   Examples
   weka/gui/LookAndFeel.props
   weka/gui/GenericPropertiesCreator.props
   weka/gui/beans/Beans.props
18.6.   XML   289
18.6   XML
Weka   now  supports   XML  (http://www.w3c.org/XML/)   (eXtensible   Markup
Language)  in  several  places.
18.6.1   Command  Line
WEKA  now  allows   Classiers   and  Experiments   to  be  started  using  an  -xml
option  followed  by  a  lename  to  retrieve  the  command  line  options  from  the
XML  le  instead  of  the  command  line.
For  such  simple  classiers  like  e.g.   J48  this  looks  like  overkill,   but  as  soon
as  one  uses  Meta-Classiers  or  Meta-Meta-Classiers  the  handling  gets  tricky
and  one  spends  a  lot  of  time  looking  for  missing  quotes.   With  the  hierarchical
structure of XML les it is simple to plug in other classiers by just exchanging
tags.
The  DTD  for  the  XML  options  is  quite  simple:
<!DOCTYPE  options
[
<!ELEMENT  options   (option)*>
<!ATTLIST  options   type   CDATA   "classifier">
<!ATTLIST  options   value   CDATA   "">
<!ELEMENT  option   (#PCDATA  |   options)*>
<!ATTLIST  option   name   CDATA   #REQUIRED>
<!ATTLIST  option   type   (flag   |   single   |   hyphens   |   quotes)   "single">
]
>
The  type  attribute  of  the  option  tag  needs  some  explanations.   There  are  cur-
rently  four  dierent  types  of  options  in  WEKA:
   ag
The  simplest   option  that   takes   no  arguments,   like  e.g.   the  -V  ag  for
inversing  an  selection.
<option   name="V"  type="flag"/>
   single
The  option  takes  exactly  one  parameter,   directly  following  after  the  op-
tion,   e.g.,   for  specifying  the  trainings  le  with  -t   somefile.arff.   Here
the parameter value is just put between the opening and closing tag.   Since
single  is  the  default  value  for  the  type  tag  we  dont  need  to  specify  it  ex-
plicitly.
<option   name="t">somefile.arff</option>
290   CHAPTER  18.   TECHNICAL  DOCUMENTATION
   hyphens
Meta-Classiers   like   AdaBoostM1  take   another   classier   as   option  with
the   -W  option,   where  the  options   for  the  base  classier  follow  after   the
--.   And  here  it  is  where  the  fun  starts:   where  to  put  parameters  for  the
base  classier  if   the  Meta-Classier  itself   is  a  base  classier  for  another
Meta-Classier?
E.g.,  does  -W   weka.classifiers.trees.J48  --   -C   0.001 become  this:
<option   name="W"  type="hyphens">
<options  type="classifier"  value="weka.classifiers.trees.J48">
<option   name="C">0.001</option>
</options>
</option>
Internally,  all  the  options  enclosed  by  the  options tag  are  pushed  to  the
end  after  the  --  if  one  transforms  the  XML  into  a  command  line  string.
   quotes
A  Meta-Classier  like  Stacking  can  take  several   -B  options,   where  each
single one encloses other options in  quotes (this  itself can contain a Meta-
Classier!).   From  -B   weka.classifiers.trees.J48  we   then  get
this  XML:
<option   name="B"  type="quotes">
<options  type="classifier"  value="weka.classifiers.trees.J48"/>
</option>
With  the  XML  representation  one  doesnt  have  to  worry  anymore  about
the  level   of  quotes  one  is  using  and  therefore  doesnt  have  to  care  about
the  correct   escaping  (i.e.      ...   \"   ...   \"   ...)  since  this   is   done
automatically.
18.6.   XML   291
And  if   we  now  put  all   together  we  can  transform  this  more  complicated  com-
mand  line  (java  and  the  CLASSPATH omitted):
<options   type="class"   value="weka.classifiers.meta.Stacking">
<option   name="B"   type="quotes">
<options   type="classifier"   value="weka.classifiers.meta.AdaBoostM1">
<option   name="W"   type="hyphens">
<options   type="classifier"   value="weka.classifiers.trees.J48">
<option   name="C">0.001</option>
</options>
</option>
</options>
</option>
<option   name="B"   type="quotes">
<options   type="classifier"   value="weka.classifiers.meta.Bagging">
<option   name="W"   type="hyphens">
<options   type="classifier"   value="weka.classifiers.meta.AdaBoostM1">
<option   name="W"   type="hyphens">
<options   type="classifier"   value="weka.classifiers.trees.J48"/>
</option>
</options>
</option>
</options>
</option>
<option   name="B"   type="quotes">
<options   type="classifier"   value="weka.classifiers.meta.Stacking">
<option   name="B"   type="quotes">
<options   type="classifier"   value="weka.classifiers.trees.J48"/>
</option>
</options>
</option>
<option   name="t">test/datasets/hepatitis.arff</option>
</options>
Note:   The  type and  value attribute  of  the  outermost  options tag  is  not used
while  reading the  parameters.   It is  merely  for  documentation  purposes,  so  that
one  knows  which  class  was  actually  started  from  the  command  line.
Responsible  Class(es):
weka.core.xml.XMLOptions
292   CHAPTER  18.   TECHNICAL  DOCUMENTATION
18.6.2   Serialization  of  Experiments
It  is  now  possible  to  serialize  the  Experiments  from  the  WEKA  Experimenter
not  only  in  the  proprietary  binary  format  Java  oers  with  serialization  (with
this you run into problems trying to read old experiments with a newer WEKA
version, due to  dierent SerialUIDs),  but also in  XML.  There are currently two
dierent  ways  to  do  this:
   built-in
The  built-in  serialization  captures  only  the  necessary  informations  of  an
experiment and doesnt serialize anything else.   Its sole purpose is to save
the  setup  of   a  specic  experiment  and  can  therefore  not  store  any  built
models.   Thanks   to  this   limitation  well   never   run  into  problems   with
mismatching  SerialUIDs.
This   kind  of   serialization  is   always  available  and  can  be  selected  via  a
Filter  (*.xml)  in  the  Save/Open-Dialog of  the  Experimenter.
The  DTD  is  very  simple  and  looks  like  this  (for  version  3.4.5):
<!DOCTYPE  object[
<!ELEMENT  object   (#PCDATA  |   object)*>
<!ATTLIST  object   name   CDATA   #REQUIRED>
<!ATTLIST  object   class   CDATA   #REQUIRED>
<!ATTLIST  object   primitive  CDATA   "no">
<!ATTLIST  object   array   CDATA   "no">
<!ATTLIST  object   null   CDATA   "no">
<!ATTLIST  object   version   CDATA   "3.4.5">
]>
Prior  to  versions 3.4.5  and  3.5.0  it  looked  like  this:
<!DOCTYPE  object
[
<!ELEMENT  object   (#PCDATA  |   object)*>
<!ATTLIST  object   name   CDATA   #REQUIRED>
<!ATTLIST  object   class   CDATA   #REQUIRED>
<!ATTLIST  object   primitive  CDATA   "yes">
<!ATTLIST  object   array   CDATA   "no">
]
>
Responsible  Class(es):
weka.experiment.xml.XMLExperiment
for  general   Serialization:
weka.core.xml.XMLSerialization
weka.core.xml.XMLBasicSerialization
18.6.   XML   293
   KOML  (http://old.koalateam.com/xml/serialization/)
The   Koala  Object   Markup  Language   (KOML)   is   published  under   the
LGPL  (http://www.gnu.org/copyleft/lgpl.html)   and  is   an  alterna-
tive   way  of   serializing  and  derserializing  Java  Objects   in  an  XML  le.
Like the normal serialization it serializes everything into XML  via an Ob-
jectOutputStream, including the SerialUID of each class.   Even though we
have  the  same  problems  with  mismatching  SerialUIDs  it  is  at  least  pos-
sible  edit  the  XML  les  by  hand  and  replace  the  oending  IDs  with  the
new  ones.
In  order   to  use  KOML  one  only  has   to  assure  that   the  KOML  classes
are  in  the  CLASSPATH  with  which  the  Experimenter   is   launched.   As
soon  as   KOML  is   present   another   Filter   (*.koml)   will   show  up  in  the
Save/Open-Dialog.
The DTD for KOML can be found at http://old.koalateam.com/xml/koml12.dtd
Responsible  Class(es):
weka.core.xml.KOML
The experiment class can of course read those XML les if passed as input or out-
put le (see options of weka.experiment.Experimentand weka.experiment.RemoteExperiment
18.6.3   Serialization  of  Classiers
The options for models of a classier, -l for the input model and -d for the out-
put model, now also supports XML serialized les.   Here we have to dierentiate
between  two  dierent  formats:
   built-in
The  built-in  serialization  captures  only  the  options  of  a  classier  but  not
the  built  model.   With  the  -l  one  still  has  to  provide a  training  le,  since
we only retrieve the  options from the XML  le.   It is possible to add more
options  on  the  command  line,   but  it  is  no  check  performed  whether  they
collide  with  the  ones  stored  in  the  XML  le.
The  le  is  expected  to  end  with  .xml.
   KOML
Since the KOML serialization captures everything of a Java Object we can
use  it  just  like  the  normal  Java  serialization.
The  le  is  expected  to  end  with  .koml.
The built-in serialization can be used in the Experimenter for loading/saving
options  from algorithms  that  have  been  added  to  a  Simple  Experiment.   Unfor-
tunately  it is  not possible  to create such a hierarchical structure like mentioned
in  Section  18.6.1.   This   is   because   of   the   loss   of   information  caused  by  the
getOptions() method  of classiers:   it returns only a  at String-Array and not
a  tree  structure.
294   CHAPTER  18.   TECHNICAL  DOCUMENTATION
Responsible  Class(es):
weka.core.xml.KOML
weka.classifiers.xml.XMLClassifier
18.6.4   Bayesian  Networks
The GraphVisualizer (weka.gui.graphvisualizer.GraphVisualizer) can save
graphs  into  the  Interchange  Format
(http://www-2.cs.cmu.edu/
fgcozman/Research/InterchangeFormat/)  for
Bayesian Networks (BIF).  If started from command line  with  an XML lename
as rst parameter and not from the Explorer it can display the given le directly.
The  DTD  for  BIF  is  this:
<!DOCTYPE  BIF   [
<!ELEMENT  BIF   (   NETWORK   )*>
<!ATTLIST  BIF   VERSION   CDATA   #REQUIRED>
<!ELEMENT  NETWORK   (   NAME,   (   PROPERTY  |   VARIABLE   |   DEFINITION  )*   )>
<!ELEMENT  NAME   (#PCDATA)>
<!ELEMENT  VARIABLE   (   NAME,   (   OUTCOME  |   PROPERTY  )*   )   >
<!ATTLIST  VARIABLE  TYPE   (nature|decision|utility)  "nature">
<!ELEMENT  OUTCOME   (#PCDATA)>
<!ELEMENT  DEFINITION  (   FOR   |   GIVEN   |   TABLE   |   PROPERTY  )*   >
<!ELEMENT  FOR   (#PCDATA)>
<!ELEMENT  GIVEN   (#PCDATA)>
<!ELEMENT  TABLE   (#PCDATA)>
<!ELEMENT  PROPERTY   (#PCDATA)>
]>
Responsible  Class(es):
weka.classifiers.bayes.BayesNet#toXMLBIF03()
weka.classifiers.bayes.net.BIFReader
weka.gui.graphvisualizer.BIFParser
18.6.5   XRFF  les
With  Weka  3.5.4  a  new,   more  feature-rich,   XML-based  data  format  got  intro-
duced:   XRFF.  For  more  information,  please  see  Chapter  10.
Chapter  19
Other  resources
19.1   Mailing  list
The  WEKA  Mailing  list  can  be  found  here:
   http://list.scms.waikato.ac.nz/mailman/listinfo/wekalist
for  subscribing/unsubscribing  the  list
   https://list.scms.waikato.ac.nz/pipermail/wekalist/
(Mirrors:   http://news.gmane.org/gmane.comp.ai.weka,
http://www.nabble.com/WEKA-f435.html)
for  searching  previous  posted  messages
Before  posting,  please  read  the  Mailing  List  Etiquette:
http://www.cs.waikato.ac.nz/
ml/weka/mailinglist  etiquette.html
19.2   Troubleshooting
Here  are  a  few  of  things  that  are  useful   to  know  when  you  are  having  trouble
installing  or  running  Weka  successfullyon  your  machine.
NB  these  java  commands  refer  to  ones  executed  in  a  shell  (bash,  command
prompt,  etc.)   and  NOT  to  commands  executed  in  the  SimpleCLI.
19.2.1   Weka  download  problems
When  you  download  Weka,  make  sure  that  the  resulting  le  size  is  the  same
as  on  our  webpage.   Otherwise  things  wont  work  properly.   Apparently  some
web  browsers have  trouble  downloading  Weka.
19.2.2   OutOfMemoryException
Most Java virtual machines only allocate a certain maximum amount of memory
to  run  Java  programs.   Usually  this  is  much  less  than  the  amount  of   RAM  in
your  computer.   However,  you  can  extend  the  memory  available  for  the  virtual
machine by setting appropriate options.   With Suns JDK, for example, you can
go
295
296   CHAPTER  19.   OTHER  RESOURCES
java   -Xmx100m   ...
to set the maximum Java heap size to 100MB. For more information about these
options  see  http://java.sun.com/docs/hotspot/VMOptions.html.
19.2.2.1   Windows
Book  version
You  have  to  modify  the  JVM  invocation  in  the  RunWeka.bat batch  le  in  your
installation  directory.
Developer  version
   up  to  Weka  3.5.2
just  like  the  book  version.
   Weka  3.5.3
You have to modify the link in the Windows Start menu, if youre starting
the  console-less Weka (only  the link  with  console in  its name  executes the
RunWeka.bat batch  le)
   Weka  3.5.4 and  higher  Due  to  the  new  launching  scheme,  you  no  longer
modify  the  batch  le,   but  the  RunWeka.ini  le.   In  that  particular  le,
youll  have  to  change  the  maxheap  placeholder.   See  section  18.2.2.
19.2.3   Mac  OSX
In your Weka installation directory (weka-3-x-y.app) locate the Contents sub-
directory  and  edit  the  Info.plist le.   Near  the  bottom  of  the  le  you  should
see  some  text  like:
<key>VMOptions</key>
<string>-Xmx256M</string>
Alter  the  256M  to  something  higher.
19.2.4   StackOverowError
Try  increasing  the  stack  of  your  virtual  machine.   With  Suns  JDK  you  can  use
this  command  to  increase  the  stacksize:
java   -Xss512k   ...
19.2.   TROUBLESHOOTING   297
to  set   the   maximum  Java  stack  size   to  512KB.   If   still   not   sucient,   slowly
increase  it.
19.2.5   just-in-time  (JIT)  compiler
For  maximum  enjoyment,   use  a  virtual   machine  that  incorporates  a  just-in-
time  compiler.   This  can  speed  things  up  quite  signicantly.   Note  also  that
there  can  be  large  dierences  in  execution  time  between  dierent  virtual   ma-
chines.
19.2.6   CSV  le  conversion
Either  load  the  CSV  le  in  the  Explorer or  use  the  CVS  converter  on  the  com-
mandline  as  follows:
java   weka.core.converters.CSVLoader  filename.csv  >   filename.arff
19.2.7   ARFF  le  doesnt  load
One  way to  gure  out why  ARFF  les  are failing  to  load  is  to  give them  to  the
Instances  class.   At  the  command  line  type  the  following:
java   weka.core.Instances  filename.arff
where  you  substitute  lename   for  the  actual   name  of   your  le.   This  should
return  an  error  if  there  is  a  problem  reading  the  le,  or  show  some  statistics  if
the  le  is  ok.   The  error message you get  should  give some  indication  of  what is
wrong.
19.2.8   Spaces  in  labels  of  ARFF  les
A  common  problem  people  have  with  ARFF  les  is  that  labels  can  only  have
spaces  if  they  are  enclosed  in  single  quotes,  i.e.   a  label  such  as:
some   value
should  be  written  either  some  value  or  some  value  in  the  le.
19.2.9   CLASSPATH  problems
Having  problems  getting  Weka  to  run  from  a  DOS/UNIX  command  prompt?
Getting java.lang.NoClassDefFoundErrorexceptions?  Most likely your CLASS-
PATH  environment   variable   is   not   set   correctly  -   it   needs   to   point   to   the
Weka.jar  le  that you downloaded with  Weka  (or the  parent of the  Weka  direc-
tory  if  you  have  extracted  the  jar).   Under  DOS  this  can  be  achieved  with:
set   CLASSPATH=c:\weka-3-4\weka.jar;%CLASSPATH%
298   CHAPTER  19.   OTHER  RESOURCES
Under  UNIX/Linux  something  like:
export   CLASSPATH=/home/weka/weka.jar:$CLASSPATH
An easy way to get avoid setting the variable this is to specify the CLASSPATH
when calling Java.   For example, if the jar le is located at c:weka-3-4weka.jar
you  can  use:
java   -cp   c:\weka-3-4\weka.jar  weka.classifiers...  etc.
See  also  Section  18.2.
19.2.10   Instance  ID
People  often  want  to  tag  their  instances  with  identiers,   so  they  can  keep
track  of  them  and  the  predictions  made  on  them.
19.2.10.1   Adding  the  ID
A  new  ID  attribute  is  added  real   easy:   one  only  needs  to  run  the  AddID  lter
over  the  dataset  and  its  done.   Heres  an  example  (at  a  DOS/Unix  command
prompt):
java   weka.filters.unsupervised.attribute.AddID
-i   data_without_id.arff
-o   data_with_id.arff
(all  on  a  single  line).
Note:   the  AddID  lter   adds   a  numeric  attribute,   not   a  String  attribute
to  the  dataset.   If   you  want  to  remove  this  ID  attribute  for  the  classier  in  a
FilteredClassifier  environment  again,   use  the  Remove  lter  instead  of   the
RemoveType lter  (same  package).
19.2.10.2   Removing  the  ID
If  you  run  from  the  command  line  you  can  use  the  -p  option  to  output  predic-
tions  plus  any  other  attributes  you  are  interested  in.   So  it  is  possible  to  have  a
string  attribute  in  your  data  that  acts  as  an  identier.   A  problem  is  that  most
classiers dont  like  String  attributes,  but  you can  get around this  by  using  the
RemoveType (this  removes String  attributes  by  default).
Heres  an  example.   Lets  say  you  have  a  training  le  named  train.arff,
a  testing  le  named  test.arff,   and  they  have  an  identier   String  attribute
as  their   5th  attribute.   You  can  get  the  predictions   from  J48  along  with  the
identier  strings  by  issuing  the  following  command  (at  a  DOS/Unix  command
prompt):
java   weka.classifiers.meta.FilteredClassifier
-F   weka.filters.unsupervised.attribute.RemoveType
-W   weka.classifiers.trees.J48
-t   train.arff  -T   test.arff  -p   5
(all  on  a  single  line).
19.2.   TROUBLESHOOTING   299
If you want, you can redirect the output to a le by adding >   output.txt
to  the  end  of  the  line.
In the Explorer GUI you could try a similar trick of using the String attribute
identiers  here  as  well.   Choose  the  FilteredClassifier, with  RemoveType as
the lter, and whatever classier you prefer.   When you visualize the results you
will  need  click  through  each  instance  to  see  the  identier  listed  for  each.
19.2.11   Visualization
Access  to  visualization  from  the  ClassierPanel,  ClusterPanel  and  Attribute-
Selection  panel  is  available  from  a  popup  menu.   Click  the  right  mouse  button
over an entry in the Result list to bring up the menu.   You will be presented with
options  for  viewing  or  saving the  text  output  anddepending on  the  scheme
further  options  for  visualizing  errors, clusters,  trees  etc.
19.2.12   Memory  consumption  and  Garbage  collector
There is  the  ability to  print how  much  memory  is  available  in  the  Explorer
and  Experimenter  and  to  run  the  garbage  collector.   Just  right  click  over  the
Status  area  in  the  Explorer/Experimenter.
19.2.13   GUIChooser  starts  but  not  Experimenter  or  Ex-
plorer
The GUIChooser starts, but Explorer and Experimenter dont start and output
an  Exception  like  this  in  the  terminal:
/usr/share/themes/Mist/gtk-2.0/gtkrc:48:  Engine   "mist"   is   unsupported,  ignoring
---Registering  Weka   Editors---
java.lang.NullPointerException
at   weka.gui.explorer.PreprocessPanel.addPropertyChangeListener(PreprocessPanel.java:519)
at   javax.swing.plaf.synth.SynthPanelUI.installListeners(SynthPanelUI.java:49)
at   javax.swing.plaf.synth.SynthPanelUI.installUI(SynthPanelUI.java:38)
at   javax.swing.JComponent.setUI(JComponent.java:652)
at   javax.swing.JPanel.setUI(JPanel.java:131)
...
This  behavior  happens  only  under  Java  1.5  and  Gnome/Linux,   KDE  doesnt
produce this error.   The reason for this is, that Weka tries to look more native
and  therefore  sets  a  platform-specic  Swing  theme.   Unfortunately,  this  doesnt
seem  to  be  working  correctly  in  Java  1.5  together  with  Gnome.   A  workaround
for  this  is  to  set  the  cross-platform Metal  theme.
In order to use another theme one only has to create the following properties
le  in  ones  home  directory:
LookAndFeel.props
With  this  content:
300   CHAPTER  19.   OTHER  RESOURCES
Theme=javax.swing.plaf.metal.MetalLookAndFeel
19.2.14   KnowledgeFlow  toolbars  are  empty
In  the  terminal,  you  will  most  likely  see  this  output  as  well:
Failed   to   instantiate:  weka.gui.beans.Loader
This  behavior  can  happen  under  Gnome  with  Java  1.5,  see  Section  19.2.13  for
a  solution.
19.2.15   Links
   Java VM options (http://java.sun.com/docs/hotspot/VMOptions.html)
301
302   BIBLIOGRAPHY
Bibliography
[1]   Witten,  I.H. and Frank, E. (2005) Data  Mining:   Practical  machine  learn-
ing  tools  and  techniques.  2nd  edition  Morgan  Kaufmann,  San  Francisco.
[2]   WekaWiki     http://weka.wikispaces.com/
[3]   Weka   Examples      A   collection   of   example   classes,   as   part   of
an   ANT   project,   included   in   the   WEKA   snapshots   (available
for   download   on   the   homepage)   or   directly   from   subversion
https://svn.scms.waikato.ac.nz/svn/weka/branches/stable-3-6/wekaexamples/
[4]   J.   Platt  (1998):   Machines  using  Sequential   Minimal   Optimization.   In  B.
Schoelkopf   and  C.   Burges   and  A.   Smola,   editors,   Advances   in  Kernel
Methods  -  Support  Vector  Learning.
[5]   Drummond,   C.   and  Holte,   R.   (2000)   Explicitly  representing   expected
cost:   An   alternative   to   ROC  representation.   Proceedings   of   the   Sixth
ACM  SIGKDD  International   Conference   on  Knowledge   Discovery   and
Data  Mining.  Publishers,  San  Mateo,  CA.
[6]   Extensions  for  Wekas  main  GUI   on  WekaWiki   
http://weka.wikispaces.com/Extensions+for+Weka%27s+main+GUI
[7]   Adding  tabs  in  the  Explorer   on  WekaWiki   
http://weka.wikispaces.com/Adding+tabs+in+the+Explorer
[8]   Explorer  visualization  plugins  on  WekaWiki   
http://weka.wikispaces.com/Explorer+visualization+plugins
[9]   Bengio,  Y.  and  Nadeau,  C.  (1999)  Inference  for  the  Generalization  Error.
[10]   Ross Quinlan (1993).  C4.5:   Programs  for  Machine  Learning, Morgan Kaufmann
Publishers,  San  Mateo,  CA.
[11]   Subversion    http://weka.wikispaces.com/Subversion
[12]   HSQLDB    http://hsqldb.sourceforge.net/
[13]   MySQL    http://www.mysql.com/
[14]   Plotting  multiple  ROC  curves   on  WekaWiki   
http://weka.wikispaces.com/Plotting+multiple+ROC+curves
[15]   R.R.   Bouckaert.   Bayesian   Belief   Networks:   from  Construction   to   Inference.
Ph.D.  thesis,  University  of  Utrecht,  1995.
[16]   W.L.  Buntine. A  guide to the literature  on  learning  probabilistic  networks from
data. IEEE Transactions on Knowledge and Data Engineering, 8:195210,  1996.
[17]   J. Cheng, R. Greiner. Comparing bayesian network classiers. Proceedings UAI,
101107,   1999.
[18]   C.K.  Chow,  C.N.Liu.  Approximating  discrete  probability  distributions  with  de-
pendence  trees.  IEEE  Trans.  on  Info.  Theory,  IT-14:   426467,   1968.
BIBLIOGRAPHY   303
[19]   G.  Cooper,  E.  Herskovits.  A  Bayesian  method  for  the  induction  of  probabilistic
networks  from  data.  Machine  Learning,  9:   309347,   1992.
[20]   Cozman. See http://www-2.cs.cmu.edu/
fgcozman/Research/InterchangeFormat/
for  details  on  XML  BIF.
[21]   N. Friedman, D. Geiger,  M. Goldszmidt.  Bayesian  Network Classiers.  Machine
Learning,  29:   131163,   1997.
[22]   D.   Heckerman,   D.   Geiger,   D.   M.   Chickering.   Learning  Bayesian  networks:   the
combination  of  knowledge  and  statistical   data.  Machine  Learining,   20(3):   197
243,  1995.
[23]   S.L. Lauritzen and D.J. Spiegelhalter. Local Computations with Probabilities on
graphical  structures  and  their  applications  to  expert  systems  (with  discussion).
Journal  of  the  Royal  Statistical   Society  B.  1988,  50,  157-224
[24]   Moore,   A.   and   Lee,   M.S.   Cached   Sucient   Statistics   for   Ecient   Machine
Learning  with  Large  Datasets,  JAIR,  Volume  8,  pages  67-91,  1998.
[25]   Verma, T. and Pearl,  J.:   An algorithm  for deciding if a set of observed indepen-
dencies has a causal explanation. Proc. of the Eighth Conference on Uncertainty
in  Articial  Intelligence,  323-330,   1992.
[26]   GraphViz. See http://www.graphviz.org/doc/info/lang.html  for more infor-
mation  on  the  DOT  language.
[27]   JMathPlot.   See  http://code.google.com/p/jmathplot/  for  more  information
on  the  project.