Overview:
Data Mining Methods
WEKA Tutorial
§ WEKA: A Machine Learning Toolkit
§ The Explorer
§ Classification and Regression
§ Clustering
§ Association Rules
§ Attribute Selection
§ Data Visualization
§ The Experimenter
§ The Knowledge Flow GUI
§ Conclusions
4
WEKA - Introduction
§ Machine learning/data mining software written in
Java (distributed under the GNU Public License)
§ Used for research, education, and applications
§ Main features:
§ Comprehensive set of data pre-processing tools, learning algorithms
and evaluation methods
§ Graphical user interfaces (incl. data visualization)
§ Environment for comparing learning algorithms
Pre-processing the data
§ Data can be imported from a file in various formats:
ARFF, CSV, C4.5, binary
§ Data can also be read from a URL or from an SQL
database (using JDBC)
§ Pre-processing tools in WEKA are called “filters”
§ WEKA contains filters for:
§ Discretization, normalization, resampling, attribute selection,
transforming and combining attributes, …
6
WEKA with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
…
WEKA with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
…
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Building “Classifiers”
§ Classifiers in WEKA are models for predicting nominal
or numeric quantities
§ Implemented learning schemes include:
§ Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
§ “Meta”-classifiers include:
§ Bagging, boosting, stacking, error-correcting output codes, locally
weighted learning, …
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
Clustering
§ WEKA contains many clustering implementations:
§ Works with both discrete and numerical data
§ Example of K-means
64
65
66
67
68
69
70
Finding Associations
§ WEKA contains an implementation of the Apriori
algorithm for learning association rules
§ Works only with discrete data
§ Can identify statistical dependencies between groups
of attributes:
§ milk, butter -> bread, eggs (with confidence 0.9 and
support 2000)
§ Apriori can compute all rules that have a given
minimum support and exceed a given confidence
71
72
73
74
75
76
77
78
Data visualization
§ Visualization very useful in practice:
§ e.g. helps to determine difficulty of the learning problem
§ WEKA can visualize single attributes and pairs of
attributes
§ To do: rotating 3-d visualizations (Xgobi-style)
§ Color-coded class values
§ “Jitter” option to deal with nominal attributes (and
to detect “hidden” data points)
§ “Zoom-in” function
79
80
81
82
83