0% found this document useful (0 votes)
16 views442 pages

机器学习 周志华

Uploaded by

Katherine Gilber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views442 pages

机器学习 周志华

Uploaded by

Katherine Gilber
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 442

1.

( machine learning).

[Mitchell, 1997]

P
T
model) (learning algorithm).
T

T P, E )

[Hand et al.,
).
1.2

(data set ),
instance)
(sample).
attribute) (feature);
(attribute value).
( attribute space) (sample space)

(feature vector).
D {» i ®2 , • ..,*m} m
d
Xi = { x n\ X i2 \ . . . Cid ) d i G Af
A j
d (dimensionality).
learning) (training),

j ” (training instance)
( training data), ( training sample),
(training set).
(hypothesis);
-
(ground truth),
learner),

prediction)

“label”
“((
.
“label” example).
(label);
1.2

(xhyi) i ey y
(label space)

(classification);
regression).
(binary classification) ( positive class),
negative class);
multi-class classification)
{(®1,2/l ),(®2,2/2 ),••. (*m ,Vm ) } Y
y f y +1} 1};
y 1R R
testing)
(testing sample).
( testing instance)
f [ x ).
clustering),
(cluster );

13.6

(supervised learning) ( unsupervised learning)

( unseen instance).

generalization)

1020).
(distribution) V
(independent and identically
distributed, P
1.3

(induction) (deduction)
generalization)
specialization)

(inductive learning).

(concept )

1.1

e(
_
A A B ) V (C A D )
=?) A =?) A =?)
1.1
1.1 A A
)”
1.3

[Cohen and Feigenbaum , A A


1983],
1.5
(hypothesis)
fit )

=?) A =?) A =?)


.
Z3 V

e *) A A

M 4 x 3 x 3 + 1 = 37. 1.1

=*)

=. =* =* ) =* )

=* ) =* )

1.1

version space).
1.1 1.2
= *; =*) = *; = *;

=*

1.2

1.4

1.2

) e(
*) A A *)”

1.1

O *) A A

O *) A A *)”.
(inductive bia ),

feature selection)
.
^

1.3
( x, y),
1.4

1.3

1.3
A B.

Occam’s razor )

A y
- B
1.3 A.

*) A A e *)
A * )”

1.3.
A B

1.4 B A
A B
y y

(a) A B (b) B A

v
1.4

1.4(b)
A B
£a ,
£a

£a

^
Y P(h \ X a)
£a
. £a

^ ^^ ^
Eote a \ XJ ) = J2 xexEx P
h -
) nh{x ) f {x ))P {h \ X a) (1.1)

I • 0.
{0, 1},

Jf2 Eote (^ \ XJ ) = Jf2 J2 xexE- x


a mx ) ^ f { x ) ) P { h \ X^ ) a
h

= E
x X-X h
i
^ f( x ))

/ l
a; h( x ) = E p{ x )
YJnh \ x^ a )- 2 \ x \
xex- x h

_ E
xex-x
p{ x )
YJnh \ x^
h
a)
1.4

2 l ^ l -1 P{ x ) • . ( 1.2)
xex - x

(1.2)

Kte ( ^ \ X , f ) = J2 ^ote ( ^ \ XJ )
a b (1.3)

£a
NFL
No FVee Lunch Theorem , NFL
[Wolpert ,1996; Wolpert and Macready,1995].

NFL

A B A •

B A
B

NFL
*)
A A } { e *) A
A NFL
)

)”

NFL
1.5

( artificial intelligence)

A. Newell H. Simon Logic Theorist )


( General Problem Solving)

2.85 . A.
Newell H. Simon

E.A. Feigenbamn

Feigenbaum
. E. A .
DENDRAL . Feigenbamn

A. Samuel
p.22.
comiecticmism)
F.Rosenblatt (Perceptron) B. Widrow
Adaline symbolism)
P. Winston R.S.Michalski
E. B. Hunt

N. J . Nilson

IWML
(IWML);
ICML . Tioga
R. S. Michalski J.G. Carbonell T. Mitchell *

[Michalski et al,1983],
Machine Learning
1.5

Artificial Intelligence
J. G . Carbonell MIT
[Carbonell, 1990]

R.S.Michalski [Michalski et al.,1983]

E.A. Fdgenbaum
[Cohen and Feigenbaum , 1983]

R.S. Michalski

(decision tree)

(Inductive Logic
ILP
Programming, ILP)
Prolog

.ILP

ILP
H.
Simon

M . Minsky S.Papert

J .J . Hopfield NP
D.E. Rumelhart
BP

BP
BP

(statistical learning)
(Support Vector Machine,
SVM) ( kernel methods).
[Vapnik, 1998]
V.N. Vapnik A. J .
Chervonenkis VC

6.5.

( kernel trick )
1.6

5.6

Intel x86

1.6
-
NASA JPL
NASA JPL
-
Science [Mjolsness and DeCoste, 2001]

DARPA PAL
DARPA
NASA DARPA

DARPA

T. Mitchell

(crowdsourcing).

(data mining)

[Zhou, 2003].

F
1.6

[Mitchell,1997]


ALVINN

DARPA
S. Thmn

. S . Thmn

.
R. Ghani

P. Kanerva
SDM (Sparse Distributed Memory) [Kanerva,
1988] SDM

1.7

[Mitchell, 1997]
paydin , 2004; Flach , 2012]
[Duda et al.,
[Hastie et al., 2009]
Al
-
[Bishop , 2006]

WEKA
-
[Shalev-Shwartz and Ben David, 2014] [Witten et al.,
2011] WEKA WEKA
Waikato
JAVA
http: // www.es.waikato. 1.5 1.6 2007].
ac.nz / ml / weka /.
[Michalski et al,1983]
Morgan Kaufmann
1.7

E.A. Feigenbaum
[Cohen and Feigenbaum, 1983]
[Dietterich,1997]

(transfer learning) [Pan and Yang,


2010], (learning by analogy)
deep learning)
5.6

[Hunt and Hovland , 1963]. [Winston , 1970]

[Simon and Lea, 1974]


[Mitchell, 1977]

[Blumer et al., 1996].

[Webb , Domingos, 1999].

( principle of multiple
explanations), [Asmis, 1984],
(ensemble learning)

(ICML)
( NIPS) (COLT)
(ECML) (ACML);
Journal of Machine Learning Research Machine Learning.

gence
IJCAI AAAI
Journal of Artificial Intelligence Research,
Artificial Intelli
-
^
KDD ICDM ACM Transactions on Knowledge Discovery
from Data Data Mining and Knowledge Discovery,
1S -

CVPR IEEE Transactions on Pattern


Analysis and Machine Intelligence, Neural Com
-
putation IEEE Transactions on Neural Networks and Learning Systems
Annals of
Statistics
1996]. 2012]

(CCML)
(MLA);
1.1 1.1

1.2

G *) *) )
(( A *) A
)
)

A A )”
)”
( A = a) V ( A = )
1.1
A *)
*

o 1.3

1.4* 1.4
A
(1.1)

Eote ( 2a \ XJ ) =
h
E-x P ( x )Hh ( x ) J ( x ) ) P ( h \ X ,& )
J2 xex a

1.5
(1996 ) .
. 2007). 3(12 ) :35-44.
( 2012 ) .
Alpaydin , E. ( 2004) . Introduction to Machine Learning. MIT Press, Cambridge,
MA.
Asmis, E. (1984) . Epicurus’ Scientific Method. Cornell University Press , Ithaca,
NY.
Bishop, C. M. ( 2006 ) . Pattern Recognition and Machine Learning. Springer ,
New York , NY.
Blumer , A., A. Ehrenfeucht , D. Haussler, and M. K . Warmuth. (1996) . “Oc-
cam 5s razor.” Information Processing Letters, 24 ( 6 ) :377-380.
Carbonell, J. G., ed. (1990 ) . Machine Learning: Paradigms and Methods. MIT
Press, Cambridge, MA.
Cohen , P. R. and E. A. Feigenbaum , eds. (1983). The Handbook of Artificial
Intelligence , volume 3. William Kaufmann, New York , NY.
Dietterich , T. G. (1997) . “Machine learning research: Four current directions.”
AI Magazine , 18( 4):97 136.
Domingos, P. (1999). “The role of Occam’s razor in knowledge discovery.” Data
Mining and Knowledge Discovery, 3( 4):409-425.
Duda, R. O . P. E. Hart , and D. G. Stork. ( 2001) . Pattern Classification, 2nd
edition. John Wiley Sons, New York , NY.
Flach , P. ( 2012 ) . Machine Learning: The Art and Science of Algorithms that
Make Sense of Data. Cambridge University Press , Cambridge, UK .
Hand , D., H. Mannila, and P. Smyth. (2001) . Principles of Data Mining. MIT
Press, Cambridge, MA.
Hastie, T., R. Tibshirani, and J . Friedman. ( 2009 ) . The Elements of Statistical
Learning 2nd edition. Springer , New York , NY.
Hunt , E. G. and D. I. Hovland. (1963). “Programming a model of human con-
cept formation.” In Computers and Thought (E. Feigenbaum and J. Feldman,
eds.) , 310-325, McGraw Hill, New York, NY.
Kanerva, P. (1988) . Sparse Distributed Memory. MIT Press, Cambridge, MA.
Michalski, R. S., J. G. Carbonell, and T. M. Mitchell, eds. (1983) . Machine
Learning: An Artificial Intelligence Approach. Tioga, Palo Alto, CA.
Mitchell, T. (1997) . Machine Learning. McGraw Hill, New York , NY.
Mitchell, T. M. (1977) . “Version spaces: A candidate elimination approach to
rule learning.” In Proceedings of the 5th International Joint Conference on
Artificial Intelligence ( IJCAI ) , 305-310, Cambridge, MA.
Mjolsness, E. and D. DeCoste. (2001) . “Machine learning for science: State of
the art and future prospects.” Science , 293(5537):2051-2055.
Pan, S. J. and Q. Yang. (2010) . UA survey of transfer learning.” IEEE Trans-
actions on Knowledge and Data Engineering , 22 (10):1345-1359.
Shalev-Shwartz , S. and S. Ben-David. (2014) . Understanding Machine Learn-
ing. Cambridge University Press, Cambridge, UK.
Simon , H. A. and G. Lea. (1974) . “Problem solving and rule induction: A
unified view.” In Knowledge and Cognition ( L. W. Gregg, ed.) , 105 127, -

Erlbaum , New York, NY.


Vapnik , V. N. (1998). Statistical Learning Theory. Wiley, New York, NY.
Webb, G. I. (1996). “Further experimental evidence against the utility of Oc-
cam s razor. Journal of Artificial Intelligence Research 43:397-417.
Winston , P. H. (1970). “Learning structural descriptions from examples.” Tech-
nical Report AI-TR-231, AI Lab, MIT, Cambridge, MA.
Witten , I. H., E. Frank , and M. A. Hall. ( 2011) . Data Mining : Practical Ma-
chine Leaving Tools and Techniques , 3rd edition. Elsevier, Burlington , MA.
Wolpert , D. H. (1996). “The lack of a priori distinctions between learning al-
gorithms.” Neural Computation, 8( 7) :1341-1390.
Wolpert , D. H. and W. G. Macready. (1995). “No free lunch theorems for
search.” Technical Report SFI-TR-05-010, Santa Fe Institute, Sante Fe,
NM.
Zhou , Z.-H. ( 2003). “Three perspectives of data mining.” Artificial Intelligence ,
143(1):139-146.
.•
Arthur Samuel, 1990)
IBM

( John McCarthy,

“ Some studies in machine learning using the game of checkers” 1959


IBM Journal • Edward Feigenbaum,
Computers and Thought,

IBM
2.1

(error
rate), m a a /m;

(1
1 - a /m accuracy),
error)
training error )
(empirical error) generalization
error ).

overfitting) underfitting)
2.1

NP
Pf

2.1

“ P= NP” “P NP”

model
selection)

2.2

( testing set )
testing error )
2.2

m _D ={(a i yi )
( x m y y m ) },
S T.

2.2.1

-
hold out ) D
T, D = SUT,snr = 0. s
r
D S
T 5
* T
(90/300) x -

(sampling)
(stratified
sampling). D S
T, D
S r
s r

D
S
2.6
r
r s D
D
(fidelity ).
[Mitchell,
1997].

2.2 .2
( cross validation) D A:
D Di U Z)2 U . . . U Dfc , Di n j). A
D
k-1 A

A:
A: tfold cross
“ fc
validation), k A:
2.2

D\ D2 D4 DQ D J - DQ D\Q

D\ D2 DQ DJ DS Dg D\ o
— > 1

2
D\ D 2 Dz D4 DQ DJ DS DIQ D9

Z>2 DQ D7 Ds D\ > 10
^
- 3 DQ DIQ

2.2 10

D
/c

P p

D m k m

- -
(Leave One Out , LOO).
2.2

m m
.
D

• m

2.2.

NFL 1.4

2.2.3
D
D

bootstrapping)
Bootstrap
(bootstrap sampling) [Efron and Tibshirani,1993]. m
D
zy D
• m m
D ir
m

^
(1 - ) ,
m

e
lim -
I 0.368 ( 2.1)

D
IT
m

-
(out of-bag
estimate).
2.2.4

(parameter )

( parameter tuning).

[0,0.2] 0.05

m D

validation set ).

2.3

(performance measure).
2.3

D {(«l,2/l),(«2,2/2),...,(®m , 2/m )}

f ( x)

mean squared error )

E( f - D ) = -
TTl J2 U ( x i ) - yi )2 . ( 2.2)

D p

E( f V ) = (/ 2/) p( x ) d x . (2.3)
jc D
"

2.3.1

no

E U\ D ) = Y j n f {xi ) (2.4)
~

^ yi ) .

acc(/;D ) =
-
(2 5)

D p

E( f V ) = 1(/ y ) p( x ) d x ( 2.6)
^
acc(/;V (x ) y) p( x ) d x (2.7)
l - E U -V ) .
2.3.2 F1

Web

precision)
recall)

(true positive) (false positive) ( true negative) >

(false negative) T P F P, T N FiV

confusion matrix) 2.1

2.1

TP FN
FP TN

TP
P ( 2.8)
TP + FP

TP
R
TP - FN ' (2.9)
^
2.3

“ PR
“P- R “P R
-
2.3
“ PR

1.0 A

-
P R

2.3 P R
-
-
PR
PR-
2.3 A C

-
PR 2.3 A B

A B

-
PR

( Break-Event Point , BEP)


2.3 C BEP 0.64, BEP
A B.
BEP F1

2x P xR x TP
FI (2.10 )
P+R

F1
( ha rmonic
mean )

4R V .
F1 P

Fp ( 1 + P2 ) X P X R
( 2.11)
{J32 x P )+ R
[VanRijsbergen , 1979].
E±E ) F1
AP x
\

-
(macro P) macro i?)
- -
FI ”(macro FI ):

macro ( 2 - 12)

macro- R / J
rt i
Ri (2.13)
i=i

_
macro- Fl =
2 x macro-P x macro
macro- P + macro
-
- . 2.14
( )

rP F P T N F7V
F P,T N FN,
micros) micro
- -
FI”(micro FI ):

TP
-
micro i5
TP + FP
(2.15)
2.3

TP
-
micro i? =
TP + FN
(2.16)

micro-FI =
2 x micro-P x micro JR
micro-P + micro
-
- (2.17)

2.3 . 3 ROC AUC

(threshold)
[0.0,1.0]
0.5 0.5

cut point )

ROC

ROC Receiver Operating Characteristic)

[Spackman , 1989]. 2.3.2 -


PR

“ROC

-
PR
True Positive Rate, TPR),
ROC
( False Positive
Rate, FPR), 2.1

TP
TPR
TP + FN
(2.18)

FP
FPR
TN + FP
' (2.19)
ROC “ROC 2.4
0, 1)

0.2 0.2

0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0

(a) ROC AUC ( b) ROC


AUC
2.4 ROC AUC

ROC
2.4 ROC
2.4 ROC m+
-
PR
m—
AUC 0,0)

(x,y),

^
( 2/ + r )

ROC
^ PR
- ROC

ROC AUC ( Area Under


ROC Curve), 2.4
AUC ROC
R0C ( x 2 , 2/2),... ( m ,Vm ) }
(a i 1), 2.4(b), AUC
^
2.3

lit/

AUC - ( xi+ i Xi ) ( yi yi + i ) . ( 2.20 )

AUC
m+ m -
loss)

£rank = Yl i +) f ( x~ ) )
\

^
1 l
x + eD+ x~ eD ~

( 2.21)

0.5 irank R0C


R0C x

A UC ( 2.22)
.

^ rank •

2.3 . 4

unequal cost ).

(cost matrix), 2.2 costij i j


costa
COStoi > COStio COS % costio
costoi costio
2.2

costoi
costio
(2.4)

( total cost ). 2.2


D+ D -
-
cost sensitive)

E( f;D;cost) E
xieD+
( xi ) yi ) x costoi

^
+ E
X i E D~
I ixi ) 7 Vi ) x COSt 10 (2.23)

costij i j

R0C
[0,1]
2.7.
(cost curve)

p costoi
P ( -\- ) cost _ (2.24)
p costoi + (1 p) X cost10

tion)
normaliza
- p 1]

[0,1],
cost norm
FNR x p x costoi FPR x (1 p) x costio — (2.25)
2.8 .
p X costoi + (1 — p) x costio
FPR (2.19) FNR TPR
R0C HOC
(TPR ,FPR), FNR,
0,FPR) 1,FNR)
R0C

2.5
2.4

FNR

FPR

0.5 1.0
.

2.5

2.4

( hypothesis test )
A B
A B,

[Wellek 2010],
e

2.4. 1

e = eo”.
e e;
m exm
e
em\l - xm
m
e

m m— e x m
P( e ) = (2.26)
^ e xm

d P ( e ; e ) / de = 0 P(e;e )
^ e=e |e —$
P (e; e) ( binomial) e

0.25 -

0.20 -

0.15 -

0.10 -

0.05 -

a
0.1, 2.6 a

2.6 (m = e = 0.3)

binomial test ) 0.3”


0.3” “e
—a a
(confidence), 2.6

s.t . “subject to”


e = max 6
i= o X m + l
(T)\ - ) -
x
e l e 171 1
<a . (2.27)
2.4

R a
qbinom ( l
o: m eo ) Matlab e eo” a

icdf ( Binomial ’ eo; a
a , m , eo ) .
eo.
R

-
www. r project .org.

“t -
t test ). A: • ..
M
( 2.28)

k
G
2
k -i
- M) 2 • ( 2.29)

A: eo

n= a
-e )
0
( 2.30)

/c - f 2.7

0.1
a a
2 2
-10 -5 0 5 10
n
2.7 * ( /c = 10)

>=
—a -
(two tailed)
2.7 a /2
OO , \Pa / 2 t /i |/i — Co|
[La/2 e0,

t a/ 2 R
-
1 a;
.a 0.05 0.1. 2.3
qt ( l a / 2 , k
1) Matlab
icdf a / 2 , k - 1) . 2.3 f

k
a

0.05 12.706 2.776 2.262 2.093 2.045


0.10 6.314 2.132 1.833 1.729 1.699

2.4.2 t
A B A:
... ef ef ... ef ef i
A: t -
( paired t tests)

ef .
/c
A , ef - ef
Al5 A2,...,Afc A B t
/i a2 , a

\ fk [i
Tt = a
(2.31)

M/2 , A: t a /2

x
2.4

[Dietterich , 1998].
5x2
A B, i

iKAj Af .
0.5( A} -f A? ),
( }- " .

n= (2.32)
E
i=l

f ta/2, a 0.05 2.5706


a = 0.1 2.0150.
2.4.3 McNemar
A B

contingency table) 2.4


2.4

A
B

eoo
en

e1()
|e0i - eio| e(u e1().

( |eoi — eio|
x2 (2.33)
eoi eio


a

R
qchisqd

a,fc l ) Matlab
,, a 0.05 a 0.1
icdf ( / CMsquare
a,k
- 1). k 2.7055.
2.4.4 Friedman Nemenyi

t McNemar

Friedman
D4 A B C

D3 A B C
A B C

2.5

A B C
Di
D2 2.5 2.5
Ds
DA 2 3
2.125 2.875

Friedman
iV q
i n
A: 1)/2 A:2 - 1)/12.

n - +2
k - 1 1 2N A: 1
Tx2 =
-
'
k k2 1 V

1 2N
Kk + i) (§ ? r
k( k
4
l)2
(2.34)

/c iV A: x2
FViedman

Tp
(N — l) rx 2
(2.35)
N {h 1) T .2
^
2.4

TX2 (2.34) . TF A: — — 1 )( N — 1) F
2.6

2.6 F

a 0.05

R iV
qf ( l Q fc —
-
l (fc l )(iV - l ) ) 10.128 5.143 3.863 3.259 2.901 2.661 2.488 2.355 2.250
Matlab icdf ( F l ’’ — 7.709
5.591
4.459
3.739
3.490
3.072
3.007
2.714
2.711
2.485
2.508
2.324
2.359
2.203
2.244
2.109
2.153
2.032
5.117 3.555 2.960 2.634 2.422 2.272 2.159 2.070 1.998
4.600 3.340 2.827 2.537 2.346 2.209 2.104 2.022 1.955
4.381 3.245 2.766 2.492 2.310 2.179 2.079 2.000 1.935
a = 0.1
A:
iV
5.538 3.463 2.813 2.480 2.273 2.130 2.023 1.940 1.874
4.545 3.113 2.606 2.333 2.158 2.035 1.943 1.870 1.811
3.589 2.726 2.365 2.157 2.019 1.919 1.843 1.782 1.733
3.360 2.624 2.299 2.108 1.980 1.886 1.814 1.757 1.710
3.102 2.503 2.219 2.048 1.931 1.845 1.779 1.726 1.682
2.990 2.448 2.182 2.020 1.909 1.826 1.762 1.711 1.668

post-hoc test )
Nemenyi
Nemenyi

k(k + l )
C D = qa
6N
(2.36)

Tukey
Qa
2.7 a = 0.05 0.1
R
iiqtukeyd — a , k , Inf ) CD,
sqrt ( 2 ) •

2.7 Nemenyi

A:
a

0.05 1.960 2.344 2.569 2.728 2.850 2.949 3.031 3.102 3.164
0.1 1.645 2.052 2.291 2.459 2.589 2.693 2.780 2.855 2.920
2.5 (2.34) (2.35) TF = 24.429,
2.6 a 0.05 F 5.143,
Nemenyi 2.7
90.05 2.344, (2.36) CD 1.657, 2.5
A B B C
A C A C
A B B C
FYiedman 2.5
2.8 ,

2.8 A B
A C

A
B

1.0 2.0 3.0


2.8 Friedman

2.5

bias-variance decomposition)

%x D)
-
yD y x
'
VD y
D
2.5

f { x ) = ED [ f { x ; D ) ] (2.37)

v a r( x ) = ED ( xyD) - (2.38)

e2 = ED ( V D - y) 2
(2.39)

(bias)
2
b i a s2 ( x ) = ( f ( x ) - y ) . (2.40)


ED [ y D - y } = 0.

E ( f ; D ) = ED { x ; D ) - V D )2

= ED [(/(x;D) — /(*) + ]
2
VD)

= ED _ -/ + ED \[ [ j ( x ) - yDf
(2.37), 0.
+ ED [2 { x\ D ) - f ( x ) ) ( f ( x ) - y D ) ]

- / ) ] + ED [(/ - )
2 2
yD

( x ; B ) - f ( x ) ) ] + ED [(>
2
/ ) y + y yo)
2
=
^ D -

= ED ( x ; B ) - f ( x ) ) ] + ED [ ( / (« ) - 2/ ) + Ez ( y -
2 2
) yD)
2

-
o,
o. + 2ED [ ( f ( x ) - y ) ( y y D ) }
= ED ( f ( X ] D ) - f ( X ) ) ] ( f ( X ) - y ) -\- ED \( ]
2 2
yD - y )2 y

(2.41)

E ( f ; D ) = bias2 (a? ) + var ( x ) e2 (2.42)

(2.40)
(2.38)

(2.39)

(bias-variance
dilemma). 2.9

2.9

2.6

[Efron and Tibshirani , 1993]

ROC [Spackman, 1989],


AUC [Bradley , 1997],
2.6

ROC [Hanley
and McNeil,1983]. [Hand and Till, 2001] ROC
[Fawcett , 2006] ROC
[Drummond and Holte, 2006]

(cost-sensitive learning) [Elkan ,


2.3 4,

Zhou and Liu , 2006]


[Dietterich, 1998] A: 5x2
[Demsar , 2006]
[Geman et al., 1992] (bias -
variance-covariance decomposition)
(2.42)

[Kong and Dietterich, Kohavi and Wolpert ,


1996; Breiman, 1996; Friedman , 1997; Domingos, 2000].
2.1

2.2

2.3 A B A BEP B

2.4 (TPR) (FPR) (P) (R)

2.5 (2.22).

2.6 ROC

2.7 ROC

2.8
-
Min max -score r
X max

'
xmax

-
min max
2-score (2.43) (2.44)

^
^ ~
^ min ^ m,v x ( xma x x ,
in ) (2.43)
^ max ^ min
x (2.44)

2.9

2.10* Friedman (2.34) (2.35) lj .


Bradley, A. P. (1997) . “The use of the area under the ROC curve in the evalua-
tion of machine learning algorithms.” Pattern Recognition, 30( 7) :1145-1159.
Breiman , L. (1996) . “Bias, variance, and arcing classifiers.” Technical Report
460 , Statistics Department , University of California, Berkeley, CA.
Demsar , J . (2006). “Statistical comparison of classifiers over multiple data
sets.” Journal of Machine Learning Research, 7:1-30.
Dietterich , T. G. (1998). “Approximate statistical tests for comparing super-
vised classification learning algorithms.” Neural Computation, 10 ( 7) :1895-
1923.
Domingos, P. ( 2000) . “ A unified bias-variance decomposition.55 In Proceedings
of the 17th International Conference on Machine Learning ( ICML ) , 231-238,
Stanford, CA.
Drummond , C. and R. C. Holte. (2006) . “Cost curves: An improved method
for visualizing classifier performance.” Machine Learning , 65 (1):95-130.
Efron , B. and R. Tibshirani. (1993) . An Introduction to the Bootstrap. Chap-
man Hall, New York , NY.
Elkan , C. ( 2001) . “The foundations of cost-senstive learning.” In Proceedings of
the 17th International Joint Conference on Artificial Intelligence ( IJCAI ) ,
973-978, Seattle, WA.
Fawcett , T. ( 2006). “An introduction to ROC analysis.” Pattern Recognition
Letters, 27(8):861-874.
Friedman, J. H. (1997) . “On bias, variance , 0/1-loss, and the curse-of-
dimensionality.” Data Mining and Knowledge Discovery, 1(1) :55-77.
Geman , S. , E. Bienenstock , and R. Doursat. (1992 ). “ Neural networks and the
bias / variance dilemma.” Neural Computation, 4( l ):l-58.
Hand , D. J . and R. J . Till. (2001). “A simple generalisation of the area under
the ROC curve for multiple class classification problems.” Machine Learning ,
45( 2 ):171-186.
Hanley, J . A. and B. J . McNeil. (1983) . “A method of comparing the areas
under receiver operating characteristic curves derived from the same cases.”
Radiology, 148(3):839-843.
Kohavi, R. and D. H. Wolpert. (1996) . “Bias plus variance decomposition for
zero-one loss functions.” In Proceeding of the 13th International Conference
on Machine Learning ( ICML ) , 275-283, Bari, Italy.
Kong , E. B. and T. G. Dietterich. (1995) . “Error-correcting output coding cor-
rects bias and variance.” In Proceedings of the 12th International Conference
on Machine Learning ( ICML ) , 313-321, Tahoe City, CA.
Mitchell, T. (1997) . Machine Learning. McGraw Hill , New York, NY.
Spackman, K. A. (1989). “Signal detection theory: Valuable tools for evaluating
inductive learning.” In Proceedings of the 6th International Workshop on
Machine Learning ( IWML )^ 160-163, Ithaca, NY.
Van Rijsbergen, C. J . (1979 ). Information Retrieval , 2nd edition. Butterworths,
London, UK.
Wellek , S. ( 2010). Testing Statistical Hypotheses of Equivalence and Noninfe-
riority, 2nd edition. Chapman & Hall/ CRC , Boca Raton , FL.
Zhou , Z.-H. and X.-Y. Liu. ( 2006) . “On multi-class cost-sensitive learning.” In
Proceeding of the 21st National Conference on Artificial Intelligence ( AAAI ),
567-572, Boston , WA.
William Gosset , 1876—1937)

f Biometrika

t
-
(Student’s t test).

( Karl Pearson , 1857—1936)


( University College London UCL) t
UCL
UCL
Biometrika
3.1

d (xi ..• d) , S

^
X X2 T

(linear model)

f ( x) = W\ X\ W 2X 2 ••• WdXd b (3.1)

(3.2)
ty (wi W2
-
.• Wd ). ti

( nonlinear model)

un-
derstandability) .
(comprehensibility).
(cc) 0.2 • z 0.5 . c 0.3 • x ”

3.2

{(«i ,yi ),( x2,2/2 ), (xm,ym )}, A = (


• linear regression)

^
• • Vi

{(xi ,yi )}Zv e R.


order)
{1.0,0.0},
{1.0,0.5,0.0};
/c

9.3

f ( xi ) = wXi f { pCi ) yi . (3.3)

/( . 2.3
2.2)
(square loss).

* ,b* w b = argminV" (xi) - yif


)
m
= argmin
(4) i=i
WXi — 6)2 . (3.4)

Euclidean distance).
(least square method).

^
M 6) Y A L I V I — W X i - b )2
b Z (parameter estimation). (w,b)
w 1(

[a,b] dE

/(
dw E - 6) (3.5)
+
X\ X 2
) 2
} b 6] i dE

db
= 2 I mb — (Vi wxi) (3.6)
U
f(x) x2,
(3.5) (3.6) 1(
-
(closed form)

o,

i
i=l

Ei 4 -
=
E -
(3 7)
3.2

m
I
b= — wxi ) (3.8)

m
1C A
i= l
$ •

f (xi ) wTXi f ( Xi )

multivariate linear regression).

ti w
A m x (d 1)
d d

Xu X1 2 … Xid (xj Is

^
X 21 ••• X 2d 2
x=
T
^ m2 •••
^ md m

y = (2/I 2/2 .. (3.4)

w* arg min ( y
w
— Xtt )T ( y — X w)
) . • (3.9)

E = ( y — X w )T ( y XA ) tD

dE
dw
2 XT ( X w - y ) . (3.10)

A
'

XTX (full-rank matrix) ( positive definite ma -


trix)

(xTx) ixTy
-
ii>* = (3.11)

)-
1
(X (XTX) ( ,1),
f (x i) =
_
x j (XTX) 1 XTy . ( 3.12)

xTx
XTX

regularization)
1.4
11.4 y e R,
(3.2)

x -\- b . (3.13)

T
lny (3.14)
x -\-b
(log-linear regression), e’

(3.14)
3.1

Vi = In

x -\- b
/
( x3 , y- jjj y
b
0 2 X

3.1
3.3

g( ) - p

y =g
~ 1
( wTx + b) (3.15)

(generalized linear model),


(link function).
In

3.3

(3.15)
y
y e {0,1},
T z
Heaviside •

-
(unit step function)

y (3.16)
z>

3.2

y i,

h z > 0:
y= 0.5. z = 0;
0.5f 0. 2 < 0.
y
+ — 2

-10 -5 0 5 10

3.2
3.2 (3.15)

(surrogate function ) , (logistic


function )

In V l+e
(3.17)

3.2 “ Sigmoid
-
Sigmoid S
Sig
moid
z
_
g ( .)

y= • (3.18)
e - ( w T x -\- b )

(3.18)

V Tx
-\- b .
In
i- y (3.19)

T mi - y
y
i- y (3.20)

(odds),
log odds, logit )

y
In
1— y
(3.21)

(3.18)
(logistic
regression , logit regression) .
logistic logit
3.3

(3.18) 6. (3.18) y
p{ y = l \ x ) , (3.19)
p{ y = \ x )
In
p{ y = 0 \ x )
^ = wT x -\- b . (3.22)

e x -\-b
p( y = l \ x ) = (3.23)
1 + ewTx+b
1

p( y = 0 | a? ) = (3.24)
1 + ewTx -\-b
( maximum likelihood method)
7.2
W 6. { ( x u y i ) } Zv (log -
likelihood)


^
^ ) = 2 lnp(^ Xi\ w , b ) (3.25)

(«;1 ) , wTx /3TX . pi ( x ; /3 ) p( y = x ; /3) ,


po ( x ] /3 ) = p( y = 0 x; /3) = 1 - pi («; /3) , ( 3.25)

p(Vi I Xi w , b ) = yiPi ( xi /3 ) ( 1 - yi ) po ( xi (3) . (3.26)

(3.26) (3.25) (3.23) (3.24) (3.25)

^^
m ( - y i(3TXi In 1 ( e
1
)) (3.27)
^

i=l

(3.27) 0 [Boyd and


Vandenberghe, 2004] (gradient descent
B.4.
method ) ( Newton method)

/3* — arg mm £ ( /3 ) • (3.28)

f3t+1 = —
d2 my 1
d£ ( f3 )
(3.29)
df3 d /3T J d(3
^^
W)
2xi(

^
d f3
yi -p X i f3)) (3.30)

d 2£ (/3)

^
d f3d(3T
XixJp X i (3)(1 - f3)) . (3.31)

3.4

(Linear Discriminant Analysis, LDA )


LDA
[Fisher , 1936]
Fisher
“Fisher
.
LDA

3.3

X2

XI
3.3 LDA U

D { { Xi , yi ) } v Vi e Xi ,
f G ^ Pi S

WT

^
3.4

T t T /Xi > wT wTJ iW

^
U /Xo

T S ti
0

J=
\\ wTflo WT
wT Eow wT Eiw
-
w
(3.32)

(within-class scatter matrix)

= So + Si

E (x T
2 (x - T

XSXQ
Mo ) { x - M0 ) +
^
xeXx

(between-class scatter matrix)


fli ) (3.33)

Sb = ( Mo Mi ) ( Mo — Mi ) T (3.34)

(3.32)
w T SbW
J = wTSww (3.35)

LDA (generalized
Rayleigh quotient ).

w (3.35) w
(3.35) • wTSww
ty
w
(3.35)

mm — w T SbW (3.36)
s.t . w T Smw = 1 .

B.l.

SbW = AS w (3.37)
A xo MI ,

SbW = A(/x0 Mi ) (3.38)

(3.37)
Mi ) • (3.39)

A.3.
Sw
UEVT , S
S? = vs UT
LDA
7.5 .
LDA
LDA iV i
TTH .

St = Sb + S

- f l ) { X i - f i )T , (3.40)
i= l

M Sw

(3.41)

xeXi
:{ x Mz ) Mi ) • (3.42)

Sb = St — sw
-
M)
T
. (3.43)

LDA S6, Sw
3.5

tr (WTS6W)
-
^
(3 44)
tr(WTS W )
W G RhW-1) tr ^
(trace). (3.44)

S6W = AS W . (3.45)
W AT - ^
LDA iV
iV
LDA

3.5

LDA

AT •

classifier).

OvR OvA (One VS.


(One vs. One, OvO)
AH ), OvA (One vs. Rest, OvR) (Many vs. Many, MvM).
{(aJi ,2/i),(«2 ,2/2 ),...,(® m ,2/m )},Vi {Ci,C2 ,...,C N } .
OvO

Cj
iV
G G
N {N - l ) / 2

^ G
OvO

N(N - l ) / 2
3.4
8.4 OvR
iV
3.4
(\
D

ISlM )

OvR
OvO
OvO

/i

f4

f5

f6



C\

cn

C3

Cs
q

-
4
C2

c3

3.4 OvO

iV
C3 C4
OvR
__
msg))
mmi)
OvR

OvO
OvR
iW

W
U


N ( N - 1)/2
c3

OvR OvO
OvO OvR

MvM OvO
OvR MvM MvM
MvM Error Correcting
Output Codes, ECOC).
ECOC [Dietterich and Bakiri 1995]
ECOC

• iV M
M
M

• M
3.5

coding matrix)
[Dietterich and Bakiri, 1995] [Allwein et al.,
2000].
3.5 3.5(a)
C3 C2 C4 3.5(b)
c4 c3

3.5(a)
C3.

n f

:n;
/1 /2 /3 /4 /5 fi h h fe /7
I
4 4
C2

C4
- BlBB TBl
>
[ -> 2 2
->• 5 2 \/5
3 \ /l0

^ ECOC
i
+i - +i

(b)
+i -i

ECOC

3.5 ECOC
A
-
“ 1”

ECOC
3.5(a)

C3. ECOC

NP
3.6

-
(class imbalance)

OvR MvM
OvR MvM

y = wTx + b a
y
> 0.5 .
0.5

V
i- y (3.46)

m+ m-

V
T-V (3.47)
3.7

(3.46)
(3.46) (3.47).

V y
(3.48)
1 — 2/ 1- y m+

ance).
-
rebal
rescaling).

midersampling)

(downsampling),

pling).
(upsam
- oversampling),

(3.48)

-
(threshold moving).

SMOTE [Chawla
et al., 2002]

EasyEnsemble [Liu et al., 2009]

ing) (3.48)
_
m /m+ cost cost
-
cost-sensitive learn

-
^
2.3.4 cost cost

3.7

(sparse representation)
sparsity)
L0 NP LASSO
[Tibshirani , 1996] h L0
OvO OvR ECOC [Allwein et al., 2000].
[Crammer and Singer , 2002]

NP ECOC
[Pujol et al., 2008]. [Escalera et
al.,2010] ECOC
MvM ECOC DAG ( Directed Acyclic
Graph) [Platt et al,2000]

[Crammer and Singer, Lee et al.,2004].

( misclassification cost), 2.2

[Elkan, 2001], [Zhou


and Liu , 2006a].
[Zhou and Liu, 2006b].

(multi-label learning),
[Zhang and Zhou, 2014].
3.1 (3.2)
3.2 (3.18)
(327)
3.0a p.89
3.3 3.0a
4.5.

UCI
3.4 UCI
http: // archive.ics.uci. edu / ml / .

3.5 3.0a

3.6
6.3

3.7 ECOC

3.8* ECOC
ECOC

3.9 OvR MvM

3.10* )
Allwein, E. L., R. E. Schapire, and Y. Singer. ( 2000). ^Reducing multiclass
to binary: A unifying approach for margin classifiers.” Journal of Machine
Learning Research, 1:113-141.
Boyd , S. and L. Vandenberghe. ( 2004) . Convex Optimization, Cambridge Uni-
versity Press, Cambridge, UK.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. (2002 ).
“SMOTE: Synthetic minority over-sampling technique.” Journal of Artificial
Intelligence Research, 16:321-357.
Crammer , K . and Y. Singer. ( 2001) . “On the algorithmic implementation of
multiclass kernel-based vector machines.” Journal of Machine Learning Re-
search^ 2:265-292.
Crammer , K. and Y. Singer. ( 2002). “On the learnability and design of output
codes for multiclass problems.” Machine Learning , 47(2-3) :201—233.
Dietterich, T. G. and G. Bakiri. (1995). “Solving multiclass learning problems
via error-correcting output codes.” Journal of Artificial Intelligence Re -
search 2:263-286.
Elkan , C. ( 2001). “The foundations of cost-sensitive learning.” In Proceedings
of the 17th International Joint Conference on Artificial Intelligence ( IJCAI )
973-978, Seattle, WA.
Escalera, S. O. Pujol, and P. Radeva. ( 2010) . “Error-correcting ouput codes
library.” Journal of Machine Learning Research, 11:661-664.
Fisher , R. A. (1936). “The use of multiple measurements in taxonomic prob-
lems. Annals of Eugenics , 7( 2 ):179-188.
Lee, Y., Y. Lin, and G. Wahba. (2004). “Multicategory support vector ma-
chines, theory, and application to the classification of microarray data and
satellite radiance data.” Journal of the American Statistical Association, 99
( 465) :67-81.
Liu , X.-Y., J. Wu , and Z.-H. Zhou. ( 2009). ^Exploratory undersamping for
class-imbalance learning.” IEEE Transactions on Systems , Man, and Cyber-
netic
^- Part B: Cybernetics, 39 ( 2 ) :539-550.
Platt J C N. Cristianini, and J . Shawe-Taylor. ( 2000) . “Large margin DAGs
, . .,
for multiclass classification.” In Advances in Neural Information Processing
Systems 12 ( NIPS ) (S. A. Solla, T. K. Leen, and K.-R. Muller , eds.) , MIT
Press, Cambridge, MA.
Pujol, 0., S. Escalera, and P. Radeva. (2008) . “An incremental node embedding
,
technique for error correcting output codes.” Pattern Recognition 41( 2 ):713-
725.
Pujol, O., P. Radeva, and J. Vitria. ( 2006 ). “Discriminant ECOC: A heuristic
method for application dependent design of error correcting output codes.”
IEEE Transactions on Pattern Analysis and Machine Intelligence , 28( 6):
1007-1012.
Tibshirani, R. (1996 ) . “Regression shrinkage and selection via the LASSO.”
Journal of the Royal Statistical Society: Series B , 58(1):267-288.
Zhang, M.-L. and Z.-H. Zhou. ( 2014). “A review on multi-label learning al-
gorithms.” IEEE Transactions on Knowledge and Data Engineering , 26 (8):
1819-1837.
Zhou, Z.-H. and X.-Y. Liu. (2006a). “On multi-class cost-sensitive learning.” In
Proceeding of the 21st National Conference on Artificial Intelligence ( AAAI ),
567-572, Boston, WA.
Zhou, Z.-H. and X.-Y. Liu. ( 2006b) . “Training cost-sensitive neural networks
with methods addressing the class imbalance problem.” IEEE Transactions
on Knowledge and Data Engineering , 18(1):63-77.
G 09674175 N 9

(1993

(1777-1855)

1752 1833)

“3L” .
4.1

(decision tree)

4.1

4.1
- -
divide and conquer ) 4.2

• .. m ym ) }'i
A ={a i ,a2 ,...,ad}.
TreeGenerate( D , A )
node;
if D C then
(1). node C7 return
end if
if A = 0 O K D A then
(2). node return
end if
a #;
for a * do
node D a
if then
(3). D return
else
a* . TreeGenerate(Dv , A \{a *})
end if
end for
node

4.2

1) 2)
3)

(2)
(3)

(3)
4.2

4.2

4.2

purity)

4.2.1

information entropy)
A /c ...

Ent( D ) . (4.1)
p p log2 p 0.
k=l

^
Ent ( D) 0,
log2 | |.
Ent )
^ y {a1 2 av
,..., },

^
a a
D y v D
(4.1)
'
a D W
mm
a D
(information gain)

^
Gain(D,a) = Ent( D) Ent(Dv ) . (4.2)

4.2 a* argmaxGain(D,a). ID3


aEA
It-
ID3 ID
erative Dichotomiser
[Quinlan,1986]
4.1 2.0
2.
m p2 •

(4.1)

- + - log -
^ = 0.998 .
n n
Ent(D) = ~
Pk iog 2 Pk = log2 2
k=l
17
4.1 2.0

'C3

'3

ffi r
« '

« '

1
*

1
X

» '

{
{
}. D D1
D2 D3
D1 17}
P2 15}
p2 D3 16}
= I P =I
2 (4.1)

Ent(D1) —( log2 log2


J 1.000

Ent(D2) — log2 log2 0.918

Ent(D3)
_ _+-
log2 log2
- = 0.722

(4.2)
4.2

j- Ent { D v )
Gain(D, Ent { D )

0.998 - —
17
-= V 1

x 1.000 ^
' I


17
x 0.918 —
17
x 0.722

= 0.109 .

Gain(D, 0.143; Gain( D, 0.141;


Gain(D, 0.381; Gain(D, 0.289;
Gain( D, 0.006.
4.3

{1,2, 15} {7, 17} {11, 16}

4.3

4.3
D1 {1,
15} {
P1
Gainp1 0.043; Gainp1 0.458;
Gain(JD1, 0.331; Gain(L)1, 0.458;
Gainp1 , = 0.458.

4.4

4.2.2
4.1
C

4.4 2.0

(4.2)

C4.5 [Quinlan, 1993]


gainmtio )
(4.2)

_
Gain ratio(D,a) =
Gain( jD ,a )
IV
(4.3)

v
(4.4)
v=l '

a (intrinsic value) [Quinlan , 1993]. a


y 4.1
0.874 ( F 2), 1.580 (V = 3),
= 4.088 (V = 17 ) .
C4.5
4.3

_
[Quinlan , 1993]:

4.2 . 3
CART Classification
and Regression Tree
CART [Breiman et al., 1984] ( Gini index)
(4.1)

Gini(D) VkVk'
k=l k k

1 1
^ (4.5)

Gmi ( D ) D
Gmi ( D )
(4.2) a

_
Gini ndex( D a) (4.6)

A
_
argmin Gini index( D,a ).

4.3
2.1
(pruning)

prepmning) (post -
pruning) [Quinlan , 1993].
2.2

4.1
4.2 17}
{4, 13} .

4.2 2.0

1
x
si
1
X

is
!
1
®''
«
«'
'

®'
1
X r
@'

4.2.1 4.2
4.5

4.3.1

4.6

4.2
4.3

CM
C

4.5 4.2

< ■
42.

57.

4.6 4.2

4.2
{4 ,5,8}
x
4.6
{6,7,15,17} .
{4 ,5,8,11,12} x
{5}
57.1% .

4.2 4.6
(decision
stump).

4.6 4.5

4.3.2

4.2
4.5

4.5
{7,15}

4.7

{6,7,15}

{1,2,3,14}
4.4

( r :

Y
)
3

t ] (5

C¥ 57.

4.7 4.2

4.2 4.7

4.7 4.6

4.4

4.4 . 1

-
( bi partition) C4.5

-
[Quinlan , 1993].
a D n
{a1 ,a2 ,...,a }. t D
Dr Dr a t
a t i+ l
[a\ai+l ) a,
n

Ta =
a{ f - -2ai+
1
i n- (4.7)

[ a a^ 1 )
^
(4.2)
[Quinlan , 1993] .
Gain(Z),a ) = max Gain( D,a ,t )
teTa

= max Ent ( D ) - I Ent (4.8)


teTa
+}
\ D\

Gain(D,a ,t ) D t
Gain(D ,a ,t )

4.1 2.0
. 4.3 3.0.

4.3 3.0

isl
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
r 0.481 0.149
«
1
X
051
r
'
0.437
0.666
0.211
0.091
0.267
1
x « ' 0.245 0.057
1
i r 0.343 0.099
1
X « ' 0.639 0.161
« '
' 0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
4.4

(4 . 7),
r 0.351, 0.381, 0.420, 0.459, 0.518,
0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746}. (4.8)
0.381.

{0.049, 0.074, 0.095, 0.101, 0.126, 0.155, 0.179, 0.204, 0.213, 0.250, 0.265,
0.418}. (4.8)
0.126.
4.2.1 4.3
Gain(D,
Gain( D _ 0.109; Gain( D,
0.141; Gain(D,
0.143;
0.381;
Gain(D, 0.289; Gain(D, 0.006;
Gain(D, 0.262; Gain(D, 0.349.

4.8

$0.381?

4.8 3.0

<0.381”

$0.294” .
4.4.2

HIV
4.4 4.1 2.0
16}

4.4 2.0a

i
1

1
X
ffi

-
lwt Il
1)
(2) •

D a, f) D a
(1), ci a F
ay}, D a
D A: ... Ul= i Dkl

-
^
D= uLi $ WB
1.

P
E ^iDX
(4.9)
ECCG-D W .

^
xeDk
Pk (1 (4.10)
.
rv = E xeDv
W
(4.H )
4.4

p
/c a
ElkllPk = h Etirv 1. .

(4.2)

_
Gain( D a) p x Gain( jD ,a)
v
(D) -
(4.1),
= px Ent
^
V =1
fvEnt (4.12)

1 1
^

(2), z
Ent ( D ) = -
^
k l

ci
2

pk log 2 p k .

z
a

C4.5 [Quinlan,1993]. 4.4

1.
17}

Ent ( D ) = - [ pk og 2 p k
k=l
^
6 6 8 8
.
i4 l0 2 l4 + Til0 2 l4 = 0.985
g g

f)1, D2 f)3

Ent(D1) =
4+4
l0g2 l0g
24 =
1.000
4
2
Ent ( D2 ) = n
6
l0g
26 + 6 l0g26 = 0.918
Ent ( _Z 3 )

D
^
^^ ^ og + log2 - 0.000

^ ^
Gain(£> E n t( D ) - f Ent( )
V =1
^
0.985 - 14—— x 1.000 IT
14
x 0.918 H - 14—— x 0.000
= 0.306 .

Gain(D, = p x Gain(£), x 0.306 0.252 .

D
Gain(D, 0.252; Gain(D, 0.171;
Gain(£), 0.145; Gain(D, 0.424;
Gain(D, 0.289; Gain(D, 0.006.

15}
17} {11,12,16}
1.
{8}
{10}

4.9

4.5

d
d

-
(axis parallel)
4.5

^
C T)( J xf
=?
\
J
=? CpjT )

4.9 2.0a

4.5 3.0a 4.10


4.11

3 . 0a
4.5 3.0a
4.3 3.0

0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0.437 0.211
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
<0.126?

< 0.381?

< 0.205?

< 0.560?

C
4.10 3,0a

0.6

0.2 0.4 0.6 0.8

4.11 4.10

4.12

4.12
multivariate decision tree)

(oblique
decision tree) .

t
t
univariate decision tree)
4.5

Vii

o X

4.12

3.0a 4.13
4.14

[ -0.800 x -0.044 x -0.313?


-0.365 x 0.366 x
-0.158?

4.13 3.0a

0.6

0.4
4
^ 0.2

0.2 0.4 0.6 0.8

4.14 4.13
4.6

ID3 [Quinlan, 1986] C4.5 [Quin -


lan , 1993] CART [Breiman et al., 1984]. [Murthy , 1998]
C4.5Rule C4.5
[Quinlan, 1993], C4.5Rule

[Mingers , 1989b],
[Raileanu and Stoffel,
2004]
. 4.3
[Mingers, 1989a],

0C1 [Murthy et al., 1994 ] [Brodley and Ut -


goff , 1995] OC1
[Brodley
and UtgofF, 1995]

( Perceptron tree) [Utgoff , 1989b]


[Guo and Gelfand , 1992]
incremental learning),

ID4 [Schlimmer and Fisher , ID5R [Utgoff ,1989a] ITI [Utgoff et al.,
1997]
4.1
0)

4.2

4.3 4.3

4.4 4.2

4.5 4.3

uci
http:// archive.ics.uci.edu / ml/. 4.6 UCI

2.4

4.7 4.2

MaxDepth 4.2

4.8*
MaxNode 4.7
4.7

4.9 4.4.2

4.10
3.0 p.84
3.0
4.3.
Breiman , L., J . Friedman , C. J . Stone, and R. A. Olshen. (1984) . Classification
and Regression Trees. Chapman & Hall/ CRC , Boca Raton , FL.
Brodley, C. E. and P. E. Utgoff . (1995) . “Multivariate decision trees.” Machine
Learning , 19 (1) :45-77.
Guo, H. and S. B. Gelfand. (1992). “ Classification trees with neural network
feature extraction.” IEEE Transactions on Neural Networks, 3(6 ) :923-933.
Mingers , J. (1989a) . “An empirical comparison of pruning methods for decision
tree induction.” Machine Learning , 4 ( 2 ):227-243.
Mingers, J . (1989b) . “An empirical comparison of selection measures for
decision-tree induction.” Machine Learning , 3( 4) :319-342.
Murthy, S. K . (1998) . “Automatic construction of decision trees from data:
A multi-disciplinary survey.” Data Mining and Knowledge Discovery , 2 ( 4):
345-389.
Murthy, S. K ., S. Kasif , and S. Salzberg. (1994) . “A system for induction of
oblique decision trees.” Journal of Artificial Intelligence Research, 2:1-32.
Quinlan , J . R. (1979) . “Discovering rules by induction from large collections of
examples.” In Expert Systems in the Micro- electronic Age ( D. Michie, ed.) ,
168—201, Edinburgh University Press , Edinburgh , UK.
Quinlan , J . R. (1986 ). “Induction of decision trees.” Machine Learning , 1(1):
81-106.
Quinlan , J . R. (1993) . C4.5: Programs for Machine Learning. Morgan Kauf-
mann San Mateo, CA.
Raileanu , L. E. and K. Stoffel. ( 2004) . “Theoretical comparison between the
Gini index and information gain criteria.” Annals of Mathematics and Arti-
ficial Intelligence 41(1):77-93.
Schlimmer , J . C. and D. Fisher. (1986). “A case study of incremental concept
induction.” In Proceedings of the 5th National Conference on Artificial In-
telligence ( AAAI )^ 495-501, Philadelphia, PA .
Utgoff , RE. (1989a) . “Incremental induction of decision trees.” Machine
Learning , 4( 2 ) :161-186.
UtgofF, P.E.(1989b). “Perceptron trees: A case study in hybrid concept rep
-
resenations.55 Connection Science,1(4):377-391.
Utgoff ,P.E., N.C.Berkman ,and J .A.Clouse.(1997). “Decision tree induction
based on effcient tree restructuring.” Machine Learning, 29(1):5-44.

• J . Ross Quinlan, .
' E. B .
Hunt CLS
( Concept Learning System)
. Hunt

D. Michie

CLS

ID3
Machine Learning U ID3
ID4 ID5
ID3
C4.0 Classifier 4 .o C4.0, C4.5. C4.5
C4.0 4.5
C4.5 WEKA
C5.0.
J 4.8.
5.1

( neural networks)

T. Kohonen
Neural Networks
[Kohonen , 1988] .

neuron unit. ( neuron )

bias.
( threshold ) ,

[McCulloch and Pitts, 1943] 5.1


“ M-P
n
(connection )

Xi

-
>

5.1 M P
-
( activation function)
5.2(a)

Sigmoid Sigmoid
3.3
Sigmoid 5.2(b)
0,1)
(squashing function).

sgn(>) sigmoid ( rr )
1.0

0.5

0 x - 1.0 - 0.5 0 0.5 1.0 X


x 0;
sgn ( x ) = a: 0.
sigmoid ( x ) = 1 e~

( a) ( b) Sigmoid

5.2

5.2

( Perception) 5.3
MP -
( threshold logic unit ).
y / CE -


5.2

X2 ): y /(1 .
^ . —
5.2

5.3

^^
Xi = X2 = 1

• $1 X2)

Z2
IJ
-
/(1 X i ~\
-1 - X2

• ~
i ): — 0.6, W 2 = 0, 9 = — 0.5, J y /( — 0.6 ri .
X2 0.5), y 1.

... n)
—1.0 dummy node)
,

^
n+i
( x , y),
® i

{
- Awi (5.1)

v( y - y)
^i (5.2)
V ry G (learning rate). (5.1)
0.1.
( x, y)

(functional neuron),
(linearly separable) [Minsky and Papert ,
1969],

-
5.4(a) (c) (converge)
;
it ...; wn+ i ) ; (fluctuation),W
5.4(d )
(0, 1) K
—-- —
S, (1, 1)

(1,0) X (0, 0) (1,0) X1

(a) { xi A X 2 ) (b) V r 2)

^
2

>
(0,

(1,0)
(c) .
(- x i ) (d) r 2)

5.4

5.5
5.5( a) M
( hidden layer ),

5.6

multi-layer feedforward neural

0.5

0.5 0.5

X1

(a) (b)

5.5
5.3

(a) (b)

5.6

networks) ,

5.5.5
5.6 ( a)

( connection weight )

5.3

(5.1)
(error
BackPropagation , BP )
BP
BP
[Pineda, 1987]. “ BP
BP
BP D
( ®2 ,2/2 ),• .. 2/m )}, Xi G Vi G ,
M1 d Z
5.7 d Z
g j

3.2
h j
Whj . Eti ^ih^i
j ELI hjbh,
VI VI
► ./
Q

^
wQ3
Pj = 2Whjh
h=l
bh
h
d

1 dfb,
Oih = EVihXi
1 Xi

5.7 BP

. U 5.2( b) Sigmoid
3.3

( xk , yk ) ,

i)j = fWj - 0j ) (5.3)

{ xk , yk )

i
. (5.4)
i=i

5.7 d Z 1) g Z dxg
gx Z g Z
. BP
(5.1) ;
t

v v - Av . (5.5)

5.7
B.4 .
BP (gradient descent )
(5.4)

dEk
Aw
^ j = — T ] dwhj (5.6)
5.3

j
Efc ,
8Ek dy] d j
dEk • ^ (5.7)

^
~
dwhj dy dpj dwhj

dwhj
h• (5.8)

5.2 Sigmoid

1— (5.9)

(5.4) (5.3),

dEk dy)
9j =
dykj d /3 j
(yj - yj )fWj Oj )
(5.10)

whj
(5.10) (5.8)


(5.7)

hj

vih
= rwjbh


= rjehXi
(5.6)


BP

(5.11)

-
(5 12)
(5.13)
7/t = -V ^ h (5.14)

(5.13) (5.14)

dEk db

^
6h =
dbh dah
l
/’(QTi — 7/ ) i
h ( l bh )
^^ hj9j . (5.15)

r? e (0,1)
0.1
(5.11) (5.12)
(5.13) (5.14)
5.8 BP BP

-
4 5

BP
5.9

D ={(H/fc )} =1;
ry.
r
. (0,1)
repeat
for all ( xk , y k ) e D do
(5.3) dfc;
(5.10)
(5.15)
J lh
end for
until

5.8

BP D

E E (5.16)

BP
5.8
5.3

-0.4? 0.51

0.63 0.32 2.00 1.65


-0.66 -0.34 -0.04 -0.96(
-2.02 -1.68
' 0.53
-0.66 0.26
0.35 -0.58 1.30 -1.40

(a) (b) (c)

5.9 BP

( accumulated error backpropagation) BP BP


BP

BP BP
(one round , D
one epoch)

BP BP
BP
D
(stochastic gradient
descent , SGD) [Hornik et al.,1989]

BP
- -
trial by error)

BP
(early stopping):

SVM regularization) [Barron, Girosi et al., I 995]


(5.16)

E= (5.17)

A e (0,1)

5.4

(local minimum)
global minimum). ;*
tt *
,

w (w e) e {(w 9) \ \\ ( w e ) - ( w e* ) \\ e}
^ ^
E(W ] e)
(w e )
-
E(W * O * )
( *
,#)

E(w *;9* )

5.10

5.1)
BP
(5.14)
-
(5.11)
5.4

--
14 . —
E

f
6

5.10

• (simulated annealing ) [Aarts and Korst , 1989] .

( genetic algorithms ) [Goldberg, 1989]


5.5

5.5 .1 RBF
RBF (Radial Basis Function, [Broomhead and Lowe,
1988]
RBF
d
RBF

W i p{ x , C i ) (5.18)

g Ci i
p ( x Ci )
)

_
p( x , C i ) = e • (5.19)

[Park and Sandberg,1991] RBF

RBF
BP ft .
5.5 .2 ART
(competitive learning)

.
- -
winner take all)
ART ( Adaptive Resonance Theory, [Carpenter and
Grossberg, 1987]
5.5

ART

ART
plasticity dilemma),
-
(stability

ART
( incremental learning) (online
learning).
ART ART
ART2 FuzzyART
ARTMAP
-
(batch mode)
5.5.3 S0 M
SOM(Self-Organizing Map , [Kohonen, 1982]
(Self-Organizing Fea
ture Map) Kohonen
-
^

5.11 SOM

SOM

SOM

( best matching unit ).


5.11 SOM

5.5 .4

tive)
-
(construc

5.5.2 ART
-
(Cascade Correlation) [Fahlman and Lcbiere, 1990]

( a) b) c)

5.12

5.12

(correlation)
5.5

5.5.5 Elman
“recursive neural recurrent neural networks)
networks” .

t t
t -
Elman [Elman , 1990] 5.13

Sigmoid BP
[Pineda, 1987].

5.13 Elman

5.5.6 Boltzmann

(energy),

Boltzmann [Ackley et al.,1985] energy-based


5.14( a )
model), 5.14(a)
Boltzmann
Boltzmann

sG l}n n i j

i s Boltzmann

=- E= i l
WijSiSj — i= l
(5.20)

Boltzmann
equilibrium)
station- Boltzmann s
ary distribution).
(a) Boltzmann ( b) Boltzmann

5.14 Boltzmann Boltzmann

P( s ) = -m • (5.21)

Boltzmann
Boltzmann
Boltzmann
( Restricted Boltzmann Machine,
mann
RBM).
Boltzmann
5.14(b)
-
Boltz

Boltzmann Contrastive Divergence,


CD) [Hinton, 2010] d g
;
t

d
p( v \ h ) =
Y[ p(Vi \ h ) 5 (5.22)

P ( h\ v ) = l=[ P{ hj \ v )
3
. (5.23)

CD (5.23)
(5.22)
hkv

Aw = r\ ( vhT — WT ) . (5.24)
^
5.6

5.6

capacity)

deep learning) •

BP
diverge)

( unsupervised layer-wise training)

-
(pre training); (fine -
tuning) ( deep belief network , DBN) [Hinton
etal. 2006] Boltzmann
RBM
RBM RBM
BP

weight sharing),
( Convolutional Neural
Network, CNN) [LeCun and Bengio, 1995; LeCun et al. 1998]
CNN [LeCun et al.,1998], 5.15
m Kir. I-

x 6014 x 14
^
16 10 x 10

5.15 [LeCun et al.,1998]

32 x 32 CNN

( feature map),

CNN
5.15
Sigmoid
28 x 28 5x5
pooling)
if x
otherwise , 5.15
14 x 14
Re
LU ( Rectified Linear Unit );
- 2x 2 5.15
CNN
CNN BP
8.4 •
5.15

DBN CNN

(feature learning)
( representation learning).

(feature engineering).
5.7

5.7

[Haykin, 1998] [Bishop, 1995]


Neural Computation Neural
Networks IEEE Transactions on Neural Networks and Learning Systems;
IEEE Transactions on Neu-
ral Networks. ( NIPS)
NIPS (IJCNN) (ICANN)

(IC0NIP).
MP -
(spiking neuron)
[Gerstner and Kistler , 2002].
BP [Werbos,1974] [Rumelhart et al.,1986a,b]
LMS Widrow- HofF
• BP LMS (Least Mean Square) LMS
d
LMS
BP BP
[Chauvin and Rumelhart , 1995].
[MacKay, 1992]
[Gori and Tesi , 1992] BP [Yao,
1999] (evolutionary computation)
BP

(trick) [Reed and Marks, Orr and Muller , 1998].


RBF [Schwenker et al ., 2001]. [Carpenter and
Grossberg, 1991] ART SOM
[Kohonen, 2001]. [Bengio et al., 2013]

[Tickle et al., Zhou , 2004].


5.1 f { x ) = wTx

5.2 5.2 ( b)

5.3 5.7 BP (5.13) .

5.4 (5.6 )

5.5 BP BP 3.0
3.0 p.84
4.3.

5.6 BP
uci
UCI BP
http:/ / archive.ics.uci.edu / ml / .

5.7 (5.18) (5.19 ) , RBF

5.8 SOM 3.0a


3.0o p.89
4.5.

5.9* Elman BP

5.10
MNIST
MNIST
http: / / yann.lecun.com /
exdb/ mnist /.
Aarts, E. and J . Korst . (1989) . Simulated Annealing and Boltzmann Machines:
A Stochastic Approach to Combinatorial Optimization and Neural Comput-
ing. John Wiley Sons, New York, NY.
Ackley, D. H., G. E. Hinton, and T. J . Sejnowski. (1985). “A learning algorithm
for Boltzmann machines.” Cognitive Science^ 9 (1):147-169.
Barron, A. R. (1991) . “Complexity regularization with application to artifi-
cial neural networks. ” In Nonparametric Functional Estimation and Related
Topics; NATO A SI Series Volume 335 ( G. Roussas , ed.) , 561-576 , Kluwer ,
Amsterdam , The Netherlands.
Bengio, Y., A . Courville, and P. Vincent . ( 2013) . “Representation learning: A
review and new perspectives.” IEEE Transactions on Pattern Analysis and
Machine Intelligence 35(8) :1798-1828.
Bishop, C. M. (1995) . Neural Networks for Pattern Recognition. Oxford Uni-
versity Press , New York, NY.
Broomhead , D . S. and D. Lowe. (1988) . “ Multivariate functional interpolation
and adaptive networks.” Complex Systems, 2 ( 3) :321-355.
Carpenter , G. A. and S. Grossberg. (1987) . “A massively parallel architecture
for a self-organizing neural pattern recognition machine.” Computer Vision,
Graphics , and Image Processing , 37 (1) :54-115.
Carpenter , G. A. and S. Grossberg, eds. (1991). Pattern Recognition by Self -
Organizing Neural Networks. MIT Press, Cambridge, MA.
Chauvin , Y. and D. E. Rumelhart , eds. (1995). Backpropagation: Theory, Ar-
chitecture , and Applications. Lawrence Erlbaum Associates, Hillsdale, NJ .
Elman, J . L. (1990). “ Finding structure in time.” Cognitive Science , 14 ( 2 ) :
179-211.
Fahlman, S. E. and C. Lebiere. (1990) . “The cascade-correlation learning archi-
tecture.” Technical Report CMU-CS-90-100 , School of Computer Sciences,
Carnergie Mellon University, Pittsburgh , PA .
Gerstner , W. and W. Kistler. (2002 ) . Spiking Neuron Models: Single Neurons,
Populations , Plasticity. Cambridge University Press, Cambridge, UK.
Girosi, F., M. Jones, and T. Poggio. (1995) . “Regularization theory and neural
networks architectures.” Neural Computation, 7( 2 ) :219-269.
Goldberg , D . E. (1989 ) . Genetic Algorithms in Search, Optimizaiton and Ma-
chine Learning. Addison-Wesley, Boston , MA.
Gori, M. and A. Tesi. (1992). “On the problem of local minima in backpropa-
gation.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
14 (1) :76-86.
Haykin , S. (1998) . Neural Networks: A Comprehensive Foundation, 2nd edi-
tion. Prentice-Hall , Upper Saddle River , NJ .
Hinton , G. ( 2010) . “A practical guide to training restricted Boltzmann ma-
chines.” Technical Report UTML TR 2010-003, Department of Computer
Science, University of Toronto.
Hinton , G ., S. Osindero, and Y.-W. Teh. ( 2006) . “A fast learning algorithm for
deep belief nets.” Neural Computation, 18( 7) :1527-1554.
Hornik , K., M. Stinchcombe, and H . White. (1989 ) . “ Multilayer feedforward
networks are universal approximators.” Neural Networks , 2 (5) :359-366.
Kohonen , T. (1982) . “Self-organized formation of topologically correct feature
maps. Biological Cybernetics , 43(1):59-69.
Kohonen , T. (1988) . “ An introduction to neural computing.” Neural Networks ,
1(1):3-16.
• Kohonen , T. ( 2001) . Self - Organizing Maps , 3rd edition. Springer , Berlin.
LeCun , Y. and Y. Bengio. (1995). “Convolutional networks for images , speech ,
and time-series.” In The Handbook of Brain Theory and Neural Networks
( M. A. Arbib, ed.) , MIT Press, Cambridge, MA.
LeCun , Y., L. Bottou , Y. Bengio, and P. Haffner. (1998) . “Gradient-based
learning applied to document recognition.” Proceedings of the IEEE , 86 (11):
2278-2324.
MacKay, D. J . C. (1992 ) . “A practical Bayesian framework for backpropagation
networks.” Neural Computation, 4 ( 3) :448-472.
McCulloch , W. S. and W. Pitts. (1943) . “A logical calculus of the ideas im-
manent in nervous activity.” Bulletin of Mathematical Biophysics, 5( 4 ) :115-
133.
Minsky, M. and S. Papert. (1969 ) . Perceptrons. MIT Press, Cambridge, MA.
Orr , G. B. and K.-R. Muller , eds. (1998) . Neural Networks: Tricks of the Trade.
Springer , London, UK.
Park , J . and I. W. Sandberg. (1991) . “Universal approximation using radial-
basis-function networks.” Neural Computation, 3( 2 ):246-257.
Pineda, F. J . (1987). “Generalization of Back-Propagation to recurrent neural
,
networks.” Physical Review Letters 59 (19):2229-2232.
Reed , R. D. and R. J . Marks. (1998) . Neural Smithing: Supervised Learning in
Feedforward Artificial Neural Networks. MIT Press, Cambridge, MA.
Rumelhart , D. E. G. E. Hinton , and R. J . Williams. (1986a) . “Learning internal
representations by error propagation.” In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition ( D. E. Rumelhart and J . L.
McClelland , eds. ) , volume 1, 318-362, MIT Press, Cambridge, MA.
Rumelhart , D. E., G. E. Hinton, and R. J. Williams. (1986b ) . “Learning repre-
sentations by backpropagating errors.” Nature , 323( 9 ) :318-362.
Schwenker , F. , H .A. Kestler , and G. Palm. ( 2001) . “Three learning phases for
radial- basis-function networks.” Neural Networks , 14( 4-5) :439-458.
Tickle, A. B., R. Andrews, M. Golea , and J . Diederich. (1998) . “The truth
will come to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks.” IEEE Transactions on
Neural Networks, 9 (6):1057-1067.
Werbos, P. (1974) . Beyond regression: New tools for prediction and analysis in
the behavior science. Ph.D. thesis, Harvard University, Cambridge , MA .
Yao, X. (1999 ) . “Evolving artificial neural networks.” Proceedings of the IEEE ,
87( 9 ) :1423 1447.
-

Zhou, Z .-H. ( 2004). “Rule extraction: Using neural networks or for neural
networks?” Journal of Computer Science and Technology , 19 ( 2 ) :249-253.
MP- Hebb
Adaline

MIT ( Marvin
Minsky, ) Seymour Papert

Paul Werbos BP

John Ilopfield
NP UCSD
David Rumelhart James McClelland PDP
Rumelhart BP
Hopfield BP

trick) NIPS

ImageNet
6.1

0
*
{(»1 2/I ) 2 y2) ... a m ym )} -
{
D
6.1

002

6.1

6.1

6.1

x -\- b = 0 ( 6.1)

(Wl ; w2
^ . . ; wd ) w
6.1. F (w ,b). T (w ,b)

\w Tx .
rr =.

(6.2)
IHI
{w ,b) G D, yi
( w / , bf )
wTXi —1, wTXi
qw w
(6.3) wT 2/i = +l ;
(6.3)
wTXi + 6 1, Vi = -

6.2 (6.3)
(support vector),

(6.4)
IIHI
margin).

w Tx
IHI

wT x + b = 0
rw
w T x -\- b = —1

_
0 0 -
0
O
1
6.2

( maximum margin)
(6.3) w

max (6.5)
IHI
s.t. yi(wTXi + 6) i = 1,2,... m.
6.2

-
n ir1 , I I 2.
(6.5)
^ ^
"'
2
I ( 6.6)

s.t . yi ( wTXi 6) i= ... m.

(Support Vector Machine, SVM)

6.2

(6.6)

f(x) wTx b (6.7)

w (6.6) (convex
quadratic programming)

B.I . (6.6) dual problem).


(6.6)

L( w , b , a) IHI 2 + Vi { wTxi + h ) ) (6.8)

a = (ai;o 2; a) 6

(6.9)

o = E OiiVi . (6.10 )

(6.9) (6.8) L(iy 6 a ) ty (6.10)


(6.6)

i= l
2 ^jyiVjxJxj ( 6.11)
s.t E o^iVi =
oO 5 i = 1, 2 ...

a t;
/ 6

T
x+b

y
^ aiyixjx + b . (6.12)

(6.11) (6.8)
(6.6) KKT
B.I.
- -
(Karush Kuhn Tucker)

^^
ai 0;
Vif {xi )
- 1 0;

1)
(6.13)
••

( Xi ,y i ), Vif ( x i ) 1.
(6.12) f{ x )
[Vapnik, /(

(6.11)
B .2.

SMO (Sequential Minimal Optimization)


[Platt,1998].
SMO
E"i = o,


SMO
SMO
^
6.2

• (6.11)
KKT
[Osuna et al,1997]. KKT
SMO KKT

SMO

• SMO
(6.11)

-
(6 14)

E OLkVk
k i,j
(6.15)

E o
i=l

OiiVi ajyj =c (6.16)

(6.11)
0.

( xs,ys ) ysf {xs)

Vs (S aiyixfxs b (6.17)

s {i I i m}
(6.17)

b= Vs - ^ aiyjxjxs • (6.18)
6.3

6.3

> j{x)
t- ( )

>

6.3

6.3

f ( x) = () j{x) b (6.19)
w 6

IHI 2 ( 6.20 )

s.t . yi ( wTcj)( xi ) 6) i . . . , m.

no no IIO
i
Y,ai ~

2
^^ aiajyiyj
^Xif^ Xj) ( 6.21 )
6.3

s.t• Ylaiyi = o
i=l

5 i = 1 , 2, m.

(6 . 2 1)

0(%)T K% )
(

K,( X i , Xj ) = { j { Xi ), j { Xj ) )
() () () j { X i )T ( j){ X j ) (6.22 )

ker-
nel trick).
(6.21)

max H= ^ a j V i y j ^ x u X j )
13 1
(6.23)

s.t. =0
i=l

>0 ••• m •

f(x) wT j ( x )
() b
m
T
= cyiVi ( > { xi ) c )( x )
/ / b

Y ] o^iVMx , Xi ) + b . (6.24)

(•
& • kernel function). (6.24)
support
vector expansion).
[Scholkopf 6.1 x
and Smola , 2002].
D = { ^1?
« . •. ® 7 7l }
( kernel matrix) K

n( x i x \ ) ••• n{x \ X j ) n{x i ,xm )

K=
^
K( Xi Xi) •• • K
^
(Xi, Xj) XfYi )
^

... (® 77l (

^
5 K X.

6.1

Reproducing Kernel Hilbert Space , RKHS) •

6.1

6.1

K( X i X j ) = xJXj
d= K, ( X i , X j ) = { x j x j ) d
RBF X j ) = exp ( ) a (width)
K, ( x i , X j ) = exp( 7
(

X j ) = tanh(/3x
Sigmoid
^ ajj 0) tanh

(6.25)
6.4

q M

ACi (
K2 X , Z )= /i
^ ( x, Z ) K 2 ( X , Z ) (6.26)

^ ( x y z ) = g ( x ) K i ( x , z )g ( z )
I (6.27)

6.4

(soft margin) 6.4

x -{ - b = 1

f
4w x b=0
+ /©.
x -\- b = —1
+
+
:
©
o 1

6.4

(hard margin),
yi ( wTXi 6) • (6.28)

^
min |
||-
|2
1 C -- 4/ I + - 1) (6.29)

C ^ £Q/1

if z
(6.30)
otherwise.

C (6.29)
(6.29) (6.6); C (6.29)

£Q/1 (6.29)
(surrogate loss).

6.5

3.3 hinge hinge = rnax(0 , 1 - z ) ; (6.31)


(exponential loss):£exp { z ) — exp(— (6.32)
W.)
( 3.15 )
log ( . ) .
( 6.33)
ln( - )
(logistic loss): £ i0g ( z ) = log ( l exp(
- )) .
2 (6.33)

hinge (6.29)

min Cy
^rnax (0,
i= l
1- yj (wTXj b) ) • (6.34)

^^
(slack variables) > 0, (6.34)

mm
2 IHI
2
+c & (6.35)
6.4

exp( — 2 )
^
exp ( ^)

4inge( ) max( 0,1 )

^
2
= log(l + exp( — 2 ))

1, if z < 0 :
4/1

-2 0 2 z

6.5 hinge

s.t. yi ( wTXi 6) -
i= . . . m.

(6.35)
(6.6)
(6.35)
m

m m

E
=
i l
(1 - - ( wTXi + 0 ) — E/ i ( 6.36 )

Qi > A>0
L( w;,6 a t tt?

m
w= E (6.37)

◦ = E^
m
iVi

5 (6.38)
i= l

C — OLi ^r \li . (6.39)


(6.36) (6.35)

T
max (6.40)
1 3=1

s.t . oily% =o,

0 ; C
a i = 1, 2, •• • , m •

(6.40) (6.11)
C 6.2
(6.40); (6.24)

(6.13), KKT

^+ o
^
y i f ( x i ) - l -h i 5
(6.41)
{V i f -1 0

( Xi , yi ), Vif ( xi )
f (x) yif { x i ) -
(6.39) C A
C, A

hinge

(6.29)
4P > (6.29)
(3.27).

[Platt , 2000];
psu and Lin,
2002]. 6.5 hinge
6.5

(6.29)

iin + (6.42)

structural risk ),
empirical risk )
C n( f )

(6.42)
regularization) C
Lp (norm) L2 ti

Lo |
|w|
|o h I Hx
11.4
w
^
6.5

( m ,2/m )}, (6.7)


{( X1 2/i ),(*2,2/2),.. ,
)

f(x) y -
^ W

(x ,y), f i x)
f ( x )3 y
(Support Vector Regression, SVR) /(» )
y e f(x) y e
6.6 f (x ) 2e
yi
+e
/ 0) wT x + b

O O
O
O
o
o
,
6
< o

0 r
6.6 e
-
SVR
m
mm WI + cE
2
A) %) (6.43)
=
1

(7 A 6.7 e
- ( e-insensitive loss)

0 if kl < e ;
4( ) (6.44)

^ kl — e

(6.43)
otherwise.

m
mm (6.45)
Mi

if e;
4( )

0 ^ z
\ z\ otherwise.

6.7 e
-
6.5

s.t . f { x i ) - Vi

- f { x i)
A

Vi +&

^^
0, i= m.

(6.36),
(6.45)
A
^ 0 , di ^ 0,
A

m m m

m
-

^ \\ w \\ 2 c + 6)
i= l ^-
m
i= l

+ X=
i l
^ ( / (% ) - V i

L(to, a A,
~ 6 ~
6) +
^ 2 ^ (V i
=
i l
i - f ( xi ) - ^ -
^) (6.46)

(6.7) d p /i) 6,
rrt
w= ai ) Xi (6.47)
m
o=y (

C — OLi \
~ -
^ Qz

5
on ) (6.48)

(6.49)
C = dii (l i . (6.50)

(6.47)-(6.50) (6.46), SVR


rrt
max ~
ai ) Oii ) (6.51)
OL
i= l
m 7Ti

^jYl
- aj ) xIxJ
i= l j = l
m
s. t . ai ) = 0

0 ai , ai
^C .

KKT
^
a i ( f ( x i ) - y% - ii ) =
on{Vi - / e-
(6.52)
aidii = =0

^
{C ai ) 0 , (C ai ) ii = 0 .
f (Xi) Pi -e &= 0
Vi f ( xi ) —e (x
“V i - e) —
yi
-
e
f (Xi) ~

Vi f{xi )
- e ii = 0
— •

(6.47) ij SVR

m
xJx -\- b .
f(x)
^
i=l
( di ai ) (6.53)

- - ai)

^
e
3i
< 0.
(6.53) 0 SVR
e
- SVR

KKT (6.52) ( xhyi ) C -a


Oii{ f ( xi ) — yi - e - Q 0. C,

m
b yi e- ai )x j x . (6.54)

(6.51) C
(6.54) 6.
C

(6.47)
m
w (6.55)

(6.55) (6.19), SVR


6.6

/ (*) - a i ) K,( x , Xi ) b (6.56)

Xi , X j ) =
^ ^ ^ cPiXj
Xi )

6.6

( (6 . 5 6) {(«i ,yi ),
SVM SVR,

representer theorem)

[Scholkopf and 6.2


Smola , 2002],
M \\ h\\ M
H D [0,oo] R
Mercer
[0,oo],

min F ( h ) = 0|
(| I|H )
/| i ( h( x i ) y h( x2),...,h( x m ) ) (6.57)
/iGlHI

h* ( x ) = x i) . (6.58)

D (6.57)
*
/1
»
( kernel
methods).

3.4
( Kernelized Linear Discriminant
Analysis, KLDA).
F,
F

h ( x ) = wTcj)( x ) . ( 6.59 )
(3.35), KLDA

w w
m a,x J(w ) (6.60)

^
w TS w

< St F
A i G {0,1}
m = m0 + mi . i F

^
(6.61)
xEXi

= (/4 — 4)( 4 - 4) ; / /
T
(6.62)

E EXi (0 - 4)T .
/ (6.63)
X
^

( ) /
0( T
0Or ) F. J(w ) (6.57)
A D h(x )

h{ x ) = y^j a i K J { x , x i ) (6.64)
i= l

(6.59)
w (6.65)

{1,0}

- K e
Xl

j
i
0.

Mo

Ai
K

m0
Kl0

Kli
li j
(K )

^
-
^
(x i .
X j ). h
e
G

(6.66)

(6.67)
mi
t
M= ( Ao Ai ) (Ao AI) (6.68)
6.7

N KKT mikpj . (6.69)


i =0

(6.60)

aTMa
max J(a )
aTNa (6.70)

3.4
a (6.64)
ZiOr ).

6.7

[Cortes and Vapnik , 1995],


SVM
[Joachims, 1998],
(statistical learning)

Mercer
-
[Cristianini and Shawe Taylor ,
2000] RKHS
12.4

Scholkopf et al.,
-
[Cristianini and Shawe Taylor , Burges,1998;
Scholkopf and Smola, 2002]
[Vapnik , 1999].
[Boyd and Vandenberghe,
2004]. SVM
SVM ( cutting plane algorithm) SVMperf
[Joachims, 2006], Pegasos
[Shalev-Shwartz et al., 2011],
[Hsieh et al., 2008]. SVM 0(m2)
CVM [Tsang et al , 2006]
Nystrom [Williams and Seeger ,
[Rahimi and Recht , 2007]
Nystrom [Yang et al., 2012].

[Hsu and Lin, 2002], [Tsochantaridis


et al., 2005]. [Drucker et al., 1997], [Smola and
Scholkopf , 2004]

( multiple kernel learning)


[Lanckriet et al., Bach et
8
al., 2004],


consistency) [Vapnik and Chervonenkis, 1991]
[Zhang, 2004]

SVM LIBSVM [Chang and Lin, 2011]


LIBLINEAR [Fan et al., 2008]
6.1 (w ,b) (6.2).
UBSVM http: // www.
6.2 LIBSVM, 3.0a
csie.ntu.edu.tw/ cjlin/ libsvm /.
~
3.0CK p.89
SVM,
4.5.

uci 6.3 UCI SVM,


http: / / archive.ics.uci.edu / ml/. BP C4.5

6.4

6.5 SVM RBF

6.6 SVM

6.7 (6.52) KKT •

6.8 3.0a
LIBSVM SVR.

6.9

6.10* SVM
. ( 2009 ) .
Bach, R. R. G. R. G. Lanckriet , and M. I. Jordan. (2004). “Multiple kernel
learning, conic duality, and the SMO algorithm.” In Proceedings of the 21 st
International Conference on Machine Learning ( ICML ) , 6—13, Banff , Cana-
da.
Boyd , S. and L. Vandenberghe. ( 2004) . Convex Optimization. Cambridge Uni-
versity Press, Cambridge, UK .
Burges, C. J . C. (1998) . “A tutorial on support vector machines for pattern
recognition.” Data Mining and Knowledge Discovery , 2 (1) :121-167.
Chang, C.-C. and C.-J. Lin. ( 2011) . “LIBSVM: A library for support vector
machines.” ACM Transactions on Intelligent Systems and Technology, 2 ( 3) :
27.
Cortes, C. and V. N. Vapnik. (1995) . “Support vector networks.” Machine
Learning , 20 ( 3):273-297.
Cristianini, N. and J. Shawe-Taylor. ( 2000 ) . An Introduction to Support Vector
Machines and Other Kernel- Based Learning Methods. Cambridge University
Press, Cambridge, UK .
Drucker , H., C. J . C. Burges , L. Kaufman, A. J. Smola , and V. Vapnik. (1997) .
“Support vector regression machines.55 In Advances in Neural Information
Processing Systems 9 ( NIPS ) ( M. C. Mozer , M. I. Jordan, and T. Petsche,
eds.) , 155—161, MIT Press, Cambridge, MA .
Fan , R.-E. K .-W. Chang, C.-J . Hsieh, X.- R. Wang, and C.-J . Lin. ( 2008) .
ULIBLINEAR: A library for large linear classification.” Journal of Machine
Learning Research, 9:1871-1874.
Hsieh , C.-J ., K.-W. Chang , C.-J . Lin , S. S. Keerthi , and S. Sundararajan.
( 2008) . “A dual coordinate descent method for large-scale linear SVM.”
In Proceedings of the 25th International Conference on Machine Learning
( ICML ) , 408-415, Helsinki, Finland.
Hsu, C.-W. and C.-J . Lin. ( 2002 ) . “A comparison of methods for multi-class
,
support vector machines.5 IEEE Transactions on Neural Networks , 13( 2 ) :
415-425.
Joachims , T. (1998) . “Text classification with support vector machines: Learn-
ing with many relevant features.” In Proceedings of the 10th European Con-
ference on Machine Learning ( ECML ) , 137—142, Chemnitz , Germany.
Joachims, T. ( 2006). “Training linear SVMs in linear time.” In Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining ( KDD ), 217-226, Philadelphia, PA.
Lanckriet , G. R . G ., N. Cristianini, and M. I. Jordan P. Bartlett , L. El Ghaoui.
( 2004 ). “Learning the kernel matrix with semidefinite programming.” Jour-
nal of Machine Learning Research, 5:27-72.
Osuna, E. , R. Freund , and F. Girosi. (1997). “An improved training algorithm
for support vector machines.” In Proceedings of the IEEE Workshop on Neu-
ral Networks for Signal Processing ( NNSP ), 276-285, Amelia Island , FL .
Platt , J . (1998) . “Sequential minimal optimization: A fast algorithm for train-
ing support vector machines.” Technical Report MSR-TR-98-14 , Microsoft
Research.
Platt , J . ( 2000 ) . “Probabilities for (SV) machines.” In Advances in Large Mar-
gin Classifiers ( A. Smola, P. Bartlett , B. Scholkopf , and D. Schuurmans,
eds.) 61-74, MIT Press, Cambridge, MA. .

Rahimi, A . and B. Recht. ( 2007) . “Random features for large-scale kernel ma-
chines.” In Advances in Neural Information Processing Systems 20 ( NIPS )

( J .C. Platt , D. Roller , Y. Singer , and S. Roweis, eds.) , 1177 1184, MIT Press,
Cambridge, MA.
Scholkopf , B., C. J . C. Burges, and A. J . Smola, eds. (1999 ). Advances in Kernel

- Methods: Support Vector Learning. MIT Press, Cambridge, MA.


Scholkopf , B. and A. J. Smola, eds. ( 2002 ) . Learning with Kernels: Support Vec-
tor Machines, Regularization, Optimization and Beyond. MIT Press , Cam-
bridge, MA.
Shalev-Shwartz, S., Y. Singer , N. Srebro, and A . Cotter. ( 2011) . “Pegasos: Pri-
mal estimated sub-gradient solver for SVM.” Mathematical Programming ,
127(1) :3-30.
Smola , A. J . and B. Scholkopf . ( 2004) . “A tutorial on support vector regres-
,
sion. 5 Statistics and Computing , 14 ( 3):199-222.
Tsang, I. W., J . T. Kwok , and P. Cheung. ( 2006) . “Core vector machines:
Fast SVM training on very large data sets.” Journal of Machine Learning
Research, 6:363-392.
Tsochantaridis, I., T. Joachims, T. Hofmann , and Y. Altun. (2005). “Large
margin methods for structured and interdependent output variables.” Jour-
nal of Machine Learning Research, 6:1453-1484.
Vapnik, V. N. (1995) . The Nature of Statistical Learning Theory. Springer , New
York , NY.
Vapnik , V. N. (1998) . Statistical Learning Theory. Wiley, New York, NY.
Vapnik, V. N. (1999). “An overview of statistical learning theory.” IEEE Trans-
actions on Neural Networks , 10 (5) :988-999.
Vapnik , V. N. and A . J . Chervonenkis. (1991). “The necessary and sufficient
conditions for consistency of the method of empirical risk.” Pattern Recog -
nition and Image Analysis, 1(3) :284-305.
Williams, C. K. and M. Seeger. ( 2001) . “Using the Nystrom method to speed
up kernel machines.” In Advances in Neural Information Processing Systems
13 ( NIPS ) (T. K . Leen, T. G. Dietterich , and V. Tresp , eds.) , 682-688, MIT
Press, Cambridge, MA.
Yang, T.-B., Y.-F. Li, M. Mahdavi, R. Jin, and Z.-H. Zhou. ( 2012 ) . “ Nystrdm
method vs random Fourier features: A theoretical and empirical compari-
son.” In Advances in Neural Information Processing Systems 25 ( NIPS ) ( P.
Bartlett , F. C. N. Pereira, C. J. C. Burges, L. Bottou , and K . Q. Weinberger ,
eds.) , 485-493, MIT Press, Cambridge, MA.
Zhang, T. ( 2004). “Statistical behavior and consistency of classification meth-
ods based on convex risk minimization ( with discussion ).” Annals of Statis-
tics, 32 (5):56-85.
1
Vladimir N. Vapnik, )

SVM MacMne
SVM

A. Chervonenkis “ VC
RBF 5.5 . 1
SVM

NEC
( Facebook )

“ Nothing is more practical than a good


theory .”
7.1

( Bayesian decision theory )

AT {cuc2 ...,cN}, Xij


l

q q P ( Ci x )
CC Ci (expected loss) , a;
(conditional risk )
(risk ) .

R{c.i x ) = XijP ( cj|x) . (7.1)

x y

^
h

R{h) = Ex [R (h{x ) x )] . (7.2)

x h R{h{x ) x ),
( Bayes decision rule) :
(c

h*( x ) = arg min i? (c|x) (7.3)


cey

/i* ( Bayes optimal classifier ) ,


-

R(h ( Bayes risk ) . R(h* )

Ay
ifi = j
Xij (7.4)
otherwise,

R( c \ x ) = 1 — P ( c x ) , (7.5)

h* ( x ) = arg max P(c|x )


cG
^
5 (7 6)
-
x,

*).
.

P( c I x). r,
c (discriminative models);
P(x ,c) P(c
(generative models). BP

P( x , c)
P{ c x ) p(x ) (7.7)

|x )

P(c) P( x I c)
P (c|x )
P{ x ) (7.8)

P(c) prior )
(class-conditional probability)
C) x
_ c
(likelihood) P(x )
evidence)
P{ x)
P(C a ) D
P(c) c).
P( c )
_
p(.)
P ( c)

I
7.2

7.3 d

m,

7.2

c
c), c)
.
p(® C)
D _POr

( parameter estimation)

(Frequentist)
(Bayesian)

[Efron ,
Samaniego, 2010].
( Maximum Likelihood Estimation, MLE),

Dc D c
Dc

P ( DC \ ec ) = JxeDcJ p( x ec) . -
(7 9)

P ( DC ec )

(7.9) -
(log likelihood)

LL ( 0c ) = log P ( DC 6C )

= logP ( x I Gc) (7.10)


xeDc
6C arg max LL(6C) . (7.11)
Oc

A/
C.1.7.
pOr C)

Me

Me E
xeDc
(7.12)

C ( ® - A c)(« - A C) T . (7.13)
\D C \ XGDC

(x - ftc )( x - AC)
T

7.3

(7.8)
c)
( naive Bayes classifier)
attribute conditional independence assiimption):

(7.8)

P(c) P { x I c) _ P(c) d
P(c|x ) = p Y[ p ( ^ i i c) (7.14)
7.3

Xi d i

(7.6)
Ci i
d

n -
,

i
® hnb» arg max
cey
c) (7 15)

P(c), P ( Xi c).
D c

P{c) \ D C\ *
(7.16)

Dc,Xi i
P ( xi c)

P { Xi c) = \ - ^\ D C, \
C Xi
(7.17)

p( c)
c i

P ( Xi c) =
Z27rcrC 2
\ }
^
exp
( ^2 Mc i )
2(7C?,,
2
2
(7.18)

3.0
3.0 p.84
4.3.

0.697 0.460

P(c),

P — 0.471

P a 0.529 .
I c):

P 0.375
3.0

p P
- 0.333

P 0.375

P P 0.333

P P 0.750

P P 0.444

P
■ P 0.875

P 0.222

P
■ P 0.750

P P(
y
0.222

P 0.750

P P
- 0.667

P P 0.697

exp
(0.697 1.959
0.129 • 0.1292
P P 0.697
(0.697
exp 1.203
0.195 • 0.1952
P P 0.460

exp
(0.460 - 0.788
A/2TT - 0.101 2.0. 1012

P P 0.460
(0.460 - 0.154) 2

- 0.066
exp •
X/ STT 0.108 2.0.1082
7.3

P P P P P P
P X P X P .•0.4601 a 0.038

P _P P P P X P
xP X X 6.80 x -5
P P

0.038 6.80 x
. •
.

(7.17) (7.15)
3.0

P -

(7.15)

smoothing),
(Laplacian correction). iV JD
i (7.16) (7.17)

\ \
P ( c ) = DC
|+W (7.19)

P( X i I c ) = \^ , \
C Xj
(7.20)
iDcl + Ni

P 0.474 P 0.526 .

P P

P 0.364
B 0.333 .

0.091 .

(prior)

10.1 lazy learning)

5.5.2

7.4

(7.8) P(C

-
(semi naive Bayes classifiers)

-
(One Dependent Estimator, ODE)

J [ P(xi I c paj)
P{c I x) oc P(c)
^ 5 (7.21)

pq Ti
(7.20) P{xi c pai).
7.4

(super -
parent ),
SPODE (Super-Parent ODE) 7.1(b)

x i M x 2 y{ x3 xd

(a) NB (b)SPODE (c) TAN

7.1

TAN (Tree Augmented naive Bayes) [Friedman et al.,1997]


( maximum weighted spanning tree) [Chow and Liu , 1968]
7.1(c)

(1) (conditional mutual information)

P( X j , X j c)
|y ) =
^ 2c e y p( x i ^ x j I c) log -
I c)P( |c ) 5 (7.22)

^
Xi , x j

(2)
y) ;

(3)

(4) y

Iixi . Xj ] y)
TAN

AODE ( Averaged One-Dependent Estimator) [Webb et al., 2005]


SPODE
AODE SPODE,
SPODE

d d

^^
P { c I x ) oc
i=l ,
P(c,Xi ) Y=[ P ( ^
j l
j I ? (7.23)

^
\ DXi \ m

m' 30 [Webb
i 77/
e t a l ., 2005].
AODE P(c,X i ) m P{ X j c,X i ). (7.20),

I
\^ C ,X i 1
(C Xi ) = D (7.24)
\ \ + Ni
|C, =
\ Dc,X j ,X j 1 1 (7.25)
\ Dc,xi |
i Dc X i c i
Xi D Xj c i j
q 3.0

0.350

0.444 .

AODE

.AODE

(7 . 2 3 ) P% , ODE
k D E. A; P( Xi pa, )

7.5

( Bayesian network) (belief network),


( Directed Acyclic Graph, DAG )
7.5

( Conditional Probability Table, CPT)


B G e B G
G
©
G ©
PB { XI I TTi ) .
7.2

P 0.1

$2(

rsj 0.1 0.9


0.7 0.3
$4( } 5(

7.2

7.5 .1

B (G ,0)

d d
PB { ^ 1 ^ 2
) 5 • • • 5
^ d ) = Y[ PB { Xi TTi ) =\
\ 0 ki .
^
X (7.26)

7.2

P ( x i , X 2 , X 3 , X 4 , X 5 ) = P ( x 1 ) P( x 2 ) P ( X 3 I X i ) P ( x4 \ X 1 ) X 2 ) P { x5 \ X2)
x3 x4 r4 x5 x2
X4 A T4 C5

7.3
Xl ( X2

X s) [ X4 X4

7.3

common parent ) T4

structure)
r y z . (V
-

P ( x i , x2 ) = y
y
^ P( x i , X 2 , X4 )

^ -P( ^4 X i , X 2 ) P { x i ) P { x 2 )

P ( x1 ) P ( x2 ) • (7.27)

marginal independence) x\ xo .
ization).
marginal
-
X4

X3 XsJLx^
y ylz
D -
D
ed ).
(direct
- separation).


[Pearl , 1988].

moral graph),
moralization) [Cowell et al.,1999].

x, y z x z
7.5

z x x z
r z 7.2 7.4
T3 Z4 T5 C2 Z3

D @2(

$3(

7.4
^^ 1
( 4(

7.2
)

7.5.2

(score function),

1.4

Minimal Description
Length, MDL)

D = { X 1 , X 2 , . . . , xm } , B G D
JCi

s ( B \ D ) = f ( 0 ) \ B\ - L L( B \ D ) , (7.28)
/(0

L L( B \ D ) = Y= ^g P B ( x i )
i l
/
(7.29)

B (7.28) B
S
S
s(B
| D)
f AIC ( Akaike Information
Criterion)

AIC ( B D) = \ B \ LL{ B \ D) . (7.30) .

^
f (0) = logm , log m BIC ( Bayesian
Information Criterion)

^
-
lo
BIC(5|D) = \ B\ LL( B \ D) . (7.31)

/(0) =

B (G,0) G
s ( B D) D) ©
(7.29) (7.26) D

\ = PD Xi , (7.32)

PD ( .) D
^ Xi 7n
^ 7Ti )

I D)

NP

TAN [Friedman et
al . 1997]
7.5

7.5.3

query)

inference)
(evidence).

NP [Cooper 1990];

(Gibbs sampling)
14.5

{Qi ,Q2,
Q
e
Qn}
{ei e2, efc}.
E = { E u E2
_P...(Q, Ekq}
j

E e)
92 ,..., }

^
q
Q E
e ={ q {

7.5 E=e
q
° t

'- ,
q 1

B Z z)
T q

-
P( Q = q | E = e )

( random walk).
^ .

E=e
(7.33)

14.5 (Markov chain).


t t oo
(stationary distribution);
P( Q E e) . r P( Q E e)
(7.33) P( Q q E e) .
B = G, e
r;
E e;
Q q.

nq = 0
qG Q
for t . . . , T do
4 for Q i e Q do
Qi 5 Z = E U Q\ { Q i } ]
z = e Uq \{ d
7 B PB { Qi|Z = z);
8 q\ = PB { Qi | Z = z )
9
10 end for
11 if qt = q then
12 nq = nq -\- l
13 end if
14 end for
~
P(Q = q E = e )

7.5
^

7.6 EM

(latent variable).
Z e e

EM LL(0 I X Z) = lnP(X, Z I 0) . (7.34)


ln (.) . Z Z
7.6 EM 163

marginal likelihood )

LL( Q X) = lnP( X|0) = InY" P(X,Z|0) .■


(7.35)

EM -
EM (Expectation Maximization) [Dempster et al., 1977]
©
Z E ); Z
EM
e M

©Q (7.35),

'
Z Z

• e ei+1

EM
Z Z
P( Z ©0, EM

^
•E Expectation): P( Z X ,0 ,
LL( S Z) Z

Q(© ©') EZ|x,e* LL ( Q I X,Z) . (7.36)

•M Maximization):

em = arg©max Q(© 0') . (7.37)

EM (E)
( M)
EM E E
[Wu ,

EM
(coordinate descent)
EM

B.5.
7.7

[Domingos and Pazzani, Ng and Jordan, 2002].

[Domingos and Pazzani, 1997];

[Zhang,2004].
[Lewis, 1998] [McCallum and Nigam, 1998]

[Kononenko, 1991]. ODE


TAN [Friedman et al.,1997] AODE [Webb et al.,
>

2005] LBR (lazy Bayesian Rule) [Zheng and Webb, 2000] kDE
A: KDB [Sahami, 1996] NBtree
[Kohavi, 1996]
( Bayes Classifier) ( Bayesian
Learning)
[Bishop,2006].

Pearl •
pearl, 1988].
NP
[Cooper, Chickering et al., 2004],
[Friedman and Goldszmidt , 1996].
[Grossman and Domingos,2004].
[Jensen, Heckerman ,1998].
EM
7.7

(Gaussian mixture model, GMM)


9.4 A: EM EM
[McLachlan and Krishnan,2008].
EM
C4.5 CART [Wu et al., 2007].
Ada Boost fc
fc
3.0 p.84 7.1 3.0
4.3.

7.2*

7.3 3.0
P.151

7.4 (7.15)
Ilti I c) 0

7.5
3.4

7.6 AODE 3.0 p.151

7.7 d
(7.15)
P(c) x AODE (7.23)
P(c,Xi )

7.8 7.3, xi
_
X 3 \LX 4

y r z
yA. •

2.0 p . 76
7.9 2.0 BIC
4.1.

7.10 2.0 EM
Bishop, C. M. ( 2006) . Pa ern Recognition and Machine Learning. Springer ,
New York, NY.
^
Chickering , D. M., D. Heckerman , and C. Meek. ( 2004). “Large-sample learning
of Bayesian networks is NP-hard.” Journal of Machine Learning Research,
5:1287-1330.
Chow, C. K. and C. N. Liu. (1968) . “Approximating discrete probability distri-
butions with dependence trees.55 IEEE Transactions on Information Theory ,
14 (3) :462-467.
Cooper , G. F. (1990) . “The computational complexity of probabilistic inference
using Bayesian belief networks.” Artificial Intelligence , 42 ( 2-3) :393-405.
Cowell, R. G., P. Dawid , S. L. Lauritzen , and D. J. Spiegelhalter. (1999 ). Prob-
abilistic Networks and Expert Systems. Springer , New York , NY.
Dempster , A. P N. M. Laird , and D. B. Rubin. (1977) . “Maximum likelihood

from incomplete data via the EM algorithm.” Journal of the Royal Statistical

-
Society Series B, 39 (1) :1-38.
Domingos , P. and M. Pazzani. (1997) . “On the optimality of the simple
Bayesian classifier under zero-one loss.” Machine Learning 29 ( 2-3) :103-130.
Efron, B. (2005). “Bayesians frequentists, and scientists.” Journal of the
American Statistical Association, 100 ( 469) :1-5.
Friedman, N. D. Geiger , and M. Goldszmidt . (1997) . “Bayesian network clas-
,
sifiers.7 Machine Learning , 29 ( 2-3):131-163.
Friedman, N. and M. Goldszmidt . (1996). “Learning Bayesian networks with
local structure.” In Proceedings of the 12th Annual Conference on Uncer-
tainty in Artificial Intelligence ( UAI ), 252— 262, Portland , OR.
Grossman , D. and P. Domingos. ( 2004) . “Learning Bayesian network classifiers
by maximizing conditional likelihood.” In Proceedings of the 21st Interna-
tional Conference on Machine Learning ( ICML ) , 46—53, Banff , Canada.
Heckerman, D. (1998) . “A tutorial on learning with Bayesian networks.” In

Learning in Graphical Models ( M. I. Jordan , ed.) , 301 354, Kluwer , Dor-
drecht , The Netherlands.
Jensen, F. V. (1997). An Introduction to Bayesian Networks. Springer, NY.
Kohavi, R. (1996 ) . “Scaling up the accuracy of naive-Bayes classifiers: A
decision-tree hybrid.” In Proceedings of the 2nd International Conference
on Knowledge Discovery and Data Mining ( KDD ), 202-207, Portland , OR.
Kononenko, I. (1991) . “Semi-naive Bayesian classifier.” In Proceedings of the
6th European Working Session on Learning ( EWSL ), 206—219, Porto, Por-
tugal.
Lewis , D. D. (1998). “ Naive ( Bayes) at forty: The independence assumption in
information retrieval. In Proceedings of the 10th European Conference on
Machine Learning ( ECML ) , 4—15, Chemnitz , Germany.
McCallum, A. and K . Nigam. (1998) . “A comparison of event models for naive
Bayes text classification.” In Working Notes of the AAAF 98 Workshop on
Learning for Text Cagegorization, Madison , WI.
McLachlan, G. and T. Krishnan. (2008) . The EM Algorithm and Extensions ,
2nd edition. John Wiley & Sons, Hoboken, NJ.
Ng, A. Y. and M. I. Jordan. (2002 ) . “On discriminative vs. generative classifiers:
A comparison of logistic regression and naive Bayes.” In chances in Neural
Information Processing Systems 14 ( NIPS ) (T. G. Dietterich , S. Becker , and
Z. Ghahramani, eds.) , 841-848, MIT Press, Cambridge, MA.
Pearl, J . (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco, CA.
Sahami, M. (1996) . “Learning limited dependence Bayesian classifiers.” In Pro-
ceedings of the 2nd International Conference on Knowledge Discovery and
Data Mining ( KDD ), 335—338, Portland, OR.
Samaniego, F. J. ( 2010) . A Comparison of the Bayesian and Frequentist Ap-
proaches to Estimation. Springer, New York, NY.
Webb, G. J. Boughton , and Z. Wang. ( 2005) . “ Not so naive Bayes: Aggregating
one-dependence estimators.” Machine Learning , 58(1):5-24.
Wu, C. F. Jeff. (1983). “On the convergence properties of the EM algorithm.”
Annals of Statistics, 11(1) :95-103.
Wu, X. , V. Kumar, J . R. Quinlan , J . Ghosh , Q. Yang, H. Motoda, G. J . M-
cLachlan, A. Ng , B. Liu, P. S. Yu , Z.-H. Zhou, M. Steinbach, D. J . Hand ,
and D. Steinberg. ( 2007) . “Top 10 algorithms in data mining.” Knowledge
and Information Systems, 14(1):1-37.
Zhang, H.(2004). “The optimality of naive Bayes.” In Proceedings of the
17th International Florida Artificial Intelligence Research Society Confer-

-
ence ( FLAIRS ) , 562 567, Miami , FL.
Zheng, Z . and G .I. Webb.(2000). “Lazy learning of Bayesian rules.” Machine

Learning , 41(1):53 84.

(Thomas Bayes,
1761) R. Price
8.1
ensemble (ensemble learning)
(multi-classifier system)
(committee-based learning)
8.1
individual learner ),
C4.5 BP

(homogeneous). ( base learner ),


( base learning algorithm).

(heterogenous).

(component learner )

8.1

weak learner )
8.2 x
(voting) 8.2(a)
8.2(b)
8.2(c)

(diversity),

m yj v mv
(a) (b) (c)

e{-l,+l}

P ( hi ( x )
^ ( ®)) . (8.1)

r T

H{ x )
— sign (8.2)

8.1.
Hoeffding
8.2 Boosting 173

(1 e ) keT ~ k

< exp -(X (8.3)

Boosting,
Bagging Random Forest).

8.2 Boosting

Boosting

r r
Boosting AdaBoost [Freund and Schapire, 1997],
8.3
AdaBoost
additive model),

H{x) =
^ atht { x )

(exponential loss function) [Friedman et al.,2000]


(8.4)

4xp( I v)= ]• (8.5)


{(® i ••• (® m 2/m )};
r.
T>i ( x ) = 1 / m .
for t • . . T do
D /it . 3: ht = 2( D ,Vt ) ;
t = PB X>t
if et > 0.5 then break
i-gt
/it
\
OLt = \n t
exp ( -at ) , if
_
Zt 7: X
ht ( x ) = f ( x )
exp ( at ) , ht ( x ) f ( x )
Vt+i
if
T>t ( x )exp(-atf ( x ) ht ( x ) )
Zt
^
8: end for
H(x) = sign(E =i
8.3 AdaBoost

H(x) (8.5) H ( x)

d£exp ( H V)
dH ( x )
e~ H ^ P ( f { x ) = l \ x ) + eH^ P { f ( x ) (8.6)

(8.6)
1 P( / Qr ) = l \ x )
H { x ) = - In (8.7)
2 P{ f { x ) = -l \ x )
5

1 - P(/ =1
sign ( H ( x )) = sign In
2 P ( f ( x ) = -1|x )
P ( f ( x) =
1 x) = P( / (x ) = -1 I a;) P(/(x) = l \ x ) > P { f ( x ) = -l \ x)
—1, P ( f ( x ) = l \ x ) < P ( f ( x ) = - l \ x)

= arg maxP(/(x) = y \ x ) 5 (8.8)


i i} ‘

sign ( H ( x ) )

6.7
(consistent)
8.2 Boosting 175

AdaBoost

~ f { x )a t h t { x )
{ p^t
exp
^t ^t ) = ® £C Vt e

= Ea e ~atl (® ) = ht ( x ) ) eatl (x)


^( )ht (ht))(] ))
®

^
=e atPx^ vt ( ® ) = ht ( ® )) eatPx^Vt
~
x *

=e
~ at
(1 et ) eatet (8.9)

et = Px^Vt ( ht ( x )
^
d £exp ( o tht I T>t )
dat
= -e-at ( l - t ) + eatet (8.10)

(8.10)
at = 2 hi — 1 —Q (8.11)
et
8.3 6

AdaBoost
i/t
- i
-1

-i
\ V ) = E* D[e- (*)( -
’“ *) *))

^
4xp( t-i + ht ]
E p[e -/
- e-/
i(® ) ( )
]. (8.12)

^^
f 2 ( x ) = ht2 ( x ) = f h
(8.12) e ~

4xP ( ^t-i + h t \ V ) c^ (- 1 f ( x ) ht ( x )
2

= E® p e f ( x ) ht ( x ) • (8.13)

ht { x ) = argmin 4xp ( -i Jth \ V )


h ^
argminEa
h
e- f ( x ) H t - i ( x ) - f( x)h{x)
argmaxEjc p e- f ( x ) H t -i ( x )
h

e- f ( x ) H t -i ( x )
= arg maxE® v rsj f ( x)h{x) (8.14)
h

- {x)]

^
f{x )Ht
v[e ~ l
A

^t { x ) = EJC (8.15)
p[e t
-10 )]
»

e - f ( x ) H t -i ( x )
ht( x) arg max Ex f ( x )h( x )
h p[e / - - i

= arg rnaxEa [ f {x )h{x )\ (8.16)

^


h


f ( x),h( x) e { l,+l},

f ( x )h( x) = 1 - 21(/ h( x )) (8.17)

^
ht{x ) = argminEa
h
/i(x))] • (8.18)

Vt 0.5.
A Vt+1
V {x)e /

^^
+1 (® ) = D [e — / (*) (*) ]
X>{x)e - - -
f ( x)Ht i( x)e f ( x)atht( x )

Exr^v [e - /

Ej [e

^^
p
Vt {x) e • ~ f atht
(8.19)
Ex v [e-/
^ ]
8.2 Boosting

8.3

(8.11) (8.19)
8.3 AdaBoost

Boosting

-
re weighting)

resampling)

Boosting
8.3

[Kohavi and Wolpert ,1996],

r
2.5
Boosting Boosting

4.3
4.5 3.0a AdaBoost (size)
8.4

0.2 0.2

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

(a) (b) (c)

m 8.4 3.0a AdaBoost


8.3 Bagging

8.1

8.3 . 1 Bagging
Bagging
Bagging [Breiman ,1996a]
Bootstrap AGGregatING
2.2.3 ( bootstrap sampling).
m
m
m
(2.1)

T m
Bagging
Bagging

Bagging 8.5

Z) {( xi , 2/i ) , ( x 2 , 2/2 ) , • • • (® m , y m ) }\
£;
r.
for t T do
2: ht = £> ( D , T>bs )
3 end for
H ( x ) = argmax Kht ( x ) = y )

8.5 Bagging
8.3 Bagging

0(m) Bagging
T(0(m) + 0(5)), r
Bagging
Bagging
AdaBoost
AdaBoost Bagging
[Zhou , 2012].

Bagging

2.2.3
-
out of-bag estimate) [Breiman ,1996a;
Wolpert and Macready , 1999].
Dt Hoob( x )
x ®

Hoob{x) argmax
^ I( /it (aj) y) l ( x Dt) (8.20)

Bagging

eoob = E y) . (8.21)
( x ,y ) eD

2.5
Bagging

8.5.3
4.5 3.0a Bagging
8.6

8.3.2
( Random Forest , RF) [Breiman , 2001a] Bagging
. RF Bagging

d )
RF A:
S
1

S "

(b) (b)

8.6 3 . 0o Bagging

k k = d,
k A; log2 d
[Breiman , 2001a].

Bagging Bagging

8.5.3

Bagging 8.7

0.028
Bagging 0.024 Bagging

0.020
0.016

0.012

0.008
XiC - -
0.004
10
'
10' 10 '

(a) glass (b) auto-mpg

m 8.7 UCI Bagging


8.4

Bagging, Bagging

8.4

[Dietterich, 2000]:

8.8

'
h3 • h2 /l2 •
• f
".3 •
• f‘
" 1
/ll

(a) (b) (c)

8.8 [Dietterich , 2000]

T l 5 /12 ,...,hT }
hi( x ).
8.4 .1

h i( x ) e M, (averaging).

• (simple averaging)

( 8.22)
• (weighted averaging)
T
H( x) =
^ W j h x) .
^ (8.23)

Breiman [1996b] T
Stacking o, E 1.

1/T
[Markowitz ,1952], [Perrone and Cooper, 1993]

[Xu et al. Ho et al. 1994;


Kittler et al,1998].

8.4. 2
{ c U C2 , ... , C N }
(voting).
iV hf ( x )) h{( x )

• ( majority voting)

H ( x) =
cj if E M (®) > °-5 kE= l iE
i= l =l
hi (x) ; (8.24)
reject , otherwise.

• ( plurality voting)

iif ( aj) — carg max h\( x) . (8.25)


3
8.4

• (w e i g h t e d v o t i n g)

H ( x ) = carg max
YliL i wi (8.26)
3 ^i (*) •

T
1.

(8.24)

majority voting ,
(8.24) (8.26)
voting.
MW
e{0,1}, q
0. ( h a r d voting).
h{ ( x ) e [0,1], _
P (q x )
(soft voting).

hi ( x )
Platt
(Platt scaling) [Platt , (isotonic regression) [Zadrozny and
Elkan , 2001] calibration)

h{ ( x ) o)
8.4.3

Stacking
Stacking [Wolpert , Breiman ,1996b]

-
(meta learner ).
Stacking

Stacking 8.9

.• • ® m V m )}
£I ...
ii.

-
£t for t . . . T do

^
ht = t( ) D
3: end for
4 D' = 0;
5 for i = 1, 2 , . . . , m do
for t = 1, 2 , . . . , T do
^it =
end for
D' Dr U (( 1, 2 , . . . , ziT ) , yi )\
10 end for ^ ^
V 11 h! = Z [ D' )\
H ( x ) = hr( hi ( x ) , h2 { x ) , .. HT ( X ) )

8.9 Stacking

/c D fc
DI , D2 , ... 14. Dj = D\ Dj j
T t
h{tj\Xi),
A k1;A2;... ziT ) ,
r
zr
MLR Stacking

(Multi-response Linear Regression , MLR)


[Ting and Witten, 1999], MLR
[Seewald ,2002].
WEKA StackingC
8.5

(Bayes Model Averaging, BMA)


[Clarke 2003]
Stacking BMA
BMA Stacking;

Stacking BMA , BMA


BMA

8.5

8.5.1
8.1

huh2 , .• ”hT (8.23)


f : Rd
ambiguity)
2
A( hi x ) ( hi ( x ) H( x)) (8.27)

A( h x ) WiA( hi x )
.
i

^
2
= Y i= iWi
^^
hi ~ H (x) ) (8.28)

(/(*) - hi ( x ) )
2
E ( hi x ) (8.29)
2
E( H \ x ) = ( f ( x ) - H ( x ) ) . (8.30)

E( /i 4 = Ej= i E ( hi|x )

T
A( h x ) =
i= l
WjE ( hj x ) — E( H \ x )
= E{ h | x ) - E( H | x) . (8.31)
(8.31) ® p( x )

A( hi x)p( x )dx = E ( hi|x )p( x ) dx — E ( H \ x)p( x )dx .


(8.32)

Ai
A( hi ) .

Ei
J E ( hi x)p{x ) dx (8.33)

Ai = A( hi|x)p( x )dx . (8.34)

E
E= E{H \ x)p(x )dx . (8.35)
E [ H ).

(8.33) (8.35) (8.32) , Y J=i


Eti
E E—A. (8.36)

(8.36)
[Krogh and Vedelsby, 1995]
( error-ambiguity decomposition ) .

8.5.2

(diversity measure)
8.5

* 0 {(£Ci ,2/i ),(® 2 ,2/2 ),• • • (xm,ym)},


2.3 . 2
(contingency table)

hi = hi = —
= a c
hj = 1
- b d

~
a C d
a + 6 + c + d = m.

• (disagreement measure)

b+c
disij = (8.37)
m

dis [0,1].


^ (correlation coefficient)

Pij =
ad — be
/ (a + 6) (a + c)(c -\- d )(b
^ - d)
(8.38)

Pij 1 1]. hi hj • hi
{), hj

• Q- -
(Q statistic)
Qij =
ad
ad — bebe . (8.39)

Qij py |(3y
|< \ p i j \ .

• -
( / statistic)

M
^ K
Pi
= 1-
P2
~
P2

p2
(8.40)

Pi
a{ d
m
-- (8.41)

P2
{ CL + 6)(a
- - c)m+ (c + d){ b
j d)
(8.42)

.
2

hj D / = 1;
K 0. K hi

pairwise)

8.10

2()

0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8

(a) AdaBoost ( b) Bagging

8.10 UCI
-
tic tac - toe C4.5

8.5.3


Bagging
AdaBoost

/c (stable
base learner ),
8.5


subspace,

(random subspace) fHo,1998]

8.11

_D ={( , , , ,
aji yi ) (»2 2/2 ) * * *
0m ,2/m )};
T;
!
6 d.

for t = .. . , ,T do
/
Tt Tt = RS ( D , d )
Dt 3: Dt = Map ( D )
ht = £( A )
5: end for
H { x ) = argmax 1 (Mapjrt (a;)) = y )
yey

8.11


Flipping Output ) [Breiman, 2000]
( Output
Smearing) [Breiman , 2000]
EC0C [Dietterich and
Bakiri 1995]
EC0C 3.5


( Negative Correlation) [Liu and Yao, 1999]

8.3.2

[Zhou, 2012].

8.6

[Zhou , 2012],
[Kuncheva, Rokach 2010b]
[Schapire and Freund ,2012] Boosting
Boosting [Schapire, 1990] [Kearns and Valiant , 1989]
Boosting
[Freund and Schapire,1997] AdaBoost ,

Boosting Bagging MultiBoosting


[Webb , 2000] Boosting Bagging
[Zhou , 2012]
8.2
man et al., 2000],
AdaBoost
AdaBoost
(statistical view) [Pried
-
( additive
model)
GradientBoosting [Friedman , LPBoost
[Demiriz et al.,2008] i AdaBoost
[Mease and Wyner, 2008], AdaBoost
AdaBoost

AdaBoost AdaBoost
margin theory) [Schapire et al., 1998]
8.6

[Zhou, 2014].
-
DS
( mixture of experts)
[Kuncheva and Whitaker, Tang et al., 2006]

[Zhou,2012]

(ensemble pruning).

tive ensemble),
-
selec
[Rokach 2010a]; [Zhou et al., 2002]

tion ). -
( ensemble selec [Zhou, 2012]

[Zhou, 2012]
KDDCup

-
twice learning) NeC4.5 [Zhou and
Jiang, 2004]. [Zhou,2012]
8.1 p 1 - p. H[n)
n A:

(1)^ (1 n-1

^
P(H(n) k) E
=i 0 V
P) • (8.43)

6> A;= (p — 5)n Hoeffding

- S)n) < e—25271

^
P(H(n) {p • (8.44)

(8.3).
8.2 0/1
/(
-/ H{x)
-
[ oo,5] (J > 0)
8.3 AdaBoost,
3.0o p.89
3.0a AdaBoost 8.4
4.5.

8.4 GradientBoosting [Friedman, 2001] Boosting


AdaBoost

8.5 Bagging, 3.0a


Bagging 8.6

8.6 Bagging

8.7 Bagging

8.8 MultiBoosting [Webb, 2000] AdaBoost Bagging


Iterative Bagging [Breiman, 2001b] Bagging
AdaBoost

8.9* 8.3 8.5

8.10* A:
Breiman, L. (1996a). “Bagging predictors.” Machine Learning , 24( 2 ) :123-140.
Breiman, L. (1996b). “Stacked regressions.” Machine Learning , 24(1) :49-64.
Breiman, L. ( 2000 ). “ Randomizing outputs to increase prediction accuracy.”
Machine Learning , 40 ( 3) :113-120.
Breiman, L. ( 2001a). “Random forests.” Machine Learning , 45(1):5-32.
Breiman, L. ( 2001b). “Using iterated bagging to debias regressions.” Machine
Learning , 45(3):261-277.
Clarke, B. (2003). “Comparing Bayes model averaging and stacking when mod-
el approximation error cannot be ignored.” Journal of Machine Learning
Research, 4:683-712.
Demiriz , A., K . P. Bennett , and J. Shawe-Taylor. ( 2008). “Linear programming
Boosting via column generation.” Machine Learning , 46(1-3):225-254.
Dietterich, T. G. ( 2000 ) . “Ensemble methods in machine learning.” In Pro-
ceedings of the 1st International Workshop on Multiple Classifier Systems
( MCS ), 1—15, Cagliari, Italy.
Dietterich, T. G. and G. Bakiri. (1995). “Solving multiclass learning problems
via error-correcting output codes.” Journal of Artificial Intelligence Re-
search^ 2:263-286.
Freund , Y. and R. E. Schapire. (1997). “A decision-theoretic generalization of
on-line learning and an application to boosting.” Journal of Computer and
System Sciences^ 55(1):119-139.
Friedman , J ., T. Hastie, and R. Tibshirani. ( 2000) . “Additive logistic regres-
sion: A statistical view of boosting (with discussions).” Annals of Statistics ,
28( 2 ):337-407.
Friedman , J . H. (2001). “Greedy function approximation: A gradient Boosting
machine.” Annals of Statistics, 29 (5):1189-1232.
Ho, T. K. (1998). “The random subspace method for constructing decision
forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
20(8) :832-844.
Ho, T. K., J. J . Hull, and S. N. Srihari. (1994). “Decision combination in multi-

pie classifier systems.” IEEE Transaction on Pattern Analysis and Machine


Intelligence 16 (1):66-75.
Kearns, M. and L. G. Valiant . (1989). “Cryptographic limitations on learning
Boolean formulae and finite automata.” In Proceedings of the 21 st Annual
ACM Symposium on Theory of Computing ( STOC ) , 433-444, Seattle, WA.
Kittler , J. M. Hatef , R. Duin, and J . Matas. (1998) . “On combining classifiers.”
IEEE Transactions on Pattern Analysis and Machine Intelligence , 20 (3):
226-239.
Kohavi, R. and D. H. Wolpert . (1996 ) . “Bias plus variance decomposition for
zero-one loss functions.” In Proceedings of the 13th International Conference
on Machine Learning ( ICML ), 275-283, Bari, Italy.
Krogh , A. and J. Vedelsby. (1995). “ Neural network ensembles, cross validation ,
and active learning.” In Advances in Neural Information Processing Systems
7 ( NIPS ) ( G. Tesauro, D. S. Touretzky, and T. K . Leen, eds.) 231-238, MIT
Press, Cambridge, MA.
Kuncheva, L. I. ( 2004). Combining Pattern Classifiers: Methods and Algo-
rithms. John Wiley & Sons, Hoboken , NJ .
Kuncheva, L. I. and C . J . Whitaker. ( 2003). “Measures of diversity in classifi-
er ensembles and their relationship with the ensemble accuracy.” Machine
Learning , 51(2):181-207.
Liu, Y. and X. Yao. (1999) . “Ensemble learning via negative . correlation.” Neu-
ral Networks, 12 (10):1399-1404.
Markowitz H. (1952) . Portfolio selection.” Journal of Finance , 7(1):77-91.
Mease, D. and A. Wyner. ( 2008). “Evidence contrary to the statistical view
of boosting ( with discussions ) .” Journal of Machine Learning Research, 9:
131-201.
Perrone, M. P. and L. N. Cooper. (1993) . “When networks disagree: Ensemble
method for neural networks.” In Artificial Neural Networks for Speech and
Vision ( R. J . Mammone, ed . ) , 126-142, Chapman Sz Hall, New York, NY.
Platt , J. C. (2000) . “Probabilities for SV machines.” In Advances in Large Mar-
gin Classifiers ( A. J . Smola, P. L. Bartlett , B. Scholkopf , and D. Schuurmans,
eds.) , 61-74, MIT Press, Cambridge, MA.
Rokach, L. (2010a) . “Ensemble-based classifiers.” Artificial Intelligence Review,
33(1):1-39.
Rokach, L. ( 2010b ). Pattern Classification Using Ensemble Methods. World
Scientific, Singapore.
Schapire, R. E. (1990) . “The strength of weak learnability.” Machine Learning ,
5(2 ):197-227.
Schapire, R. E. and Y. Freund . ( 2012 ). Boosting: Foundations and Algorithms.
MIT Press, Cambridge, MA.
Schapire, R. E., Y. Freund , P. Bartlett , and W. S. Lee. (1998) . “Boosting the
margin: A new explanation for the effectiveness of voting methods.” Annals
of Statistics, 26(5):1651-1686.
Seewald , A. K . ( 2002). “How to make Stacking better and faster while also tak-
ing care of an unknown weakness.” In Proceedings of the 19th International
Conference on Machine Learning ( ICML ) , 554-561, Sydney, Australia.
Tang, E. K ., P. N. Suganthan , and X. Yao. ( 2006) . “ An analysis of diversity
measures.” Machine Learning , 65(1) :247-271.
Ting, K. M. and I. H. Witten. (1999). “Issues in stacked generalization.” Jour-
nal of Artificial Intelligence Research, 10:271-289.
Webb, G . I. ( 2000). “MultiBoosting: A technique for combining boosting and
wagging.” Machine Learning , 40 ( 2 ):159-196.
Wolpert , D. H. (1992). “Stacked generalization.” Neural Networks , 5 ( 2):241-
260.
Wolpert , D. H. and W. G. Macready. (1999). “An efficient method to estimate
Bagging’s generalization error.” Machine Learning , 35(1) :41-55. •

Xu , L. A. Krzyzak , and C. Y. Suen. (1992). “Methods of combining multiple


classifiers and their applications to handwriting recognition.” IEEE Trans-
actions on Systems, Man, and Cybernetics, 22 (3) :418-435.
Zadrozny, B. and C. Elkan. ( 2001) . “Obtaining calibrated probability esti-
mates from decision trees and naive Bayesian classifiers.” In Proceedings of
the 18th International Conference on Machine Learning ( ICML ), 609-616,
Williamstown, MA.
Zhou , Z.- H. (2012 ). Ensemble Methods: Foundations and Algorithms.
Chapman Hall/CRC, Boca Raton , FL.

-
Zhou,Z.H.(2014). “Large margin distribution learning.”In Proceedings of the
6th I APR International Workshop on Artificial Neural Networks in Pattern

-
Recognition ( ANNPR ), 1 11, Montreal, Canada.

-
Zhou, Z. H. and Y. Jiang.(2004).“NeC4.5: Neural ensemble based C4.5.”
IEEE Transactions on Knowledge and Data Engineering , 16(6):770 773. -
-
Zhou, Z.H J .Wu,and W.Tang.(2002). “Ensembling neural networks: Many

-
could be better than all.” Artificial Intelligence , 137(1 2):239-263.

(Leo Breiman , 1928-2005)

CART
Bagging Boosting

(UCLA)

UCLA
9.1

unsupervised learning)

.
ty estimation)
(densi
-
(anomaly detection)
(clustering).

cluster ).

m
n
9.4.2
13.6 D fc Z
D UiLiQ.
(cluster label), IP X j G C\y m
T (Al; • .•

9.2

validity index).
2.3

(intra-cluster similarity) (inter-cluster similarity)

reference model) external index)


(internal
index).
D .• }, C
s.
k
C2,... Ck } , C *

A* C C *

a S S = { ( xi , X j ) Xi = X j , X * = X j , i < j ) } ,

-
-
(9 1)

^^ ^
b =|5D
| , S D ={( , )1 = < j)}, (9.2)

^^ ^
c =\D S\ , D S = { { x u x j|
) Xi X j , X* = X j , i < j)}, (9.3)
d =\ DDl DD = {( , )| j)}, (9.4)

SS C C*
SD C C*
( xi , Xj ) ( i < j)
a c d m( m — 1)/2

• Jaccard (Jaccard Coefficient , JC)

• FM ( Fowlkes and Mallows Index


JC =
a -\- b -\- c
FMI)
-
(9 5)

FMI =
a+6 a+c
*

-
(9 6)
9.3

• Rand (Rand Index, RI)

2 ( (2 d)
RI =
m( m — 1) (9.7)

p l]

{CV,C2 ,... Cfc},

avg(C) =

diam ( C)
|C|(|C|

>
1)
^
maxKi < 7X|C| dist ( xi , Xj ) ,
< <\c\
l i< j
dist (a?i , X j ) (9.8)

(9.9)
dmin ( Ci Cj ) = IRiTLXi ^ Ci ,XjECj dist (3Ji X j) , (9.10)
_
d cen { C i Cj ) = dist
> /i)) (9.11)

dist( ) /x C

^-xi ayg(C)
.
C diam(C)

^
C d Ci C j ) Ci Cj
.

^
d Ci C j ) Q Cj
(9.8) (9.11)

• DB (Davies- Bouldin Index, DBI)

k
avg(Q) avg(CJ )
DBI max ( (9.12)
(
cen MZ ? MJ )
kU V ^
• Dunn ( Dunn Index, DI)

Cj )
DI = mm mm . dminiCidiam ^
(9.13)
l
^ k j i VmaxK» (C/ )
DBI DI

9.3
dist(., (distance measure),

d i s t( x i ,X j ) (9.14)
dist(®i ,X j ) (9.15)
dist( xi,Xj ) dist( x j ,Xi ); (9.16)
dist(a i,Xj ) < dist(cci ,ajfc) + dist(ajfc ccj) •
-
(9 17)

An ) ( Xji Xj 2 ...] Xjn),


Minkowski distance)

(9.18) aji - ajj


— \I u=
^^
Lp || jCi X j ||p . dist \ Xiu ju \ (9.18)
l

1, (9.18) (9.14) (9.17)


p
^ oo p (Euclidean distance)

diste j{xi Xj ) \\ xi — ® jf ||2 > \ iu ju \ . (9.19)


^ \ =1 ^
^
( A ~~
U

(city p=1
block distance).
(Manhattan distance)

distman ( ^i « 5 — \\ x i - x j \\ 1 = Y \^iu — X iju \


U
^
=1 - (9.20)

(continuous attribute)
(categorical attribute),
(numerical attribute),

(nominal attribute)•
3}

ordinal attribute); {
non-ordinal
attribute).
VDM (Value Difference Metric) [Stanfill and Waltz ,
1986]. mu,a w a m i i

SSr VDM
b

k V
mu ,a,i
VDMP(a,b) = , Y u ,a u ,b
(9.21)
9.3

VDM nc
n

MinkovDMp(cc x j ) I \ p [ VDMp ( xiu , x j u) .

^
iu j.
u=nc -\- l
(9.22)
( weighted
distance).

distwm ^ 5 Xj) ( t^i . \ x i\ X j\ + . . + W - \ ^ in



p
(9.23)

1, 2, . . . , n) YA= I =
Wi
^ 0 (i
(similarity measure)
(9.17).

9.1
(non-metric
distance).

10.6
(distance metric learning) .

d\ d2 C?3

9.1
9.4

(prototype-based clustering),

9.4.1 k

{£CI ,®2 ,• • . , x },
D -
fc means)
C {Ci C2 ,• • . Cfc}

k
E= H II ® - Mill!, (9.24)

i \ xECi

A= * Ci (9.24)
A
(9.24) D
NP [Aloise et al. 2009]. /c

(9.24). 9.2
4-8 9-16

9.1 4.0
i

P.89 3.0o 9.1 4.0


4.0

0.697 0.460 0.245 0.057 0.748 0.232


0.774 0.376 0.343 0.099 0.714 0.346
0.634 0.264 0.639 0.161 0.483 0.312
0.608 0.318 0.657 0.198 0.478 0.437
0.556 0.215 0.360 0.370 0.525 0.369
0.403 0.237 0.593 0.042 0.751 0.489
0.481 0.149 0.719 0.103 0.532 0.472
0.437 0.211 0.359 0.188 0.473 0.376
0.666 0.091 0.339 0.241 0.725 0.445
0.243 0.267 0.282 0.257 0.446 0.459
9.4

D {aJi ,a?2 ,... a? m};


/c.


{Ml ,M2 ,
repeat
Ci (1 i fc)
for j = 1, . .. m do
-

^
=|
^ (1 ^ i ^ k ) Aj
|%
argmin M k\
U{ } ^

^
end for
for i = 1,2,...,/c do

^ ^ ^^
i =] xeCi x:
11 if /x • \ii then
12 A i
13 else

end if
end for
IT until
m%

^
C ={CuC2,...,Ck}

9.2

k = S, x6 , ® 27

Mi (0.403;0.237), /x2 (0.343;0.099), /x3 (0.532;0.472) .

(0.697;0.460), Mi ,M2 ,M3


0.369,0.506, C3

{®5 ? ®6 )
« ®10 ? ®13 ? ®14 > ®15 ? ®17 ? ®18 ? ®19 ? ® 20 ? ® 23}j
«

◦= {«11 , , 16}

^^
2 12

Cs = {a?i , ®2 , ®3, ®4 , a?21, ®22 , *24 , ®25 ? ® 265 ® 28 ? ®3 o} *

Ch C2

^
Mi ( 0.473; 0.214) , (0.394; 0.066) , = ( 0.623; 0.388) .

9.3
'
' . i•

F

<• •
- ¥•
'

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

( a) (b)

i
<•

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2

(c) (d )

9.3 4.0 /c ( /c 3)

9.4 .2
k Learning Vector Quantization ,
LVQ)
LVQ

( x2 ,2/2 ),• • . («m ,2/m )}, n


[ x j i \ X j 2 ] ... \ X j n ) , y j G y LVQ
n {p!,p2 ,..
G

LVQ 9.4
g
9.4

D {(xi ,2/i ),(a 2 ,2/2 ),..• (®m ,y m)}\


g t2 ,..
r e (0 l ).
{pi ,P2 , Pg}
repeat
( X j , y j );
1 i g) ||
||x j pi
axgminiG{1 2
-
2
? >
dji ]
6 if y j = U * then
j P i* 7
8: else

P = Pi* V ~ Pi* )

P i*
-
Pf = P i * r j •( x j
-p )

^
Xj
end if
V
until
0>i ,P2 , }
9.4

SOM

LVQ SOM

SOM ,
5.5.2 5.5.3

LVQ 6 10-
Xj

Pr P i* +V ( xj
' ~ P i* ) » (9.25)

V
\\p - Xj \ \ 2 = WPi * V '( X j
- jib)-
P i* X j \ \2

= (1 H* — x • (9.26)

7 e .

(1 + 7?) \ \ p i * X j \ \ 2,

Y
r,

( lossy compression),
Ri = { x e X \ \\ x - p i \\ 2 \ \ X - P i > \ \ 2 , i’ i} • (9.27)
vector
quantization); LVQ

^^ ^
Y { , 2 ,..., },
“Voronoi Voronoi tessellation).
9.1 4.0 LVQ -
9 21
c2 , Cl . g
pl 5 P2 , P3 , P4 , P5 ,
Cl , C2 , C2 , Cl , Cl .

®12 , $29 .
P2 , P3 , P4 , P5
0.283, 0.506, 0.434, 0.032. p5
c2 , 0.1, LVQ p5

P5 P5 )

-
(0.725;0.445) + 0.1 ((0.697;0.460) (0.725;0.445))
(0.722;0.442) •

p5 9.5

9.4.3

Gaussian)
k LVQ (Mixture of
--
) n Y
® A/(pt ,S).
*
x,

S:
||
S :S p( x ) (9.28)
-
S 1: S ( 2TT ) 2 |I | 2

M nxn (9.28)
M S
9.4

0.50
4200..31
.

2 .
V.

0.3 0.5 0.7 I1 0.3 0.4 0.8

(a) ( b)
0.8

0.6

0.5

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4

(c) (d )

9.5 4.0 LVQ ( g = 5) . C L C2


Voronoi

/x ,E ).

PM (.) k
P M ( x )d x = 1. PM {x) = (9.29)

/c /Xi
i ( mixture
coefficient ), J2i=i ai =

m ..
i
D G {1,
...
P{ z j i) (i = l ,2,...,/c ).

P{zj i ) • pM ( x j I Zj = i)
PM (zj =i | xj) =
P M{x j )
o^i p( X |j
k (9.30)
1=1
Oil p { xj I

^
PMizj = I xj) i
(i ... /c).
(9.29) D A:
C ={Ci ,C2 ,...,Ck }

Xj arg max (9.31)

^^
\ l i k}
7.2
D

L L( D ) = In n=
i i

m k

Eh
=
'
?
E
=1 \i l
0i '
^ P { X j | /ii , (9.32)

EM 7.6 EM
d L L( D)
(9 . 3 2) ~
9JM

E k
(xj - Mi) = 0 , (9.33)

^^
j=1
E a i p( x j
' I u i)

(9.30) i X j ), W
7# = P M { z j = |
9.4

*y • • nf •

Mi (9.34)
S

= (9.35)
E iji

LL( D), =l
LL( D )

LL{ D ) A
SH (9.36)

A (9.36)

p { x j /Xj , 5] )
E k
^ A (9.37)
=
j 1
J2 ai - p( x j \
1=1

A= m —
ai = (9.38)
J=I

EM
(E
9.35) (9.38) W (M •

9.6
2-12 EM
EM LL( D )
{® i ,x2 ,• • • ® m .

A:.

repeat
EM E for j = 1,2,... m do
(9.30)

^
Iji ~ = (1
end for
EM M 6: for i = 1,2,..., do
W = Eg
jjgj
7: = 7 i •
i jLi 7 > j

8: = 7j*

a
end for

^
(K ,E0
12 until
13 Ci = 0
14 for j = 1,2,..., m do
(9.31)
(7Aj U{ }

^
C
end for
Q}

9.6

14-17

9.1 4.0 k = 3.
ai a2 as XQ

, M3
O.I o.o
M2 ®22 ® 27 Ei = S2 = H 3 = o.o o.i

(9.30)

Q
-i
0.219,

0.361, cx 2 0.323,
0.404,

0.316
0.377.

^
(0.491;0.251), (0.571; (0.534;0.295)

0.025 0.004 0.023 0.004 0.024 0.005


, S2 = ,

^
3 =
0.004 0.016 0.004 0.017 0.005 0.016

9.7
9.5

8.7
o.
o.

65.4
o.
o.
o.
orJ

i.l

(a) 5 (b)
.
8.7
.6
.S

4.3
. ▲

+o

.2

8.i 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(C) (d)
9.7 3) G C3
“o”

9.5

(density-based clustering),

-
Density- Based S

-
patial Clustering of Appli
DBSCAN neigh
-
cations with Noise” . borhood) (e,MinPts )
D= xm } ,

dist(.,.)
•e - G
Ne ( Xj ) = { xi
e
-
D dist ( x i , X j )
e \
D
;
e}
e

\ Ne{Xj)\
(core object ):
MirLPk
e
- MinPts

• -
(directly density reachable): e
-
• -
(density reachable): pi ,P2 , Pn ,
Pi Pn Pi+i Pi


xk
-
(density connected): Xi

9.8

o
o
o
o
9.8 DBSCAN
Xi
( Minims
a?3
3) e
-
DBSCAN
(e,MinPts),
D
C QD
( noise)
(anomaly )

(connectivity): e C, G C (9.39)
(maximality): G C, Xj eC (9.40)

D x
X {x' e D
x K x
DBSCAN (seed),
9.9
(e , MinPts)
9.5

D . . . ® m };
(e , MinPts ) .

for j ... m do
e- Ne ( xj ) ;
if \ Ne ( x j ) \ ^ MinPts then
5 D
6 end if
7: end for
8: k =0
9 r

^
while Q 0 do
r;
oe o
r = r \{o}

^
while Q 0 do
g g;
if | iVe (g ) | MinPts then

(3;
r = r \ A;
end if
end while
fc = /c + i , C4 = rQld \ r;
O = 0 \ Ck
24 end while
Q}

9.9 DBSCAN

9.1 4.0 (e,MinPts ) e


0.11 , MinPts 5 . DBSCAN e-
{®3,®5,®6 ,® 8 , ®13 ,®14 ,®18 , ® 25 ,® 28 ,« 29}• D

^ U DBSCAN

C\ = •

DBSCAN m Cl D Cl
{ xs ,x5,x9, *24 ,® 25 ,® 28 ,X 2Q}. D
D
9.10 DBSCAN . Ci
O.

o.
o.
o.
o.
o.
o

1.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3

(a) CL (b) C2

r
..


Vo ' -
o\

•••
/o .
'
--- ,
1? ..

• y
:
.3
•••••• V.
8.1 0.7 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(c) C3 (d) C4

9.10 DBSCAN ( e = 0.11, 5)


®13 ,® 14 ,®16 ,

^
2 ®4 , ® 2l }
C3 { X\ , X 2 , X 2 2 , *29}

CA = {^24 , ^25 , ^27, ^28 , ^30}

9.6

( hierarchical clustering)

AGNES AGglomera- AGNES


tive NESting
9.6

Ci Cj ,

dorff distance),
( Haus
-
9.2. dmin(Ci Cj) min dist(cc ,z ) (9.41)
xeCi ,zeCj

dmax(Ci Cj ) = xeCmax dist(«,z ) , (9.42)


z £ Cj

daVg(Ci ,Cj ) = H dist( x z) • (9.43)
laii i xeCi
^ zeCj

dmin dmax

dmin , dm, d;
davg. /c.

for j .. . m do
={x j }
3 end for
4 for i = 1, 2, . .. , m do
for j = 2, m do
M ( i , j ) = d{ ci , cjy,

end for
9: end for
10: g m
11 while q > k do
i* < j * .

Cr: Ce
for j = j* +l j* + ••• g do
G
end for
M r
for j 1, 2, .. . 1 do
M ( i* j ) = d ( ci* , cjy,
n
M(i, = M ( i* , j )
end for
q = q-i
23 end while
mM
^ c = { cuc , ... ,ck }
2

9.11 AGNES
216 9

davg AGNES (single linkage)-


-
complete linkage)
-
average linkage)

-
'
AGNES 9.11 1 9
11 23- AGNES

4.0 p .202 4.0 AGNES


9.1.
A: 9.12 (dendrogram),

rj
1 29 26 2 22 21 3 4 23 25 28 24 30 27 5 7 9 17 13 14 16 6 8 18 19 10 20 15 11 12

9.12 4.0 AGNES rfmax).

9.12

Cl = C2 ={CC2 ,®3,® 4 ,«21 , 22}

Cs = {*23, *24 , ®25



? ® 27 ?

13 ? ®14 ? *®16 ? «®17}5


^
*28 , ®3o} C4 = {^ 5 , ^7}
, , , , ,
5 = ^
*
^
6 ~ ®8 ®10 ®15 ®18 ®19 ® 2o};

◦=
7 {® n , Xi 2}.
9.7 21?

9.13
9.12

•./
■ ■■

0.2 0.2 0.5

( a) k =7 (b) k=6

8.i 0.3 0.6 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

(c) /c (d ) k =4

9.13 AGNES dmax ) ( A:

9.7

-
[Estivill Castro, 2002].
[Mitchell , 1997]
[Jain and Dubes,
Jain et al. 1999; Xu and Wunsch II, 2005; Jain, 2009]
9.2 F mutual
information) (average silhouette width) [Rousseeuw, 1987]
[Jain and Dubes,1988; Halkidi et al. 2001; Maulik and Bandyopadhyay ,
2002] .

[Deza and Deza, 2009]. MinkovDM [Zhou and Yu , 2005]


[Jacobs et al,
Tan et al., 2009]. [Xing et
10.6
al., 2003].
k

Steinhaus Lloyd McQueen


[Jain and Dubes, Jain , 2009]. k /c-medoids
[Kaufman and Rousseeuw , 1987] -modes
/c

[Huang,1998] Fuzzy C-means FCM) [Bezdek,1981]


(soft clustering)
A:
Bregman
Bregman
Bregman divergence,
[Banerjee et al., 2005].
[Scholkopf et al., 1998],
/c
-
kernel A: means)
(spectral clustering) [von Luxburg, 2007]
[Dhillon et al., 2004], (Laplacian
Eigenmap) /c
/c [Pelleg and Moore, 2000; Tibshirani et al., 2001],
/c

LVQ
LVQ2 LVQ3
[Kohonen, 2001]. [McLachlan and Peel, 2000]
EM [Bilmes, Jain and Dubes,1988].

DBSCAN [Ester et al.,1996] OPTICS [Ankerst et al.,


9.7

DENCLUE [Hinneburg and Keim, 1998] A G N E S [Kaufman and


Rousseeuw , 1990]
DIANA [Kaufman and Rousseeuw, 1990]
AGNES DIANA
BIRCH [Zhang et al.,1996] ROCK [Guha et al ,1999]

( clustering ensemble)

[Zhou , 2012] 7
.

outlier detection. anomaly detection) [Hodge and Austin, 2004; Chandola et


al. 2009]

isolation) [Liu et al. 2012].


9.1 p
p
P

lim
-> H-oo
pi
iu xju \P
U=1
9.2 X Z
Hausdorff distance)

distnC-X , Z ) max (disth ( X, Z) , disth ( , X ) ) (9.44)


'

^
disth (X, Z ) = maxmin ||® z\\2 . (9.45)
xex zez

9.3 (9.24)
9.4 /c
4.0 p.202
4.0
9.1.
.

9.5 DBSCAN ®
X. (9.39) (9.40).
9.6 AGNES
9.7

9.8 9.2

9.9*

9.10* A:
4.0
Aloise, D., A. Deshpande, P. Hansen, and P. Popat . ( 2009) . “ NP-hardness of
Euclidean sum-of-squares clustering.” Machine Learning , 75( 2 ) :245-248.
Ankerst , M., M. Breunig , H.-P. Kriegel, and J . Sander. (1999 ) . ^OPTICS: Or-
dering points to identify the clustering structure.” In Proceedings of the
ACM SIGMOD International Conference on Management of Data ( SIG-
MOD ) , 49-60 , Philadelphia, PA.
Banerjee, A., S. Merugu, I. Dhillon , and J . Ghosh. ( 2005) . “Clustering with
Bregman divergences.” Journal of Machine Learning Research, 6:1705-1749.
Bezdek , J . C. (1981) . Pattern Recognition with Fuzzy Objective Function Algo-
rithms. Plenum Press, New York , NY.
Bilmes, J. A. (1998) . “A gentle tutorial of the EM algorithm and its appli-
cations to parameter estimation for Gaussian mixture and hidden Markov
models.” Technical Report TR-97-021, Department of Electrical Engineering
and Computer Science, University of California at Berkeley, Berkeley, CA.
Chandola, V., A. Banerjee, and V. Kumar . ( 2009) . “Anomaly detection: A
survey.” ACM Computing Surveys, 41(3) :Article 15.
Deza, M. and E. Deza. ( 2009) . Encyclopedia of Distances. Springer ,. Berlin.
Dhillon , I. S., Y. Guan , and B. Kulis. ( 2004) . “ Kernel fc-means: Spectral clus-
tering and normalized cuts.” In Proceedings of the 10th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining ( KDD ) ,
551-556, Seattle, WA.
Ester , M., H. P. Kriegel, J . Sander , and X. Xu. (1996 ). “A density-based algo-
rithm for discovering clusters in large spatial databases.” In Proceedings of
the 2nd International Conference on Knowledge Discovery and Data Mining
( KDD ), 226-231, Portland , OR.
Estivill-Castro, V. ( 2002 ) . “Why so many clustering algorithms a position
paper. SIGKDD Explorations , 1( 4) :65-75.
Guha, S. , R. Rastogi, and K. Shim. (1999 ). “ROCK : A robust clustering al-
gorithm for categorical attributes.” In Proceedings of the 15th International
Conference on Data Engineering ( ICDE ), 512-521, Sydney, Australia.
Halkidi, M., Y. Batistakis, and M. Vazirgiannis. ( 2001) . “On clustering valida-
,
tion techniques.” Journal of Intelligent Information Systems 27( 2-3) :107-
145.
Hinneburg, A. and D. A. Keim. (1998). “ An efficient approach to clustering in
large multimedia databases with noise.” In Proceedings of the 4 th Interna-
tional Conference on Knowledge Discovery and Data Mining ( KDD ) , 58—65,
New York, NY.
Hodge, V. J . and J . Austin. ( 2004) . “A survey of outlier detection methodolo-
gies. Artificial Intelligence Review , 22 (2):85-126.
Huang, Z. (1998). “Extensions to the A:-means algorithm for clustering large
data sets with categorical values.” Data Mining and Knowledge Discovery ,
2 (3) :283-304.
Jacobs , D. W., D. Weinshall, and Y. Gdalyahu. ( 2000 ) . “Classification with
non-metric distances: Image retrieval and class representation.55 IEEE
Transactions on Pattern Analysis and Machine Intelligence 6 ( 22 ) :583-600.
Jain, A. K . ( 2009) . “Data clustering: 50 years beyond /c-means.” Pattern Recog-
nition Letters, 31(8):651-666.
Jain , A. K. and R. C. Dubes. (1988) . Algorithms for Clustering Data. Prentice
Hall, Upper Saddle River , NJ .
Jain , A. K., M. N . Murty, and P. J . Flynn. (1999 ) . “Data clustering: A review.”
ACM Computing Surveys, 3(31):264-323.
Kaufman, L. and P. J . Rousseeuw. (1987) . “Clustering by means of medoids.”
In Statistical Data Analysis Based on the Li - Norm and Related Methods (Y.
Dodge, ed . ) , 405-416 , Elsevier , Amsterdam , The Netherlands.
Kaufman, L. and P. J . Rousseeuw. (1990). Finding Groups in Data: An Intro-
duction to Cluster Analysis. John Wiley Sons New York , NY.
Kohonen , T. ( 2001). Self - Organizing Maps , 3rd edition. Springer , Berlin.
Liu , F. T., K. M. Ting , and Z.-H. Zhou. ( 2012 ) . “Isolation-based anomaly detec-
tion.” ACM Transactions on Knowledge Discovery from Data, 6 (1) :Article
3.
Maulik, U. and S. Bandyopadhyay. ( 2002 ) . “Performance evaluation of some
clustering algorithms and validity indices.” IEEE Transactions on Pattern
Analysis and Machine Intelligence , 24(12):1650-1654.
McLachlan , G. and D. Peel. (2000) . Finite Mixture Models. John Wiley Sons,
New York, NY.
Mitchell, T. (1997) . Machine Learning. McGraw Hill, New York, NY.
Pelleg, D. and A. Moore. ( 2000) . “X-means: Extending fc-means with efficient
estimation of the number of clusters.” In Proceedings of the 17th Interna-
tional Conference on Machine Learning ( ICML ), 727-734, Stanford , CA.
Rousseeuw , P. J . (1987). “Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis.” Journal of Computational and Applied
Mathematics, 20:53-65.
Scholkopf , B., A. Smola, and K.-R. Muller. (1998). “ Nonliear component anal-
ysis as a kernel eigenvalue problem.” Neural Computation, 10 (5):1299-1319.
Stanfill, C. and D. Waltz. (1986) . “Toward memory-based reasoning.” Commu-
nications of the ACM , 29 (12 ) :1213-1228.
Tan , X. , S. Chen , Z.-H. Zhou, and J . Liu. ( 2009) . “Face recognition under oc-
elusions and variant expressions with partial similarity.” IEEE Transactions
on Information Forensics and Security , 2 ( 4):217-230.
Tibshirani, R., G. Walt her , and T. Hastie. (2001). “Estimating the number of
clusters in a data set via the gap statistic.” Journal of the Royal Statistical

- —
Society Series B, 63( 2):411 423.
von Luxburg, U. (2007). “A tutorial on spectral clustering.” Statistics and
Computing , 17( 4) :395-416.
Xing, E. P., A. Y. Ng, M. I. Jordan , and S. Russell. ( 2003) . “Distance metric
learning , with application to clustering with side-information.” In Advances
in Neural Information Processing Systems 15 ( NIPS ) (S. Becker , S. Thrun ,
and K. Obermayer , eds.) , 505-512, MIT Press, Cambridge, MA.
Xu, R. and D. Wunsch II. ( 2005) . “Survey of clustering algorithms.” IEEE
Transactions on Neural Networks, 3(16):645-678.
Zhang, T., R. Ramakrishnan, and M. Livny. (1996 ). “BIRCH: An efficient data
clustering method for very large databases.” In Proceedings of the ACM
SIGMOD International Conference on Management of Data ( SIGMOD ) ,
103—114, Montreal, Canada.
Zhou, Z.-H. ( 2012 ). Ensemble Methods: Foundations and Algorithms. Chap-
man Hall/CRC, Boca Raton , FL.

-
Zhou , Z . H.and Y.Yu.(2005). “Ensembling local learners through multimodal

-
perturbation.” IEEE Transactions on Systems , Man, and Cybernetics Part
B: Cybernetics , 35(4):725-735.

( Manhattan distance)
( Taxicab geometry),

Hermann Minkowski , 1864 1909)

5 - 3) (33 — 23)
( Kaunas) .
( Alexotas)

n

10.1 A;

/c
. Nearest Neighbor ,

A:
ANN )

/c

8.4

lazy learning)

( eager learning).
10.1 A: /c

A:
1NN A 1)

10.1 /c
— /

y
/
L
> A:


k=s

- >

/c /c
k =3
Z

^
P ( err ) P( c x ) P ( c z ) . ( 10.1)
cey

7.1
(10.1) z. c* argmaxc
-
3 P(c x)

P ( err ) = 1 — cey P ( c \ x ) P ( c \ z )
5cey P (c |
^
2
1- ®)

1 - P2(c* x )

< 2 x (1 - P(c|x ) )
*
. ( 10.2)

[Cover
and Hart , 1967].

10.2

(dense sample).
5=
0.001
10.2

[Bellman , 1957] (curse of


dimensionality).
(dimension reduction),

(subspace),

embedding). 10.2

( a) (b)

10.2

10.2
( Multiple Dimensional Scaling, MDS) [Cox and
Cox , 2001]
m D G i j
disUj
ZG Rd x m , S!

^
d,


A — distij .
B ZTZ e Mmxm , B


^zjzj
ba + bjj ^ bij . (10.3)
oeRd
Y T=I bij
z
ET=1 hij
YZi zi 0
- B

distfj tr(B) mbjj , (10.4)

distfj = tr(B) + mbu , (10.5)


i= i

J2 distl = 2m tr(B) ,
i= l j = l
( 10.6)

tr(.) (trace),tr(B) = Y T= I IWI 2.


1 \
distf =
^i= distfj ,
^ i
(10.7)
m
distij = ( 10.8)
i= i
m m

^
dist . = 2 £ distij
i= l j = l
(10.9)

(10.3) (10.4) (10.9)

hj = -- ( distfj — distf — dist 2


j dist 2) , ( 10.10)

D B.
B (eigenvalue decomposition), B VAVT
A diag(Ai ,A2 , ...,Ad) Ai A2 ... Ad,
cT A+
diag( Ai ,A2,... A# ) V# Z

z AI / 2VJ e Rd * xm . ( 10.11)

d
A = diag( A!,A2,... A& ), Z
10.3

Z AVSyT e Rd’ xm . (10.12)

10.3 MDS

De disUj

(10.7) (10.9) distf dist2j , dist2;


.
(10.10) B;
B
cf
VAV2 G Rmxd'
10.3 MDS

d X = ( x1 , X 2 , . . . , x m ) e Rdxm , d d
d.

Z WTX , (10.13)
W e Rdxd Z e Rd xm
W !
d d WTXi i f
(

{W i ,W 2 r
- d}
W
"
(i j)

(10.13)
AV

10.3

( Principal Component Analysis, PCA)


PCA
)


Ei o
{wi,W2,.• .,Wd},
(i i ).
A {zn ] Zi2 ...;zidf ),
j .
YU ㈣

d'
const HZi — X i
0W 3
=
i 1
Zi -2 ZiTWTXi const

oc —tr WT
\i=l
W . (10.14)

(10.14) xiXJ

mm tr (WTXXTW) (10.15)

s.t . WTW = I.

10.4

Ei TXixJW,

^
max tr (WTXXTW)

s.t . WXW = I
(10.16)
10.3

= 0.206

= 0.045
o

10.4

(10.16) (10.15)
(10.15) (10.16)

XXTW = AW (10.17)

x XXT
Ai A2 Ad , cf W (i/n ,
PCA
W2 , ... ,W d r ) . PCA 10.5

Ei *#?
D= X m };
wi a.
-
Si xixI AllUltOi'
tu2; XXT;
W XXT
Wd .

^

i= l
= 2XjWjwj
=
j \
W

10.5 PCA

cf
A;
PCA, t

EtiA . t. (10.18)
PCA W

d - d!

10.4

10.6
S

intrinsic)

(a) (b) (c) PCA


10.6
S

6.6
(kernelized). ( Kernelized PCA , KPCA )
[Scholkopf et al., 1998]
M
PCA

( )
| - T W = AW (10.19)
10.4

w = E Zi A
i= l

E ZiOCi , ( 10.20)

={z j w.
A i •• • ) m.
PCA (10.19)

W AW ( 10.21)

(10.20)
W= ( j^ ( Xi ) OLi . ( 10.22)

{ Xi Xj) () j { Xi )^ . (10.23)

(10.22) (10.23) (10.21)

KA AA (10.24)

K K (K )i j K( x i X j ), ,A (ai;o 2; a ).
(10.24) K cf
j ( j = l 2,...,cT)

^
ZJ w j f(x) = Y
() (
4 Hxi )T l(x) ()

Y .
^^ oP Xi x )

j
5

(10.25)
(10.25)

KPCA
10.5

( manifold learning)

10.5.1
( Isometric Mapping, Isomap) [Tenenbaum et al ., 2000]

10.7 geodesic)
10.7(a)
S

fi

mm "1

0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.I

( a) ( b)
10.7
10.5

10.7(b)

E. Dijkstra
W.Dijkstra
R. Floyd Floyd 10.2 MDS
10.8 Isomap

JD

for i ..., m do
2: fc

end for

^^
dist , );
MDS 10.2 distix^ Xj ) MDS
return MDS
Z

10.8 Isomap

Isomap

A A:
e, e e

10.5.2
Isomap (Locally
Linear Embedding, LLE) [Roweis and Saul,2000]
xk
wik
ik
Wil Xi
m
, xi

10.9

10.9
Xk xi

( 10.26)

LLE ( 10.26)
LLE

^
mm
, ,...,Wm
Wl W 2 2 j Qi
w xj
^ ( 10.27)
^ 2

s .t E
jeQi
Win =1

= — X j )T ( x i - Xk ) , W i j

wij = • ( 10.28)
E C
i ,seQi

LLE

( 10.29)

( 10.27) (10.29 ) ( 10.27)


Wi, ( 10.29 )
10.6

Z (Z i • ) eRd’ xm W j = Wij ,

M = (I - W ) T (I - W ) , (10.30)

(10.29)

lin tr ( ZMZT ) , (10.31)

s.t . ZZT

(10.31) M (f
ZT.
LLE 10.10

D ={aJi ,£C2,...,Xm }\
fc;
(f.
for i ... m do
2: &
3 (10.27) jG
j 0 Qi , = o;
5 end for
6 (10.30) M;
7: M
return M /
Z = { zuz2 , ... , zm }.

10.10 LLE

10.6

(distance metric learning) .

( metric learning )
9.3

distgd (aJi , X j ) = \\ x i — X j W l = distfj^ dist 2 + • • • + dist i2j d (10.32)

distijik fc
^ 2

ti

dist
^ e(j ( xi ,

= (xi
\\ x {

T^ W
\
Xj 2

- Xj ) (xi
• distfj 1
X j)
}

,
W2 dist j2 ... + Wcr dist jd

(10.33)

0, W = diag( ) (W )u = W i .

(10.33) AV
^ W

(10.33) AV

P.C. Mahalanobis
M ( Mahalanobis distance)
M

-
S 1;
M
M

M
dist
^^ ah i Xj ) ^ ^\/ .{ Xi Xj ) = \\ X {

M
(10.34)

M ) P M
M PPT.
M
M
M. ( Neighbourhood Component
Analysis, NCA) [Goldberger et al. 2005]
10.6

Pij =
exp \\ xi AIIL) (10.35)
e x p l - Wxi IlL)
i= j
^ - xi

2.2.2
(LOO)

Pi = (10.36)
j
^ i

A
.

^
2 Pi
i =\ i = l j fli
. (10.37)

(10.35) (10.37), M PPT , NCA

^
ain
exp (-|
|P - PTx |Q .

(10.38)
P
^ Ez exP llpT pT

^^
i=i je i l2

(10.38) L00 M.
[Goldberger 2005].

C G
must link )
-
c ,
M
C ajfc
-
cannot link)

M [Xing et al., 2003]:

I - HM (10.39)

^^
mm
M
( xi ,x j ) eM

s.t. \\ xi ~
>1
(ajj,® /e ) GC

M 0 M (10.39)
M M M
M
M rank(M), d
P

10.7

[Friedman et al., 1996];

[Aha,1997].

(LDA) [Fisher ,1936], 3.4 KLDA [Baudat


and Anouar , 2000] 6.6
( Canonical Correlation Analysis, CCA)[Hotelling,
1936] KCCA [Harden et al., 2004], -
(multi
view learning)

2DPCA [Yang et al.,


2DLDA [Ye et al., 2005] (2 [Zhang and Zhou , 2005]
D ) 2 PCA
(tensor ) [Kolda and Bader , 2009].
Isomap LLE, (Lapl -
cian Eigenmaps, LE) [Belkin and Niyogi, 2003] (Local
Tangent Space Alignment , LTSA) [ Zhang and Zha,2004]
(Locality Preserving Projections, LPP) [He and Niyogi, 2004]
LE
[Geng et al., 2005].

[Belkin et al., 2006]. [Yan et al., 2007]

13.6
[Wagstaffetal. 2001].
[Xing et al., 2003],
10.7

[Weinberger and Saul, 2009],


[Frome et al., Zhan et al., 2009].
[Yang et
al. 2006]
[Davis et al., 2007] Bregman
10.1 A: 3.0a
3.0a p.89
4.5.

10.2 err, err*

err* err err* [ x err (10.40)


1^ 1 - 1
10.3
XXT XHHTXT , H 11T

10.4 xxT

10.5

princomp
10.6 MATLAB PCA Yale
Yale
http: // vision . ucsd .edu / content
/yale-face-database. 10.7

10.8* k e Isomap

10.9* LLE

9.3
10.10
Aha, D . , ed. (1997). Lazy Learning. Kluwer , Nor well , MA.
Baudat , G. and F. Anouar. ( 2000) . “Generalized discriminant analysis using a
kernel approach.” Neural Computation, 12 (10 ):2385-2404.
Belkin, M. and P. Niyogi. ( 2003). “Laplacian eigenmaps for dimensionality re-
duction and data representation.” Neural Computation, 15(6):1373-1396.
Belkin , M., P. Niyogi, and V. Sindhwani. ( 2006 ) . “Manifold regularization: A
geometric framework for learning from labeled and unlabeled examples.”
Journal of Machine Learning Research, 7:2399-2434.
Bellman , R. E. (1957). Dynamic Programming. Princeton University Press,
Princeton, NJ .
Cover , T. M. and P. E. Hart . (1967) . “ Nearest neighbor pattern classification.”
IEEE Transactions on Information Theory, 13(1) :21-27.
Cox , T. F. and M. A. Cox. ( 2001). Multidimensional Scaling. Chapman & Hal-
1/ CRC, London , UK .
Davis , J . V., B. Kulis, P. Jain , S. Sra, and I. S. Dhillon. ( 2007) . aInformation-
theoretic metric learning.” In Proceedings of the 24 th International Confer-
ence on Machine Learning ( ICML ) , 209-216 , Corvalis, OR.
Fisher , R. A. (1936) . “The use of multiple measurements in taxonomic prob-
lems. Annals of Eugenics , 7( 2 ) :179-188.
Friedman, J . H., R. Kohavi, and Y. Yun. (1996 ) . “Lazy decision trees.” In Pro-
ceedings of the 13th National Conference on Aritificial Intelligence ( AAAI )
^
717-724, Portland , OR.
Frome, A., Y. Singer , and J . Malik. ( 2007) . “Image retrieval and classification
using local distance functions.” In Advances in Neural Information Process-
ing Systems 19 ( NIPS ) ( B. Scholkopf , J . C. Platt , and T. Hoffman , eds.) ,
417-424, MIT Press, Cambridge, MA.
Geng , X. D.-C. Zhan , and Z.-H. Zhou. ( 2005) . “Supervised nonlinear dimen-
sionality reduction for visualization and classification.” IEEE Transactions
on Systems , Man, and Cybernetics Part B: Cybernetics, 35(6 ) :1098-1107.
Goldberger, J . , G. E. Hinton , S. T. Roweis, and R. R. Salakhutdinov. ( 2005) .
^ Neighbourhood components analysis.” In Advances in Neural Information
Processing Systems 17 ( NIPS ) (L. K . Saul, Y. Weiss, and L. Bottou, eds.) ,
513-520, MIT Press, Cambridge, MA.
Harden, D. R., S. Szedmak, and J. Shawe-Taylor. ( 2004) . “Canonical correla-
tion analysis: An overview with application to learning methods.” Neural
Computation, 16 (12 ) :2639-2664.
He, X. and P. Niyogi. ( 2004) . “Locality preserving projections.” In Advances
in Neural Information Processing Systems 16 ( NIPS ) (S. Thrun , L. K. Saul,
and B. Scholkopf , eds.) , 153—160 , MIT Press, Cambridge, MA.
Hotelling, H. (1936) . “Relations between two sets of variates.” Biometrika, 28
(3-4 ) :321-377.
Kolda, T. G. and B. W. Bader. (2009) . “Tensor decompositions and applica-
tions.” SIAM Review, 51(3):455-500.
Roweis, S. T. and L. K. Saul. ( 2000) . “Locally linear embedding.” Science , 290
(5500) :2323-2316.
Scholkopf , B. , A. Smola, and K .-R. Muller. (1998) . “ Nonlinear component anal-
ysis as a kernel eigenvalue problem.” Neural Computation, 10 (5 ):1299-1319.
Tenenbaum , J . B., V. de Silva, and J . C. Langford. ( 2000). “A global geomet-
ric framework for nonlinear dimensionality reduction.” Science , 290 (5500 ):
2319-2323.
Wagstaff , K. C. Cardie, S. Rogers, and S. Schrodl. (2001). “Constrained
/c-means clustering with background knowledge.” In Proceedings of the
18th International Conference on Machine Learning ( ICML ) , 577-584,
Williamstown, MA.
Weinberger , K . Q. and L. K. Saul. ( 2009 ) . Distance metric learning for large
margin nearest neighbor classification.” Journal of Machine Learning Re -
search, 10:207-244.
Xing, E. P., A. Y. Ng, M. I. Jordan, and S. Russell. ( 2003) . “Distance metric
learning, with application to clustering with side-information.” In Advances
in Neural Information Processing Systems 15 ( NIPS ) (S. Becker , S. Thrun ,
and K . Obermayer , eds.) , 505-512, MIT Press, Cambridge, MA .
Yan, S. , D. Xu, B. Zhang, and H.-J. Zhang. ( 2007). “Graph embedding and ex-
tensions: A general framework for dimensionality reduction.” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence , 29 (1) :40-51.
Yang, J . , D. Zhang , A. F. Frangi, and J .-Y. Yang. ( 2004). “Two-dimensional
PCA: A new approach to appearance-based face representation and recog-
nition.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
26 (1):131-137.
Yang, L., R. Jin , R. Sukthankar , and Y. Liu. (2006 ) . “An efficient algorithm
for local distance metric learning.” In Proceedings of the 21 st National Con-
ference on Artificial Intelligence ( AAAI ), 543-548, Boston , MA.
Ye, J ., R. Janardan, and Q. Li. ( 2005) . “Two-dimensional linear discriminant
analysis.” In Advances in Neural Information Processing Systems 17 ( NIPS )
(L. K. Saul, Y. Weiss, and L. Bottou, eds.) , 1569-1576 , MIT Press, Cam-
bridge, MA .
Zhan, D.-C., Y.-F. Li , and Z.-H. Zhou . ( 2009 ) . ^Learning instance specific
distances using metric propagation.” In Proceedings of the 26th International
Conference on Machine Learning ( ICML ) , 1225-1232, Montreal, Canada.
Zhang , D. and Z.-H. Zhou. ( 2005) . “( 2D) 2 PCA: 2-directional 2-dimensional
PCA for efficient face representation and recognition.” Neurocomputing , 69
(1-3) :224-231.
Zhang , Z. and H . Zha. ( 2004). “Principal manifolds and nonlinear dimension
reduction via local tangent space alignment .” SIAM Journal on Scientific
Computing , 26 (1):313-338.
PCA)
SVD)
factor analysis)

-
Karhiinen Loeve Hotelling
LSA)
POD) EOF)
(EMA) -
Schmidt Mirsky
( Karl Pearson, 1936) PCA.

University
College London, UCL)

F. Galton W.Welton
Galton

Biometrika, Egon
UCL
omeMk
11.1

(feature),
relevant feature)
(irrelevant feature).
(feature selection).
(data preprocessing)

( redundant feature),

(subset search) {ai
ad } , d

d
{a2 ,a4} {a2}, {a2 ,a4}
A: (/c+1)

fbrwani)

backward )

bidirectional)

{a2 ,a4 ,a5},


{a2 a4,a6 ,a8}
?
,
{a2,a4 ,a5 ,a }

(subset evaluation) _
D,
D i
v
A D F
{ D\ D2 , ... , DV } , A

g
^
Gain(A) Ent - Ent( ) ( 11.1)
11.2

4.2 . 1

Gain(A)
Ent ( D ) = -
^
2 pk

A
log 2 pk ( 11.2)

D
F D
A F

8.5 . 2

(filter ) (wrapper )
(embedding).

11.2

Relief ( Relevant Features) [Kira and Rendell, l 2]


"
T T

Relief
( 2,2/2 ), ... Relief

^ aVh , -
(near hit ),
nm
-
(near miss), j

P E — diff 4 nh ) 2 diff ( 4< nm )


2
( H .3)

j diff «0 j
j d diS( xi,xJb) j
diE ( xJa , XJb ) = \ xJa - XJb \ , x{ , x{ [0,1]
Relief
(11.3) A nh j

^
a nm j
10.6

^
j a nh j

^
A ,nm j
j

(11.3) i Relief
[Kira and Rendell, 1992].
Relief

Relief Relief-F [Kononenko, 1994] m


D

- ^
A: /c e ... Relief F
- ;
A
A:

^
,nh
xiMm (/ 1,2,.. l k ).
j

6j = J2 -d [
^^X
XUh ) E( Pz x diff « 4nJ ) 2 •
(11.4)
^
l k

^
A l

11.3
11.3

LVW ( Las Vegas Wrapper) [Liu and Setiono, 1996]


( Las Vegas method)

11.1

U
T.

1 E = oo
2 d = \ A\ ]
A* = A ]
4: t = 0;
5 while t < T do
6
7
W E f = CrossValidation (ii( jDA )) ;
,
if ( E < E ) W ( ( E = E ) A ( d r < d ) ) then
t
11 E = Ef ]
12 d = df ]
13 A* = A/
14 else
15 t = t -h l
16 end if
T 17 end while

11.1 LVW

11.1 D
W
W
W
LVW

T. LVW
r
11.4 U

x e R d , y e R. ®

mm 2
°=
(
^
i l
yi - wTXi ) 2 . (11.5)

(11.5)
6.4
(11.5) L2

wTXi ) 2 -h\\\w \\ l • ( 11.6)

A. A > 0. (11.6) ridge regression) [Tikhonov


Tikhonov
and Arsenin , 1977], L2
“Tikhonov
L2 LP
“Tikhonov LL

inE — wTXi ) 2
X \\ w \\ i • (11 7)
-
A 0. (11.7) LASSO ( Least Absolute Shrinkage and

so.
LAS
- Selection Operator ) [Tibshirani, 1996]).
. Li L2
sparse) w

T
L0 L0 . .
W2
u (11.6) (11.7)
11.4 h

L2

W2

11.2 In L2

{ wuw2 ) h L2
{ wuw2 ) In L2
11.2 (11.6) (11.7)

11.2 h
L2
M In L2

w i
( w
In
h

Li Proximal Gradient Descent ,


PGD ) [Boyd and Vandenberghe , 2004].

minf ( x ) A||« || i ( 11.8 )

f (x) / L-Lipschitz L

X (VX ’)
X (11.9)
f ( x)

f(x) f ( xk )

L
X

(V f { x k ) , x - x k ) +

Xk - TL ^ f ( x k ) ^ \\ x - xk \\ 2
const , ( 11.10)

const z (11.10)

L / O f c) . ( 11.11)

f{x)
/». (11.8),

L
Xk+ i = argmm
X
-^ — X xk - Tv
L
/( ® ) fc
2
|
A|x || i ( 11.12)

f(x) h

^
(11.12), z — V/(xfc ),

L
® /c + i
• -
arg mm \\ x - z \\ l +\\ lx\ h (11.13)
^

X

f r i (11.13) W
11.8.
(i j) z (11.13)

zl - A /L, X / L < zl
xk -\- l 0 ) A/ L (11.14)
A /L , -
A/L

xk+1 z i PGD LASSO


h

11.5
11.5

) D

6.3 12.4

D
(sparse representation)

(codebook) .

(dictionary learning),
(codebook learning).
(sparse coding).

xm},

^
min
i=l%
• •
ll^ - Bailll + A
^
x=\-
i
llailli (11.15)
B e Rdxfc e
e (11.15)

LASSO (11.15) (11.7)


B. LASSO
(11.15).
B, (11.15)
(u v)
^ OLi
LASSO

min \\ xi
- Bcxi \\l A||ai ||i . (11.16)

B (11.15)

mm || X - BA ||2F , (11.17)

x2 , ®m) G A (al5 a2,• • • am ) #■


FYobenius (11.17)
KSVD [Aharon et al., 2006] B i d
A i (11.17)

k
min || X - BA | min X-
B bi
3= F
2

Y aj
^ - biOL 1
mm X-
bi
F

= min ||Ei • (11.18)


bi

i bjcxj
(11.18)

A KSVD

'
a
11.6

B B
A A:

11.6

( Nyquist )

compressive sens-
(compressed sensing) [Donoho, Candes et al ., 2006]

m
n y, n m,

y= (11.19)

G z if

z
®
“No” n m y
(11.19
® G Mmxm , C s, y

y As ( 11.20)

e Rnxm . s, x=
A
r.
y
^s
(11.20) (11.20) s
s
(11.20)
A
S

11.5

(Restricted Isometry Property, RIP) [Candes, 2008].


n x m (n m) A 4 (0,1)
s A Afc e

^
(1 5fc|
|s| |
)| |
| | (1
AfcS| )||
/c| s|2 ( 11.21)

A A: AxRIP).
y s, ®

mjn|
||
s|0 ( 11.22 )

s.t . y = As .

(11.22) L0 NP h
Lo [Candes et al., 2006 j ,

min|
|s||i (11.23)
s.t . y = As .

h (11.23)
11.6

LASSO (Basis

-
Pursuit De Noising) [Chen et al 1998].

collaborative filter-
ing)

11.1
5

11.1

11.1

11.1

11.1

( matrix completion) [Candes and Recht ,2009]

min rank(X) (11.24)

-
s.t . (X) = (A)ij, ( i , j ) e O,

X
^
rank(X) A 11.1
Q A (A) ,,- i j)
(11.24) (X)

( trace norm ).
{XG |
(11.22)
Rmxn :|
| |
X
(11.24)
| 1}
NP
^ rank(X)
(nuclear norm )

min{m n}

l|x||* = X (x) (11.25)

aj ( X ) X

min|
x ||
X|* (11.26)

s.t . (x.)ij = ( A ) G rj.

(11.26) (Semi-Definite Programming,


B.3.
SDP
SDP) A r n m,
2
0 { mr \og m ) A [Recht,2011].

11.7

[Narendra and Pukunaga,


[Pudil et al., 1994]
AIC ( Akaike Information Criterion) [Akaike, 1974] [Blum and
Langley,1997] [Forman,2003]

[Kohavi and
John , 1997], [Weston et al., 2003],
ID3
[Quinlan, 1986]. [Yang
and Pederson , Jain and Zongker , 1997].
[Guyon and Elisseeff , Liu et al., 2010],
11.7

[Liu and Motoda, 1998, 2007].


LARS (Least Angle Regression) [Efron et al., 2004]
LARS .

L A S S O [Tibshirani, 1996] LARS


LASSO Group LASSO [Yuan and
Lin , 2006] Fused LASSO [Tibshirani et al., 2005]
LASSO ( Elastic
Net ) [ Zou and Hastie, 2005].
[Aharon et al., 2006],

group sparsity), (group


sparse coding) [Bengio et al. 2009].
[Mairal et al., Wang et al., 2010].

[Donoho, 2006; Candes et al., 2006]


[Candes et al., 2011] [Recht
et al , 2010]. [Baraniuk , 2007] Lo
h LASSO
Basis Pursuit ) [Chen et al., 1998] >

Matching Pursuit )[Mallat and Zhang, 1993] [L i u and Ye, 2009]


SLEP
(http: www.yelab.net /software/SLEP/).
3.0 p.84
11.1 Relief 3.0

11.2
-
Relief F

11.3 Relief

11.4 LVW

11.5

11.6

11.7 Lo
11.8 In (11.14)

11.9

11.10* (11.15),
Aharon, M., M. Elad, and A. Bruckstein. ( 2006) . “K-SVD: An algorithm for
designing overcomplete dictionaries for sparse representation.” IEEE Trans-
actions on Image Processing ^ 54(11) :4311-4322.
Akaike, H. (1974) . UA new look at the statistical model identification.” IEEE
Transactions on Automatic Control , 19 (6):716-723.
Baraniuk , R. G. ( 2007). “Compressive sensing.” IEEE Signal Processing Mag-
azine , 24(4):118-121.
Bengio, S., F. Pereira, Y. Singer , and D. Strelow. ( 2009). “Group sparse cod-
mg. In Advances in Neural Information Processing Systems 22 ( NIPS ) (Y.
Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta,
eds.) , 82-89, MIT Press, Cambridge, MA.
Blum , A. and P. Langley. (1997). “Selection of relevant features and examples
,
in machine learning.” Artificial Intelligence 97(1-2 ) :245 271.
-

Boyd , S. and L. Vandenberghe. (2004). Convex Optimization. Cambridge Uni-


versity Press, Cambridge, UK .
Candes E. J. ( 2008). “The restricted isometry property and its implications for
compressed sensing.” Comptes Rendus Mathematique , 346 ( 9-10) :589-592.
Candes, E. J., X. Li, Y. Ma and J. Wright. ( 2011). “Robust principal compo-
nent analysis?” Journal of the ACM ^ 58 (3) :Article 11.
Candes, E. J. and B. Recht. (2009). “Exact matrix completion via convex op-
,
timization. 5 Foundations of Computational Mathematics , 9(6):717-772.
Candes, E. J., J . Romberg, and T. Tao. (2006 ) . “Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information.”
IEEE Transactions on Information Theory , 52 (2):489-509.
Chen , S. S., D. L. Donoho, and M. A. Saunders. (1998). “Atomic decomposition
by basis pursuit .” SIAM Journal on Scientific Computing , 20(1):33-61.
Donoho, D. L. (2006) . “Compressed sensing.” IEEE Transactions on Informa-
tion Theory, 52 ( 4) :1289-1306.
Efron, B., T. Hastie, I. Johnstone, and R. Tibshirani. ( 2004) . “Least angle
regression.” Annals of Statistics , 32 ( 2):407-499.
Forman , G . ( 2003) . “An extensive empirical study of feature selection metrics
for text classification.” Journal of Machine Learning Research, 3:1289-1305.
Guyon , I. and A. ElisseefF. ( 2003) . “ An introduction to variable and feature
selection.” Journal of Machine Learning Research, 3:1157-1182.
Jain , A. and D. Zongker. (1997). “Feature selection: Evaluation , application ,
and small sample performance.” IEEE Transactions on Pattern Analysis
and Machine Intelligence , 19 ( 2 ) :153-158.
Kira, K. and L. A. Rendell. (1992). “The feature selection problem: Tradi-
tional methods and a new algorithm.” In Proceedings of the 10th National
Conference on Artificial Intelligence ( AAAI ), 129-134, San Jose, CA.
Kohavi, R. and G. H . John. (1997). “Wrappers for feature subset selection.”
Artificial Intelligence , 97(1-2) :273-324.
Kononenko, I. (1994) . “Estimating attributes: Analysis and extensions of RE-
LIEF.” In Proceedings of the 7th European Conference on Machine Learning
( ECML ) , 171-182, Catania, Italy.
Liu, H . and H. Motoda. (1998) . Feature Selection for Knowledge Discovery and
Data Mining. Kluwer , Boston, MA.
Liu , H. and H. Motoda. (2007) . Computational Methods of Feature Selection.
Chapman & Hall / CRC , Boca Raton , FL.
Liu , H., H. Motoda, R. Setiono, and Z. Zhao. ( 2010). “Feature selection: An
ever evolving frontier in data mining.” In Proceedings of the J th Workshop
^
on Feature Selection in Data Mining ( FSDM ) , 4-13, Hyderabad, India.
Liu, H. and R. Setiono. (1996 ). Feature selection and classification — a prob-
abilistic wrapper approach.” In Proceedings of the 9th International Con-
ference on Industrial and Engineering Applications of Artificial Intelligence
and Expert Systems ( IEA/ AIE ) , 419-424, Fukuoka, Japan.
Liu, J . and J . Ye. ( 2009) . “Efficient Euclidean projections in linear time.”
In Proceedings of the 26th International Conference on Machine Learning
( ICML ) , 657-664, Montreal, Canada.
Mairal, J ., M. Elad , and G. Sapiro. ( 2008) . “Sparse representation for color
image restoration.” IEEE Transactions on Image Processing , 17 (1) :53-69.
Mallat , S. G. and Z. F. Zhang. (1993) . “Matching pursuits with time-frequency
dictionaries.” IEEE Transactions on Signal Processing , 41(12 ):3397-3415.
Narendra, R M. and K. Fukunaga. (1977) . “A branch and bound algorithm
for feature subset selection.” IEEE Transactions on Computers, C- 26 ( 9 ):
917-922.
Pudil, P., J . Novovicova, and J. Kittler. (1994) . “Floating search methods in
feature selection.” Pattern Recognition Letters , 15(11) :1119-1125.
Quinlan, J. R. (1986) . “Induction of decision trees.” Machine Learning , 1(1) :
81-106.
Recht , B. ( 2011). “ A simpler approach to matrix completion.” Journal of
Machine Learning Research , 12:3413-3430.
Recht , B., M. Fazel, and P. Parrilo. ( 2010). “Guaranteed minimum-rank so-
lutions of linear matrix equations via nuclear norm minimization.” SIAM
Review, 52 (3):471-501.
Tibshirani, R. (1996 ) . “Regression shrinkage and selection via the LASSO.”
Journal of the Royal Statistical Society - Series B, 58(1):267-288.
Tibshirani, R. , M. Saunders, S. Rosset , J . Zhu, and K . Knight. ( 2005). “Spar-
sity and smoothness via the fused LASSO.” Journal of the Royal Statistical
Society
-
Series B , 67(1) :91-108.
Tikhonov A N. and V. Y. Arsenin , eds. (1977) . Solution of III - Posed Problems.
, .
Winston , Washington , DC.
Wang, J ., J . Yang, K . Yu, F. Lv, T. Huang, and Y. Gong. ( 2010). “Locality-
constrained linear coding for image classification.” In Proceedings of the
IEEE Computer Society Conference on Computer Vision and Pattern Recog -
nition ( CVPR ) , 3360-3367, San Francisco, CA.
Weston, J. , A. Elisseff , B. Scholkopf , and M. Tipping. (2003). “Use of the zero
norm with linear models and kernel methods.” Journal of Machine Learning
Research, 3:1439—1461.
Yang, Y. and J. O. Pederson. (1997) . “A comparative study on feature selection
in text categorization.” In Proceedings of the 14 th International Conference
on Machine Learning ( ICML ) , 412-420 , Nashville, TN.
Yuan, M. and Y. Lin. (2006 ) . “Model selection and estimation in regression
with grouped variables.” Journal of the Royal Statistical Society - Series B,
68(1):49-67.
Zou, H.and T.Hastie.(2005). “Regularization and variable selection via the

-
elastic net .” Journal of the Royal Statistical Society Series JB, 67(2):301-
320.

• (Stanislaw Ulam , 1909-1984)

( Lviv)

1867 1918


ENIAC

Nicolas
Metropolis- Hasting
Metropolis
12.1

(computational learning theory)

{( x i ,y i ),( x2,V 2 ),• • • (xm,ym )}, Xi e X ,


rney =
D
( independent and identically distributed, i.i.d.)
/i Y y

E(h\ V ) Px v{h( x ) y) ( 12.1)


^
D
no

^ ^
(h;D) — l(h( x i ) yi ). ( 12.2)
i= l

D D
E(h;D)
e E( h) E(h) e

D D D
disagreement)

.
"
h ') 2)( i( ® ) 2( ® )) (12.3)

^
• Jensen /($),

(12.4)

• Hoeffding [Hoeffding, 1963]: xi , ... xm m


e

m m
P E ( exp ( 2me2 ) — (12.5)

^
m
=
i l =1
1

rrt -j m
^ I ^ 2 exp( — 2me ) .
2
P 6 ( 12.6)
1 =1 =1
1

• McDiarmid [McDiarmid , 1989]: m

^
ri • • • 5 , 771

i m

Xi i , .
sup
— m)
|

^
• •• Q
Xiy ...,X m j xi

e>

P ( / ( xi , .. . , xm ) - E ( / (xi , ...,xm )) -2e 2

^^
) exp (12.7)
6
^
2e2
P (|/ (^ i , . . . , a:m ) - E(/ (a i , ...,xm ))\ e) < 2 exp . ( 12.8)

12.2 PAC

( Probably Approximately
Correct , PAC) [Valiant , 1984].

c (concept ), y
r (o0
C y c
( concept
class), C

ii hypothesis space),
1.3
. C
12.2 PAC 269

Z i e W,
hypothesis).
ce H
(separable),
(consistent ); c

-
non consistent ). *
-
non separable),

c. c
D
1.4

P D

12.1 PAC ( PAC Identify): e ,J ceC


V hen

P( E(h) 1 -5 (12.9)

PAC C.

- 5) c
e).

12.2 PAC PAC Learnable): m


D
m e
, poly( l /e ,l /(5 size(a?) size(c)), i g
size(cc)
poly( *

PAC C,
m
C
^
PAC
size(c)
C PAC
12.3 PAC PAC Learning Algorithm):
C PAC poly(l/6,1/5,
,
size(a;) size(c)), C PAC efficiently PAC learnable)
C PAC

12.4 (Sample Complexity): PAC


m poly(l /e,1/5,size(x ),size(c)) m,

PAC

PAC .W il
PAC W C,
PAC properly PAC learnable);

C. W
.

12.3

12.3.1
c m
D
.
• D c
c P
C. D
P
12.3 ’

K a
D
D
c PAC
D
e
- e

e P

- P ( h{ x )

^
P ( h( x ) = y ) = 1 y)
E( h)

— *
. ( 12.10)

D m P D

m
P ( ( h( xi ) yi ) ( h( x m ) = y m ) ) = { l - P ( h { x ) y) )

< (1 e) m .
^ ( 12.11)

£ W
e

P ( h e n E { h ) > e A E { h ) = 0) < \U \ ( l - e )m
< \ n\ e — me (12.12)

(12.12)
\n\ e -^ ^ d , (12.13)

- ( in \ H \ -f- In

^
m (12.14)

PAC
(12.14)
o{ ).
^
12.3.2
c
h e n, E{ h) o, W
Hoeffding

12.1 D m D
e hGH,

P E( h) — E{h) — e) exp( 2me2) (12.15)


P[E{h) — E(h) e) exp(—2me2) (12.16)

[ - ) —
p \ E{h) E(h)\ e ( 2 exp( 2m 2 ) .

^
(12.17)

12.1 D m P
e h e H , (12.18) -
In (2/5) In (2/S )
E(h) ~
2m
< E[h)( E(h)
2m
(12.18)

12.1 m /i

12.1 w h e n,

ln| + ln(2/5) 5. (12.19)


2m

...

P(3h e n \ E{h) - E{h)\ > e)


[ - Ehl \ > e) V . . . V ( \ E h n
= p ( \ Ehl \ \ ||
H I )
e)

^ j2 p ( \ Ew - m\ > e )
hen

(12.17)

hen
P( \ ^W E(h)\ > e) < 2|W
| exp( 2me2)
-
12.4 VC 273

5
(

2 \H \ exp( 2me2) (12.19).

W
e
K .K
PAC C W
agnostic learning).

12.5 PAC (agnostic PAC learnable): m


D e
D poly(.
m poly(l/e,l/5,size(x),size(c)), (12.20)

^
P(E{h) minE ) e) 1 -6 ( 12.20 )

H PAC

PAC
poly(l/e,1 5 size size(c)) PAC
W PAC
m

12.4 VC

“VC Vapnik -
Chervonenkis dimension) [Vapnik and Chervonenkis, 1971].
VC (growth function)
dichotomy) (shattering).
H D ={Xl ,x2,...,xm}, n
D

h\ D {(ft{xi ),h{x2),...,h{xm ))}.


m H D
D

12.6 m e N, W
N

UH(m) = max \ {h(xi ),...,h(xm))


{ h e H }\ . (12.21)

Un(m) m

12.2 m G N, e hen
[Vapnik
and Chervonenkis, 1971].
(E
P| /i|
) >e ) < 4nK(2m)exp( -l). (12.22)

W D
W D
m
D
D D
D Un(m) = 2m , D

VC
12.7 VC K

2m}.

^
V C(? ) max {m n (m) (12.23)

^
^
VC( ) = d
W H
VC P
VC
VC d
d W VC d.
VC
12.4 VC 275

12.1 [a ,6]
b}, M = R . +1,

^
[a ,6] : CL,b e R, a x G [a 6], h[a (x) =
"
{ x 1

^
h [aM { x ) 1. ci 0.5, X 2 1.5, W

"
~^^
{ [0 1] 0,2] [1 2] [2,3]} VC
{a 3,x 4 , 5}, x3 X 4
/l[a,fe]

12.2
^ -
{(a 3,+),(a 4 , ),(X5 ,+)}. VC 2.

Y = R2 . 12.1
H
VC 3.

(b)

12.1 VC

12.7 VC 12.2
[Sauer , 1972]:
“S a u e r 12.2 VC m eN

Hw ( m )

m
t d d
(12.24)

m - l , d - 1) m
- l d) _D { x i , X 2 ,...,«m},
D ={iCi •• . l }

^i\ D = { ( h ( x i ) , h ( x2 ) h ( xm) ) \ h e n } ,
n|D’ = { { h ( x i ) , h ( x 2 ) h ( «m-i ) ) \ h e 7i } .

n +i
h &
^ Xm
W
| D W|D
K|iy

. . . Vm — l ) 3 /l /i G
^Dr \ D

=hi
^ ) — JJij A
i 7 ^ h (a^ ))
TTj 1
^ ^ TTl •

^
y|jD H|D H
| zy

\^ \ D \ = l%|_D’ l I^D DI (12.25)


^
D’ 771 - 1,

(12.26)
i
— O

HD \ D Ujy U { xm }

'
D
U \D VC d |D VC d

(12.27)
i=0

(12.25) (12.27)

i=0 V i =0 V 7

r -1
.i) = o-

E (xT)
i=0 7
>

D 12.2 •
12.2

12.2 VC d m d

e
(12.28)
_
12.4 VC 277

n« M (7)
d
m d-i
(7
^=
i 0

m d J.
d

d \i
(7
m
^ d.
d
^
i =0
m
m

2 =0
m\ d d m
d
i+
m —
e • m\ d
d

12.2 12.2 VC

12.3 VC d, m J
hen

Min + 81nf

^
P E( h ) — E ( h)
m
1— S . (12.29)

—f ) < 4(t

^
4 I (2m)exp( exp )

e=
8dln + 8 1nf
m

12.2, 12.3

12.3 (12.29) m
P D VC

-
distribution free)
-
data independent )
h h

(12.30)

Empirical Risk Minimization, ERM)

12.4 VC W PAC

E(g ) min E(h) • (12.31)


hen

{ (In2m e
2
(12.32)

12.1
|

^
E( g ) E(g)
E(g )

1
_ J /2

8dln + 8 1nf e
m 2
(12.34)

12.3

6
P
^ -
2

E h) - E(g)( E(h) --
= E(h) - E(g )+ e
12.5 Rademacher 279

1 -6 m,
12.4 ■

12.5 Rademacher

12.4 VC
VC
VC

Rademacher Rademacher complexity)


VC
-
H. Rademach

-
er (1892 1969).
P {( xi ,yi ), ... (® m ,2/m )}
I Iv
?

"

^^
I { h( x i ) yi )

yih { x i )
E
2 2m 2 y i h( x i ) (12.36)

ET=i h(Xi )
f e {l 2, YALI ViKxi )

arg max
hen rn
^
2 y i h{ x i ) . (12.37)

(xuyi),
Vi

0.5 0.5
Rademacher (12.37)
K
sup (12.38)

(12.38)

Eo- SUp (12.39)


lhen

a = {ai , J2 , .. .
(
5 m}.
(7 (12.39) [0, 1] ,
|W|
(12.39) |W| H D (7

h{ xi ) = ai (i • •• m) r (12.39) 1.

R. Z { zuz 2 , ... , zm } , A e
(12.39) Z

12.8 7
*

Z Rademacher

Rz { ) = Eo sup
^ - i= l
• (12.40)

Rademacher 7 Z
P
P m Z

12.9 7 2"
*

D Rademacher

ZC.Z \ \ Z \=m (12.41)

Rademacher [Mohri et al .
2012]:

12.5 [0,1], D Z
Z { Z m } , Zi e Z , 0 < S < 1, G J7,
12.5 Rademacher 281

1 -6
m A
y in(2im/5)
E [/ W ]
^ / (^) + 2^m (7 ) + (12.42)
'
-
i= l
m
^ ln(2/(5)
E -
E
= ^ ^
/ )+2 (
i l
j r) +3 2m
(12.43)

771

=; E/
$(Z) = sup E
feT
[f] - E z( f ) .
Z G Z G

- ( Z ) = ( supE [ f ]
- Ezr( )) ( supE[/] - Ez(/))
^ ^
(Z ) f ~

^
< sf euTp S z ( f ) - S z ( f )
= sup
/( m) — f{zm )
feT m


m

HZ) - ( Z’ K
m

McDiarmid (12.7) 6 G (0,1),

ln ( l / 5)
(12.44)
$ ( Z)
^ Ez [ $ ( Z) ]
2m
-5 ( Ez[$(Z)]

Ez [^ { Z ) ] = Ez \ sup E /
feT
- Ez ( f ) ]
Ez [ supEz [ Ezr ( f ) - z ( /) ] ]
/

^
(12.4)
Jensen
zz L /eF 1
)

-- $1( (4)
m
= SUP / /

,1
- fe
^ i=l

= ]Ea z [LfeSUPr — i=1


)
m m
1 1
+ Ecr,Z SUp V" - Tif ( Zi )
(

tTj
_ <7i = 2E(r Z }
sup
L /eF m
i==l
m

i= l

= 2KJ7) .

(l2.42) 12.9 Z RziJ7)


1/m. McDiarmid (12.7)

ln(2/5)
Rm { ^F ) Rz { ^F ) 2m
(12.45)

- 5/2
( (12.44)

\n( 2 / 5 )
(Z) EzMZ ) ) 2m

ln { 2 / S )
$(Z) 2RZ {T ) (12.46)
2m

- (12.43)

12.5 1]
12.5
12.5 Rademacher 283

12.6 P Af
{x i ,x 2,...,xm}, Xi e S h eH ,

-
^
1

ln(l/5)
E(h) E(h) Rm{n) 2m
(12.47)
ln(2/5)
E( h )
^ E { h ) + RD { ) + Z n 2m
(12.48)

H :Z Xx
/i
fh( z ) fh[ x y) I(h( x) y) (12.49)

[0,1]
Jrn ={ fh h e H }.

Rz { n )
^ sup Vi )
- f h e T n m 1i=l
Ecr sup — V CTil( ( ) ^
' h Xi yi )

= ^ <T
lhenm
sup

1 ^ IIV

V] H 1 (
yih( Xi )
2

y ai + s y^ (
2

1
E<T -
lmU
1 "
m
^ \i p — Jih{ x i ) )
- yi(

— V yiGih{ Xi ) )1
2 ^ lhenm
7 a sup

^
K

m
1 1
7
^ CT sup V" (aih( xi ))
2 vhenm
^ (12.50)

(12.50)

Rmi^n ) = --R m (H ) . (12.51)

12.6
12.6 Rademacher 12.3
VC Rademacher
(12.47) D (12.48) D
Rademacher
VC

Rademacher

[ Mohri et 12.7 Rademacher Rm {V)


al 2012].

nK(m)
In (m )
(12.52)

(12.47) ( l 2.52) l2.2

E( h) ( E( h) -
2dln , IH 1 / 5 ) (12.53)
^ 2m

Rademacher VC

12.6

VC Rademacher

stability/

D = { zi = { xi , y1 ) , z2 = (® m 2/m )} G
P W
eH D
P
12.6

• DV D i

. ••
^i+ l
* • ••

• D i

D1 • i —1 i i+1 ,•
• • jzm }

x
- X> JD.

e( D( x),y ) y x y R+

^
£D


^
t{Z D,z ).

£ { 2 ,V ) = [K ^ D , Z ) ] . ( 12.54)
^ xex ,z={ x ,y )


^ D) = -
m

YJ= ^
i l
D . zi ) . ( 12.55)

• - -
(leave one out )
- m
eloo (
^ D) = -
TTt J2 ^ Dv ^ ) i ( 12.56 )

F uniform stability):

12.10 z= y) i

^
\ t{ Z D\ ,z )\ 1,2,..., m

^
D ,z ) i z ( 12.57)

\ ,z) - , z ) \ \ e( 2Di , z ) - £( 2D\i , z ) \


( 23
^/ D
^ D\i
D Z 2/)
[Bousquet and Elisseeff , 2002]

[Bous
quet and Elisseeff, 2002].
- 12.8 D m
D,
M, J m —
qn ( l / )
^
_
V D) (4m/3 M ) 2m
(12.58)
ln(l /0
£(£,P)

^
Uil £)) + + (4m + M) 2m
(12.59)

12.8 £
(12.58)

^
0( ) ,
VC Rademacher
12.3 12.6

^
suP/ie% { EW - Eih)
)

ln(l /(5)
P) ?(£,D) m (4 M) 2m
(12.60)

Empirical Risk Minimization) ERM

12.9 ERM W

g U
12.7

£ ( g ,V ) = min £ ( h ,V ) .
hen


e
2

= 2 exp (-2m(e’) ) 2

Hoeffding (12.6) m|
> In

J /2 (12.60)

ln(2/(J ) e
m +( + )
4 M
2m

m = o( In
*) D) < D) +

- 5/2
(


£ ( 2 , V) £ ( g , V) D) ( i( g , D )
^ T( 2, D ) - ?( g , D ) + e

—J 12.9

12.7 •

[Valiant ,1984] PAC


[Kearns and Vazirani 1994]
(COLT).
vc VC [Vapnik and Chervonenkis, 1971]
Sauer [Sauer , 1972] [Vapnik
and Chervonenkis, 1971] [Shelah, 1972]
VC Natamjan
[Natarajan , Ben-David et al., 1995].
Rademacher [Koltchinskii and Panchenko, 2000]
[Bartlett and Mendelson, 2003] [Bartlett et al., 2002]
Rademacher
[Bousquet and Elisseeff , 2002]
[Mukherjee
et al., 2006] [Shalev-Shwartz et al., 2010] ERM ERM
ERM [Shalev-Shwartz
et al., 2010] AERM ( Asymptotical Empirical Risk Minimization)

deterministic)
z
stochastic)

[Devroye et al., 1996].


12.1 Jensen

12.2 12.1.

2e
_ 2me2
12.3 12.1.

12.4 VC 1.

12.5 VC

12.6 VC

12.7 VC

12.8 c Rademacher 0.

^
Rmi^i + 2)
*

12.9 Ji Rademacher

12.10*
Bartlett , P. L., O. Bousquet , and S. Mendelson. ( 2002 ) . “Localized Rademacher
complexities.” In Proceedings of the 15th Annual Conference on Learning
Theory ( COLT ), 44-58, Sydney, Australia.
Bartlett , P. L. and S. Mendelson. (2003) . u Rademacher and Gaussian com-
plexities: Risk bounds and structural results.” Journal of Machine Learning
Research, 3:463-482.
Ben-David , S., N. Cesa-Bianchi, D. Haussler , and P. M. Long. (1995) . “Charac-
terizations of learnability for classes of {0, , n}-valued functions.” Journal
of Computer and System Sciences , 50 (1):74-86.
Bousquet , O. and A. ElisSeeff . (2002 ) . “Stability and generalisation.” Journal
of Machine Learning Research, 2:499-526.
Devroye, L., L. Gyorfi and G. Lugosi, eds. (1996). A Probabilistic Theory of
Pattern Recognition. Springer , New York, NY.
Hoeffding, W. (1963). “Probability inequalities for sums of bounded random
variables.” Journal of the American Statistical Association, 58(301):13-30.
Kearns, M. J . and U. V. Vazirani. (1994) . An Introduction to Computational
Learning Theory. MIT Press, Cambridge, MA.
Koltchinskii, V. and D. Panchenko. (2000) . “Rademacher processes and bound-
ing the risk of function learning.” In High Dimensional Probability II ( E.
Gine, D. M. Mason, and J. A. Wellner , eds.) , 443-457, Birkhauser Boston,
Cambridge, MA.
McDiarmid , C. (1989 ) . “On the method of bounded differences.” Surveys in
Combinatorics , 141(1):148-188.
Mohri , M., A. Rostamizadeh , and A. Talwalkar , eds. ( 2012). Foundations of
Machine Learning. MIT Press, Cambridge, MA.
Mukherjee, S., P. Niyogi, T. Poggio, and R. M. Rifkin. ( 2006 ) . “Learning the-
ory: Stability is sufficient for generalization and necessary and sufficient
for consistency of empirical risk minimization.” Advances in Computational
Mathematics, 25(1-3):161-193.
Natarajan , B. K. (1989). “On learning sets and functions.” Machine Learning ,
4(1):67-97.
Sauer , N. (1972 ) . “On the density of families of sets.” Journal of Combinatorial

-
Theory Series A , 13(1):145-147.
Shalev-Shwartz, S., O. Shamir , N. Srebro, and K . Sridharan. ( 2010) . “Learn-
,
ability, stability and uniform convergence.5 Journal of Machine Learning
Research, 11:2635-2670.
Shelah , S. (1972 ) . “A combinatorial problem ; stability and order for models
and theories in infinitary languages.” Pacific Journal of Mathematics, 41
(1):247-261.
Valiant , L. G. (1984) . “A theory of the learnable.” Communications of the
ACM , 27(11) :1134-1142.
Vapnik, V. N. and A. Chervonenkis. (1971). “On the uniform convergence of
relative frequencies of events to their probabilities.” Theory of Probability
and Its Applications , 16 ( 2) :264-280.
TCS (Theoretical Computer
Science),
“P?=NP” .

(Leslie G . Valiant ,

ACM “ A theory of the


learnable” . PAC
ACM PAC

ACM
“ ACM Turing Award Goes to Innovator in Machine Learning”
13.1

A Z
labeled )
{x i+i ,Xi+2,.• .,X /+W}, l U U

unlabeled )
A Q

A
A
SVM ,

active learning), query)

“Yes!”
13.1

O > o

13.1

(semi-supervised learning).

(cluster assumption),
13.1

manifold assumption),
10.5

r,
13.2

( pure) ( transductive
learning),

13.2

> zMM y

13.2

13.2

( generative methods)

EM
EM 7.6
ye y {1, AT}

N
2 a i - p( x \
P{ x ) =
^ (13.1)

sill ai = p{ x I T i
9.4
A Si
f(x) G y ® ,
G {1 2,...,iV} x

f(x) arg maxp(?


jey
j x)
N
= arg max = j,0 = i \ x)

N
= arg max p(y = j \ S = i,x) •p(@ = i \ x) (13.2)
^

p{x|
p = i \ x) = N (13.3)
I Mi ?

:c i y j
i j
j ©

^
py j \ Q = i) i i
^p(y = j \ Q = i) = l i j p(y j Q i) 0.

= j \ 0 = i,x)
= i \ x)

(13.2)

A {(aJUi) ...
13.2

Du = ... Xl+U } , u = m.

N }, A U Du

L L( Di U Du ) = E
( x j ,y j ) eDi
In
(s oti p{ x j| j 0 = i,X j )
\i i , Si) p( y |

+ E - In
XjED
(s Oi
^ p( X j I . (13.4)

(13.4)
EM
EM 9.4

•E
p{ *
Iji = N
*

^j • (13.5)
E1 a i P ( X j I Mi
2 =

•M i

Mi = (13.6)
® jf
E
G -Oti
+ Xj
^ D- ^ jixj ( x j ,yj ) eDiAyj=i
xj

5] =

^ 3? G ^ +h
0
Jf G D\L

( x j — P i )( x j — Pi ) T
9 (13.7)
lxj y j ) GDi /\ yj =i

oti —m ®J G Dy, J
I (13.8)

(13.3) (13.2)

[Miller and Uyar ,


1 9 9 7], [ Nigam et al., 2000]
_
[Cozman and Cohen , 2002].

13.3 SVM

Semi-Supervised Support Vector Machine ,


SVM 6
S3VM)
S3VM

13.3 ( low-density separation )

S3 VM

• • •- •

SVM

13.3

TSVM (Transductive Support Vector


Machine) [Joachims , 1999]. SVM TSVM
TSVM ( label
assignment ),
13.3 SVM

A (»2 ,y2),
® Z+2 ,• • • ® Z+w} G{ Z I \ u = m. TSVM
~ ~

DU (U i+2,… U m e{- i ,+i},


i
mm
-2 \ H \ 22 + CiJ2
^ ^ Cu ^ ^
i=l
i (13.9)

s .t .
^ + )> ^ —
yii u) Xi 6 1 5 i 2, • • • Z

(wTXi + 6)
yi i= 1 - l - - l , l + 2,
\ • ••5 m

0 ? i = l 2 ...

(w,b)

^( = i /
^
+ l ,/ + 2,...,m)
=
Q Cu

TSVM (13.9)
SVM (13.9) Du
SVM (label assignment), SVM

-
pseudo label)
(13.9) SVM

Q TSVM
(13.9)

CU = Q
SVM
TSVM 13.4

SVM
13.4

(13.10) 3.6
A {( 1,2/1),(«2 , 2 ), ( z ,yz )};

^ ^ ^
• •
{ x i+1 , X i+2 , • • • X i .u }]
a. ^

-
A SVMZ
SVMz Ax { yi+ u y i+2,...,yi+u)
Cu Cl;
while Cu < Ci do
5: (l3.9) (w , b),
A
while 3{ i , j \ < 0) A > 0) A > 0) A ^ 2 ) } do
6:

Vj = -Vj\
j
^
DhDu , y , CuCn (13.9),
end while
11: Cu = mm{ 2Cu , Ci }
12: end while
(UZ+2 .. . U
13.4 TSVM

+ = - Q-

^
(13.10)

13.4 6-1 0

(13.9),
[Joachims, 1999].
(13.9) .

SVM
(graph
kernel) LDS [Chapelle and Zien,
meanS3VM [Li et al., 2009]

13.4

(strength)
13.4

A {(® l yi ) y2) } { +1, +2,...,

^^
2
I U I U m. G ( F E),
{xi,...,X/ ,CC/+1 ,...,xi+u}, E affinity
matrix),

ex \ \ Xi - X j \\ l
2&1 j
(W )y = (13.11)
0 otherwise

i , j G {l , 2 , . . . , m}, a > 0

{V,E) G f : V R,
yi sign(/(xj)), yi e{-l,+l}.
(energy function) [Zhu et al.,2003]: ^
m m
2
Ein )
i=l j=l

(m m m m

^
' + - X=2 (w)t/ / /
2 \i \
= j=l 2 1 j=l
m m m

^
- X=
=
^
i=l
2 dif 2{ xi )
i l j=l
(W )o / /(

= /T(D
- W)/ , (13.12)

/ = (//T /J )T 5 f l = (/(«1);/ ;• • •;/(% ))


2)

f{xi+2)] f{x i+u)) /


D diag(di, 2 ,
d ... d i+u )
w i
w i

f(x i) = y i (i = l,2,...,l),
/ A=D -W (Laplacian
matrix). Z W
W n W iU
Wui wu u
^
>11
D=

E( f ) = ( m) 0 Wn
wnZ wuu_
fi
fu
(13.13)

?
f ( Du
- Wu ) ft - 2 f ^ ^
Wutfi + f ( Duu - Wuu ) fu . (13.14)

d E( f )
df =0
'
.
fu = (Dun Wuu
^ Wulfl (13.15)

^
1
p = D 1W = D Oiu Wu Wz.
Oul I>Zu wm Ww
DfW DfWz.
..

- ^ ^^
" D-JW
(13.16)

P =D W Py = D- W , (13.15)

fu = ( Duu ( l -

= (i - D
^ w
1

_ m
1
= (I Puu ) Pui /z (13.17)

A = yi )
•••
'
(label propagation)
[Zhou et al., 2004].
A U G = (V , E ) ,
F { x i ,... ® z • W

^
D diag(di ,d2 ,..., +n )

^
Z+W x F (F? F V... FDT
F, ,
((F)il ,(F) 2 ,...,(F)w )
Vi argmaxi 3 | ( F) W
i m, j | F
13.4

if (1 i A ( yi = j )\
F (0) = (Y ) = (13.18)
^ otherwise.

Y / Z
W S D-iWD i, D - 5

diag

F(t 1) aSF ( t ) + ( 1 - a ) Y - (13.19)

a e (0,1) SF Y
(13.19)
_
F* = (1 - - aS ) 1Y (13.20)

F* { yw , yw , . . +u ). 13.5

A { ( x i , y i ) , (£C2 , 2/2 ) , ( x h y i ) }\
P Du {jCz +i X^+2 , •
^
"

••

a;
a.

a W; _
W S = D IWD
(13.18) F
t
repeat
F ( i + 1) aSF + ( 1 - a ) Y;
7: t = t+l
8 until F*
9: for i = l + 1, 1 + 2, . . . , 1 + u do
F* ) -
10
11:
Vi = axgmax
end for ^^ ^ ^ ^
d = d i, ..• m+u )
13.5

13.5 [Zhou etal,2004]


- -u
l\ l

E ( w ) ij ,- 2

" "
F + Fd (13.21)
"
11
2 ,i= Vdi
* i
(13.21) 13.5
"
M

(13.21)

F

(13.12)
(13.21)
(13.12)

0(m2),

13.5

SVM
(disagreement-based methods)
disagreement diver-
disagreement )

-
co training) [Blum and Mitchell, 1998]

-
multi view)
multi-view learning)

attribute
set), view).

( ( x\ x 2 ) , y),
W i

( ( x\ x 2 ) , y)
'
13.5

(compatibility),
y 1 y2
y y1
y1 y2

(sufficient)

13.6

[Blum and Mitchell, 1998].

fBlum and Mitchell,1998].

2013].

[Goldman and Zhou ,


2000], [ Zhou and Li,2005b],
[ Zhou and Li,2005a]
Dt ={((ajJ ,xf ),2/i ),...,((xl xf ),yi )}]
Ax {(x[+1,a;f+1> ,... (xl+u,xf+u)}]
s;
p
p n s. n;

T.

h Du s Ds;
Du = Du\ Ds ]
3: for 7 = 1, 2 do
Di = { ( xivi ) I { { xi xl
end for
~3
) , y i ) G A}


5
for t ••• T do
for j = 2 do
j 8: hj W)
={g
/ij. •
j
9: hj <,xl ~
) G Ds} p
C Ds n C
- {( x r j l ) x{ e D0
3

^^
Dp3

^
10
11 bl ~
--
3 j
{( x , i ) \ x{ e Diy,
12 Ds = Ds\ ( Dp U Dn )\
13 end for
14 if then
15 break
16 else
17 for j = 2 do
18 D )
19
20
end for
Ax 2p + 2n
'
21 end if
22: end for
/ll Zl2

13.6

2013];
13.6

13.6

(semi-supervised clustering)

10.6
-
(must link) -
cannot link)

;
A ( Constrained fe means) - [Wagstaff et al. 2001]

m};
A4
*

C;
.

f Ml , M2 , - .. , /ifc}
repeat
fc Cj = 0 ( l j k ) ;
3
4 ^ ^
for i = 1, 2, . .. , m do
-
^^
5 (1 j k) =|
\\ xi Xj|
|2
6 ={1,2,
7 is _merged =false;
8 while is _merged do
/C r = argmir
A G
&
M C ^ K

if -i is voilated then
12 Cr = Cr U i}; '

13 is _merged =true
14 else
JC = JC\ { r } ;
if /C = 0 then
break
end if
end if
end while
end for
for j = 1, /c do

end for
25: until
= \ c~\
^ xeCj X

13.7 /c
C, { x i , X j ) e M
fc 9.4.1 {xi . Xj ) eC
C
A:

13.7
p .202 4.0
X13

M = {( X4 , 25 ), ), X u ) },

C
^
{(x2 ,X2l ),{ x 2 l : X 2 ) ,( 13 ,«23 ) , ®13 ),(«19 ,® 23),(«23 ,

/c
^ z6 , :c27 13.8

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.7

(a) (b)

v.
0.3 -
.4\

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.7

(c) (d)
13.8 4.0 /c A: 3)
13.6

A:

®9, ®14 , ®17,®2l }


;
C2 {®65 *8, ®10 ?
^11, ^12, ®15, ^19 *2o } ?

Cs = {a i , aJ 2, X4, «22 , X 2S , X 24 , ® 26 , *28, *29 , *30}•

D { x u x 2 , ... , x m } ,
(cluster label),
(class label ).
S \Jkj=1 S j C D, /0 j

A:
(Constrained Seed

-
Ai means) [Basu et al., 2002], 13.9

JD {xi ,x2,...,Xm}
,|
S C D|5 <|D
| . S = U =! •

for j .. . A: do
3 = 15 1
^^
x
• ^
3 end for
xeSj

4: repeat

^
5: C j = 0 (1 < j k ) ]
for j = 1, 2, . .. , /c do
for all x S j do
C j = Cj \J { x }
end for
end for
for a l l X i e D\ S do
(1 \ \2

^
j k) HA
r argmin

^ ^^ ^
Tj g ,... }
A CV CV U{^}
end for
for j = 1, 2, . .. , /c do

19 until
end for
= W i\
^ xeCj

13.9 /c
4.0

SI { X 4, X25}, S2 {xi2 ,X2o}y Ss = {«14, X 17 }.

13.10 A;

Cl = { x i , X 2 , ,«22 ,*23 ,*24 ,® 25 , ® 29 ? ®30}

^ ^
4 $ 27 5 28 ?

^^
2 = ®7 , ® 10 ,®11 ®12 , ® 18 ,®19 ,®2o}
;
( 3 = ® 13 , 21

7.6
.
o.
o.

5.4
o.
•\ •
-vr
>•••
V*
:

\ o.
'w
3.2
o.
.
+ o. V.
#
.

A \• ■
o.1

0.2
/
V•.
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.6

(a) (b)

76
o.
o.

..
o..S
,
- :
?

4.32
\
f
o. i,
•• \

'

^
o.
\

•••
o. • •
••f
!

i
o.1 •
.
if
- 0.5
w \

0.7 0.2 -
If y0>.

- 0.9

(c) (d)

13.10 4.0 /c A; 3)
13.7

13.7

[Shahshahani and Landgrebe,

(ICML)
(paradigm)
SVM [Blum and Mitchell,
[Joachims, 1999] [Zhu et al.? 2003]
[Shahshahani and Landgrebe, 1994].

SVM
(continuation)
S3VM [Chapelle et al,2006a];
(deterministic annealing)
[Sindhwani et al., 2006]; CCCP
[Collobert et al.,2006]
[Blum and Chawla, 2001]
(mincut).
fc e
13.4 A: e
10.5.1
[Wang and Zhang, Jebara et al.,2009], (graph
kernel) [Chapelle et al., 2003].

[Blum and Mitchell, 1998]. -


(tri training)
[Zhou and Li, 2005b].

[Zhou, 2009].
2013].
[Belkin et al., 2006]
ization)
( manifold regular
-
.

[Cozman
and Cohen, 2002] ,
SVM,
S4VM [Li and Zhou , 2015]
safe)

[ Zhou and Li, 2005a] [Zhang et al., 2007]


[Chapelle et al, 2006b;
Zhu, 2006] , [Zhou and Li, 2010; 2013]
[Settles, 2009]
13.1 (13.5) (13.8).
13.2

13.3 ( mixture of experts) /c

P ( X \ 0) = Y ai
...
^ ' Pix \ ei )

i
(13.22)

o, =

uci 13.4 TSVM UCI


http:/ / archive.ics. uci.edu / ml /.

TSVM
SVM,

13.5
TSVM

13.6* TSVM

r
i 3.

13.8
-
(self training)

AA,
13.9*

13.10 13.7
. 2013). 39 (11):1871-1878.
Basu S., A. Banerjee, and R. J. Mooney. ( 2002). “Semi-supervised clustering
by seeding.” In Proceedings of the 19th International Conference on Machine
Learning ( ICML ) , 19-26, Sydney, Australia.
Belkin, M., P. Niyogi, and V. Sindhwani. ( 2006) . “Manifold regularization: A
geometric framework for learning from labeled and unlabeled examples.”
Journal of Machine Learning Research, 7:2399-2434.
Blum, A. and S. Chawla. (2001). “Learning from labeled and unlabeled data
using graph mincuts.” In Proceedings of the 18th International Conference
on Machine Learning ( ICML ), 19-26, Williamston, MA.
Blum, A. and T. Mitchell. (1998). “Combining labeled and unlabeled data with
co-training.” In Proceedings of the 11th Annual Conference on Computation-
al Learning Theory ( COLT ), 92-100, Madison , WI.
Chapelle, O., M. Chi, and A. Zien. ( 2006a) . “A continuation method for semi-
supervised SVMs.” In Proceedings of the 23rd International Conference on
Machine Learning ( ICML ) , 185-192, Pittsburgh , PA.
Chapelle, O., B. Scholkopf , and A. Zien , eds. ( 2006b). Semi-Supervised Learn-
ing. MIT Press, Cambridge, MA.
Chapelle, O., J . Weston, and B. Scholkopf. ( 2003) . “Cluster kernels for semi-

supervised learning.” In Advances in Neural Information Processing Systems


15 ( NIPS ) (S. Becker, S. Thrun , and K. Obermayer , eds.) , 585-592, MIT
Press, Cambridge, MA.
Chapelle, O. and A. Zien. ( 2005). “Semi-supervised learning by low density
separation.” In Proceedings of the 10th International Workshop on Artificial
Intelligence and Statistics ( AISTATS ), 57 64, Savannah Hotel, Barbados.
-

Collobert , R., F. Sinz , J. Weston , and L. Bottou. ( 2006). “Trading convexity for
scalability.” In Proceedings of the 23rd International Conference on Machine
Learning ( ICML ) , 201-208, Pittsburgh , PA.
Cozman , F. G. and I. Cohen. ( 2002 ) . “Unlabeled data can degrade classifica-
tion performance of generative classifiers.” In Proceedings of the 15th Inter-
national Conference of the Florida Artificial Intelligence Research Society
( FLAIRS ) , 327-331, Pensacola, FL.
Goldman , S. and Y. Zhou. (2000) . “Enhancing supervised learning with unla-
beled data.” In Proceedings of the 17th International Conference on Machine
Learning ( ICML ) , 327-334, San Francisco, CA.
Jebara, T., J. Wang, and S. F. Chang. ( 2009) . “Graph construction and b-
matching for semi-supervised learning.” In Proceedings of the 26th Interna-
tional Conference on Machine Learning ( ICML ), 441—448, Montreal, Cana-
da.
Joachims, T. (1999) . “Transductive inference for text classification using sup-
port vector machines.” In Proceedings of the 16th International Conference
on Machine Learning ( ICML ), 200-209, Bled , Slovenia.
Li, Y.-F. , J . T. Kwok, and Z.-H. Zhou. (2009). “Semi-supervised learning using
label mean.” In Proceedings of the 26th International Conference on Machine
Learning ( ICML ) , 633-640 , Montreal, Canada.
Li, Y.-F. and Z.-H. Zhou. ( 2015) . “Towards making unlabeled data never hurt .”
IEEE Transactions on Pattern Analysis and Machine Intelligence , 37(1):
175-188.
Miller, D. J . and H. S. Uyar. (1997) . “A mixture of experts classifier with
learning based on both labelled and unlabelled data.” In Advances in Neural
Information Processing Systems 9 ( NIPS ) ( M. Mozer , M. I. Jordan, and T.
Petsche, eds.) , 571-577, MIT Press, Cambridge, MA.
Nigam, K ., A. McCallum , S. Thrun , and T. Mitchell. (2000 ). “Text classifica-
tion from labeled and unlabeled documents using EM.” Machine Learning ,
39 ( 2-3):103—134.
Settles, B. ( 2009 ). “Active learning literature survey.” Technical Re-
port 1648, Department of Computer Sciences, University of Wis-
~
consin at Madison, Wisconsin, WI. http:// pages.cs.wisc.edu / bsettles /
pub /settles.activelearning.pdf .
Shahshahani, B. and D. Landgrebe. (1994) . “The effect of unlabeled samples
in reducing the small sample size problem and mitigating the Hughes phe-
nomenon. IEEE Transactions on Geoscience and Remote Sensing , 32 (5):
1087-1095.
Sindhwani, S. S. Keerthi, and O. Chapelle. (2006 ) . “Deterministic annealing
for semi-supervised kernel machines.” In Proceedings of the 23rd Internation-
al Conference on Machine Learning ( ICML ) , 123-130, Pittsburgh, PA.
Wagstaff , K., C. Cardie, S. Rogers, and S. Schrodl. ( 2001) . “Constrained
fc-means clustering with background knowledge.” In Proceedings of the
18th International Conference on Machine Learning ( ICML ) , 577 584, -

Williamstown , MA.
Wang, F. and C. Zhang. ( 2006 ) . “Label propagation through linear neighbor-
hoods.” In Proceedings of the 23rd International Conference on Machine
Learning ( ICML ), 985-992, Pittsburgh , PA.
Zhang , D., Z.-H. Zhou, and S. Chen. ( 2007) . “Semi-supervised dimensionality
reduction.” In Proceedings of the 7th SIAM International Conference on
Data Mining ( SDM ), 629-634, Minneapolis, MN.
Zhou, D., O. Bousquet , T. N. Lai, J . Weston , and B. Scholkopf. (2004) . “Learn-
ing with local and global consistency.” In Advances in Neural Information
Processing Systems 16 ( NIPS ) (S. Thrun, L. Saul, and B. Scholkopf , eds.) ,
284 291, MIT Press, Cambridge, MA.
-

Zhou, Z.-H. ( 2009 ). “When semi-supervised learning meets ensemble learning.”


In Proceedings of the 8th International Workshop on Multiple Classifier Sys-
tems, 529 538, Reykjavik , Iceland.
-

Zhou , Z.-H. and M. Li. ( 2005a) . “Semi-supervised regression with co-training.”


In Proceedings of the 19th International Joint Conference on Artificial In-
telligence ( IJCAI ) , 908-913, Edinburgh, Scotland.
Zhou, Z.-H. and M. Li. ( 2005b) . “Tri-training: Exploiting unlabeled data using
three classifiers.” IEEE Transactions on Knowledge and Data Engineering ,
17(11) :1529-1541.
Zhou , Z.-H. and M. Li. ( 2010) . “Semi-supervised learning by disagreement .”
Knowledge and Information Systems, 24(3):415-439.
Zhu, X. ( 2006). “Semi-supervised learning literature survey.” Technical Report
1530 , Department of Computer Sciences, University of Wisconsin at Madi-
~ _
son, Madison, WI. http: www.cs.wisc.edu / jerryzhu / pub/ssl survey.pdf.
Zhu , X., Z. Ghahramani, and J . Lafferty. ( 2003). “Semi-supervised learning
using Gaussian fields and harmonic functions.” In Proceedings of the 20th
International Conference on Machine Learning ( ICML ), 912—919,Washing
ton , DC.
-

manifold ) Mannig
-
faltigkeit , • ( Bernhard
Riemann , 1866)

(Breselenz),

Mannigfaltigkeit

(
14.1

)
(probabilistic
model)
inference),

'
0,
generative) P% discriminative)
P (X , R \ 0 ) . P(y, O )
P(y,R I O ) P (Y I O).

0(2lyW l ).

(probabilistic graphical model)

( Bayesian network);
( Markov
network)•
( Hidden Markov Model, HMM)
7.5
(dynamic Bayesian network),

14.1
{2/1, • . 2/n} Vi j
( hidden variable).
y i M y2 Vi Vn

Xi) ( X2 Xi

14.1

{xi ,x2 ,...,xn}, xi eX f


S 2,... , SN }
y iV A

14.1
W
t f
- —
yt 1 , n-2
(Markov chain), BP

Vn ) = P { yi ) P { xi yi )
Y=[ P ( yi yi - i ) P { xi \ yi ) . (14.1)
i 2>

• A
[ao]ivxiV

dij = P { yt -\- i = Sj yt = Si ) ,


B [ b{ j ]iv x M

bij = P { xt Oj yt = si )


14.1

(7ri ,7T2 ...,7riv )

P ( Vl = Si )
TTz l
^i^ N

y
A [A,B ,TT] A
x2 , ... xn}:

(1) t 7T

(2) B

(3) A

t n t t .

G {& ••• }
Sjv {oi ,02,...,0M} t

A [A,B,TT],
{m , X2 , .. .
^} n P( X A )

• A [A,B ,7r] x 2, .
y {yi y2,… yn}

• x= A = [A B ,TT]
P(x A)

{ i ,x2,...,xn i} _
^ P (x A )
(14.1)

14.2

(Markov Random Field , MRF)

(potential
functions), factor ),

14.2
clique).
maximal
clique); 14.2
{ 1 , 2},{ x i , XS } ,{ 2 , 4},{ 2 , 5},{ X 2 , Xe } ,{ 3 , 5},

^^x3 ^^ ^^
{ X 2 , X },
^
{ x u x2 , xs }
{X5,a 6}
^^
XA
X2 x6

X3

14.2

n
x { x u x2 , .. .,xn}, C, Q eC
xQ , P(x)

p(x) n
Qec
(XQ ) , (14.2)

Q Q Z=
14.2

^
Ex UQec Q( XQ )
Z Z

(14.2)
9
( 3*
< Xg Q Xg* ;
XQ
P(x) C*

P (x )
Z* n
QeC*
(14.3)

Z* = EX Ugec* (XQ )
x6}, P(x)

P( X ) —^ 12 ( X1 , X 2 ) 13 (a l , 3 ) 24
^ ^ ^ ^4 ) 035 (^3, ^5 )
*

^ ^ ,^ ,
256 2 5

0256( 2 , 5 , 6)

^^ ^
*

{x|2,x }, a 2 a;6 ) {X5 ,X 6}

7.5.1
14.3 B C
A B c C separating
set ).

• global Markov property):

X _
e
14.3
XC
A C
XA X _e XC7 .
xA xs xa

14.3 S (7
14.3 B C
14.3 1 4 . 4.

Q)
(x
^ )

14.4 14.3

14.4, (14.2)

P( XA, XB , XC ) = AC { XA, XC ) BC { XB, XC ) . (14.4)


^ ^
P{ x A, X B , X C ) P( X A , X B , X C )
xc) P { xc ) ExfAExfBP( x A ^ X B ^ X c )
, XCyipBC( XB
^^ ^ ^
E^B
AC{ X A
Adx XC )
XC )
,
B c { xfB X C )

,AC{ XA XC) ,
BC { XB X C )
^
YUX Xx^
• (14.5)

^^ AC XC ) B ’
BC [ X B XC )

P{ x A I xc ) =
P { x A , XC )
p( xc )
. ^ .^
T P A X XC )
^ Pix x ^ xc )
X

J2xfA Ex
JD
^ ^^
A C ( X A , X C ) B C ( XfB X C )

i A C [ XfA , XCyiPBC ( XfB X C )


AC [ XA, XC )
.
T x Acix
' (14.6)

(14.5) (14.6)
^ ^ Xc )

P ( x A , XB I Xc ) = P { x A | xc ) P { x B I xc ) , (14.7)

CA

• (local Markov property):


14.3

Markov V n\v ) = n{ v ) U { v } , xy\n*(v)| n(y ) ■ •

blanket ).
• (pairwise Markov property):
F
W WJ Xw ^v\{ u ,v)

XQ
14.4

1.5, if XA = xc\
AC { XA, XC )
^ 0.1, otherwise

0.2, if xB = xc\
,
B C{ X B XC )
^ 1.3, otherwise ,

XC7 XC7
(1 4 . 2)
M M

4Q(XQ ) e- Q(XQ )
. (14.8)

^
g(xg ) xQ

HQ ( XQ ) =
/

u ,veQ ,u v
-
OL yX n X y H-
^ :P v V (14.9)
^ VGQ

14.3

( Conditional Random Field , CRF)


. 14.1

3.3
X2 , Tn} y 2/n}
P( y I
y

14.5(a)
14.5(b)

[S]
[ NP] [VP]
y
{ yi V2 V3 DA V 5 ye }
[PP] y
[ D] [ N ] [V] [ P ] [D] [N ]

{ ^1
• X2 X3 X4 X5 6 } [ D] [ N ] [V] [ P][D] [N ]
X
The boy knocked at the watermelon. The boy knocked at the watermelon . x

( a) ( b)
14.5

G (V , E ) y
r n( v ) r G

P {Vv I ^ Yv\{ v } ) = P (Vv I x, yn(t ) ) ; (14.10)

(y,x )
G
14.6
(chain-structured CRF).

{ xi X2 ... xn }
14.6
14.3

x). x, 14.6

(1 4 . 2)
(feature function),

P( y I x ) = exp
Z

(14.11)

( transition feature function),


sk ( yi i) i
(status feature function),
^ Aj
Z (14.11)

14.5(a)

if = [P], yi = [V ] and Xi =“knock”


yi + i ;
0, otherwise,

i A “knock”

if yi [V] and = “knock”;


sk ( y i x i ) =
0, otherwise,

A “knock” [F].

(14.11) (14.2)
14.4

( marginal distribution)
x
A

z
marginalization).

F
{X15 X2 , M
XF P(xF)
P(xF I XE ) .

P( X E , X F ) XF )
P( X
P( XF = (14.12)
P (^E ) EXFP XF)
( X

P(xs ,xF)

P{XE) = y P(xjg , XF )
^ . (14.13)

14.4 .1
14.4

. •

14.7(a)

X4
m43(X3) x4

© X3 X3

(b)
14.7

POr5).
{xi ,X2,x3 , 4}, BP

X4 X3 X2
^ Xi

Y1 2 2 YlP ^ Xl > P ^ I Xl)P(X3 I X 2 )P { X 4 I X s ) P ( x 5 I X 3) .


^^
>
X2
X4 X3 X2 Xl

(14.14)

{ X U X 2 , X 4 , XS }

^
2P{ 2 P{xs I 2
^ ^2 ^ ^ ^^2 ^
P( x ) = 3) P( I 3) ) I l)
^
X5 4 X2
^ Xs X4 X2 x\

p( (14.15)
= ^ I^
5 3) P ( X 4 I Xs ) P( X 3 I X 2 )m12 ( x2 )

rrHjixj ) i
Xj )
j m
^

^^
= 2 P( x 5 I Xs ) 2P( x 4 I X 3 ) m2s { xs )

2 p(x 2 P { x^ I xs )
=
^2 X3
5 I xs )m2s { xs )
^ X4

^
p( x
=
^ 1 ^3 )^23 (^3 )^43 (^3 )
= 35( 5 ) . (14.16)

^^
• m35(a 5) x5
14.7(a)

P{ XI , X2 , X
^ X^ X ^ ) = — ^ 1 2 ( X 1 , X 2 ) 2S { X 2 ,
^
Z _
P(x5)
(14.17)

^ ^ ^
P(X5) = 35( 3, 5) 34( 3 , 4) 3) 2)
Xs

^^ ^^ X4 X2

^
^^
=
^ ^ ^ ) 2^
35 3 5 34 ( x3 , x4 ) 2^ ^ , xs )m12 { x2 )
23 ( 2

^-3 5 (^5) (14.18)

14.7(a)
P ( x5 ) P(x4), { xi , X2 , X , Xs } 77 2(2:2 )

^
^
m23(x3)

14.4.2

Sum- Product ( Belief Propagation)

i(xj ) JJ
ken( i )\j
mki ( xi ) (14.19)

n
A myh ). (14.15) (14.16)
14.7(b)
A
14.5

P ( xi ) ex JJ( ) mki { xi ) .
kGn i
(14.20)

14.7(b) X2

X4 m35(x5) P ( x5 ) .


14.7 n r4 x5
14.8
(14.20)

^43( 3) ( 3)
^ 43

^^
m2i ( x i ) rn32 ( x2 ) 21( 1) 32( 2)

^
34( 4 )

53( 3)
( 2) 23( 3)

^
12

^ 35 { x5 )
(a) (b)
14.8

14.5

(sampling),

(variational inference).
14.5.1 MCMC
14.7(a) r5

r
EP[/] f { x ) p( x ) dx (14.21)

p( x )
p( x ) ^2 , - .. /(4

(14.22)

(Markov Chain
Monte Carlo, MCMC) xGX p(z)
x A SJ
"

P ( A) = p ( x ) dx (14.23)
JA

R, f { x)

[ f ( X )] = f ( x ) p( x ) dx (14.24)
JX

a
(14.24) MCMC
x
_
p
Xl,X2 , ..,xN ,
- (14.24)
N
P( f ) = • (14.25)

p(x) p
14.5

MCMC p

x p.
r r(/
t p(x* ),

p(xt )T( xt-1 x£ ) p( xt-1) T (xt x^1 ) (14.26 )

p(x)

MCMC

MCMC
Metropolis-Hastings Metropolis-Hastings MH ) MCMC
N . Metropolis
[Metropolis (reject sampling ) p. 14.9
et al., 1953], W. K . x* ,
Hastings
X* Q(X*
i-1
[Hastings, 19 70]
*

-
^
x* 1), 1
') A(X*
Q(X* x 1
"
X )
X* X *
(14.26 )

p(xt ~1) Q ( x* xt-1) A (x* V —1) p( x* ) Q ( xt-1 | X


*
)A(X£ 1
~

I x* ) , (14.27)

xQ ;
for t . . . do
3: Q (x* | x^1) x* ;

(14.28).
4: 0, 1)
if u A ( x* xt 1) then
_
= x*
else
xt xt-i
9 end if
10 end for
11 return x x2 ,
^
^
x x2 , .. .

14.9 Metropolis-Hastings
A (x* x4-1) (14.28)

7.5 . 3 (Gibbs sampling ) MH


p(x).
x { Xh ^2 , - • }, P(x)
”XJV

(1)
( 2) x I X0,
xi = {XU2 , • • •

(3) |x7) •

14.5. 2

( plate notation ) [Buntine, I994]. M.


14.10 (a) iV z.
14.10 ( b)
7V;
14.10 A

(a) ( b)
14.10
14.5

14.10 x

N
p(X 0) = Y[ J2 p( Xi , Z I ©) (14.29)

N
lnp(x 0) 0) 5 (14.30)

X ={X1,X2 ,.. ㊀ X Z

14.10 x
Z ©, p(z ©) ©.
(14.30)
EM 7.6 EM E t p(z x,
p(x,z 0); M E
e Q(0;0 )
'
©t +1 arg max Q( @ ; @t )
©

^
^
arg max p( z|x,0 )lnp(x,z|0) • (14.31)
©

(14.31) Q(0;©0 _
ln p(x z ©)
p(z x,

^
p( z x © ) z
2(0;0 ) ' EM
z
z x,©0 z
g(z)

lnp(x) = C(q) \ KL ( q p),-- (14.32)

q) dz (14.33)

KL C.3.
KL (Q p) = ~
J• ) h P(
z x)
dz . (14.34)
p( z x,© )
E
z
' z

M
( 14.35 )

z
(exponential
gt (zt ) qi .
family )

Q) =
/ n ^{ lnp ( x , z )
^ In q i > dz

const
J lnp( x , z ) JJ q i d z i dzj — qj In q j d z j 4 const

q j lnp( x , Z j ) d z j — qj In q j d z j const ( 14.36)

lnp( x , Z j ) = E [lnp(x , z ) ] const ( 14.37 )

E [lnp(x z)] = lnp( x , z ) . ( 14.38)


#3

^

( 14.36) -KL ( q j p( x, Zj )) , (x, z) ) C(q)

\n q j ( z j ) = E [lnp(x , z ) ] const ( 14.39 )

exp (E[ (x , z )] ) lnp


^ ,

f exp (E [lnp (x z)] ) d z j


* ( 14.40)

( 14.35 )
( 14.40)
( 14.35 )
[lnp(x , z )] ( 14.40 )
z ( 14.38) Zj
14.6

Zj lnp(x,z ) Z

mean field mean field )

(14.40) EM

14.6

( topic model)

(Latent Dirichlet Allocation , LDA)


(word) (document )
(topic).

- -
bag of words).

14.11
K
r iV T AT
W K iV
= G n n f
n fc G n n A; n
i
i ,2,... r)
LDA LDA
G t
©# t A:

C. I .6.
(1)
(2) 7V
0.04 0.02 0.04 0.02
0.01 ten o.o2 0.02 0.01
0.01 0.01 0.02 tm 0.01

So

-
tt
if .

r r» j
»

“r
m
" iiw
-> . -4't -
^
I ilT
'
h .

14.11 LDA

( a) n &n;
( b)

14.11
1),
2b), 2a).

14.12 LDA n

n,
a,
ry .

OOOO
a © t n Pk V
N T K
14.12 LDA

LDA
14.7

p(W,z,/3,0 I a 77)

PJp(0£ I a ) Y[ p( f3k I r/) ( P ( wt n \ zt n


}
P ( zt n|©t ) I , (14.41)
t=l 2=1 \n=l

p(0t a) p( f3k|rj ) a r/ K iV

r (Sfc ak)
p( Qt a ) =
nfcrK ) m 1
(14.42)

C.1.5. r Gamma a ;
r (14.41)
W { w 1 , w 2 , .. . , w T } , LDA
a ry

LL(a ,rj ) = p( wt|a,ry) . (14.43)

a, (14.43)

a T7 n

At n)

'
p(z /3 0 W,a,T7) =
p(W,z,/3,0 a,
(14.44)
p(W|a ,r/)

p(W|a,77) (14.44)

14.7

[Roller and Friedman, 2009].


[Pearl, 1982] [Pearl, 1988]
[Geman and Geman, 1984]

[Rabiner, 1989]. [Lafferty et al.,


2001] [Sutton and McCallum, 2012].
[Pearl, 1986]

(Loopy Belief Propagation)


[Murphy et al., 1999], [Mooij
and Kappen , Weiss, 2000]. factor graph)
[Kschischang et al.,2001] (factor tree)
[Lauritzen and Spiegelhalter ,1988].
[Gonzalez
et al,2009] Te

[Jordan , 1998]
[Wainwright and Jordan, 2008].

L D A [Blei et al , 2003]
[Blei, 2012].
-
(non parametric)
[Teh et al., 2006] [Ghahramani and
p.164. Griffiths , 2006]
PLSA
LSA SVD
[Hofmann , 2001], LSA

p .266.
MCMC
[Pearl,1987] MCMC
[Neal, 1993], MCMC [Andrieu et al.,
Gilks et al., 1996].
14.1

14.2 .

14.3

14.4

14.5

14.6

14.7 MH

14.8

14.9* LDA,

14.10* LDA
Andrieu, C., N. De Freitas , A. Doucet , and M. I. Jordan. ( 2003). “An intro-
duction to MCMC for machine learning.” Machine Learning 50 (1-2 ) :5-43.
Blei, D. M. ( 2012 ) . “Probabilistic topic models.” Communications of the ACM
^
55 7-84.
Blei, D. M., A. Ng, and M. I. Jordan. ( 2003). “Latent Dirichlet allocation.”
Journal of Machine Learning Research, 3:993-1022.
Buntine, W. (1994) . “Operations for learning with graphical models.” Journal
of Artificial Intelligence Research, 2:159-225.
Geman , S. and D. Geman. (1984) . “Stochastic relaxation , Gibbs distributions,
and the Bayesian restoration of images.” IEEE Transactions on Pattern
Analysis and Machine Intelligence , 6 ( 6) :721-741.
Ghahramani, Z. and T. L. Griffiths. ( 2006 ) . “Infinite latent feature models and
the Indian buffet process.” In Advances in Neural Information Processing
Systems 18 ( NIPS ) (Y. Weiss, B. Scholkopf , and J . C. Platt , eds.) , 475-482,
MIT Press , Cambridge, MA.
Gilks, W. R. S. Richardson , and D. J . Spiegelhalter. (1996) . Markov Chain
Monte Carlo in Practice. Chapman & Hall/ CRC , Boca Raton , FL.
Gonzalez, J . E. Y. Low, and C. Guestrin. ( 2009) . “Residual splash for optimally
parallelizing belief propagation.” In Proceedings of the 12th International
Conference on Artificial Intelligence and Statistics ( AISTATS ) , 177-184,
Clearwater Beach, FL.
Hastings, W. K. (1970 ) . “Monte Carlo sampling methods using Markov chains
and their applications.” Biometrica, 57(1):97-109.
Hofmann , T. ( 2001) . “ Unsupervised learning by probabilistic latent semantic
analysis.” Machine Learning , 42 (1) :177-196.
Jordan, M. I., ed. (1998) . Learning in Graphical Models. Kluwer , Dordrecht ,
The Netherlands.
Koller , D. and N. Friedman. ( 2009 ) . Probabilistic Graphical Models: Principles
and Techniques. MIT Press , Cambridge , MA.
Kschischang, F. R., B. J . Frey, and H.-A. Loeliger. ( 2001). “ Factor graphs and
the sum-product algorithm.” IEEE Transactions on Information Theory 47
( 2 ):498-519. .
Lafferty, J. D., A. McCallum , and F. C. N. Pereira. ( 2001) . “Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence data.”
In Proceedings of the 18th International Conference on Machine Learning
( ICML ), 282-289, Williamstown, MA.
Lauritzen, S. L. and D. J . Spiegelhalter. (1988). “Local computations with prob-
abilities on graphical structures and their application to expert systems.”
Journal of the Royal Statistical Society - Series B, 50 ( 2):157-224.
Metropolis , N., A. W. Rosenbluth, M. N. Rosenbluth , A. H. Teller , and E.
Teller. (1953) . “Equations of state calculations by fast computing machines.”
Journal of Chemical Physics , 21(6 ) :1087-1092.
Mooij , J . M . and H. J . Kappen. ( 2007). “Sufficient conditions for convergence
of the sum-product algorithm.” IEEE Transactions on Information Theory,
53(12 ) :4422-4437.
Murphy, K. P., Y. Weiss , and M. I. Jordan. (1999) . “Loopy belief propaga-
tion for approximate inference: An empirical study.” In Proceedings of the
15th Conference on Uncertainty in Artificial Intelligence ( UAI ) , 467-475,
Stockholm, Sweden.
Neal , R. M. (1993) . “Probabilistic inference using Markov chain Monte Carlo
methods.” Technical Report CRG-TR-93-1, Department of Computer Sci-
ence University of Toronto.
Pearl , .J . (1982 ) . uAsymptotic properties of minimax trees and game-searching
procedures.” In Proceedings of the 2nd National Conference on Artificial
Intelligence ( AAAI )^ Pittsburgh, PA.
Pearl , J . (1986) . “Fusion , propagation and structuring in belief networks.”
Artificial Intelligence , 29 ( 3) :241-288.
Pearl, J . (1987). “Evidential reasoning using stochastic simulation of causal
models.” Artificial Intelligence, 32(2 ):245-258.
Pearl, J . (1988) . Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann , San Francisco, CA.
Rabiner , L. R. (1989 ). UA tutorial on hidden Markov model and selected ap-
plications in speech recognition.” Proceedings of the IEEE , 77(2):257-286.
Sutton , C. and A. McCallum. ( 2012 ) . “An introduction to conditional random
fields.” Foundations and Trends in Machine Learning , 4 ( 4) :267-373.
Teh , Y. W., M. I. Jordan , M. J . Beal, and D. M. Blei. ( 2006) . “Hierarchical
Dirichlet processes.” Journal of the American Statistical Association , 101
(476):1566-1581.
Wainwright , M. J. and M. I. Jordan. ( 2008) . “ Graphical models, exponential
families, and variational inference.” Foundations and Trends in Machine
Learning , 1(1- 2) :1-305.
Weiss, Y. ( 2000 ). “Correctness of local probability propagation in graphical
models with loops.” Neural Computation, 12 (1):1-41.
( Judea Pearl, ).

Rutgers
RCA

1.5

• ACM AAAI ACM

Allen Newell

“9.11”
Michael Jordan


15.1

mle)

[Fiirnkranz et al., 2012]. ( rule learning)

f A f2 A • • • A (15.1)

“i” (body),
(head),
(literal) (conjunction), “ A”

atom )

-
“if then

(X K ) (X Z) A (ZJ )”

AI
t

. (valuation)
2.oaP.76 2.0 1)
4.1.
cover).

(conflict), (conflict resolution).

(ordered
mle) (priority rule)

-
meta mle),

(default rule),

(propositional
rule) -
(first order rule). (propositional
atom) A) t)

(atomic formula), (predicate) (X,y)”


v
X+ a( X )
(( ((
a( X )

P0” “VX”
“3F” Y
“VX3F (Y) t (Y = (T(X)))”
“VX KX)) e .
F
quantifier).
15.2

( relational rule).

pO t

15.2

(sequential covering),

- -
(separate and conquer )

0.2”

r.
i i
y i R[ x , y ) a:

P.80 4.2 2.0


f
-^ f.

t A

17.

P A

t A

b A ).

r/ n - -
generate then
test ) specialization)

-
(bottom up),
-
data driven)
generaiization)

15.2

2.0

n /m
m n
15.1
3/4.

(2/4) (2/2)
(3/4) A (1/1)
(3/5) (1/1)

(4/6)

(4/6) (2/2)

(3/6) A

2.0
p .80 4.2 15.1 2.0

15.1

A
(beam search),
15.1

15.3

4.3 (pruning).

CN2 [Clark and Niblett ,


2.4
1989]

Ratio Statistics, LRS). m+ , m _


CN2 (Likelihood

LRS • rh+ log2


771-(-
m + +m
A A

_ +m _ log2
rh-|- +rn _
(15.2)
m + +m m+ +m _

LRS
LRS
LRS 0.99) CN2
15.3

(Reduced Error Pruning, REP)


[Brunk and Pazzani, 1991],
(growing set)
(pruning set).

REP [Brunk and Pazzani, 1991], 0(m4 ),


m IREP (Incremental REP) [Fiirnkranz and Widmer,1994]
0(mlog2 m),
r
REP
REP IREP

RIPPER Repeat- RIPPER [Cohen, 1995]


ed Incremental Pruning to
Produce Error Reduction,
WEKA
JRIP.
RIPPER 15.2 IREP*
15.2 *
IREP [Cohen 1995] IREP _
fc RIPPERfe, m + +m
RIPPER 5 k = 5. IREP
IREP RIPPER

IREP* n = IREP*(JD);
i
repeat
4 IV Post Opt (7?.) ;
5 Di = Not Covered (T?.7 , D )\
6 Ui = IREP*(A )
7 1Z = RJ U IZi ;
^

8 i i + 1;
9: until i = k

15.2 RIPPER
RIPPER

• ;
r IREP*
!j ( replacement rule);

• if IREP*
(revised rule).

r
15.2

RIPPER

RIPPER

[Fiirnkranz et al.,2012].

15.4

relation),







15.4

15.1 5.0
P.80
4.2 1) (2,6) (2,10) 14)
16) 17) 1) 6)

(15, 17) 14) (17, 16)


6) (1,7) 10) 14)

(17,7) (17,10) 14) 15)


1) (2,3) 6) 7)

7) 10) 15) 16)


(1, 7) (1, M) 16) 17)

M) 17) (17,16)
6) (1,10) (1,15)

1 0) 10) (17,16)
(1,6) (1,7) (1,10) 15)

(17,6) 7) (17,1 0) 1 5)
10) 14) (1,15) 16)

- - -
(7, 14)

-
1)

2)

2.0
15)
(10,2)

(17, 3)
16)
(10, 3)

(17,6)

15.1
17)

(17, 7)
6)

5.0.
relational data),

( background knowledge),
(examples). 5.0

clause).

(vx vy) (x,y ) e (x,r ) A .

(15.1)

“X” “y”
y) t ( X Z ) /\ ( A Y) •

FOIL (First-Order Inductive Learner ) [Quinlan , 1990]


15.2
FOIL
5.0

i •

(x z) (Z
(y,z),
(x,y)
(x,y)
15.5

FOIL “FOIL FOIL gain)

m+ m+
F.Gain m .x log2 log2 (15.3)
^ m+ + m m+ + m

m+ ,
FOIL
4.2.1

3.6
5.0
( XT ( XT
FOIL x ( log2 g - log 2

1)” 6)”. FOIL

FOIL
FOIL
f
FOIL

15.5

(Inductive Logic Programming, ILP)

ILP
(logic program) PROLOG

P
P(/(/(X )))
• FOIL

FOIL

15.5.1

(grounded factM

5.0 (X,Y)”
( X ,Y ) (1,15)”

10) 10) A (1,10) A 10)

10);

15) t (1,15) A (1,15) A (1,15).

Least General Generalization ,


LGG) [Plotkin,1970].
n r2 , LGG

LGG ( t , t ) = t ;

LGG ( s , t ) = V , LGG(s,t)
(1,10)” (1,15)”
n r2
K,

(1, f A (1,10) A (1,


15.5

(1, (1, A A (1,10 .


LGG ri r2 LGG
LGG
10)” LGG

F) F) A (1,y ) A (1,r ). (15.4)

(15.4)

(2,10) (2,10) A (2,10) A (2,10)


A (2,10) A (2,10) (15.5)

(15.4) (15.5) LGG. (2,1 0)”


(1 F)”
LGG(10,y ) = r2, R
LGG(2,1 ) = X

K ).

LGG
(x y)”
[Lavrac and Dze-
roski , 1993] 3
(X,y )
ILP RLGG ( Relative Least
General Generalization) [Plotkin , 1971], LGG
e eIK K
LGG ri r2
ri r2 , LGG
LGG

15.5.2
deduction) induction)
AM
w. s.
Jevons J . A. Robinson

( resolution principle) [Robinson,1965].


S. Muggleton W. Buntine (inverse
resolution) [Muggleton and Buntine,1988],

-
Q C2 L2;
L L2 , C1 = A W L , C2 = B W ^ L .
L C = AVB.

( AW B) - { B} = A , (15.6)

15.3
C = (Ci - { L})
c . c2 .
c =\
(C2 { L})

- (15.7)

(15.8)

B ^L
~

AVB

15.3

C Q
C/i # C (15.7),

C2 = ( C - ( C l - { L} ) ) L}. (15.9)
15.5

[Muggleton,1995]
pf g p ng

^^ ^
p AAB q A
(absorption): q-
' (15.10)
p qAB
^A
p <— A /\ B p AAq
(identification): q i- B p AAq (15.11)

-
(intra construction)
p
^ AAB p AAC ' (15.12)
q <r- B p AAq q ^r- C

-
p ^r- A A B
(inter constmction):
q
^ Aq A C (15.13)
p < r A5

— r^ A ^r AC
'

y” .
f x xhr. x
y y y

substitution)
.
{ 1/ X 2/ Y } “C (X y) (X 1T
“ C = ce = { X ,Y }
(domain).
{F/X} A {1/ F} K

^
oA; 0 { X /Y }.
unificaticm )
(1,X )” B ( y 2)”
{2/X,l / l"} = B6 = (1,2)” B
unifiable) A B unifier )
W W A
A W most
general unifier , MGU)
(1 1T {1/X},02 ={l/X,2/ y},
{1/Z,Z / X } MGU .
A L2.
G AvA c2 v L2 ,
^
C ( Ci —{LI}) V (C2 {L2})6 . (15.14)

(15.9)
(15.8), Ci C /C2 C2 C/ Ci resolution quotient ),
C G C2. A G Cl 5

h
(

A \- C
- ,

^
^ C(=
C ^
A\/ B ,
A S) .
(Ci {Lx}) hC (15.15)

G vars(Ci ), G
C vars(Li ) vars( Ci {Li})
C2 vars(L2)
la , 0()2/ o h
MGU.
L2. (15.9),

-
C2 = (C (Ci -{Li})0i (15.16)

Lp L2

——
5.0

Ci = (1, X ) <- A (1,

_
C2 = y) (1,y) A (1,10.


“p AAB” “ pt AAC”
(15.12) Cl C2 5

WM,7V) (15.12)

Cr = Z) Z) q{M ,N )

(15.12) G /C C2 /(7
C h

15.6

q(M N ) . g “ q[M N)

1.4
1( M AT) Li L2 q(l ,S ),
{X /Z}, {/l M ,X/ },
iV {X /S}.
" (1,5) (1,5)”. C2 /C f

^
ILP

15.6

symbolism learning)
[Michalski, 1983]. [Fiirnkranz et al., 2012]

[Michalski, 1969] AQ
AQ
optima!
-
Algorithm Quasi
AQ AQ15 [Michalski et al,

AQ
-
AQ17 HCI [Wnek and Michalski, 1994]

AQ PRISM [Cendrowska, 1987]

PRISM AQ
WEKA PRISM

CN2 [Clark and Niblett, 1989]


[Fiirnkranz,1994]
RIPPER [Cohen,1995]
RIPPER C4.5
C
RIPPER
[Winston,1970];
FOIL

Never-Ending Language Learning, NELL)


FOIL [Carlson et al., 2010].

[Muggleton, 1991] (ILP)


GOLEM [Muggleton and Feng, 1990]
ILP (LGG)
[Plotkin,1970] GOLEM RLGG . PROGOL [Muggleton ,
1995] (inverse entailment)
[Muggleton and Lin , 2013]. ILP
PROLOG PROLOG
ILP PROGOL
1.5
[Muggleton , 1995] ALEPH [Srinivasan ,1999] ILP
ILP Datalog [Ceri et al., 1989]
SQL IBM DB2. ILP
[Muggleton , Lavrac and Dzeroski, 1993],
(ILP).
ILP
[Bratko and Muggleton, 1995],

( probabilistic
ILP) [De Raedt et al,2008],
(relational Bayesian network) [Jaeger , 2002]

[FYiedman et al,
( Bayesian Logic Program) [Kersting et al.,2000]
( Markov logic network) [Richardson and Domingos,2006]
statistical relational learning) [Getoor and Taskar , 2007].
2.0 p.76 15.1
4.1.

15.2

15.3 RIPPER 2.0

15.4
2.0a p.86
2.0a

15.5 RIPPER
5.0

15.6 .

15.7 ri r2 , ri r2
LGG

15.8 5.0 LGG

15.9* P ( h , t 2 … ,tn ) , P
U
{ EI , E2 , . . . En } ,
S
MGU .

15.10*
Bratko, I. and S. Muggleton. (1995) . “Applications of inductive logic program-
mmg. Communicantions of the ACM , 38(11):65-70.
Brunk, C. A. and M. J . Pazzani. (1991). “An investigation of noise-tolerant re-
lational concept learning algorithms.” In Proceedings of the 8th International
Workshop on Machine Learning ( IWML ) , 389-393, Evanston, IL.
Carlson , A., J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M.
Mitchell. ( 2010). “Toward an architecture for never-ending language learn-
ing.” In Proceedings of the 24th AAAI Conference on Artificial Intelligence
( AAAI ), 1306-1313, Atlanta, GA.
Cendrowska, J . (1987). “PRISM: An algorithm for inducing modular rules.”
-
International Journal of Man Machine Studies, 27( 4) :349-370.
Ceri, S., G. Gottlob, and L. Tanca. (1989) . “What you always wanted to know
about Datalog (and never dared to ask ).’’ IEEE Transactions on Knowledge
and Data Engineering , 1(1):146-166.
Clark, P. and T. Niblett . (1989 ) . “The CN2 induction algorithm.” Machine
Learning 3( 4):261-283.
Cohen, W. W. (1995). “Fast effective rule induction.” In Proceedings of the 12th
International Conference on Machine Learning ( ICML ) , 115-123, Tahoe,
CA.
De Raedt , L., P. Frasconi, K. Kersting, and S. Muggleton, eds. ( 2008) . Prob-
abilistic Inductive Logic Programming: Theory and Applications. Springer ,
Berlin.
Friedman , N., L. Getoor , D. Roller, and A Pfeffer. (1999) . “Learning prob-
abilistic relational models.” In Proceedings of the 16th International Joint
Conference on Artificial Intelligence ( IJCAI ), 1300-1307, Stockholm , Swe-
den.
Fiirnkranz , J . (1994) . “Top-down pruning in relational learning.” In Proceed-
ings of the 11th European Conference on Artificial Intelligence ( ECAI ) , 453-
457, Amsterdam, The Netherlands.
Fiirnkranz, J ., D. Gamberger , and N. Lavrac. ( 2012) . Foundations of Rule
Learning. Springer , Berlin.
Fiirnkranz, J. and G. Widmer. (1994). “Incremental reduced error pruning.”
In Proceedings of the 11th International Conference on Machine Learning
( ICML ), 70-77, New Brunswick , NJ.
Getoor , L. and B. Taskar. (2007) . Introduction to Statistical Relational Learn-
ing. MIT Press, Cambridge, MA.
,
Jaeger , M. ( 2002 ) . ^Relational Bayesian networks: A survey.5 Electronic Trans-
actions on Artificial Intelligence 6:Article 15.
Kersting, K., L. De Raedt , and S. Kramer. (2000). “Interpreting Bayesian logic
programs.” In Proceedings of the AAAF 2000 Workshop on Learning Statis-
tical Models from Relational Data, 29-35, Austin, TX.
Lavrac, N. and S. Dzeroski. (1993). Inductive Logic Programming: Techniques
and Applications. Ellis Horwood , New York , NY.
Michalski, R. S. (1969). “On the quasi-minimal solution of the general covering
problem.” In Proceedings of the 5th International Symposium on Information
Processing ( FCIP ), volume A3, 125-128, Bled, Yugoslavia.
Michalski, R. S. (1983). “A theory and methodology of inductive learning.” In
Machine Learning: An Artificial Intelligence Approach ( R. S. Michalski, J.
Carbonell, and T. Mitchell, eds.) , 111-161, Tioga, Palo Alto, CA.
Michalski, R. S., I. Mozetic, J . Hong, and N. Lavrac. (1986) . “The multi-purpose
incremental learning system AQ15 and its testing application to three med-
ical domains.” In Proceedings of the 5th National Conference on Artificial
Intelligence ( AAAI ) , 1041-1045, Philadelphia, PA.
Muggleton , S. (1991) . “Inductive logic programming New Generation Com-
puting ^ 8( 4) .
:295-318
Muggleton , S., ed. (1992) . Inductive Logic Programming. Academic Press, Lon-
don, UK .
Muggleton, S. (1995) . “Inverse entailment and Progol.” New Generation Com-
puting , 13(3-4):245-286.
Muggleton, S. and W. Buntine. (1988) . “Machine invention of first order predi-
cates by inverting resolution.” In Proceedings of the 5th International Work -
shop on Machine Learning ( IWML ) , 339—352, Ann Arbor , MI.
Muggleton, S. and C. Feng. (1990) . “Efficient induction of logic programs.”
In Proceedings of the 1st International Workshop on Algorithmic Learning
Theory ( ALT ) , 368-381 Tokyo, Japan.
Muggleton , S. and D. Lin. (2013) . “Meta-interpretive learning of higher-order
dyadic datalog: Predicate invention revisited.” In Proceedings of the 23rd In-
ternational Joint Conference on Artificial Intelligence ( IJCAI ), 1551-1557,
Beijing, China.
Plot kin , G. D. (1970). “A note on inductive generalization.” In Machine Intel-
ligence 5 ( B. Meltzer and D. Mitchie, eds.) , 153—165, Edinburgh University
Press, Edinburgh , Scotland.
Plotkin , G. D. (1971). “A further note on inductive generalization.” In Ma-
chine Intelligence 6 ( B. Meltzer and D. Mitchie, eds.) , 107-124, Edinburgh
University Press, Edinburgh , Scotland.
Quinlan, J. R. (1990) . “Learning logical definitions from relations.” Machine
Learning , 5(3):239-266.
Richardson, M. and P. Domingos. (2006) . “Markov logic networks.” Machine
Learning , 62 (1-2 ):107-136.
Robinson, J . A. (1965). “A machine-oriented logic based on the resolution prin-
ciple.” Journal of the ACM , 12 (1):23-41.
Srinivasan ,A.(1999 ). uThe Aleph manual.” http:/ / www.cs.ox.ac.uk / activities /
machlearn / Aleph / aleph.html.
Winston , P. H. (1970) . Learning structural descriptions from examples. Ph.D.
thesis, Department of Electrical Engineering, MIT, Cambridge, MA.
Wnek, J. and R. S. Michalski. (1994) . “Hypothesis-driven constructive induc-
tion in AQ17-HCI: A method and experiments.” Machine Learning , 2 (14) :
139-168.
sarsa

Q-learning
state value
function
V(x)

You might also like