机器学习 周志华
机器学习 周志华
( machine learning).
[Mitchell, 1997]
P
T
model) (learning algorithm).
T
T P, E )
[Hand et al.,
).
1.2
(data set ),
instance)
(sample).
attribute) (feature);
(attribute value).
( attribute space) (sample space)
(feature vector).
D {» i ®2 , • ..,*m} m
d
Xi = { x n\ X i2 \ . . . Cid ) d i G Af
A j
d (dimensionality).
learning) (training),
j ” (training instance)
( training data), ( training sample),
(training set).
(hypothesis);
-
(ground truth),
learner),
prediction)
“label”
“((
.
“label” example).
(label);
1.2
(xhyi) i ey y
(label space)
(classification);
regression).
(binary classification) ( positive class),
negative class);
multi-class classification)
{(®1,2/l ),(®2,2/2 ),••. (*m ,Vm ) } Y
y f y +1} 1};
y 1R R
testing)
(testing sample).
( testing instance)
f [ x ).
clustering),
(cluster );
13.6
( unseen instance).
generalization)
1020).
(distribution) V
(independent and identically
distributed, P
1.3
(induction) (deduction)
generalization)
specialization)
(inductive learning).
(concept )
1.1
e(
_
A A B ) V (C A D )
=?) A =?) A =?)
1.1
1.1 A A
)”
1.3
e *) A A
M 4 x 3 x 3 + 1 = 37. 1.1
=*)
=. =* =* ) =* )
=* ) =* )
1.1
version space).
1.1 1.2
= *; =*) = *; = *;
=*
1.2
1.4
1.2
) e(
*) A A *)”
1.1
O *) A A
O *) A A *)”.
(inductive bia ),
feature selection)
.
^
1.3
( x, y),
1.4
1.3
1.3
A B.
Occam’s razor )
A y
- B
1.3 A.
*) A A e *)
A * )”
1.3.
A B
1.4 B A
A B
y y
(a) A B (b) B A
’
v
1.4
1.4(b)
A B
£a ,
£a
£a
^
Y P(h \ X a)
£a
. £a
^ ^^ ^
Eote a \ XJ ) = J2 xexEx P
h -
) nh{x ) f {x ))P {h \ X a) (1.1)
I • 0.
{0, 1},
= E
x X-X h
i
^ f( x ))
/ l
a; h( x ) = E p{ x )
YJnh \ x^ a )- 2 \ x \
xex- x h
_ E
xex-x
p{ x )
YJnh \ x^
h
a)
1.4
2 l ^ l -1 P{ x ) • . ( 1.2)
xex - x
(1.2)
Kte ( ^ \ X , f ) = J2 ^ote ( ^ \ XJ )
a b (1.3)
£a
NFL
No FVee Lunch Theorem , NFL
[Wolpert ,1996; Wolpert and Macready,1995].
NFL
A B A •
B A
B
NFL
*)
A A } { e *) A
A NFL
)
)”
•
NFL
1.5
( artificial intelligence)
2.85 . A.
Newell H. Simon
E.A. Feigenbamn
•
Feigenbaum
. E. A .
DENDRAL . Feigenbamn
A. Samuel
p.22.
comiecticmism)
F.Rosenblatt (Perceptron) B. Widrow
Adaline symbolism)
P. Winston R.S.Michalski
E. B. Hunt
N. J . Nilson
IWML
(IWML);
ICML . Tioga
R. S. Michalski J.G. Carbonell T. Mitchell *
[Michalski et al,1983],
Machine Learning
1.5
Artificial Intelligence
J. G . Carbonell MIT
[Carbonell, 1990]
E.A. Fdgenbaum
[Cohen and Feigenbaum , 1983]
R.S. Michalski
(decision tree)
(Inductive Logic
ILP
Programming, ILP)
Prolog
.ILP
ILP
H.
Simon
M . Minsky S.Papert
J .J . Hopfield NP
D.E. Rumelhart
BP
BP
BP
(statistical learning)
(Support Vector Machine,
SVM) ( kernel methods).
[Vapnik, 1998]
V.N. Vapnik A. J .
Chervonenkis VC
6.5.
( kernel trick )
1.6
5.6
Intel x86
1.6
-
NASA JPL
NASA JPL
-
Science [Mjolsness and DeCoste, 2001]
DARPA PAL
DARPA
NASA DARPA
DARPA
T. Mitchell
(crowdsourcing).
(data mining)
[Zhou, 2003].
F
1.6
[Mitchell,1997]
•
ALVINN
DARPA
S. Thmn
. S . Thmn
.
R. Ghani
P. Kanerva
SDM (Sparse Distributed Memory) [Kanerva,
1988] SDM
1.7
[Mitchell, 1997]
paydin , 2004; Flach , 2012]
[Duda et al.,
[Hastie et al., 2009]
Al
-
[Bishop , 2006]
WEKA
-
[Shalev-Shwartz and Ben David, 2014] [Witten et al.,
2011] WEKA WEKA
Waikato
JAVA
http: // www.es.waikato. 1.5 1.6 2007].
ac.nz / ml / weka /.
[Michalski et al,1983]
Morgan Kaufmann
1.7
E.A. Feigenbaum
[Cohen and Feigenbaum, 1983]
[Dietterich,1997]
( principle of multiple
explanations), [Asmis, 1984],
(ensemble learning)
(ICML)
( NIPS) (COLT)
(ECML) (ACML);
Journal of Machine Learning Research Machine Learning.
gence
IJCAI AAAI
Journal of Artificial Intelligence Research,
Artificial Intelli
-
^
KDD ICDM ACM Transactions on Knowledge Discovery
from Data Data Mining and Knowledge Discovery,
1S -
(CCML)
(MLA);
1.1 1.1
1.2
G *) *) )
(( A *) A
)
)
A A )”
)”
( A = a) V ( A = )
1.1
A *)
*
o 1.3
1.4* 1.4
A
(1.1)
Eote ( 2a \ XJ ) =
h
E-x P ( x )Hh ( x ) J ( x ) ) P ( h \ X ,& )
J2 xex a
1.5
(1996 ) .
. 2007). 3(12 ) :35-44.
( 2012 ) .
Alpaydin , E. ( 2004) . Introduction to Machine Learning. MIT Press, Cambridge,
MA.
Asmis, E. (1984) . Epicurus’ Scientific Method. Cornell University Press , Ithaca,
NY.
Bishop, C. M. ( 2006 ) . Pattern Recognition and Machine Learning. Springer ,
New York , NY.
Blumer , A., A. Ehrenfeucht , D. Haussler, and M. K . Warmuth. (1996) . “Oc-
cam 5s razor.” Information Processing Letters, 24 ( 6 ) :377-380.
Carbonell, J. G., ed. (1990 ) . Machine Learning: Paradigms and Methods. MIT
Press, Cambridge, MA.
Cohen , P. R. and E. A. Feigenbaum , eds. (1983). The Handbook of Artificial
Intelligence , volume 3. William Kaufmann, New York , NY.
Dietterich , T. G. (1997) . “Machine learning research: Four current directions.”
AI Magazine , 18( 4):97 136.
Domingos, P. (1999). “The role of Occam’s razor in knowledge discovery.” Data
Mining and Knowledge Discovery, 3( 4):409-425.
Duda, R. O . P. E. Hart , and D. G. Stork. ( 2001) . Pattern Classification, 2nd
edition. John Wiley Sons, New York , NY.
Flach , P. ( 2012 ) . Machine Learning: The Art and Science of Algorithms that
Make Sense of Data. Cambridge University Press , Cambridge, UK .
Hand , D., H. Mannila, and P. Smyth. (2001) . Principles of Data Mining. MIT
Press, Cambridge, MA.
Hastie, T., R. Tibshirani, and J . Friedman. ( 2009 ) . The Elements of Statistical
Learning 2nd edition. Springer , New York , NY.
Hunt , E. G. and D. I. Hovland. (1963). “Programming a model of human con-
cept formation.” In Computers and Thought (E. Feigenbaum and J. Feldman,
eds.) , 310-325, McGraw Hill, New York, NY.
Kanerva, P. (1988) . Sparse Distributed Memory. MIT Press, Cambridge, MA.
Michalski, R. S., J. G. Carbonell, and T. M. Mitchell, eds. (1983) . Machine
Learning: An Artificial Intelligence Approach. Tioga, Palo Alto, CA.
Mitchell, T. (1997) . Machine Learning. McGraw Hill, New York , NY.
Mitchell, T. M. (1977) . “Version spaces: A candidate elimination approach to
rule learning.” In Proceedings of the 5th International Joint Conference on
Artificial Intelligence ( IJCAI ) , 305-310, Cambridge, MA.
Mjolsness, E. and D. DeCoste. (2001) . “Machine learning for science: State of
the art and future prospects.” Science , 293(5537):2051-2055.
Pan, S. J. and Q. Yang. (2010) . UA survey of transfer learning.” IEEE Trans-
actions on Knowledge and Data Engineering , 22 (10):1345-1359.
Shalev-Shwartz , S. and S. Ben-David. (2014) . Understanding Machine Learn-
ing. Cambridge University Press, Cambridge, UK.
Simon , H. A. and G. Lea. (1974) . “Problem solving and rule induction: A
unified view.” In Knowledge and Cognition ( L. W. Gregg, ed.) , 105 127, -
( John McCarthy,
IBM
2.1
(error
rate), m a a /m;
(1
1 - a /m accuracy),
error)
training error )
(empirical error) generalization
error ).
overfitting) underfitting)
2.1
NP
Pf
2.1
“ P= NP” “P NP”
model
selection)
2.2
( testing set )
testing error )
2.2
m _D ={(a i yi )
( x m y y m ) },
S T.
2.2.1
-
hold out ) D
T, D = SUT,snr = 0. s
r
D S
T 5
* T
(90/300) x -
(sampling)
(stratified
sampling). D S
T, D
S r
s r
•
D
S
2.6
r
r s D
D
(fidelity ).
[Mitchell,
1997].
2.2 .2
( cross validation) D A:
D Di U Z)2 U . . . U Dfc , Di n j). A
D
k-1 A
A:
A: tfold cross
“ fc
validation), k A:
2.2
D\ D2 D4 DQ D J - DQ D\Q
D\ D2 DQ DJ DS Dg D\ o
— > 1
2
D\ D 2 Dz D4 DQ DJ DS DIQ D9
Z>2 DQ D7 Ds D\ > 10
^
- 3 DQ DIQ
2.2 10
D
/c
P p
D m k m
- -
(Leave One Out , LOO).
2.2
m m
.
D
• m
2.2.
NFL 1.4
2.2.3
D
D
bootstrapping)
Bootstrap
(bootstrap sampling) [Efron and Tibshirani,1993]. m
D
zy D
• m m
D ir
m
^
(1 - ) ,
m
e
lim -
I 0.368 ( 2.1)
D
IT
m
-
(out of-bag
estimate).
2.2.4
(parameter )
( parameter tuning).
[0,0.2] 0.05
m D
validation set ).
2.3
(performance measure).
2.3
D {(«l,2/l),(«2,2/2),...,(®m , 2/m )}
•
f ( x)
E( f - D ) = -
TTl J2 U ( x i ) - yi )2 . ( 2.2)
D p
E( f V ) = (/ 2/) p( x ) d x . (2.3)
jc D
"
2.3.1
no
E U\ D ) = Y j n f {xi ) (2.4)
~
^ yi ) .
acc(/;D ) =
-
(2 5)
D p
E( f V ) = 1(/ y ) p( x ) d x ( 2.6)
^
acc(/;V (x ) y) p( x ) d x (2.7)
l - E U -V ) .
2.3.2 F1
Web
precision)
recall)
2.1
TP FN
FP TN
TP
P ( 2.8)
TP + FP
TP
R
TP - FN ' (2.9)
^
2.3
“ PR
“P- R “P R
-
2.3
“ PR
1.0 A
-
P R
2.3 P R
-
-
PR
PR-
2.3 A C
-
PR 2.3 A B
A B
-
PR
2x P xR x TP
FI (2.10 )
P+R
F1
( ha rmonic
mean )
4R V .
F1 P
Fp ( 1 + P2 ) X P X R
( 2.11)
{J32 x P )+ R
[VanRijsbergen , 1979].
E±E ) F1
AP x
\
-
(macro P) macro i?)
- -
FI ”(macro FI ):
macro ( 2 - 12)
macro- R / J
rt i
Ri (2.13)
i=i
_
macro- Fl =
2 x macro-P x macro
macro- P + macro
-
- . 2.14
( )
rP F P T N F7V
F P,T N FN,
micros) micro
- -
FI”(micro FI ):
TP
-
micro i5
TP + FP
(2.15)
2.3
TP
-
micro i? =
TP + FN
(2.16)
micro-FI =
2 x micro-P x micro JR
micro-P + micro
-
- (2.17)
(threshold)
[0.0,1.0]
0.5 0.5
cut point )
ROC
“ROC
-
PR
True Positive Rate, TPR),
ROC
( False Positive
Rate, FPR), 2.1
TP
TPR
TP + FN
(2.18)
FP
FPR
TN + FP
' (2.19)
ROC “ROC 2.4
0, 1)
0.2 0.2
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
ROC
2.4 ROC
2.4 ROC m+
-
PR
m—
AUC 0,0)
(x,y),
^
( 2/ + r )
ROC
^ PR
- ROC
lit/
AUC
m+ m -
loss)
£rank = Yl i +) f ( x~ ) )
\
^
1 l
x + eD+ x~ eD ~
( 2.21)
A UC ( 2.22)
.
^ rank •
2.3 . 4
unequal cost ).
costoi
costio
(2.4)
E( f;D;cost) E
xieD+
( xi ) yi ) x costoi
^
+ E
X i E D~
I ixi ) 7 Vi ) x COSt 10 (2.23)
costij i j
R0C
[0,1]
2.7.
(cost curve)
p costoi
P ( -\- ) cost _ (2.24)
p costoi + (1 p) X cost10
tion)
normaliza
- p 1]
[0,1],
cost norm
FNR x p x costoi FPR x (1 p) x costio — (2.25)
2.8 .
p X costoi + (1 — p) x costio
FPR (2.19) FNR TPR
R0C HOC
(TPR ,FPR), FNR,
0,FPR) 1,FNR)
R0C
2.5
2.4
FNR
FPR
0.5 1.0
.
2.5
2.4
( hypothesis test )
A B
A B,
[Wellek 2010],
e
2.4. 1
e = eo”.
e e;
m exm
e
em\l - xm
m
e
m m— e x m
P( e ) = (2.26)
^ e xm
d P ( e ; e ) / de = 0 P(e;e )
^ e=e |e —$
P (e; e) ( binomial) e
0.25 -
0.20 -
0.15 -
0.10 -
0.05 -
a
0.1, 2.6 a
2.6 (m = e = 0.3)
R a
qbinom ( l
o: m eo ) Matlab e eo” a
’
icdf ( Binomial ’ eo; a
a , m , eo ) .
eo.
R
-
www. r project .org.
“t -
t test ). A: • ..
M
( 2.28)
k
G
2
k -i
- M) 2 • ( 2.29)
A: eo
n= a
-e )
0
( 2.30)
/c - f 2.7
0.1
a a
2 2
-10 -5 0 5 10
n
2.7 * ( /c = 10)
>=
—a -
(two tailed)
2.7 a /2
OO , \Pa / 2 t /i |/i — Co|
[La/2 e0,
t a/ 2 R
-
1 a;
.a 0.05 0.1. 2.3
qt ( l a / 2 , k
1) Matlab
icdf a / 2 , k - 1) . 2.3 f
k
a
2.4.2 t
A B A:
... ef ef ... ef ef i
A: t -
( paired t tests)
ef .
/c
A , ef - ef
Al5 A2,...,Afc A B t
/i a2 , a
\ fk [i
Tt = a
(2.31)
M/2 , A: t a /2
x
2.4
[Dietterich , 1998].
5x2
A B, i
iKAj Af .
0.5( A} -f A? ),
( }- " .
n= (2.32)
E
i=l
A
B
eoo
en
e1()
|e0i - eio| e(u e1().
( |eoi — eio|
x2 (2.33)
eoi eio
•
a
R
qchisqd
—
a,fc l ) Matlab
,, a 0.05 a 0.1
icdf ( / CMsquare
a,k
- 1). k 2.7055.
2.4.4 Friedman Nemenyi
t McNemar
Friedman
D4 A B C
D3 A B C
A B C
2.5
A B C
Di
D2 2.5 2.5
Ds
DA 2 3
2.125 2.875
Friedman
iV q
i n
A: 1)/2 A:2 - 1)/12.
n - +2
k - 1 1 2N A: 1
Tx2 =
-
'
k k2 1 V
1 2N
Kk + i) (§ ? r
k( k
4
l)2
(2.34)
/c iV A: x2
FViedman
Tp
(N — l) rx 2
(2.35)
N {h 1) T .2
^
2.4
TX2 (2.34) . TF A: — — 1 )( N — 1) F
2.6
2.6 F
a 0.05
R iV
qf ( l Q fc —
-
l (fc l )(iV - l ) ) 10.128 5.143 3.863 3.259 2.901 2.661 2.488 2.355 2.250
Matlab icdf ( F l ’’ — 7.709
5.591
4.459
3.739
3.490
3.072
3.007
2.714
2.711
2.485
2.508
2.324
2.359
2.203
2.244
2.109
2.153
2.032
5.117 3.555 2.960 2.634 2.422 2.272 2.159 2.070 1.998
4.600 3.340 2.827 2.537 2.346 2.209 2.104 2.022 1.955
4.381 3.245 2.766 2.492 2.310 2.179 2.079 2.000 1.935
a = 0.1
A:
iV
5.538 3.463 2.813 2.480 2.273 2.130 2.023 1.940 1.874
4.545 3.113 2.606 2.333 2.158 2.035 1.943 1.870 1.811
3.589 2.726 2.365 2.157 2.019 1.919 1.843 1.782 1.733
3.360 2.624 2.299 2.108 1.980 1.886 1.814 1.757 1.710
3.102 2.503 2.219 2.048 1.931 1.845 1.779 1.726 1.682
2.990 2.448 2.182 2.020 1.909 1.826 1.762 1.711 1.668
post-hoc test )
Nemenyi
Nemenyi
k(k + l )
C D = qa
6N
(2.36)
Tukey
Qa
2.7 a = 0.05 0.1
R
iiqtukeyd — a , k , Inf ) CD,
sqrt ( 2 ) •
2.7 Nemenyi
A:
a
■
0.05 1.960 2.344 2.569 2.728 2.850 2.949 3.031 3.102 3.164
0.1 1.645 2.052 2.291 2.459 2.589 2.693 2.780 2.855 2.920
2.5 (2.34) (2.35) TF = 24.429,
2.6 a 0.05 F 5.143,
Nemenyi 2.7
90.05 2.344, (2.36) CD 1.657, 2.5
A B B C
A C A C
A B B C
FYiedman 2.5
2.8 ,
2.8 A B
A C
A
B
2.5
bias-variance decomposition)
%x D)
-
yD y x
'
VD y
D
2.5
f { x ) = ED [ f { x ; D ) ] (2.37)
v a r( x ) = ED ( xyD) - (2.38)
e2 = ED ( V D - y) 2
(2.39)
(bias)
2
b i a s2 ( x ) = ( f ( x ) - y ) . (2.40)
•
ED [ y D - y } = 0.
E ( f ; D ) = ED { x ; D ) - V D )2
= ED [(/(x;D) — /(*) + ]
2
VD)
= ED _ -/ + ED \[ [ j ( x ) - yDf
(2.37), 0.
+ ED [2 { x\ D ) - f ( x ) ) ( f ( x ) - y D ) ]
- / ) ] + ED [(/ - )
2 2
yD
( x ; B ) - f ( x ) ) ] + ED [(>
2
/ ) y + y yo)
2
=
^ D -
= ED ( x ; B ) - f ( x ) ) ] + ED [ ( / (« ) - 2/ ) + Ez ( y -
2 2
) yD)
2
-
o,
o. + 2ED [ ( f ( x ) - y ) ( y y D ) }
= ED ( f ( X ] D ) - f ( X ) ) ] ( f ( X ) - y ) -\- ED \( ]
2 2
yD - y )2 y
(2.41)
(2.40)
(2.38)
(2.39)
(bias-variance
dilemma). 2.9
2.9
2.6
ROC [Hanley
and McNeil,1983]. [Hand and Till, 2001] ROC
[Fawcett , 2006] ROC
[Drummond and Holte, 2006]
2.2
2.3 A B A BEP B
2.5 (2.22).
2.6 ROC
2.7 ROC
2.8
-
Min max -score r
X max
'
xmax
-
min max
2-score (2.43) (2.44)
^
^ ~
^ min ^ m,v x ( xma x x ,
in ) (2.43)
^ max ^ min
x (2.44)
2.9
f Biometrika
t
-
(Student’s t test).
d (xi ..• d) , S
^
X X2 T
(linear model)
(3.2)
ty (wi W2
-
.• Wd ). ti
( nonlinear model)
un-
derstandability) .
(comprehensibility).
(cc) 0.2 • z 0.5 . c 0.3 • x ”
3.2
^
• • Vi
9.3
/( . 2.3
2.2)
(square loss).
Euclidean distance).
(least square method).
^
M 6) Y A L I V I — W X i - b )2
b Z (parameter estimation). (w,b)
w 1(
[a,b] dE
/(
dw E - 6) (3.5)
+
X\ X 2
) 2
} b 6] i dE
•
db
= 2 I mb — (Vi wxi) (3.6)
U
f(x) x2,
(3.5) (3.6) 1(
-
(closed form)
o,
i
i=l
Ei 4 -
=
E -
(3 7)
3.2
m
I
b= — wxi ) (3.8)
m
1C A
i= l
$ •
f (xi ) wTXi f ( Xi )
ti w
A m x (d 1)
d d
Xu X1 2 … Xid (xj Is
^
X 21 ••• X 2d 2
x=
T
^ m2 •••
^ md m
w* arg min ( y
w
— Xtt )T ( y — X w)
) . • (3.9)
E = ( y — X w )T ( y XA ) tD
dE
dw
2 XT ( X w - y ) . (3.10)
A
'
(xTx) ixTy
-
ii>* = (3.11)
)-
1
(X (XTX) ( ,1),
f (x i) =
_
x j (XTX) 1 XTy . ( 3.12)
xTx
XTX
regularization)
1.4
11.4 y e R,
(3.2)
x -\- b . (3.13)
T
lny (3.14)
x -\-b
(log-linear regression), e’
(3.14)
3.1
Vi = In
x -\- b
/
( x3 , y- jjj y
b
0 2 X
3.1
3.3
g( ) - p
y =g
~ 1
( wTx + b) (3.15)
3.3
(3.15)
y
y e {0,1},
T z
Heaviside •
-
(unit step function)
y (3.16)
z>
3.2
y i,
h z > 0:
y= 0.5. z = 0;
0.5f 0. 2 < 0.
y
+ — 2
-10 -5 0 5 10
3.2
3.2 (3.15)
In V l+e
(3.17)
3.2 “ Sigmoid
-
Sigmoid S
Sig
moid
z
_
g ( .)
y= • (3.18)
e - ( w T x -\- b )
(3.18)
V Tx
-\- b .
In
i- y (3.19)
T mi - y
y
i- y (3.20)
(odds),
log odds, logit )
y
In
1— y
(3.21)
(3.18)
(logistic
regression , logit regression) .
logistic logit
3.3
(3.18) 6. (3.18) y
p{ y = l \ x ) , (3.19)
p{ y = \ x )
In
p{ y = 0 \ x )
^ = wT x -\- b . (3.22)
e x -\-b
p( y = l \ x ) = (3.23)
1 + ewTx+b
1
p( y = 0 | a? ) = (3.24)
1 + ewTx -\-b
( maximum likelihood method)
7.2
W 6. { ( x u y i ) } Zv (log -
likelihood)
•
^
^ ) = 2 lnp(^ Xi\ w , b ) (3.25)
^^
m ( - y i(3TXi In 1 ( e
1
)) (3.27)
^
•
i=l
f3t+1 = —
d2 my 1
d£ ( f3 )
(3.29)
df3 d /3T J d(3
^^
W)
2xi(
^
d f3
yi -p X i f3)) (3.30)
d 2£ (/3)
^
d f3d(3T
XixJp X i (3)(1 - f3)) . (3.31)
3.4
3.3
X2
XI
3.3 LDA U
D { { Xi , yi ) } v Vi e Xi ,
f G ^ Pi S
WT
^
3.4
^
U /Xo
T S ti
0
J=
\\ wTflo WT
wT Eow wT Eiw
-
w
(3.32)
= So + Si
E (x T
2 (x - T
XSXQ
Mo ) { x - M0 ) +
^
xeXx
Sb = ( Mo Mi ) ( Mo — Mi ) T (3.34)
(3.32)
w T SbW
J = wTSww (3.35)
LDA (generalized
Rayleigh quotient ).
w (3.35) w
(3.35) • wTSww
ty
w
(3.35)
mm — w T SbW (3.36)
s.t . w T Smw = 1 .
B.l.
SbW = AS w (3.37)
A xo MI ,
(3.37)
Mi ) • (3.39)
A.3.
Sw
UEVT , S
S? = vs UT
LDA
7.5 .
LDA
LDA iV i
TTH .
St = Sb + S
- f l ) { X i - f i )T , (3.40)
i= l
M Sw
(3.41)
xeXi
:{ x Mz ) Mi ) • (3.42)
Sb = St — sw
-
M)
T
. (3.43)
LDA S6, Sw
3.5
tr (WTS6W)
-
^
(3 44)
tr(WTS W )
W G RhW-1) tr ^
(trace). (3.44)
S6W = AS W . (3.45)
W AT - ^
LDA iV
iV
LDA
3.5
LDA
AT •
classifier).
Cj
iV
G G
N {N - l ) / 2
^ G
OvO
N(N - l ) / 2
3.4
8.4 OvR
iV
3.4
(\
D
ISlM )
OvR
OvO
OvO
/i
f4
f5
f6
—
—
—
C\
cn
C3
Cs
q
-
4
C2
c3
3.4 OvO
iV
C3 C4
OvR
__
msg))
mmi)
OvR
OvO
OvR
iW
W
U
—
N ( N - 1)/2
c3
OvR OvO
OvO OvR
MvM OvO
OvR MvM MvM
MvM Error Correcting
Output Codes, ECOC).
ECOC [Dietterich and Bakiri 1995]
ECOC
• iV M
M
M
• M
3.5
coding matrix)
[Dietterich and Bakiri, 1995] [Allwein et al.,
2000].
3.5 3.5(a)
C3 C2 C4 3.5(b)
c4 c3
3.5(a)
C3.
n f
:n;
/1 /2 /3 /4 /5 fi h h fe /7
I
4 4
C2
C4
- BlBB TBl
>
[ -> 2 2
->• 5 2 \/5
3 \ /l0
^ ECOC
i
+i - +i
(b)
+i -i
ECOC
3.5 ECOC
A
-
“ 1”
ECOC
3.5(a)
C3. ECOC
NP
3.6
-
(class imbalance)
OvR MvM
OvR MvM
y = wTx + b a
y
> 0.5 .
0.5
V
i- y (3.46)
m+ m-
V
T-V (3.47)
3.7
(3.46)
(3.46) (3.47).
V y
(3.48)
1 — 2/ 1- y m+
ance).
-
rebal
rescaling).
midersampling)
(downsampling),
pling).
(upsam
- oversampling),
(3.48)
-
(threshold moving).
SMOTE [Chawla
et al., 2002]
ing) (3.48)
_
m /m+ cost cost
-
cost-sensitive learn
-
^
2.3.4 cost cost
3.7
(sparse representation)
sparsity)
L0 NP LASSO
[Tibshirani , 1996] h L0
OvO OvR ECOC [Allwein et al., 2000].
[Crammer and Singer , 2002]
NP ECOC
[Pujol et al., 2008]. [Escalera et
al.,2010] ECOC
MvM ECOC DAG ( Directed Acyclic
Graph) [Platt et al,2000]
(multi-label learning),
[Zhang and Zhou, 2014].
3.1 (3.2)
3.2 (3.18)
(327)
3.0a p.89
3.3 3.0a
4.5.
UCI
3.4 UCI
http: // archive.ics.uci. edu / ml / .
3.5 3.0a
3.6
6.3
3.7 ECOC
3.8* ECOC
ECOC
3.10* )
Allwein, E. L., R. E. Schapire, and Y. Singer. ( 2000). ^Reducing multiclass
to binary: A unifying approach for margin classifiers.” Journal of Machine
Learning Research, 1:113-141.
Boyd , S. and L. Vandenberghe. ( 2004) . Convex Optimization, Cambridge Uni-
versity Press, Cambridge, UK.
Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. (2002 ).
“SMOTE: Synthetic minority over-sampling technique.” Journal of Artificial
Intelligence Research, 16:321-357.
Crammer , K . and Y. Singer. ( 2001) . “On the algorithmic implementation of
multiclass kernel-based vector machines.” Journal of Machine Learning Re-
search^ 2:265-292.
Crammer , K. and Y. Singer. ( 2002). “On the learnability and design of output
codes for multiclass problems.” Machine Learning , 47(2-3) :201—233.
Dietterich, T. G. and G. Bakiri. (1995). “Solving multiclass learning problems
via error-correcting output codes.” Journal of Artificial Intelligence Re -
search 2:263-286.
Elkan , C. ( 2001). “The foundations of cost-sensitive learning.” In Proceedings
of the 17th International Joint Conference on Artificial Intelligence ( IJCAI )
973-978, Seattle, WA.
Escalera, S. O. Pujol, and P. Radeva. ( 2010) . “Error-correcting ouput codes
library.” Journal of Machine Learning Research, 11:661-664.
Fisher , R. A. (1936). “The use of multiple measurements in taxonomic prob-
lems. Annals of Eugenics , 7( 2 ):179-188.
Lee, Y., Y. Lin, and G. Wahba. (2004). “Multicategory support vector ma-
chines, theory, and application to the classification of microarray data and
satellite radiance data.” Journal of the American Statistical Association, 99
( 465) :67-81.
Liu , X.-Y., J. Wu , and Z.-H. Zhou. ( 2009). ^Exploratory undersamping for
class-imbalance learning.” IEEE Transactions on Systems , Man, and Cyber-
netic
^- Part B: Cybernetics, 39 ( 2 ) :539-550.
Platt J C N. Cristianini, and J . Shawe-Taylor. ( 2000) . “Large margin DAGs
, . .,
for multiclass classification.” In Advances in Neural Information Processing
Systems 12 ( NIPS ) (S. A. Solla, T. K. Leen, and K.-R. Muller , eds.) , MIT
Press, Cambridge, MA.
Pujol, 0., S. Escalera, and P. Radeva. (2008) . “An incremental node embedding
,
technique for error correcting output codes.” Pattern Recognition 41( 2 ):713-
725.
Pujol, O., P. Radeva, and J. Vitria. ( 2006 ). “Discriminant ECOC: A heuristic
method for application dependent design of error correcting output codes.”
IEEE Transactions on Pattern Analysis and Machine Intelligence , 28( 6):
1007-1012.
Tibshirani, R. (1996 ) . “Regression shrinkage and selection via the LASSO.”
Journal of the Royal Statistical Society: Series B , 58(1):267-288.
Zhang, M.-L. and Z.-H. Zhou. ( 2014). “A review on multi-label learning al-
gorithms.” IEEE Transactions on Knowledge and Data Engineering , 26 (8):
1819-1837.
Zhou, Z.-H. and X.-Y. Liu. (2006a). “On multi-class cost-sensitive learning.” In
Proceeding of the 21st National Conference on Artificial Intelligence ( AAAI ),
567-572, Boston, WA.
Zhou, Z.-H. and X.-Y. Liu. ( 2006b) . “Training cost-sensitive neural networks
with methods addressing the class imbalance problem.” IEEE Transactions
on Knowledge and Data Engineering , 18(1):63-77.
G 09674175 N 9
(1993
(1777-1855)
1752 1833)
“3L” .
4.1
(decision tree)
4.1
4.1
- -
divide and conquer ) 4.2
• .. m ym ) }'i
A ={a i ,a2 ,...,ad}.
TreeGenerate( D , A )
node;
if D C then
(1). node C7 return
end if
if A = 0 O K D A then
(2). node return
end if
a #;
for a * do
node D a
if then
(3). D return
else
a* . TreeGenerate(Dv , A \{a *})
end if
end for
node
4.2
1) 2)
3)
(2)
(3)
(3)
4.2
4.2
4.2
purity)
4.2.1
information entropy)
A /c ...
Ent( D ) . (4.1)
p p log2 p 0.
k=l
^
Ent ( D) 0,
log2 | |.
Ent )
^ y {a1 2 av
,..., },
^
a a
D y v D
(4.1)
'
a D W
mm
a D
(information gain)
^
Gain(D,a) = Ent( D) Ent(Dv ) . (4.2)
(4.1)
- + - log -
^ = 0.998 .
n n
Ent(D) = ~
Pk iog 2 Pk = log2 2
k=l
17
4.1 2.0
'C3
'3
ffi r
« '
« '
1
*
1
X
» '
{
{
}. D D1
D2 D3
D1 17}
P2 15}
p2 D3 16}
= I P =I
2 (4.1)
Ent(D3)
_ _+-
log2 log2
- = 0.722
(4.2)
4.2
j- Ent { D v )
Gain(D, Ent { D )
0.998 - —
17
-= V 1
x 1.000 ^
' I
—
17
x 0.918 —
17
x 0.722
= 0.109 .
4.3
4.3
D1 {1,
15} {
P1
Gainp1 0.043; Gainp1 0.458;
Gain(JD1, 0.331; Gain(L)1, 0.458;
Gainp1 , = 0.458.
4.4
4.2.2
4.1
C
4.4 2.0
(4.2)
_
Gain ratio(D,a) =
Gain( jD ,a )
IV
(4.3)
v
(4.4)
v=l '
_
[Quinlan , 1993]:
4.2 . 3
CART Classification
and Regression Tree
CART [Breiman et al., 1984] ( Gini index)
(4.1)
Gini(D) VkVk'
k=l k k
1 1
^ (4.5)
Gmi ( D ) D
Gmi ( D )
(4.2) a
_
Gini ndex( D a) (4.6)
A
_
argmin Gini index( D,a ).
4.3
2.1
(pruning)
prepmning) (post -
pruning) [Quinlan , 1993].
2.2
4.1
4.2 17}
{4, 13} .
4.2 2.0
1
x
si
1
X
is
!
1
®''
«
«'
'
®'
1
X r
@'
4.2.1 4.2
4.5
4.3.1
4.6
4.2
4.3
CM
C
4.5 4.2
< ■
42.
57.
4.6 4.2
4.2
{4 ,5,8}
x
4.6
{6,7,15,17} .
{4 ,5,8,11,12} x
{5}
57.1% .
4.2 4.6
(decision
stump).
4.6 4.5
4.3.2
4.2
4.5
4.5
{7,15}
4.7
{6,7,15}
{1,2,3,14}
4.4
( r :
Y
)
3
t ] (5
C¥ 57.
4.7 4.2
4.2 4.7
4.7 4.6
4.4
4.4 . 1
-
( bi partition) C4.5
-
[Quinlan , 1993].
a D n
{a1 ,a2 ,...,a }. t D
Dr Dr a t
a t i+ l
[a\ai+l ) a,
n
Ta =
a{ f - -2ai+
1
i n- (4.7)
[ a a^ 1 )
^
(4.2)
[Quinlan , 1993] .
Gain(Z),a ) = max Gain( D,a ,t )
teTa
Gain(D,a ,t ) D t
Gain(D ,a ,t )
4.1 2.0
. 4.3 3.0.
4.3 3.0
isl
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
r 0.481 0.149
«
1
X
051
r
'
0.437
0.666
0.211
0.091
0.267
1
x « ' 0.245 0.057
1
i r 0.343 0.099
1
X « ' 0.639 0.161
« '
' 0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
4.4
(4 . 7),
r 0.351, 0.381, 0.420, 0.459, 0.518,
0.600, 0.621, 0.636, 0.648, 0.661, 0.681, 0.708, 0.746}. (4.8)
0.381.
{0.049, 0.074, 0.095, 0.101, 0.126, 0.155, 0.179, 0.204, 0.213, 0.250, 0.265,
0.418}. (4.8)
0.126.
4.2.1 4.3
Gain(D,
Gain( D _ 0.109; Gain( D,
0.141; Gain(D,
0.143;
0.381;
Gain(D, 0.289; Gain(D, 0.006;
Gain(D, 0.262; Gain(D, 0.349.
4.8
$0.381?
4.8 3.0
<0.381”
$0.294” .
4.4.2
HIV
4.4 4.1 2.0
16}
4.4 2.0a
i
1
1
X
ffi
-
lwt Il
1)
(2) •
D a, f) D a
(1), ci a F
ay}, D a
D A: ... Ul= i Dkl
-
^
D= uLi $ WB
1.
P
E ^iDX
(4.9)
ECCG-D W .
^
xeDk
Pk (1 (4.10)
.
rv = E xeDv
W
(4.H )
4.4
p
/c a
ElkllPk = h Etirv 1. .
(4.2)
_
Gain( D a) p x Gain( jD ,a)
v
(D) -
(4.1),
= px Ent
^
V =1
fvEnt (4.12)
1 1
^
(2), z
Ent ( D ) = -
^
k l
ci
2
—
pk log 2 p k .
z
a
1.
17}
Ent ( D ) = - [ pk og 2 p k
k=l
^
6 6 8 8
.
i4 l0 2 l4 + Til0 2 l4 = 0.985
g g
f)1, D2 f)3
Ent(D1) =
4+4
l0g2 l0g
24 =
1.000
4
2
Ent ( D2 ) = n
6
l0g
26 + 6 l0g26 = 0.918
Ent ( _Z 3 )
D
^
^^ ^ og + log2 - 0.000
^ ^
Gain(£> E n t( D ) - f Ent( )
V =1
^
0.985 - 14—— x 1.000 IT
14
x 0.918 H - 14—— x 0.000
= 0.306 .
D
Gain(D, 0.252; Gain(D, 0.171;
Gain(£), 0.145; Gain(D, 0.424;
Gain(D, 0.289; Gain(D, 0.006.
15}
17} {11,12,16}
1.
{8}
{10}
•
4.9
4.5
d
d
-
(axis parallel)
4.5
^
C T)( J xf
=?
\
J
=? CpjT )
4.9 2.0a
3 . 0a
4.5 3.0a
4.3 3.0
0.697 0.460
0.774 0.376
0.634 0.264
0.608 0.318
0.556 0.215
0.403 0.237
0.481 0.149
0.437 0.211
0.666 0.091
0.243 0.267
0.245 0.057
0.343 0.099
0.639 0.161
0.657 0.198
0.360 0.370
0.593 0.042
0.719 0.103
<0.126?
< 0.381?
< 0.205?
< 0.560?
C
4.10 3,0a
0.6
4.11 4.10
4.12
4.12
multivariate decision tree)
(oblique
decision tree) .
t
t
univariate decision tree)
4.5
Vii
o X
4.12
3.0a 4.13
4.14
4.13 3.0a
0.6
0.4
4
^ 0.2
4.14 4.13
4.6
[Mingers , 1989b],
[Raileanu and Stoffel,
2004]
. 4.3
[Mingers, 1989a],
ID4 [Schlimmer and Fisher , ID5R [Utgoff ,1989a] ITI [Utgoff et al.,
1997]
4.1
0)
4.2
4.3 4.3
4.4 4.2
4.5 4.3
uci
http:// archive.ics.uci.edu / ml/. 4.6 UCI
2.4
4.7 4.2
MaxDepth 4.2
4.8*
MaxNode 4.7
4.7
4.9 4.4.2
4.10
3.0 p.84
3.0
4.3.
Breiman , L., J . Friedman , C. J . Stone, and R. A. Olshen. (1984) . Classification
and Regression Trees. Chapman & Hall/ CRC , Boca Raton , FL.
Brodley, C. E. and P. E. Utgoff . (1995) . “Multivariate decision trees.” Machine
Learning , 19 (1) :45-77.
Guo, H. and S. B. Gelfand. (1992). “ Classification trees with neural network
feature extraction.” IEEE Transactions on Neural Networks, 3(6 ) :923-933.
Mingers , J. (1989a) . “An empirical comparison of pruning methods for decision
tree induction.” Machine Learning , 4 ( 2 ):227-243.
Mingers, J . (1989b) . “An empirical comparison of selection measures for
decision-tree induction.” Machine Learning , 3( 4) :319-342.
Murthy, S. K . (1998) . “Automatic construction of decision trees from data:
A multi-disciplinary survey.” Data Mining and Knowledge Discovery , 2 ( 4):
345-389.
Murthy, S. K ., S. Kasif , and S. Salzberg. (1994) . “A system for induction of
oblique decision trees.” Journal of Artificial Intelligence Research, 2:1-32.
Quinlan , J . R. (1979) . “Discovering rules by induction from large collections of
examples.” In Expert Systems in the Micro- electronic Age ( D. Michie, ed.) ,
168—201, Edinburgh University Press , Edinburgh , UK.
Quinlan , J . R. (1986 ). “Induction of decision trees.” Machine Learning , 1(1):
81-106.
Quinlan , J . R. (1993) . C4.5: Programs for Machine Learning. Morgan Kauf-
mann San Mateo, CA.
Raileanu , L. E. and K. Stoffel. ( 2004) . “Theoretical comparison between the
Gini index and information gain criteria.” Annals of Mathematics and Arti-
ficial Intelligence 41(1):77-93.
Schlimmer , J . C. and D. Fisher. (1986). “A case study of incremental concept
induction.” In Proceedings of the 5th National Conference on Artificial In-
telligence ( AAAI )^ 495-501, Philadelphia, PA .
Utgoff , RE. (1989a) . “Incremental induction of decision trees.” Machine
Learning , 4( 2 ) :161-186.
UtgofF, P.E.(1989b). “Perceptron trees: A case study in hybrid concept rep
-
resenations.55 Connection Science,1(4):377-391.
Utgoff ,P.E., N.C.Berkman ,and J .A.Clouse.(1997). “Decision tree induction
based on effcient tree restructuring.” Machine Learning, 29(1):5-44.
• J . Ross Quinlan, .
' E. B .
Hunt CLS
( Concept Learning System)
. Hunt
D. Michie
CLS
ID3
Machine Learning U ID3
ID4 ID5
ID3
C4.0 Classifier 4 .o C4.0, C4.5. C4.5
C4.0 4.5
C4.5 WEKA
C5.0.
J 4.8.
5.1
( neural networks)
T. Kohonen
Neural Networks
[Kohonen , 1988] .
bias.
( threshold ) ,
Xi
-
>
5.1 M P
-
( activation function)
5.2(a)
Sigmoid Sigmoid
3.3
Sigmoid 5.2(b)
0,1)
(squashing function).
sgn(>) sigmoid ( rr )
1.0
0.5
( a) ( b) Sigmoid
5.2
5.2
( Perception) 5.3
MP -
( threshold logic unit ).
y / CE -
•
5.2
X2 ): y /(1 .
^ . —
5.2
5.3
^^
Xi = X2 = 1
• $1 X2)
Z2
IJ
-
/(1 X i ~\
-1 - X2
• ~
i ): — 0.6, W 2 = 0, 9 = — 0.5, J y /( — 0.6 ri .
X2 0.5), y 1.
... n)
—1.0 dummy node)
,
^
n+i
( x , y),
® i
{
- Awi (5.1)
v( y - y)
^i (5.2)
V ry G (learning rate). (5.1)
0.1.
( x, y)
(functional neuron),
(linearly separable) [Minsky and Papert ,
1969],
-
5.4(a) (c) (converge)
;
it ...; wn+ i ) ; (fluctuation),W
5.4(d )
(0, 1) K
—-- —
S, (1, 1)
(a) { xi A X 2 ) (b) V r 2)
^
2
>
(0,
(1,0)
(c) .
(- x i ) (d) r 2)
5.4
5.5
5.5( a) M
( hidden layer ),
5.6
0.5
0.5 0.5
X1
(a) (b)
5.5
5.3
(a) (b)
5.6
networks) ,
5.5.5
5.6 ( a)
( connection weight )
5.3
(5.1)
(error
BackPropagation , BP )
BP
BP
[Pineda, 1987]. “ BP
BP
BP D
( ®2 ,2/2 ),• .. 2/m )}, Xi G Vi G ,
M1 d Z
5.7 d Z
g j
3.2
h j
Whj . Eti ^ih^i
j ELI hjbh,
VI VI
► ./
Q
^
wQ3
Pj = 2Whjh
h=l
bh
h
d
1 dfb,
Oih = EVihXi
1 Xi
5.7 BP
. U 5.2( b) Sigmoid
3.3
( xk , yk ) ,
{ xk , yk )
i
. (5.4)
i=i
5.7 d Z 1) g Z dxg
gx Z g Z
. BP
(5.1) ;
t
v v - Av . (5.5)
5.7
B.4 .
BP (gradient descent )
(5.4)
dEk
Aw
^ j = — T ] dwhj (5.6)
5.3
j
Efc ,
8Ek dy] d j
dEk • ^ (5.7)
^
~
dwhj dy dpj dwhj
dwhj
h• (5.8)
5.2 Sigmoid
1— (5.9)
(5.4) (5.3),
dEk dy)
9j =
dykj d /3 j
(yj - yj )fWj Oj )
(5.10)
whj
(5.10) (5.8)
—
(5.7)
hj
vih
= rwjbh
―
= rjehXi
(5.6)
•
BP
(5.11)
-
(5 12)
(5.13)
7/t = -V ^ h (5.14)
(5.13) (5.14)
dEk db
^
6h =
dbh dah
l
/’(QTi — 7/ ) i
h ( l bh )
^^ hj9j . (5.15)
r? e (0,1)
0.1
(5.11) (5.12)
(5.13) (5.14)
5.8 BP BP
-
4 5
BP
5.9
D ={(H/fc )} =1;
ry.
r
. (0,1)
repeat
for all ( xk , y k ) e D do
(5.3) dfc;
(5.10)
(5.15)
J lh
end for
until
5.8
BP D
E E (5.16)
BP
5.8
5.3
-0.4? 0.51
5.9 BP
BP BP
(one round , D
one epoch)
BP BP
BP
D
(stochastic gradient
descent , SGD) [Hornik et al.,1989]
BP
- -
trial by error)
BP
(early stopping):
E= (5.17)
A e (0,1)
5.4
(local minimum)
global minimum). ;*
tt *
,
w (w e) e {(w 9) \ \\ ( w e ) - ( w e* ) \\ e}
^ ^
E(W ] e)
(w e )
-
E(W * O * )
( *
,#)
E(w *;9* )
5.10
5.1)
BP
(5.14)
-
(5.11)
5.4
--
14 . —
E
f
6
5.10
5.5 .1 RBF
RBF (Radial Basis Function, [Broomhead and Lowe,
1988]
RBF
d
RBF
W i p{ x , C i ) (5.18)
g Ci i
p ( x Ci )
)
_
p( x , C i ) = e • (5.19)
RBF
BP ft .
5.5 .2 ART
(competitive learning)
.
- -
winner take all)
ART ( Adaptive Resonance Theory, [Carpenter and
Grossberg, 1987]
5.5
ART
ART
plasticity dilemma),
-
(stability
ART
( incremental learning) (online
learning).
ART ART
ART2 FuzzyART
ARTMAP
-
(batch mode)
5.5.3 S0 M
SOM(Self-Organizing Map , [Kohonen, 1982]
(Self-Organizing Fea
ture Map) Kohonen
-
^
5.11 SOM
SOM
SOM
5.5 .4
tive)
-
(construc
5.5.2 ART
-
(Cascade Correlation) [Fahlman and Lcbiere, 1990]
( a) b) c)
5.12
5.12
(correlation)
5.5
5.5.5 Elman
“recursive neural recurrent neural networks)
networks” .
t t
t -
Elman [Elman , 1990] 5.13
Sigmoid BP
[Pineda, 1987].
5.13 Elman
5.5.6 Boltzmann
(energy),
sG l}n n i j
◦
i s Boltzmann
=- E= i l
WijSiSj — i= l
(5.20)
Boltzmann
equilibrium)
station- Boltzmann s
ary distribution).
(a) Boltzmann ( b) Boltzmann
P( s ) = -m • (5.21)
Boltzmann
Boltzmann
Boltzmann
( Restricted Boltzmann Machine,
mann
RBM).
Boltzmann
5.14(b)
-
Boltz
d
p( v \ h ) =
Y[ p(Vi \ h ) 5 (5.22)
P ( h\ v ) = l=[ P{ hj \ v )
3
. (5.23)
CD (5.23)
(5.22)
hkv
Aw = r\ ( vhT — WT ) . (5.24)
^
5.6
5.6
capacity)
deep learning) •
BP
diverge)
-
(pre training); (fine -
tuning) ( deep belief network , DBN) [Hinton
etal. 2006] Boltzmann
RBM
RBM RBM
BP
weight sharing),
( Convolutional Neural
Network, CNN) [LeCun and Bengio, 1995; LeCun et al. 1998]
CNN [LeCun et al.,1998], 5.15
m Kir. I-
x 6014 x 14
^
16 10 x 10
32 x 32 CNN
( feature map),
CNN
5.15
Sigmoid
28 x 28 5x5
pooling)
if x
otherwise , 5.15
14 x 14
Re
LU ( Rectified Linear Unit );
- 2x 2 5.15
CNN
CNN BP
8.4 •
5.15
DBN CNN
(feature learning)
( representation learning).
(feature engineering).
5.7
5.7
5.2 5.2 ( b)
5.4 (5.6 )
5.5 BP BP 3.0
3.0 p.84
4.3.
5.6 BP
uci
UCI BP
http:/ / archive.ics.uci.edu / ml / .
5.9* Elman BP
5.10
MNIST
MNIST
http: / / yann.lecun.com /
exdb/ mnist /.
Aarts, E. and J . Korst . (1989) . Simulated Annealing and Boltzmann Machines:
A Stochastic Approach to Combinatorial Optimization and Neural Comput-
ing. John Wiley Sons, New York, NY.
Ackley, D. H., G. E. Hinton, and T. J . Sejnowski. (1985). “A learning algorithm
for Boltzmann machines.” Cognitive Science^ 9 (1):147-169.
Barron, A. R. (1991) . “Complexity regularization with application to artifi-
cial neural networks. ” In Nonparametric Functional Estimation and Related
Topics; NATO A SI Series Volume 335 ( G. Roussas , ed.) , 561-576 , Kluwer ,
Amsterdam , The Netherlands.
Bengio, Y., A . Courville, and P. Vincent . ( 2013) . “Representation learning: A
review and new perspectives.” IEEE Transactions on Pattern Analysis and
Machine Intelligence 35(8) :1798-1828.
Bishop, C. M. (1995) . Neural Networks for Pattern Recognition. Oxford Uni-
versity Press , New York, NY.
Broomhead , D . S. and D. Lowe. (1988) . “ Multivariate functional interpolation
and adaptive networks.” Complex Systems, 2 ( 3) :321-355.
Carpenter , G. A. and S. Grossberg. (1987) . “A massively parallel architecture
for a self-organizing neural pattern recognition machine.” Computer Vision,
Graphics , and Image Processing , 37 (1) :54-115.
Carpenter , G. A. and S. Grossberg, eds. (1991). Pattern Recognition by Self -
Organizing Neural Networks. MIT Press, Cambridge, MA.
Chauvin , Y. and D. E. Rumelhart , eds. (1995). Backpropagation: Theory, Ar-
chitecture , and Applications. Lawrence Erlbaum Associates, Hillsdale, NJ .
Elman, J . L. (1990). “ Finding structure in time.” Cognitive Science , 14 ( 2 ) :
179-211.
Fahlman, S. E. and C. Lebiere. (1990) . “The cascade-correlation learning archi-
tecture.” Technical Report CMU-CS-90-100 , School of Computer Sciences,
Carnergie Mellon University, Pittsburgh , PA .
Gerstner , W. and W. Kistler. (2002 ) . Spiking Neuron Models: Single Neurons,
Populations , Plasticity. Cambridge University Press, Cambridge, UK.
Girosi, F., M. Jones, and T. Poggio. (1995) . “Regularization theory and neural
networks architectures.” Neural Computation, 7( 2 ) :219-269.
Goldberg , D . E. (1989 ) . Genetic Algorithms in Search, Optimizaiton and Ma-
chine Learning. Addison-Wesley, Boston , MA.
Gori, M. and A. Tesi. (1992). “On the problem of local minima in backpropa-
gation.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
14 (1) :76-86.
Haykin , S. (1998) . Neural Networks: A Comprehensive Foundation, 2nd edi-
tion. Prentice-Hall , Upper Saddle River , NJ .
Hinton , G. ( 2010) . “A practical guide to training restricted Boltzmann ma-
chines.” Technical Report UTML TR 2010-003, Department of Computer
Science, University of Toronto.
Hinton , G ., S. Osindero, and Y.-W. Teh. ( 2006) . “A fast learning algorithm for
deep belief nets.” Neural Computation, 18( 7) :1527-1554.
Hornik , K., M. Stinchcombe, and H . White. (1989 ) . “ Multilayer feedforward
networks are universal approximators.” Neural Networks , 2 (5) :359-366.
Kohonen , T. (1982) . “Self-organized formation of topologically correct feature
maps. Biological Cybernetics , 43(1):59-69.
Kohonen , T. (1988) . “ An introduction to neural computing.” Neural Networks ,
1(1):3-16.
• Kohonen , T. ( 2001) . Self - Organizing Maps , 3rd edition. Springer , Berlin.
LeCun , Y. and Y. Bengio. (1995). “Convolutional networks for images , speech ,
and time-series.” In The Handbook of Brain Theory and Neural Networks
( M. A. Arbib, ed.) , MIT Press, Cambridge, MA.
LeCun , Y., L. Bottou , Y. Bengio, and P. Haffner. (1998) . “Gradient-based
learning applied to document recognition.” Proceedings of the IEEE , 86 (11):
2278-2324.
MacKay, D. J . C. (1992 ) . “A practical Bayesian framework for backpropagation
networks.” Neural Computation, 4 ( 3) :448-472.
McCulloch , W. S. and W. Pitts. (1943) . “A logical calculus of the ideas im-
manent in nervous activity.” Bulletin of Mathematical Biophysics, 5( 4 ) :115-
133.
Minsky, M. and S. Papert. (1969 ) . Perceptrons. MIT Press, Cambridge, MA.
Orr , G. B. and K.-R. Muller , eds. (1998) . Neural Networks: Tricks of the Trade.
Springer , London, UK.
Park , J . and I. W. Sandberg. (1991) . “Universal approximation using radial-
basis-function networks.” Neural Computation, 3( 2 ):246-257.
Pineda, F. J . (1987). “Generalization of Back-Propagation to recurrent neural
,
networks.” Physical Review Letters 59 (19):2229-2232.
Reed , R. D. and R. J . Marks. (1998) . Neural Smithing: Supervised Learning in
Feedforward Artificial Neural Networks. MIT Press, Cambridge, MA.
Rumelhart , D. E. G. E. Hinton , and R. J . Williams. (1986a) . “Learning internal
representations by error propagation.” In Parallel Distributed Processing:
Explorations in the Microstructure of Cognition ( D. E. Rumelhart and J . L.
McClelland , eds. ) , volume 1, 318-362, MIT Press, Cambridge, MA.
Rumelhart , D. E., G. E. Hinton, and R. J. Williams. (1986b ) . “Learning repre-
sentations by backpropagating errors.” Nature , 323( 9 ) :318-362.
Schwenker , F. , H .A. Kestler , and G. Palm. ( 2001) . “Three learning phases for
radial- basis-function networks.” Neural Networks , 14( 4-5) :439-458.
Tickle, A. B., R. Andrews, M. Golea , and J . Diederich. (1998) . “The truth
will come to light: Directions and challenges in extracting the knowledge
embedded within trained artificial neural networks.” IEEE Transactions on
Neural Networks, 9 (6):1057-1067.
Werbos, P. (1974) . Beyond regression: New tools for prediction and analysis in
the behavior science. Ph.D. thesis, Harvard University, Cambridge , MA .
Yao, X. (1999 ) . “Evolving artificial neural networks.” Proceedings of the IEEE ,
87( 9 ) :1423 1447.
-
Zhou, Z .-H. ( 2004). “Rule extraction: Using neural networks or for neural
networks?” Journal of Computer Science and Technology , 19 ( 2 ) :249-253.
MP- Hebb
Adaline
MIT ( Marvin
Minsky, ) Seymour Papert
Paul Werbos BP
John Ilopfield
NP UCSD
David Rumelhart James McClelland PDP
Rumelhart BP
Hopfield BP
trick) NIPS
ImageNet
6.1
0
*
{(»1 2/I ) 2 y2) ... a m ym )} -
{
D
6.1
002
6.1
6.1
6.1
x -\- b = 0 ( 6.1)
(Wl ; w2
^ . . ; wd ) w
6.1. F (w ,b). T (w ,b)
\w Tx .
rr =.
•
(6.2)
IHI
{w ,b) G D, yi
( w / , bf )
wTXi —1, wTXi
qw w
(6.3) wT 2/i = +l ;
(6.3)
wTXi + 6 1, Vi = -
6.2 (6.3)
(support vector),
(6.4)
IIHI
margin).
w Tx
IHI
wT x + b = 0
rw
w T x -\- b = —1
_
0 0 -
0
O
1
6.2
( maximum margin)
(6.3) w
max (6.5)
IHI
s.t. yi(wTXi + 6) i = 1,2,... m.
6.2
-
n ir1 , I I 2.
(6.5)
^ ^
"'
2
I ( 6.6)
6.2
(6.6)
w (6.6) (convex
quadratic programming)
a = (ai;o 2; a) 6
(6.9)
o = E OiiVi . (6.10 )
i= l
2 ^jyiVjxJxj ( 6.11)
s.t E o^iVi =
oO 5 i = 1, 2 ...
a t;
/ 6
T
x+b
y
^ aiyixjx + b . (6.12)
(6.11) (6.8)
(6.6) KKT
B.I.
- -
(Karush Kuhn Tucker)
^^
ai 0;
Vif {xi )
- 1 0;
1)
(6.13)
••
( Xi ,y i ), Vif ( x i ) 1.
(6.12) f{ x )
[Vapnik, /(
(6.11)
B .2.
•
SMO
SMO
^
6.2
• (6.11)
KKT
[Osuna et al,1997]. KKT
SMO KKT
SMO
• SMO
(6.11)
-
(6 14)
E OLkVk
k i,j
(6.15)
E o
i=l
(6.11)
0.
Vs (S aiyixfxs b (6.17)
s {i I i m}
(6.17)
b= Vs - ^ aiyjxjxs • (6.18)
6.3
6.3
> j{x)
t- ( )
>
6.3
6.3
f ( x) = () j{x) b (6.19)
w 6
IHI 2 ( 6.20 )
s.t . yi ( wTcj)( xi ) 6) i . . . , m.
no no IIO
i
Y,ai ~
2
^^ aiajyiyj
^Xif^ Xj) ( 6.21 )
6.3
s.t• Ylaiyi = o
i=l
5 i = 1 , 2, m.
(6 . 2 1)
0(%)T K% )
(
K,( X i , Xj ) = { j { Xi ), j { Xj ) )
() () () j { X i )T ( j){ X j ) (6.22 )
ker-
nel trick).
(6.21)
max H= ^ a j V i y j ^ x u X j )
13 1
(6.23)
s.t. =0
i=l
>0 ••• m •
f(x) wT j ( x )
() b
m
T
= cyiVi ( > { xi ) c )( x )
/ / b
Y ] o^iVMx , Xi ) + b . (6.24)
(•
& • kernel function). (6.24)
support
vector expansion).
[Scholkopf 6.1 x
and Smola , 2002].
D = { ^1?
« . •. ® 7 7l }
( kernel matrix) K
K=
^
K( Xi Xi) •• • K
^
(Xi, Xj) XfYi )
^
... (® 77l (
^
5 K X.
6.1
6.1
6.1
K( X i X j ) = xJXj
d= K, ( X i , X j ) = { x j x j ) d
RBF X j ) = exp ( ) a (width)
K, ( x i , X j ) = exp( 7
(
X j ) = tanh(/3x
Sigmoid
^ ajj 0) tanh
(6.25)
6.4
q M
ACi (
K2 X , Z )= /i
^ ( x, Z ) K 2 ( X , Z ) (6.26)
^ ( x y z ) = g ( x ) K i ( x , z )g ( z )
I (6.27)
6.4
x -{ - b = 1
f
4w x b=0
+ /©.
x -\- b = —1
+
+
:
©
o 1
6.4
(hard margin),
yi ( wTXi 6) • (6.28)
^
min |
||-
|2
1 C -- 4/ I + - 1) (6.29)
C ^ £Q/1
if z
(6.30)
otherwise.
C (6.29)
(6.29) (6.6); C (6.29)
£Q/1 (6.29)
(surrogate loss).
6.5
hinge (6.29)
min Cy
^rnax (0,
i= l
1- yj (wTXj b) ) • (6.34)
^^
(slack variables) > 0, (6.34)
mm
2 IHI
2
+c & (6.35)
6.4
exp( — 2 )
^
exp ( ^)
^
2
= log(l + exp( — 2 ))
1, if z < 0 :
4/1
-2 0 2 z
6.5 hinge
s.t. yi ( wTXi 6) -
i= . . . m.
(6.35)
(6.6)
(6.35)
m
m m
E
=
i l
(1 - - ( wTXi + 0 ) — E/ i ( 6.36 )
Qi > A>0
L( w;,6 a t tt?
m
w= E (6.37)
◦ = E^
m
iVi
■
5 (6.38)
i= l
T
max (6.40)
1 3=1
0 ; C
a i = 1, 2, •• • , m •
(6.40) (6.11)
C 6.2
(6.40); (6.24)
(6.13), KKT
^+ o
^
y i f ( x i ) - l -h i 5
(6.41)
{V i f -1 0
( Xi , yi ), Vif ( xi )
f (x) yif { x i ) -
(6.39) C A
C, A
hinge
(6.29)
4P > (6.29)
(3.27).
[Platt , 2000];
psu and Lin,
2002]. 6.5 hinge
6.5
(6.29)
iin + (6.42)
’
structural risk ),
empirical risk )
C n( f )
(6.42)
regularization) C
Lp (norm) L2 ti
Lo |
|w|
|o h I Hx
11.4
w
^
6.5
f(x) y -
^ W
(x ,y), f i x)
f ( x )3 y
(Support Vector Regression, SVR) /(» )
y e f(x) y e
6.6 f (x ) 2e
yi
+e
/ 0) wT x + b
O O
O
O
o
o
,
6
< o
0 r
6.6 e
-
SVR
m
mm WI + cE
2
A) %) (6.43)
=
1
(7 A 6.7 e
- ( e-insensitive loss)
0 if kl < e ;
4( ) (6.44)
^ kl — e
(6.43)
otherwise.
m
mm (6.45)
Mi
if e;
4( )
0 ^ z
\ z\ otherwise.
6.7 e
-
6.5
s.t . f { x i ) - Vi
- f { x i)
A
Vi +&
^^
0, i= m.
(6.36),
(6.45)
A
^ 0 , di ^ 0,
A
m m m
m
-
^ \\ w \\ 2 c + 6)
i= l ^-
m
i= l
+ X=
i l
^ ( / (% ) - V i
L(to, a A,
~ 6 ~
6) +
^ 2 ^ (V i
=
i l
i - f ( xi ) - ^ -
^) (6.46)
(6.7) d p /i) 6,
rrt
w= ai ) Xi (6.47)
m
o=y (
C — OLi \
~ -
^ Qz
5
on ) (6.48)
(6.49)
C = dii (l i . (6.50)
^jYl
- aj ) xIxJ
i= l j = l
m
s. t . ai ) = 0
0 ai , ai
^C .
KKT
^
a i ( f ( x i ) - y% - ii ) =
on{Vi - / e-
(6.52)
aidii = =0
^
{C ai ) 0 , (C ai ) ii = 0 .
f (Xi) Pi -e &= 0
Vi f ( xi ) —e (x
“V i - e) —
yi
-
e
f (Xi) ~
Vi f{xi )
- e ii = 0
— •
(6.47) ij SVR
m
xJx -\- b .
f(x)
^
i=l
( di ai ) (6.53)
- - ai)
^
e
3i
< 0.
(6.53) 0 SVR
e
- SVR
m
b yi e- ai )x j x . (6.54)
(6.51) C
(6.54) 6.
C
(6.47)
m
w (6.55)
Xi , X j ) =
^ ^ ^ cPiXj
Xi )
6.6
( (6 . 5 6) {(«i ,yi ),
SVM SVR,
representer theorem)
min F ( h ) = 0|
(| I|H )
/| i ( h( x i ) y h( x2),...,h( x m ) ) (6.57)
/iGlHI
h* ( x ) = x i) . (6.58)
D (6.57)
*
/1
»
( kernel
methods).
3.4
( Kernelized Linear Discriminant
Analysis, KLDA).
F,
F
h ( x ) = wTcj)( x ) . ( 6.59 )
(3.35), KLDA
w w
m a,x J(w ) (6.60)
^
w TS w
< St F
A i G {0,1}
m = m0 + mi . i F
^
(6.61)
xEXi
= (/4 — 4)( 4 - 4) ; / /
T
(6.62)
E EXi (0 - 4)T .
/ (6.63)
X
^
( ) /
0( T
0Or ) F. J(w ) (6.57)
A D h(x )
h{ x ) = y^j a i K J { x , x i ) (6.64)
i= l
(6.59)
w (6.65)
{1,0}
- K e
Xl
j
i
0.
Mo
Ai
K
m0
Kl0
Kli
li j
(K )
^
-
^
(x i .
X j ). h
e
G
(6.66)
(6.67)
mi
t
M= ( Ao Ai ) (Ao AI) (6.68)
6.7
(6.60)
aTMa
max J(a )
aTNa (6.70)
3.4
a (6.64)
ZiOr ).
6.7
Mercer
-
[Cristianini and Shawe Taylor ,
2000] RKHS
12.4
Scholkopf et al.,
-
[Cristianini and Shawe Taylor , Burges,1998;
Scholkopf and Smola, 2002]
[Vapnik , 1999].
[Boyd and Vandenberghe,
2004]. SVM
SVM ( cutting plane algorithm) SVMperf
[Joachims, 2006], Pegasos
[Shalev-Shwartz et al., 2011],
[Hsieh et al., 2008]. SVM 0(m2)
CVM [Tsang et al , 2006]
Nystrom [Williams and Seeger ,
[Rahimi and Recht , 2007]
Nystrom [Yang et al., 2012].
•
consistency) [Vapnik and Chervonenkis, 1991]
[Zhang, 2004]
6.4
6.6 SVM
6.8 3.0a
LIBSVM SVR.
6.9
6.10* SVM
. ( 2009 ) .
Bach, R. R. G. R. G. Lanckriet , and M. I. Jordan. (2004). “Multiple kernel
learning, conic duality, and the SMO algorithm.” In Proceedings of the 21 st
International Conference on Machine Learning ( ICML ) , 6—13, Banff , Cana-
da.
Boyd , S. and L. Vandenberghe. ( 2004) . Convex Optimization. Cambridge Uni-
versity Press, Cambridge, UK .
Burges, C. J . C. (1998) . “A tutorial on support vector machines for pattern
recognition.” Data Mining and Knowledge Discovery , 2 (1) :121-167.
Chang, C.-C. and C.-J. Lin. ( 2011) . “LIBSVM: A library for support vector
machines.” ACM Transactions on Intelligent Systems and Technology, 2 ( 3) :
27.
Cortes, C. and V. N. Vapnik. (1995) . “Support vector networks.” Machine
Learning , 20 ( 3):273-297.
Cristianini, N. and J. Shawe-Taylor. ( 2000 ) . An Introduction to Support Vector
Machines and Other Kernel- Based Learning Methods. Cambridge University
Press, Cambridge, UK .
Drucker , H., C. J . C. Burges , L. Kaufman, A. J. Smola , and V. Vapnik. (1997) .
“Support vector regression machines.55 In Advances in Neural Information
Processing Systems 9 ( NIPS ) ( M. C. Mozer , M. I. Jordan, and T. Petsche,
eds.) , 155—161, MIT Press, Cambridge, MA .
Fan , R.-E. K .-W. Chang, C.-J . Hsieh, X.- R. Wang, and C.-J . Lin. ( 2008) .
ULIBLINEAR: A library for large linear classification.” Journal of Machine
Learning Research, 9:1871-1874.
Hsieh , C.-J ., K.-W. Chang , C.-J . Lin , S. S. Keerthi , and S. Sundararajan.
( 2008) . “A dual coordinate descent method for large-scale linear SVM.”
In Proceedings of the 25th International Conference on Machine Learning
( ICML ) , 408-415, Helsinki, Finland.
Hsu, C.-W. and C.-J . Lin. ( 2002 ) . “A comparison of methods for multi-class
,
support vector machines.5 IEEE Transactions on Neural Networks , 13( 2 ) :
415-425.
Joachims , T. (1998) . “Text classification with support vector machines: Learn-
ing with many relevant features.” In Proceedings of the 10th European Con-
ference on Machine Learning ( ECML ) , 137—142, Chemnitz , Germany.
Joachims, T. ( 2006). “Training linear SVMs in linear time.” In Proceedings of
the 12th ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining ( KDD ), 217-226, Philadelphia, PA.
Lanckriet , G. R . G ., N. Cristianini, and M. I. Jordan P. Bartlett , L. El Ghaoui.
( 2004 ). “Learning the kernel matrix with semidefinite programming.” Jour-
nal of Machine Learning Research, 5:27-72.
Osuna, E. , R. Freund , and F. Girosi. (1997). “An improved training algorithm
for support vector machines.” In Proceedings of the IEEE Workshop on Neu-
ral Networks for Signal Processing ( NNSP ), 276-285, Amelia Island , FL .
Platt , J . (1998) . “Sequential minimal optimization: A fast algorithm for train-
ing support vector machines.” Technical Report MSR-TR-98-14 , Microsoft
Research.
Platt , J . ( 2000 ) . “Probabilities for (SV) machines.” In Advances in Large Mar-
gin Classifiers ( A. Smola, P. Bartlett , B. Scholkopf , and D. Schuurmans,
eds.) 61-74, MIT Press, Cambridge, MA. .
Rahimi, A . and B. Recht. ( 2007) . “Random features for large-scale kernel ma-
chines.” In Advances in Neural Information Processing Systems 20 ( NIPS )
—
( J .C. Platt , D. Roller , Y. Singer , and S. Roweis, eds.) , 1177 1184, MIT Press,
Cambridge, MA.
Scholkopf , B., C. J . C. Burges, and A. J . Smola, eds. (1999 ). Advances in Kernel
SVM MacMne
SVM
A. Chervonenkis “ VC
RBF 5.5 . 1
SVM
NEC
( Facebook )
q q P ( Ci x )
CC Ci (expected loss) , a;
(conditional risk )
(risk ) .
x y
^
h
x h R{h{x ) x ),
( Bayes decision rule) :
(c
Ay
ifi = j
Xij (7.4)
otherwise,
R( c \ x ) = 1 — P ( c x ) , (7.5)
*).
.
P( c I x). r,
c (discriminative models);
P(x ,c) P(c
(generative models). BP
P( x , c)
P{ c x ) p(x ) (7.7)
|x )
P(c) P( x I c)
P (c|x )
P{ x ) (7.8)
P(c) prior )
(class-conditional probability)
C) x
_ c
(likelihood) P(x )
evidence)
P{ x)
P(C a ) D
P(c) c).
P( c )
_
p(.)
P ( c)
I
7.2
7.3 d
m,
7.2
c
c), c)
.
p(® C)
D _POr
( parameter estimation)
(Frequentist)
(Bayesian)
[Efron ,
Samaniego, 2010].
( Maximum Likelihood Estimation, MLE),
Dc D c
Dc
P ( DC \ ec ) = JxeDcJ p( x ec) . -
(7 9)
P ( DC ec )
(7.9) -
(log likelihood)
LL ( 0c ) = log P ( DC 6C )
A/
C.1.7.
pOr C)
Me
Me E
xeDc
(7.12)
C ( ® - A c)(« - A C) T . (7.13)
\D C \ XGDC
(x - ftc )( x - AC)
T
7.3
(7.8)
c)
( naive Bayes classifier)
attribute conditional independence assiimption):
(7.8)
P(c) P { x I c) _ P(c) d
P(c|x ) = p Y[ p ( ^ i i c) (7.14)
7.3
Xi d i
(7.6)
Ci i
d
n -
,
i
® hnb» arg max
cey
c) (7 15)
P(c), P ( Xi c).
D c
P{c) \ D C\ *
(7.16)
Dc,Xi i
P ( xi c)
P { Xi c) = \ - ^\ D C, \
C Xi
(7.17)
p( c)
c i
P ( Xi c) =
Z27rcrC 2
\ }
^
exp
( ^2 Mc i )
2(7C?,,
2
2
(7.18)
3.0
3.0 p.84
4.3.
0.697 0.460
P(c),
P — 0.471
P a 0.529 .
I c):
P 0.375
3.0
p P
- 0.333
P 0.375
P P 0.333
P P 0.750
P P 0.444
P
■ P 0.875
P 0.222
P
■ P 0.750
P P(
y
0.222
P 0.750
P P
- 0.667
P P 0.697
exp
(0.697 1.959
0.129 • 0.1292
P P 0.697
(0.697
exp 1.203
0.195 • 0.1952
P P 0.460
exp
(0.460 - 0.788
A/2TT - 0.101 2.0. 1012
P P 0.460
(0.460 - 0.154) 2
- 0.066
exp •
X/ STT 0.108 2.0.1082
7.3
P P P P P P
P X P X P .•0.4601 a 0.038
P _P P P P X P
xP X X 6.80 x -5
P P
0.038 6.80 x
. •
.
(7.17) (7.15)
3.0
P -
(7.15)
smoothing),
(Laplacian correction). iV JD
i (7.16) (7.17)
\ \
P ( c ) = DC
|+W (7.19)
P( X i I c ) = \^ , \
C Xj
(7.20)
iDcl + Ni
P 0.474 P 0.526 .
P P
P 0.364
B 0.333 .
0.091 .
(prior)
5.5.2
7.4
(7.8) P(C
-
(semi naive Bayes classifiers)
-
(One Dependent Estimator, ODE)
J [ P(xi I c paj)
P{c I x) oc P(c)
^ 5 (7.21)
pq Ti
(7.20) P{xi c pai).
7.4
(super -
parent ),
SPODE (Super-Parent ODE) 7.1(b)
x i M x 2 y{ x3 xd
7.1
P( X j , X j c)
|y ) =
^ 2c e y p( x i ^ x j I c) log -
I c)P( |c ) 5 (7.22)
^
Xi , x j
(2)
y) ;
(3)
(4) y
Iixi . Xj ] y)
TAN
d d
^^
P { c I x ) oc
i=l ,
P(c,Xi ) Y=[ P ( ^
j l
j I ? (7.23)
^
\ DXi \ m
m' 30 [Webb
i 77/
e t a l ., 2005].
AODE P(c,X i ) m P{ X j c,X i ). (7.20),
I
\^ C ,X i 1
(C Xi ) = D (7.24)
\ \ + Ni
|C, =
\ Dc,X j ,X j 1 1 (7.25)
\ Dc,xi |
i Dc X i c i
Xi D Xj c i j
q 3.0
0.350
0.444 .
AODE
.AODE
(7 . 2 3 ) P% , ODE
k D E. A; P( Xi pa, )
7.5
•
B G e B G
G
©
G ©
PB { XI I TTi ) .
7.2
P 0.1
$2(
7.2
7.5 .1
B (G ,0)
d d
PB { ^ 1 ^ 2
) 5 • • • 5
^ d ) = Y[ PB { Xi TTi ) =\
\ 0 ki .
^
X (7.26)
7.2
P ( x i , X 2 , X 3 , X 4 , X 5 ) = P ( x 1 ) P( x 2 ) P ( X 3 I X i ) P ( x4 \ X 1 ) X 2 ) P { x5 \ X2)
x3 x4 r4 x5 x2
X4 A T4 C5
7.3
Xl ( X2
X s) [ X4 X4
7.3
common parent ) T4
structure)
r y z . (V
-
P ( x i , x2 ) = y
y
^ P( x i , X 2 , X4 )
^ -P( ^4 X i , X 2 ) P { x i ) P { x 2 )
P ( x1 ) P ( x2 ) • (7.27)
marginal independence) x\ xo .
ization).
marginal
-
X4
X3 XsJLx^
y ylz
D -
D
ed ).
(direct
- separation).
•
[Pearl , 1988].
•
moral graph),
moralization) [Cowell et al.,1999].
x, y z x z
7.5
z x x z
r z 7.2 7.4
T3 Z4 T5 C2 Z3
D @2(
$3(
7.4
^^ 1
( 4(
7.2
)
7.5.2
(score function),
1.4
Minimal Description
Length, MDL)
D = { X 1 , X 2 , . . . , xm } , B G D
JCi
s ( B \ D ) = f ( 0 ) \ B\ - L L( B \ D ) , (7.28)
/(0
L L( B \ D ) = Y= ^g P B ( x i )
i l
/
(7.29)
B (7.28) B
S
S
s(B
| D)
f AIC ( Akaike Information
Criterion)
^
f (0) = logm , log m BIC ( Bayesian
Information Criterion)
^
-
lo
BIC(5|D) = \ B\ LL( B \ D) . (7.31)
/(0) =
B (G,0) G
s ( B D) D) ©
(7.29) (7.26) D
\ = PD Xi , (7.32)
PD ( .) D
^ Xi 7n
^ 7Ti )
I D)
NP
TAN [Friedman et
al . 1997]
7.5
7.5.3
query)
inference)
(evidence).
NP [Cooper 1990];
(Gibbs sampling)
14.5
{Qi ,Q2,
Q
e
Qn}
{ei e2, efc}.
E = { E u E2
_P...(Q, Ekq}
j
E e)
92 ,..., }
^
q
Q E
e ={ q {
7.5 E=e
q
° t
'- ,
q 1
B Z z)
T q
-
P( Q = q | E = e )
( random walk).
^ .
E=e
(7.33)
nq = 0
qG Q
for t . . . , T do
4 for Q i e Q do
Qi 5 Z = E U Q\ { Q i } ]
z = e Uq \{ d
7 B PB { Qi|Z = z);
8 q\ = PB { Qi | Z = z )
9
10 end for
11 if qt = q then
12 nq = nq -\- l
13 end if
14 end for
~
P(Q = q E = e )
7.5
^
7.6 EM
(latent variable).
Z e e
marginal likelihood )
EM -
EM (Expectation Maximization) [Dempster et al., 1977]
©
Z E ); Z
EM
e M
•
©Q (7.35),
'
Z Z
• e ei+1
EM
Z Z
P( Z ©0, EM
^
•E Expectation): P( Z X ,0 ,
LL( S Z) Z
•M Maximization):
EM (E)
( M)
EM E E
[Wu ,
EM
(coordinate descent)
EM
B.5.
7.7
[Zhang,2004].
[Lewis, 1998] [McCallum and Nigam, 1998]
2005] LBR (lazy Bayesian Rule) [Zheng and Webb, 2000] kDE
A: KDB [Sahami, 1996] NBtree
[Kohavi, 1996]
( Bayes Classifier) ( Bayesian
Learning)
[Bishop,2006].
Pearl •
pearl, 1988].
NP
[Cooper, Chickering et al., 2004],
[Friedman and Goldszmidt , 1996].
[Grossman and Domingos,2004].
[Jensen, Heckerman ,1998].
EM
7.7
7.2*
7.3 3.0
P.151
7.4 (7.15)
Ilti I c) 0
7.5
3.4
7.7 d
(7.15)
P(c) x AODE (7.23)
P(c,Xi )
7.8 7.3, xi
_
X 3 \LX 4
y r z
yA. •
2.0 p . 76
7.9 2.0 BIC
4.1.
7.10 2.0 EM
Bishop, C. M. ( 2006) . Pa ern Recognition and Machine Learning. Springer ,
New York, NY.
^
Chickering , D. M., D. Heckerman , and C. Meek. ( 2004). “Large-sample learning
of Bayesian networks is NP-hard.” Journal of Machine Learning Research,
5:1287-1330.
Chow, C. K. and C. N. Liu. (1968) . “Approximating discrete probability distri-
butions with dependence trees.55 IEEE Transactions on Information Theory ,
14 (3) :462-467.
Cooper , G. F. (1990) . “The computational complexity of probabilistic inference
using Bayesian belief networks.” Artificial Intelligence , 42 ( 2-3) :393-405.
Cowell, R. G., P. Dawid , S. L. Lauritzen , and D. J. Spiegelhalter. (1999 ). Prob-
abilistic Networks and Expert Systems. Springer , New York , NY.
Dempster , A. P N. M. Laird , and D. B. Rubin. (1977) . “Maximum likelihood
”
from incomplete data via the EM algorithm.” Journal of the Royal Statistical
-
Society Series B, 39 (1) :1-38.
Domingos , P. and M. Pazzani. (1997) . “On the optimality of the simple
Bayesian classifier under zero-one loss.” Machine Learning 29 ( 2-3) :103-130.
Efron, B. (2005). “Bayesians frequentists, and scientists.” Journal of the
American Statistical Association, 100 ( 469) :1-5.
Friedman, N. D. Geiger , and M. Goldszmidt . (1997) . “Bayesian network clas-
,
sifiers.7 Machine Learning , 29 ( 2-3):131-163.
Friedman, N. and M. Goldszmidt . (1996). “Learning Bayesian networks with
local structure.” In Proceedings of the 12th Annual Conference on Uncer-
tainty in Artificial Intelligence ( UAI ), 252— 262, Portland , OR.
Grossman , D. and P. Domingos. ( 2004) . “Learning Bayesian network classifiers
by maximizing conditional likelihood.” In Proceedings of the 21st Interna-
tional Conference on Machine Learning ( ICML ) , 46—53, Banff , Canada.
Heckerman, D. (1998) . “A tutorial on learning with Bayesian networks.” In
—
Learning in Graphical Models ( M. I. Jordan , ed.) , 301 354, Kluwer , Dor-
drecht , The Netherlands.
Jensen, F. V. (1997). An Introduction to Bayesian Networks. Springer, NY.
Kohavi, R. (1996 ) . “Scaling up the accuracy of naive-Bayes classifiers: A
decision-tree hybrid.” In Proceedings of the 2nd International Conference
on Knowledge Discovery and Data Mining ( KDD ), 202-207, Portland , OR.
Kononenko, I. (1991) . “Semi-naive Bayesian classifier.” In Proceedings of the
6th European Working Session on Learning ( EWSL ), 206—219, Porto, Por-
tugal.
Lewis , D. D. (1998). “ Naive ( Bayes) at forty: The independence assumption in
information retrieval. In Proceedings of the 10th European Conference on
Machine Learning ( ECML ) , 4—15, Chemnitz , Germany.
McCallum, A. and K . Nigam. (1998) . “A comparison of event models for naive
Bayes text classification.” In Working Notes of the AAAF 98 Workshop on
Learning for Text Cagegorization, Madison , WI.
McLachlan, G. and T. Krishnan. (2008) . The EM Algorithm and Extensions ,
2nd edition. John Wiley & Sons, Hoboken, NJ.
Ng, A. Y. and M. I. Jordan. (2002 ) . “On discriminative vs. generative classifiers:
A comparison of logistic regression and naive Bayes.” In chances in Neural
Information Processing Systems 14 ( NIPS ) (T. G. Dietterich , S. Becker , and
Z. Ghahramani, eds.) , 841-848, MIT Press, Cambridge, MA.
Pearl, J . (1988). Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann, San Francisco, CA.
Sahami, M. (1996) . “Learning limited dependence Bayesian classifiers.” In Pro-
ceedings of the 2nd International Conference on Knowledge Discovery and
Data Mining ( KDD ), 335—338, Portland, OR.
Samaniego, F. J. ( 2010) . A Comparison of the Bayesian and Frequentist Ap-
proaches to Estimation. Springer, New York, NY.
Webb, G. J. Boughton , and Z. Wang. ( 2005) . “ Not so naive Bayes: Aggregating
one-dependence estimators.” Machine Learning , 58(1):5-24.
Wu, C. F. Jeff. (1983). “On the convergence properties of the EM algorithm.”
Annals of Statistics, 11(1) :95-103.
Wu, X. , V. Kumar, J . R. Quinlan , J . Ghosh , Q. Yang, H. Motoda, G. J . M-
cLachlan, A. Ng , B. Liu, P. S. Yu , Z.-H. Zhou, M. Steinbach, D. J . Hand ,
and D. Steinberg. ( 2007) . “Top 10 algorithms in data mining.” Knowledge
and Information Systems, 14(1):1-37.
Zhang, H.(2004). “The optimality of naive Bayes.” In Proceedings of the
17th International Florida Artificial Intelligence Research Society Confer-
-
ence ( FLAIRS ) , 562 567, Miami , FL.
Zheng, Z . and G .I. Webb.(2000). “Lazy learning of Bayesian rules.” Machine
—
Learning , 41(1):53 84.
(Thomas Bayes,
1761) R. Price
8.1
ensemble (ensemble learning)
(multi-classifier system)
(committee-based learning)
8.1
individual learner ),
C4.5 BP
(heterogenous).
(component learner )
8.1
weak learner )
8.2 x
(voting) 8.2(a)
8.2(b)
8.2(c)
(diversity),
m yj v mv
(a) (b) (c)
e{-l,+l}
P ( hi ( x )
^ ( ®)) . (8.1)
r T
H{ x )
— sign (8.2)
8.1.
Hoeffding
8.2 Boosting 173
(1 e ) keT ~ k
Boosting,
Bagging Random Forest).
8.2 Boosting
Boosting
r r
Boosting AdaBoost [Freund and Schapire, 1997],
8.3
AdaBoost
additive model),
H{x) =
^ atht { x )
H(x) (8.5) H ( x)
d£exp ( H V)
dH ( x )
e~ H ^ P ( f { x ) = l \ x ) + eH^ P { f ( x ) (8.6)
(8.6)
1 P( / Qr ) = l \ x )
H { x ) = - In (8.7)
2 P{ f { x ) = -l \ x )
5
1 - P(/ =1
sign ( H ( x )) = sign In
2 P ( f ( x ) = -1|x )
P ( f ( x) =
1 x) = P( / (x ) = -1 I a;) P(/(x) = l \ x ) > P { f ( x ) = -l \ x)
—1, P ( f ( x ) = l \ x ) < P ( f ( x ) = - l \ x)
sign ( H ( x ) )
6.7
(consistent)
8.2 Boosting 175
AdaBoost
~ f { x )a t h t { x )
{ p^t
exp
^t ^t ) = ® £C Vt e
^
=e atPx^ vt ( ® ) = ht ( ® )) eatPx^Vt
~
x *
=e
~ at
(1 et ) eatet (8.9)
et = Px^Vt ( ht ( x )
^
d £exp ( o tht I T>t )
dat
= -e-at ( l - t ) + eatet (8.10)
(8.10)
at = 2 hi — 1 —Q (8.11)
et
8.3 6
AdaBoost
i/t
- i
-1
-i
\ V ) = E* D[e- (*)( -
’“ *) *))
^
4xp( t-i + ht ]
E p[e -/
- e-/
i(® ) ( )
]. (8.12)
^^
f 2 ( x ) = ht2 ( x ) = f h
(8.12) e ~
4xP ( ^t-i + h t \ V ) c^ (- 1 f ( x ) ht ( x )
2
= E® p e f ( x ) ht ( x ) • (8.13)
e- f ( x ) H t -i ( x )
= arg maxE® v rsj f ( x)h{x) (8.14)
h
- {x)]
^
f{x )Ht
v[e ~ l
A
^t { x ) = EJC (8.15)
p[e t
-10 )]
»
e - f ( x ) H t -i ( x )
ht( x) arg max Ex f ( x )h( x )
h p[e / - - i
^
—
•
h
—
f ( x),h( x) e { l,+l},
^
ht{x ) = argminEa
h
/i(x))] • (8.18)
Vt 0.5.
A Vt+1
V {x)e /
^^
+1 (® ) = D [e — / (*) (*) ]
X>{x)e - - -
f ( x)Ht i( x)e f ( x)atht( x )
Exr^v [e - /
Ej [e
^^
p
Vt {x) e • ~ f atht
(8.19)
Ex v [e-/
^ ]
8.2 Boosting
8.3
(8.11) (8.19)
8.3 AdaBoost
Boosting
-
re weighting)
resampling)
Boosting
8.3
r
2.5
Boosting Boosting
4.3
4.5 3.0a AdaBoost (size)
8.4
0.2 0.2
8.1
8.3 . 1 Bagging
Bagging
Bagging [Breiman ,1996a]
Bootstrap AGGregatING
2.2.3 ( bootstrap sampling).
m
m
m
(2.1)
T m
Bagging
Bagging
Bagging 8.5
Z) {( xi , 2/i ) , ( x 2 , 2/2 ) , • • • (® m , y m ) }\
£;
r.
for t T do
2: ht = £> ( D , T>bs )
3 end for
H ( x ) = argmax Kht ( x ) = y )
8.5 Bagging
8.3 Bagging
0(m) Bagging
T(0(m) + 0(5)), r
Bagging
Bagging
AdaBoost
AdaBoost Bagging
[Zhou , 2012].
Bagging
2.2.3
-
out of-bag estimate) [Breiman ,1996a;
Wolpert and Macready , 1999].
Dt Hoob( x )
x ®
Hoob{x) argmax
^ I( /it (aj) y) l ( x Dt) (8.20)
Bagging
eoob = E y) . (8.21)
( x ,y ) eD
2.5
Bagging
8.5.3
4.5 3.0a Bagging
8.6
8.3.2
( Random Forest , RF) [Breiman , 2001a] Bagging
. RF Bagging
d )
RF A:
S
1
S "
(b) (b)
8.6 3 . 0o Bagging
k k = d,
k A; log2 d
[Breiman , 2001a].
Bagging Bagging
8.5.3
Bagging 8.7
0.028
Bagging 0.024 Bagging
0.020
0.016
0.012
0.008
XiC - -
0.004
10
'
10' 10 '
Bagging, Bagging
8.4
[Dietterich, 2000]:
8.8
'
h3 • h2 /l2 •
• f
".3 •
• f‘
" 1
/ll
T l 5 /12 ,...,hT }
hi( x ).
8.4 .1
h i( x ) e M, (averaging).
• (simple averaging)
( 8.22)
• (weighted averaging)
T
H( x) =
^ W j h x) .
^ (8.23)
Breiman [1996b] T
Stacking o, E 1.
1/T
[Markowitz ,1952], [Perrone and Cooper, 1993]
8.4. 2
{ c U C2 , ... , C N }
(voting).
iV hf ( x )) h{( x )
• ( majority voting)
H ( x) =
cj if E M (®) > °-5 kE= l iE
i= l =l
hi (x) ; (8.24)
reject , otherwise.
• ( plurality voting)
• (w e i g h t e d v o t i n g)
H ( x ) = carg max
YliL i wi (8.26)
3 ^i (*) •
T
1.
(8.24)
majority voting ,
(8.24) (8.26)
voting.
MW
e{0,1}, q
0. ( h a r d voting).
h{ ( x ) e [0,1], _
P (q x )
(soft voting).
hi ( x )
Platt
(Platt scaling) [Platt , (isotonic regression) [Zadrozny and
Elkan , 2001] calibration)
h{ ( x ) o)
8.4.3
Stacking
Stacking [Wolpert , Breiman ,1996b]
-
(meta learner ).
Stacking
Stacking 8.9
.• • ® m V m )}
£I ...
ii.
-
£t for t . . . T do
^
ht = t( ) D
3: end for
4 D' = 0;
5 for i = 1, 2 , . . . , m do
for t = 1, 2 , . . . , T do
^it =
end for
D' Dr U (( 1, 2 , . . . , ziT ) , yi )\
10 end for ^ ^
V 11 h! = Z [ D' )\
H ( x ) = hr( hi ( x ) , h2 { x ) , .. HT ( X ) )
8.9 Stacking
/c D fc
DI , D2 , ... 14. Dj = D\ Dj j
T t
h{tj\Xi),
A k1;A2;... ziT ) ,
r
zr
MLR Stacking
8.5
8.5.1
8.1
A( h x ) WiA( hi x )
.
i
^
2
= Y i= iWi
^^
hi ~ H (x) ) (8.28)
(/(*) - hi ( x ) )
2
E ( hi x ) (8.29)
2
E( H \ x ) = ( f ( x ) - H ( x ) ) . (8.30)
E( /i 4 = Ej= i E ( hi|x )
T
A( h x ) =
i= l
WjE ( hj x ) — E( H \ x )
= E{ h | x ) - E( H | x) . (8.31)
(8.31) ® p( x )
Ai
A( hi ) .
Ei
J E ( hi x)p{x ) dx (8.33)
E
E= E{H \ x)p(x )dx . (8.35)
E [ H ).
(8.36)
[Krogh and Vedelsby, 1995]
( error-ambiguity decomposition ) .
8.5.2
(diversity measure)
8.5
hi = hi = —
= a c
hj = 1
- b d
~
a C d
a + 6 + c + d = m.
• (disagreement measure)
b+c
disij = (8.37)
m
dis [0,1].
•
^ (correlation coefficient)
Pij =
ad — be
/ (a + 6) (a + c)(c -\- d )(b
^ - d)
(8.38)
Pij 1 1]. hi hj • hi
{), hj
• Q- -
(Q statistic)
Qij =
ad
ad — bebe . (8.39)
Qij py |(3y
|< \ p i j \ .
• -
( / statistic)
M
^ K
Pi
= 1-
P2
~
P2
p2
(8.40)
Pi
a{ d
m
-- (8.41)
P2
{ CL + 6)(a
- - c)m+ (c + d){ b
j d)
(8.42)
.
2
hj D / = 1;
K 0. K hi
pairwise)
8.10
2()
8.10 UCI
-
tic tac - toe C4.5
8.5.3
•
Bagging
AdaBoost
/c (stable
base learner ),
8.5
•
subspace,
8.11
_D ={( , , , ,
aji yi ) (»2 2/2 ) * * *
0m ,2/m )};
T;
!
6 d.
for t = .. . , ,T do
/
Tt Tt = RS ( D , d )
Dt 3: Dt = Map ( D )
ht = £( A )
5: end for
H { x ) = argmax 1 (Mapjrt (a;)) = y )
yey
8.11
•
Flipping Output ) [Breiman, 2000]
( Output
Smearing) [Breiman , 2000]
EC0C [Dietterich and
Bakiri 1995]
EC0C 3.5
•
( Negative Correlation) [Liu and Yao, 1999]
8.3.2
[Zhou, 2012].
8.6
[Zhou , 2012],
[Kuncheva, Rokach 2010b]
[Schapire and Freund ,2012] Boosting
Boosting [Schapire, 1990] [Kearns and Valiant , 1989]
Boosting
[Freund and Schapire,1997] AdaBoost ,
AdaBoost AdaBoost
margin theory) [Schapire et al., 1998]
8.6
[Zhou, 2014].
-
DS
( mixture of experts)
[Kuncheva and Whitaker, Tang et al., 2006]
[Zhou,2012]
(ensemble pruning).
tive ensemble),
-
selec
[Rokach 2010a]; [Zhou et al., 2002]
tion ). -
( ensemble selec [Zhou, 2012]
[Zhou, 2012]
KDDCup
-
twice learning) NeC4.5 [Zhou and
Jiang, 2004]. [Zhou,2012]
8.1 p 1 - p. H[n)
n A:
(1)^ (1 n-1
^
P(H(n) k) E
=i 0 V
P) • (8.43)
^
P(H(n) {p • (8.44)
(8.3).
8.2 0/1
/(
-/ H{x)
-
[ oo,5] (J > 0)
8.3 AdaBoost,
3.0o p.89
3.0a AdaBoost 8.4
4.5.
8.6 Bagging
8.7 Bagging
8.10* A:
Breiman, L. (1996a). “Bagging predictors.” Machine Learning , 24( 2 ) :123-140.
Breiman, L. (1996b). “Stacked regressions.” Machine Learning , 24(1) :49-64.
Breiman, L. ( 2000 ). “ Randomizing outputs to increase prediction accuracy.”
Machine Learning , 40 ( 3) :113-120.
Breiman, L. ( 2001a). “Random forests.” Machine Learning , 45(1):5-32.
Breiman, L. ( 2001b). “Using iterated bagging to debias regressions.” Machine
Learning , 45(3):261-277.
Clarke, B. (2003). “Comparing Bayes model averaging and stacking when mod-
el approximation error cannot be ignored.” Journal of Machine Learning
Research, 4:683-712.
Demiriz , A., K . P. Bennett , and J. Shawe-Taylor. ( 2008). “Linear programming
Boosting via column generation.” Machine Learning , 46(1-3):225-254.
Dietterich, T. G. ( 2000 ) . “Ensemble methods in machine learning.” In Pro-
ceedings of the 1st International Workshop on Multiple Classifier Systems
( MCS ), 1—15, Cagliari, Italy.
Dietterich, T. G. and G. Bakiri. (1995). “Solving multiclass learning problems
via error-correcting output codes.” Journal of Artificial Intelligence Re-
search^ 2:263-286.
Freund , Y. and R. E. Schapire. (1997). “A decision-theoretic generalization of
on-line learning and an application to boosting.” Journal of Computer and
System Sciences^ 55(1):119-139.
Friedman , J ., T. Hastie, and R. Tibshirani. ( 2000) . “Additive logistic regres-
sion: A statistical view of boosting (with discussions).” Annals of Statistics ,
28( 2 ):337-407.
Friedman , J . H. (2001). “Greedy function approximation: A gradient Boosting
machine.” Annals of Statistics, 29 (5):1189-1232.
Ho, T. K. (1998). “The random subspace method for constructing decision
forests.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
20(8) :832-844.
Ho, T. K., J. J . Hull, and S. N. Srihari. (1994). “Decision combination in multi-
•
-
Zhou,Z.H.(2014). “Large margin distribution learning.”In Proceedings of the
6th I APR International Workshop on Artificial Neural Networks in Pattern
-
Recognition ( ANNPR ), 1 11, Montreal, Canada.
-
Zhou, Z. H. and Y. Jiang.(2004).“NeC4.5: Neural ensemble based C4.5.”
IEEE Transactions on Knowledge and Data Engineering , 16(6):770 773. -
-
Zhou, Z.H J .Wu,and W.Tang.(2002). “Ensembling neural networks: Many
”
-
could be better than all.” Artificial Intelligence , 137(1 2):239-263.
CART
Bagging Boosting
(UCLA)
UCLA
9.1
unsupervised learning)
.
ty estimation)
(densi
-
(anomaly detection)
(clustering).
cluster ).
m
n
9.4.2
13.6 D fc Z
D UiLiQ.
(cluster label), IP X j G C\y m
T (Al; • .•
9.2
validity index).
2.3
A* C C *
a S S = { ( xi , X j ) Xi = X j , X * = X j , i < j ) } ,
-
-
(9 1)
^^ ^
b =|5D
| , S D ={( , )1 = < j)}, (9.2)
^^ ^
c =\D S\ , D S = { { x u x j|
) Xi X j , X* = X j , i < j)}, (9.3)
d =\ DDl DD = {( , )| j)}, (9.4)
SS C C*
SD C C*
( xi , Xj ) ( i < j)
a c d m( m — 1)/2
FMI =
a+6 a+c
*
-
(9 6)
9.3
2 ( (2 d)
RI =
m( m — 1) (9.7)
p l]
avg(C) =
diam ( C)
|C|(|C|
>
1)
^
maxKi < 7X|C| dist ( xi , Xj ) ,
< <\c\
l i< j
dist (a?i , X j ) (9.8)
(9.9)
dmin ( Ci Cj ) = IRiTLXi ^ Ci ,XjECj dist (3Ji X j) , (9.10)
_
d cen { C i Cj ) = dist
> /i)) (9.11)
dist( ) /x C
^-xi ayg(C)
.
C diam(C)
^
C d Ci C j ) Ci Cj
.
^
d Ci C j ) Q Cj
(9.8) (9.11)
k
avg(Q) avg(CJ )
DBI max ( (9.12)
(
cen MZ ? MJ )
kU V ^
• Dunn ( Dunn Index, DI)
Cj )
DI = mm mm . dminiCidiam ^
(9.13)
l
^ k j i VmaxK» (C/ )
DBI DI
9.3
dist(., (distance measure),
d i s t( x i ,X j ) (9.14)
dist(®i ,X j ) (9.15)
dist( xi,Xj ) dist( x j ,Xi ); (9.16)
dist(a i,Xj ) < dist(cci ,ajfc) + dist(ajfc ccj) •
-
(9 17)
(city p=1
block distance).
(Manhattan distance)
(continuous attribute)
(categorical attribute),
(numerical attribute),
(nominal attribute)•
3}
ordinal attribute); {
non-ordinal
attribute).
VDM (Value Difference Metric) [Stanfill and Waltz ,
1986]. mu,a w a m i i
SSr VDM
b
k V
mu ,a,i
VDMP(a,b) = , Y u ,a u ,b
(9.21)
9.3
VDM nc
n
^
iu j.
u=nc -\- l
(9.22)
( weighted
distance).
1, 2, . . . , n) YA= I =
Wi
^ 0 (i
(similarity measure)
(9.17).
9.1
(non-metric
distance).
10.6
(distance metric learning) .
d\ d2 C?3
9.1
9.4
(prototype-based clustering),
9.4.1 k
{£CI ,®2 ,• • . , x },
D -
fc means)
C {Ci C2 ,• • . Cfc}
k
E= H II ® - Mill!, (9.24)
—
i \ xECi
A= * Ci (9.24)
A
(9.24) D
NP [Aloise et al. 2009]. /c
(9.24). 9.2
4-8 9-16
9.1 4.0
i
^
=|
^ (1 ^ i ^ k ) Aj
|%
argmin M k\
U{ } ^
^
end for
for i = 1,2,...,/c do
^ ^ ^^
i =] xeCi x:
11 if /x • \ii then
12 A i
13 else
end if
end for
IT until
m%
^
C ={CuC2,...,Ck}
9.2
k = S, x6 , ® 27
{®5 ? ®6 )
« ®10 ? ®13 ? ®14 > ®15 ? ®17 ? ®18 ? ®19 ? ® 20 ? ® 23}j
«
◦= {«11 , , 16}
^^
2 12
Ch C2
^
Mi ( 0.473; 0.214) , (0.394; 0.066) , = ( 0.623; 0.388) .
9.3
'
' . i•
■
F
•
<• •
- ¥•
'
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
( a) (b)
i
<•
(c) (d )
9.3 4.0 /c ( /c 3)
9.4 .2
k Learning Vector Quantization ,
LVQ)
LVQ
LVQ 9.4
g
9.4
P i*
-
Pf = P i * r j •( x j
-p )
^
Xj
end if
V
until
0>i ,P2 , }
9.4
SOM
LVQ SOM
SOM ,
5.5.2 5.5.3
LVQ 6 10-
Xj
Pr P i* +V ( xj
' ~ P i* ) » (9.25)
V
\\p - Xj \ \ 2 = WPi * V '( X j
- jib)-
P i* X j \ \2
= (1 H* — x • (9.26)
7 e .
(1 + 7?) \ \ p i * X j \ \ 2,
Y
r,
( lossy compression),
Ri = { x e X \ \\ x - p i \\ 2 \ \ X - P i > \ \ 2 , i’ i} • (9.27)
vector
quantization); LVQ
^^ ^
Y { , 2 ,..., },
“Voronoi Voronoi tessellation).
9.1 4.0 LVQ -
9 21
c2 , Cl . g
pl 5 P2 , P3 , P4 , P5 ,
Cl , C2 , C2 , Cl , Cl .
®12 , $29 .
P2 , P3 , P4 , P5
0.283, 0.506, 0.434, 0.032. p5
c2 , 0.1, LVQ p5
P5 P5 )
-
(0.725;0.445) + 0.1 ((0.697;0.460) (0.725;0.445))
(0.722;0.442) •
p5 9.5
9.4.3
Gaussian)
k LVQ (Mixture of
--
) n Y
® A/(pt ,S).
*
x,
S:
||
S :S p( x ) (9.28)
-
S 1: S ( 2TT ) 2 |I | 2
M nxn (9.28)
M S
9.4
0.50
4200..31
.
2 .
V.
•
(a) ( b)
0.8
0.6
0.5
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.4
(c) (d )
/x ,E ).
PM (.) k
P M ( x )d x = 1. PM {x) = (9.29)
/c /Xi
i ( mixture
coefficient ), J2i=i ai =
m ..
i
D G {1,
...
P{ z j i) (i = l ,2,...,/c ).
P{zj i ) • pM ( x j I Zj = i)
PM (zj =i | xj) =
P M{x j )
o^i p( X |j
k (9.30)
1=1
Oil p { xj I
^
PMizj = I xj) i
(i ... /c).
(9.29) D A:
C ={Ci ,C2 ,...,Ck }
^^
\ l i k}
7.2
D
•
L L( D ) = In n=
i i
m k
Eh
=
'
?
E
=1 \i l
0i '
^ P { X j | /ii , (9.32)
EM 7.6 EM
d L L( D)
(9 . 3 2) ~
9JM
E k
(xj - Mi) = 0 , (9.33)
^^
j=1
E a i p( x j
' I u i)
(9.30) i X j ), W
7# = P M { z j = |
9.4
*y • • nf •
Mi (9.34)
S
= (9.35)
E iji
LL( D), =l
LL( D )
LL{ D ) A
SH (9.36)
A (9.36)
p { x j /Xj , 5] )
E k
^ A (9.37)
=
j 1
J2 ai - p( x j \
1=1
A= m —
ai = (9.38)
J=I
EM
(E
9.35) (9.38) W (M •
9.6
2-12 EM
EM LL( D )
{® i ,x2 ,• • • ® m .
A:.
repeat
EM E for j = 1,2,... m do
(9.30)
^
Iji ~ = (1
end for
EM M 6: for i = 1,2,..., do
W = Eg
jjgj
7: = 7 i •
i jLi 7 > j
8: = 7j*
a
end for
^
(K ,E0
12 until
13 Ci = 0
14 for j = 1,2,..., m do
(9.31)
(7Aj U{ }
^
C
end for
Q}
9.6
14-17
9.1 4.0 k = 3.
ai a2 as XQ
, M3
O.I o.o
M2 ®22 ® 27 Ei = S2 = H 3 = o.o o.i
(9.30)
Q
-i
0.219,
0.361, cx 2 0.323,
0.404,
0.316
0.377.
^
(0.491;0.251), (0.571; (0.534;0.295)
^
3 =
0.004 0.016 0.004 0.017 0.005 0.016
9.7
9.5
8.7
o.
o.
65.4
o.
o.
o.
orJ
i.l
(a) 5 (b)
.
8.7
.6
.S
4.3
. ▲
+o
.2
(C) (d)
9.7 3) G C3
“o”
9.5
(density-based clustering),
-
Density- Based S
-
patial Clustering of Appli
DBSCAN neigh
-
cations with Noise” . borhood) (e,MinPts )
D= xm } ,
dist(.,.)
•e - G
Ne ( Xj ) = { xi
e
-
D dist ( x i , X j )
e \
D
;
e}
e
•
\ Ne{Xj)\
(core object ):
MirLPk
e
- MinPts
• -
(directly density reachable): e
-
• -
(density reachable): pi ,P2 , Pn ,
Pi Pn Pi+i Pi
•
xk
-
(density connected): Xi
9.8
o
o
o
o
9.8 DBSCAN
Xi
( Minims
a?3
3) e
-
DBSCAN
(e,MinPts),
D
C QD
( noise)
(anomaly )
(connectivity): e C, G C (9.39)
(maximality): G C, Xj eC (9.40)
D x
X {x' e D
x K x
DBSCAN (seed),
9.9
(e , MinPts)
9.5
D . . . ® m };
(e , MinPts ) .
for j ... m do
e- Ne ( xj ) ;
if \ Ne ( x j ) \ ^ MinPts then
5 D
6 end if
7: end for
8: k =0
9 r
^
while Q 0 do
r;
oe o
r = r \{o}
^
while Q 0 do
g g;
if | iVe (g ) | MinPts then
(3;
r = r \ A;
end if
end while
fc = /c + i , C4 = rQld \ r;
O = 0 \ Ck
24 end while
Q}
9.9 DBSCAN
^ U DBSCAN
C\ = •
DBSCAN m Cl D Cl
{ xs ,x5,x9, *24 ,® 25 ,® 28 ,X 2Q}. D
D
9.10 DBSCAN . Ci
O.
o.
o.
o.
o.
o.
o
1.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3
(a) CL (b) C2
r
..
■
■
Vo ' -
o\
•••
/o .
'
--- ,
1? ..
• y
:
.3
•••••• V.
8.1 0.7 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(c) C3 (d) C4
®13 ,® 14 ,®16 ,
^
2 ®4 , ® 2l }
C3 { X\ , X 2 , X 2 2 , *29}
9.6
( hierarchical clustering)
Ci Cj ,
dorff distance),
( Haus
-
9.2. dmin(Ci Cj) min dist(cc ,z ) (9.41)
xeCi ,zeCj
dmin dmax
dmin , dm, d;
davg. /c.
for j .. . m do
={x j }
3 end for
4 for i = 1, 2, . .. , m do
for j = 2, m do
M ( i , j ) = d{ ci , cjy,
end for
9: end for
10: g m
11 while q > k do
i* < j * .
Cr: Ce
for j = j* +l j* + ••• g do
G
end for
M r
for j 1, 2, .. . 1 do
M ( i* j ) = d ( ci* , cjy,
n
M(i, = M ( i* , j )
end for
q = q-i
23 end while
mM
^ c = { cuc , ... ,ck }
2
9.11 AGNES
216 9
-
'
AGNES 9.11 1 9
11 23- AGNES
rj
1 29 26 2 22 21 3 4 23 25 28 24 30 27 5 7 9 17 13 14 16 6 8 18 19 10 20 15 11 12
9.12
◦=
7 {® n , Xi 2}.
9.7 21?
9.13
9.12
•./
■ ■■
( a) k =7 (b) k=6
8.i 0.3 0.6 0.9 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
(c) /c (d ) k =4
9.7
-
[Estivill Castro, 2002].
[Mitchell , 1997]
[Jain and Dubes,
Jain et al. 1999; Xu and Wunsch II, 2005; Jain, 2009]
9.2 F mutual
information) (average silhouette width) [Rousseeuw, 1987]
[Jain and Dubes,1988; Halkidi et al. 2001; Maulik and Bandyopadhyay ,
2002] .
LVQ
LVQ2 LVQ3
[Kohonen, 2001]. [McLachlan and Peel, 2000]
EM [Bilmes, Jain and Dubes,1988].
( clustering ensemble)
[Zhou , 2012] 7
.
lim
-> H-oo
pi
iu xju \P
U=1
9.2 X Z
Hausdorff distance)
^
disth (X, Z ) = maxmin ||® z\\2 . (9.45)
xex zez
9.3 (9.24)
9.4 /c
4.0 p.202
4.0
9.1.
.
9.5 DBSCAN ®
X. (9.39) (9.40).
9.6 AGNES
9.7
9.8 9.2
9.9*
9.10* A:
4.0
Aloise, D., A. Deshpande, P. Hansen, and P. Popat . ( 2009) . “ NP-hardness of
Euclidean sum-of-squares clustering.” Machine Learning , 75( 2 ) :245-248.
Ankerst , M., M. Breunig , H.-P. Kriegel, and J . Sander. (1999 ) . ^OPTICS: Or-
dering points to identify the clustering structure.” In Proceedings of the
ACM SIGMOD International Conference on Management of Data ( SIG-
MOD ) , 49-60 , Philadelphia, PA.
Banerjee, A., S. Merugu, I. Dhillon , and J . Ghosh. ( 2005) . “Clustering with
Bregman divergences.” Journal of Machine Learning Research, 6:1705-1749.
Bezdek , J . C. (1981) . Pattern Recognition with Fuzzy Objective Function Algo-
rithms. Plenum Press, New York , NY.
Bilmes, J. A. (1998) . “A gentle tutorial of the EM algorithm and its appli-
cations to parameter estimation for Gaussian mixture and hidden Markov
models.” Technical Report TR-97-021, Department of Electrical Engineering
and Computer Science, University of California at Berkeley, Berkeley, CA.
Chandola, V., A. Banerjee, and V. Kumar . ( 2009) . “Anomaly detection: A
survey.” ACM Computing Surveys, 41(3) :Article 15.
Deza, M. and E. Deza. ( 2009) . Encyclopedia of Distances. Springer ,. Berlin.
Dhillon , I. S., Y. Guan , and B. Kulis. ( 2004) . “ Kernel fc-means: Spectral clus-
tering and normalized cuts.” In Proceedings of the 10th ACM SIGKDD In-
ternational Conference on Knowledge Discovery and Data Mining ( KDD ) ,
551-556, Seattle, WA.
Ester , M., H. P. Kriegel, J . Sander , and X. Xu. (1996 ). “A density-based algo-
rithm for discovering clusters in large spatial databases.” In Proceedings of
the 2nd International Conference on Knowledge Discovery and Data Mining
( KDD ), 226-231, Portland , OR.
Estivill-Castro, V. ( 2002 ) . “Why so many clustering algorithms a position
paper. SIGKDD Explorations , 1( 4) :65-75.
Guha, S. , R. Rastogi, and K. Shim. (1999 ). “ROCK : A robust clustering al-
gorithm for categorical attributes.” In Proceedings of the 15th International
Conference on Data Engineering ( ICDE ), 512-521, Sydney, Australia.
Halkidi, M., Y. Batistakis, and M. Vazirgiannis. ( 2001) . “On clustering valida-
,
tion techniques.” Journal of Intelligent Information Systems 27( 2-3) :107-
145.
Hinneburg, A. and D. A. Keim. (1998). “ An efficient approach to clustering in
large multimedia databases with noise.” In Proceedings of the 4 th Interna-
tional Conference on Knowledge Discovery and Data Mining ( KDD ) , 58—65,
New York, NY.
Hodge, V. J . and J . Austin. ( 2004) . “A survey of outlier detection methodolo-
gies. Artificial Intelligence Review , 22 (2):85-126.
Huang, Z. (1998). “Extensions to the A:-means algorithm for clustering large
data sets with categorical values.” Data Mining and Knowledge Discovery ,
2 (3) :283-304.
Jacobs , D. W., D. Weinshall, and Y. Gdalyahu. ( 2000 ) . “Classification with
non-metric distances: Image retrieval and class representation.55 IEEE
Transactions on Pattern Analysis and Machine Intelligence 6 ( 22 ) :583-600.
Jain, A. K . ( 2009) . “Data clustering: 50 years beyond /c-means.” Pattern Recog-
nition Letters, 31(8):651-666.
Jain , A. K. and R. C. Dubes. (1988) . Algorithms for Clustering Data. Prentice
Hall, Upper Saddle River , NJ .
Jain , A. K., M. N . Murty, and P. J . Flynn. (1999 ) . “Data clustering: A review.”
ACM Computing Surveys, 3(31):264-323.
Kaufman, L. and P. J . Rousseeuw. (1987) . “Clustering by means of medoids.”
In Statistical Data Analysis Based on the Li - Norm and Related Methods (Y.
Dodge, ed . ) , 405-416 , Elsevier , Amsterdam , The Netherlands.
Kaufman, L. and P. J . Rousseeuw. (1990). Finding Groups in Data: An Intro-
duction to Cluster Analysis. John Wiley Sons New York , NY.
Kohonen , T. ( 2001). Self - Organizing Maps , 3rd edition. Springer , Berlin.
Liu , F. T., K. M. Ting , and Z.-H. Zhou. ( 2012 ) . “Isolation-based anomaly detec-
tion.” ACM Transactions on Knowledge Discovery from Data, 6 (1) :Article
3.
Maulik, U. and S. Bandyopadhyay. ( 2002 ) . “Performance evaluation of some
clustering algorithms and validity indices.” IEEE Transactions on Pattern
Analysis and Machine Intelligence , 24(12):1650-1654.
McLachlan , G. and D. Peel. (2000) . Finite Mixture Models. John Wiley Sons,
New York, NY.
Mitchell, T. (1997) . Machine Learning. McGraw Hill, New York, NY.
Pelleg, D. and A. Moore. ( 2000) . “X-means: Extending fc-means with efficient
estimation of the number of clusters.” In Proceedings of the 17th Interna-
tional Conference on Machine Learning ( ICML ), 727-734, Stanford , CA.
Rousseeuw , P. J . (1987). “Silhouettes: A graphical aid to the interpretation
and validation of cluster analysis.” Journal of Computational and Applied
Mathematics, 20:53-65.
Scholkopf , B., A. Smola, and K.-R. Muller. (1998). “ Nonliear component anal-
ysis as a kernel eigenvalue problem.” Neural Computation, 10 (5):1299-1319.
Stanfill, C. and D. Waltz. (1986) . “Toward memory-based reasoning.” Commu-
nications of the ACM , 29 (12 ) :1213-1228.
Tan , X. , S. Chen , Z.-H. Zhou, and J . Liu. ( 2009) . “Face recognition under oc-
elusions and variant expressions with partial similarity.” IEEE Transactions
on Information Forensics and Security , 2 ( 4):217-230.
Tibshirani, R., G. Walt her , and T. Hastie. (2001). “Estimating the number of
clusters in a data set via the gap statistic.” Journal of the Royal Statistical
- —
Society Series B, 63( 2):411 423.
von Luxburg, U. (2007). “A tutorial on spectral clustering.” Statistics and
Computing , 17( 4) :395-416.
Xing, E. P., A. Y. Ng, M. I. Jordan , and S. Russell. ( 2003) . “Distance metric
learning , with application to clustering with side-information.” In Advances
in Neural Information Processing Systems 15 ( NIPS ) (S. Becker , S. Thrun ,
and K. Obermayer , eds.) , 505-512, MIT Press, Cambridge, MA.
Xu, R. and D. Wunsch II. ( 2005) . “Survey of clustering algorithms.” IEEE
Transactions on Neural Networks, 3(16):645-678.
Zhang, T., R. Ramakrishnan, and M. Livny. (1996 ). “BIRCH: An efficient data
clustering method for very large databases.” In Proceedings of the ACM
SIGMOD International Conference on Management of Data ( SIGMOD ) ,
103—114, Montreal, Canada.
Zhou, Z.-H. ( 2012 ). Ensemble Methods: Foundations and Algorithms. Chap-
man Hall/CRC, Boca Raton , FL.
-
Zhou , Z . H.and Y.Yu.(2005). “Ensembling local learners through multimodal
-
perturbation.” IEEE Transactions on Systems , Man, and Cybernetics Part
B: Cybernetics , 35(4):725-735.
( Manhattan distance)
( Taxicab geometry),
—
Hermann Minkowski , 1864 1909)
5 - 3) (33 — 23)
( Kaunas) .
( Alexotas)
n
•
10.1 A;
/c
. Nearest Neighbor ,
A:
ANN )
/c
8.4
lazy learning)
( eager learning).
10.1 A: /c
A:
1NN A 1)
10.1 /c
— /
y
/
L
> A:
■
—
k=s
- >
/c /c
k =3
Z
^
P ( err ) P( c x ) P ( c z ) . ( 10.1)
cey
7.1
(10.1) z. c* argmaxc
-
3 P(c x)
P ( err ) = 1 — cey P ( c \ x ) P ( c \ z )
5cey P (c |
^
2
1- ®)
1 - P2(c* x )
< 2 x (1 - P(c|x ) )
*
. ( 10.2)
[Cover
and Hart , 1967].
10.2
(dense sample).
5=
0.001
10.2
(subspace),
embedding). 10.2
( a) (b)
10.2
10.2
( Multiple Dimensional Scaling, MDS) [Cox and
Cox , 2001]
m D G i j
disUj
ZG Rd x m , S!
^
d,
―
A — distij .
B ZTZ e Mmxm , B
—
^zjzj
ba + bjj ^ bij . (10.3)
oeRd
Y T=I bij
z
ET=1 hij
YZi zi 0
- B
J2 distl = 2m tr(B) ,
i= l j = l
( 10.6)
^
dist . = 2 £ distij
i= l j = l
(10.9)
D B.
B (eigenvalue decomposition), B VAVT
A diag(Ai ,A2 , ...,Ad) Ai A2 ... Ad,
cT A+
diag( Ai ,A2,... A# ) V# Z
z AI / 2VJ e Rd * xm . ( 10.11)
d
A = diag( A!,A2,... A& ), Z
10.3
10.3 MDS
De disUj
d X = ( x1 , X 2 , . . . , x m ) e Rdxm , d d
d.
Z WTX , (10.13)
W e Rdxd Z e Rd xm
W !
d d WTXi i f
(
{W i ,W 2 r
- d}
W
"
(i j)
(10.13)
AV
10.3
•
•
Ei o
{wi,W2,.• .,Wd},
(i i ).
A {zn ] Zi2 ...;zidf ),
j .
YU ㈣
d'
const HZi — X i
0W 3
=
i 1
Zi -2 ZiTWTXi const
oc —tr WT
\i=l
W . (10.14)
(10.14) xiXJ
mm tr (WTXXTW) (10.15)
s.t . WTW = I.
10.4
Ei TXixJW,
^
max tr (WTXXTW)
s.t . WXW = I
(10.16)
10.3
= 0.206
= 0.045
o
10.4
(10.16) (10.15)
(10.15) (10.16)
XXTW = AW (10.17)
x XXT
Ai A2 Ad , cf W (i/n ,
PCA
W2 , ... ,W d r ) . PCA 10.5
Ei *#?
D= X m };
wi a.
-
Si xixI AllUltOi'
tu2; XXT;
W XXT
Wd .
^
•
i= l
= 2XjWjwj
=
j \
W
10.5 PCA
cf
A;
PCA, t
EtiA . t. (10.18)
PCA W
d - d!
10.4
10.6
S
intrinsic)
6.6
(kernelized). ( Kernelized PCA , KPCA )
[Scholkopf et al., 1998]
M
PCA
( )
| - T W = AW (10.19)
10.4
w = E Zi A
i= l
E ZiOCi , ( 10.20)
={z j w.
A i •• • ) m.
PCA (10.19)
W AW ( 10.21)
(10.20)
W= ( j^ ( Xi ) OLi . ( 10.22)
{ Xi Xj) () j { Xi )^ . (10.23)
KA AA (10.24)
K K (K )i j K( x i X j ), ,A (ai;o 2; a ).
(10.24) K cf
j ( j = l 2,...,cT)
^
ZJ w j f(x) = Y
() (
4 Hxi )T l(x) ()
Y .
^^ oP Xi x )
j
5
(10.25)
(10.25)
KPCA
10.5
( manifold learning)
10.5.1
( Isometric Mapping, Isomap) [Tenenbaum et al ., 2000]
10.7 geodesic)
10.7(a)
S
fi
mm "1
0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.I
( a) ( b)
10.7
10.5
10.7(b)
E. Dijkstra
W.Dijkstra
R. Floyd Floyd 10.2 MDS
10.8 Isomap
JD
for i ..., m do
2: fc
end for
^^
dist , );
MDS 10.2 distix^ Xj ) MDS
return MDS
Z
10.8 Isomap
Isomap
A A:
e, e e
10.5.2
Isomap (Locally
Linear Embedding, LLE) [Roweis and Saul,2000]
xk
wik
ik
Wil Xi
m
, xi
10.9
10.9
Xk xi
( 10.26)
LLE ( 10.26)
LLE
^
mm
, ,...,Wm
Wl W 2 2 j Qi
w xj
^ ( 10.27)
^ 2
s .t E
jeQi
Win =1
= — X j )T ( x i - Xk ) , W i j
wij = • ( 10.28)
E C
i ,seQi
LLE
( 10.29)
Z (Z i • ) eRd’ xm W j = Wij ,
M = (I - W ) T (I - W ) , (10.30)
(10.29)
s.t . ZZT
(10.31) M (f
ZT.
LLE 10.10
D ={aJi ,£C2,...,Xm }\
fc;
(f.
for i ... m do
2: &
3 (10.27) jG
j 0 Qi , = o;
5 end for
6 (10.30) M;
7: M
return M /
Z = { zuz2 , ... , zm }.
10.10 LLE
10.6
( metric learning )
9.3
distijik fc
^ 2
ti
dist
^ e(j ( xi ,
= (xi
\\ x {
T^ W
\
Xj 2
- Xj ) (xi
• distfj 1
X j)
}
,
W2 dist j2 ... + Wcr dist jd
(10.33)
0, W = diag( ) (W )u = W i .
(10.33) AV
^ W
(10.33) AV
P.C. Mahalanobis
M ( Mahalanobis distance)
M
-
S 1;
M
M
M
dist
^^ ah i Xj ) ^ ^\/ .{ Xi Xj ) = \\ X {
M
(10.34)
M ) P M
M PPT.
M
M
M. ( Neighbourhood Component
Analysis, NCA) [Goldberger et al. 2005]
10.6
Pij =
exp \\ xi AIIL) (10.35)
e x p l - Wxi IlL)
i= j
^ - xi
2.2.2
(LOO)
Pi = (10.36)
j
^ i
A
.
^
2 Pi
i =\ i = l j fli
. (10.37)
^
ain
exp (-|
|P - PTx |Q .
(10.38)
P
^ Ez exP llpT pT
^^
i=i je i l2
(10.38) L00 M.
[Goldberger 2005].
C G
must link )
-
c ,
M
C ajfc
-
cannot link)
I - HM (10.39)
^^
mm
M
( xi ,x j ) eM
s.t. \\ xi ~
>1
(ajj,® /e ) GC
M 0 M (10.39)
M M M
M
M rank(M), d
P
10.7
[Aha,1997].
13.6
[Wagstaffetal. 2001].
[Xing et al., 2003],
10.7
10.4 xxT
10.5
princomp
10.6 MATLAB PCA Yale
Yale
http: // vision . ucsd .edu / content
/yale-face-database. 10.7
10.8* k e Isomap
10.9* LLE
9.3
10.10
Aha, D . , ed. (1997). Lazy Learning. Kluwer , Nor well , MA.
Baudat , G. and F. Anouar. ( 2000) . “Generalized discriminant analysis using a
kernel approach.” Neural Computation, 12 (10 ):2385-2404.
Belkin, M. and P. Niyogi. ( 2003). “Laplacian eigenmaps for dimensionality re-
duction and data representation.” Neural Computation, 15(6):1373-1396.
Belkin , M., P. Niyogi, and V. Sindhwani. ( 2006 ) . “Manifold regularization: A
geometric framework for learning from labeled and unlabeled examples.”
Journal of Machine Learning Research, 7:2399-2434.
Bellman , R. E. (1957). Dynamic Programming. Princeton University Press,
Princeton, NJ .
Cover , T. M. and P. E. Hart . (1967) . “ Nearest neighbor pattern classification.”
IEEE Transactions on Information Theory, 13(1) :21-27.
Cox , T. F. and M. A. Cox. ( 2001). Multidimensional Scaling. Chapman & Hal-
1/ CRC, London , UK .
Davis , J . V., B. Kulis, P. Jain , S. Sra, and I. S. Dhillon. ( 2007) . aInformation-
theoretic metric learning.” In Proceedings of the 24 th International Confer-
ence on Machine Learning ( ICML ) , 209-216 , Corvalis, OR.
Fisher , R. A. (1936) . “The use of multiple measurements in taxonomic prob-
lems. Annals of Eugenics , 7( 2 ) :179-188.
Friedman, J . H., R. Kohavi, and Y. Yun. (1996 ) . “Lazy decision trees.” In Pro-
ceedings of the 13th National Conference on Aritificial Intelligence ( AAAI )
^
717-724, Portland , OR.
Frome, A., Y. Singer , and J . Malik. ( 2007) . “Image retrieval and classification
using local distance functions.” In Advances in Neural Information Process-
ing Systems 19 ( NIPS ) ( B. Scholkopf , J . C. Platt , and T. Hoffman , eds.) ,
417-424, MIT Press, Cambridge, MA.
Geng , X. D.-C. Zhan , and Z.-H. Zhou. ( 2005) . “Supervised nonlinear dimen-
sionality reduction for visualization and classification.” IEEE Transactions
on Systems , Man, and Cybernetics Part B: Cybernetics, 35(6 ) :1098-1107.
Goldberger, J . , G. E. Hinton , S. T. Roweis, and R. R. Salakhutdinov. ( 2005) .
^ Neighbourhood components analysis.” In Advances in Neural Information
Processing Systems 17 ( NIPS ) (L. K . Saul, Y. Weiss, and L. Bottou, eds.) ,
513-520, MIT Press, Cambridge, MA.
Harden, D. R., S. Szedmak, and J. Shawe-Taylor. ( 2004) . “Canonical correla-
tion analysis: An overview with application to learning methods.” Neural
Computation, 16 (12 ) :2639-2664.
He, X. and P. Niyogi. ( 2004) . “Locality preserving projections.” In Advances
in Neural Information Processing Systems 16 ( NIPS ) (S. Thrun , L. K. Saul,
and B. Scholkopf , eds.) , 153—160 , MIT Press, Cambridge, MA.
Hotelling, H. (1936) . “Relations between two sets of variates.” Biometrika, 28
(3-4 ) :321-377.
Kolda, T. G. and B. W. Bader. (2009) . “Tensor decompositions and applica-
tions.” SIAM Review, 51(3):455-500.
Roweis, S. T. and L. K. Saul. ( 2000) . “Locally linear embedding.” Science , 290
(5500) :2323-2316.
Scholkopf , B. , A. Smola, and K .-R. Muller. (1998) . “ Nonlinear component anal-
ysis as a kernel eigenvalue problem.” Neural Computation, 10 (5 ):1299-1319.
Tenenbaum , J . B., V. de Silva, and J . C. Langford. ( 2000). “A global geomet-
ric framework for nonlinear dimensionality reduction.” Science , 290 (5500 ):
2319-2323.
Wagstaff , K. C. Cardie, S. Rogers, and S. Schrodl. (2001). “Constrained
/c-means clustering with background knowledge.” In Proceedings of the
18th International Conference on Machine Learning ( ICML ) , 577-584,
Williamstown, MA.
Weinberger , K . Q. and L. K. Saul. ( 2009 ) . Distance metric learning for large
margin nearest neighbor classification.” Journal of Machine Learning Re -
search, 10:207-244.
Xing, E. P., A. Y. Ng, M. I. Jordan, and S. Russell. ( 2003) . “Distance metric
learning, with application to clustering with side-information.” In Advances
in Neural Information Processing Systems 15 ( NIPS ) (S. Becker , S. Thrun ,
and K . Obermayer , eds.) , 505-512, MIT Press, Cambridge, MA .
Yan, S. , D. Xu, B. Zhang, and H.-J. Zhang. ( 2007). “Graph embedding and ex-
tensions: A general framework for dimensionality reduction.” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence , 29 (1) :40-51.
Yang, J . , D. Zhang , A. F. Frangi, and J .-Y. Yang. ( 2004). “Two-dimensional
PCA: A new approach to appearance-based face representation and recog-
nition.” IEEE Transactions on Pattern Analysis and Machine Intelligence ,
26 (1):131-137.
Yang, L., R. Jin , R. Sukthankar , and Y. Liu. (2006 ) . “An efficient algorithm
for local distance metric learning.” In Proceedings of the 21 st National Con-
ference on Artificial Intelligence ( AAAI ), 543-548, Boston , MA.
Ye, J ., R. Janardan, and Q. Li. ( 2005) . “Two-dimensional linear discriminant
analysis.” In Advances in Neural Information Processing Systems 17 ( NIPS )
(L. K. Saul, Y. Weiss, and L. Bottou, eds.) , 1569-1576 , MIT Press, Cam-
bridge, MA .
Zhan, D.-C., Y.-F. Li , and Z.-H. Zhou . ( 2009 ) . ^Learning instance specific
distances using metric propagation.” In Proceedings of the 26th International
Conference on Machine Learning ( ICML ) , 1225-1232, Montreal, Canada.
Zhang , D. and Z.-H. Zhou. ( 2005) . “( 2D) 2 PCA: 2-directional 2-dimensional
PCA for efficient face representation and recognition.” Neurocomputing , 69
(1-3) :224-231.
Zhang , Z. and H . Zha. ( 2004). “Principal manifolds and nonlinear dimension
reduction via local tangent space alignment .” SIAM Journal on Scientific
Computing , 26 (1):313-338.
PCA)
SVD)
factor analysis)
-
Karhiinen Loeve Hotelling
LSA)
POD) EOF)
(EMA) -
Schmidt Mirsky
( Karl Pearson, 1936) PCA.
University
College London, UCL)
F. Galton W.Welton
Galton
Biometrika, Egon
UCL
omeMk
11.1
(feature),
relevant feature)
(irrelevant feature).
(feature selection).
(data preprocessing)
( redundant feature),
•
(subset search) {ai
ad } , d
d
{a2 ,a4} {a2}, {a2 ,a4}
A: (/c+1)
fbrwani)
backward )
bidirectional)
(subset evaluation) _
D,
D i
v
A D F
{ D\ D2 , ... , DV } , A
g
^
Gain(A) Ent - Ent( ) ( 11.1)
11.2
4.2 . 1
Gain(A)
Ent ( D ) = -
^
2 pk
A
log 2 pk ( 11.2)
D
F D
A F
8.5 . 2
(filter ) (wrapper )
(embedding).
11.2
Relief
( 2,2/2 ), ... Relief
^ aVh , -
(near hit ),
nm
-
(near miss), j
j diff «0 j
j d diS( xi,xJb) j
diE ( xJa , XJb ) = \ xJa - XJb \ , x{ , x{ [0,1]
Relief
(11.3) A nh j
^
a nm j
10.6
^
j a nh j
^
A ,nm j
j
(11.3) i Relief
[Kira and Rendell, 1992].
Relief
- ^
A: /c e ... Relief F
- ;
A
A:
^
,nh
xiMm (/ 1,2,.. l k ).
j
6j = J2 -d [
^^X
XUh ) E( Pz x diff « 4nJ ) 2 •
(11.4)
^
l k
^
A l
11.3
11.3
11.1
U
T.
1 E = oo
2 d = \ A\ ]
A* = A ]
4: t = 0;
5 while t < T do
6
7
W E f = CrossValidation (ii( jDA )) ;
,
if ( E < E ) W ( ( E = E ) A ( d r < d ) ) then
t
11 E = Ef ]
12 d = df ]
13 A* = A/
14 else
15 t = t -h l
16 end if
T 17 end while
11.1 LVW
11.1 D
W
W
W
LVW
T. LVW
r
11.4 U
x e R d , y e R. ®
mm 2
°=
(
^
i l
yi - wTXi ) 2 . (11.5)
(11.5)
6.4
(11.5) L2
inE — wTXi ) 2
X \\ w \\ i • (11 7)
-
A 0. (11.7) LASSO ( Least Absolute Shrinkage and
so.
LAS
- Selection Operator ) [Tibshirani, 1996]).
. Li L2
sparse) w
T
L0 L0 . .
W2
u (11.6) (11.7)
11.4 h
L2
W2
11.2 In L2
{ wuw2 ) h L2
{ wuw2 ) In L2
11.2 (11.6) (11.7)
11.2 h
L2
M In L2
w i
( w
In
h
f (x) / L-Lipschitz L
X (VX ’)
X (11.9)
f ( x)
f(x) f ( xk )
L
X
—
(V f { x k ) , x - x k ) +
Xk - TL ^ f ( x k ) ^ \\ x - xk \\ 2
const , ( 11.10)
const z (11.10)
L / O f c) . ( 11.11)
f{x)
/». (11.8),
L
Xk+ i = argmm
X
-^ — X xk - Tv
L
/( ® ) fc
2
|
A|x || i ( 11.12)
f(x) h
^
(11.12), z — V/(xfc ),
L
® /c + i
• -
arg mm \\ x - z \\ l +\\ lx\ h (11.13)
^
•
X
f r i (11.13) W
11.8.
(i j) z (11.13)
zl - A /L, X / L < zl
xk -\- l 0 ) A/ L (11.14)
A /L , -
A/L
11.5
11.5
) D
6.3 12.4
D
(sparse representation)
(codebook) .
(dictionary learning),
(codebook learning).
(sparse coding).
xm},
^
min
i=l%
• •
ll^ - Bailll + A
^
x=\-
i
llailli (11.15)
B e Rdxfc e
e (11.15)
min \\ xi
- Bcxi \\l A||ai ||i . (11.16)
B (11.15)
mm || X - BA ||2F , (11.17)
k
min || X - BA | min X-
B bi
3= F
2
Y aj
^ - biOL 1
mm X-
bi
F
i bjcxj
(11.18)
A KSVD
'
a
11.6
B B
A A:
11.6
( Nyquist )
compressive sens-
(compressed sensing) [Donoho, Candes et al ., 2006]
m
n y, n m,
y= (11.19)
G z if
z
®
“No” n m y
(11.19
® G Mmxm , C s, y
y As ( 11.20)
e Rnxm . s, x=
A
r.
y
^s
(11.20) (11.20) s
s
(11.20)
A
S
11.5
^
(1 5fc|
|s| |
)| |
| | (1
AfcS| )||
/c| s|2 ( 11.21)
A A: AxRIP).
y s, ®
mjn|
||
s|0 ( 11.22 )
s.t . y = As .
(11.22) L0 NP h
Lo [Candes et al., 2006 j ,
min|
|s||i (11.23)
s.t . y = As .
h (11.23)
11.6
LASSO (Basis
-
Pursuit De Noising) [Chen et al 1998].
”
collaborative filter-
ing)
11.1
5
11.1
11.1
11.1
11.1
-
s.t . (X) = (A)ij, ( i , j ) e O,
X
^
rank(X) A 11.1
Q A (A) ,,- i j)
(11.24) (X)
( trace norm ).
{XG |
(11.22)
Rmxn :|
| |
X
(11.24)
| 1}
NP
^ rank(X)
(nuclear norm )
min{m n}
aj ( X ) X
min|
x ||
X|* (11.26)
11.7
[Kohavi and
John , 1997], [Weston et al., 2003],
ID3
[Quinlan, 1986]. [Yang
and Pederson , Jain and Zongker , 1997].
[Guyon and Elisseeff , Liu et al., 2010],
11.7
11.2
-
Relief F
11.3 Relief
11.4 LVW
11.5
11.6
11.7 Lo
11.8 In (11.14)
11.9
11.10* (11.15),
Aharon, M., M. Elad, and A. Bruckstein. ( 2006) . “K-SVD: An algorithm for
designing overcomplete dictionaries for sparse representation.” IEEE Trans-
actions on Image Processing ^ 54(11) :4311-4322.
Akaike, H. (1974) . UA new look at the statistical model identification.” IEEE
Transactions on Automatic Control , 19 (6):716-723.
Baraniuk , R. G. ( 2007). “Compressive sensing.” IEEE Signal Processing Mag-
azine , 24(4):118-121.
Bengio, S., F. Pereira, Y. Singer , and D. Strelow. ( 2009). “Group sparse cod-
mg. In Advances in Neural Information Processing Systems 22 ( NIPS ) (Y.
Bengio, D. Schuurmans, J. D. Lafferty, C. K. I. Williams, and A. Culotta,
eds.) , 82-89, MIT Press, Cambridge, MA.
Blum , A. and P. Langley. (1997). “Selection of relevant features and examples
,
in machine learning.” Artificial Intelligence 97(1-2 ) :245 271.
-
-
elastic net .” Journal of the Royal Statistical Society Series JB, 67(2):301-
320.
( Lviv)
—
1867 1918
•
•
ENIAC
Nicolas
Metropolis- Hasting
Metropolis
12.1
^ ^
(h;D) — l(h( x i ) yi ). ( 12.2)
i= l
D D
E(h;D)
e E( h) E(h) e
D D D
disagreement)
.
"
h ') 2)( i( ® ) 2( ® )) (12.3)
^
• Jensen /($),
(12.4)
m m
P E ( exp ( 2me2 ) — (12.5)
^
m
=
i l =1
1
rrt -j m
^ I ^ 2 exp( — 2me ) .
2
P 6 ( 12.6)
1 =1 =1
1
^
ri • • • 5 , 771
i m
Xi i , .
sup
— m)
|
^
• •• Q
Xiy ...,X m j xi
e>
^^
) exp (12.7)
6
^
2e2
P (|/ (^ i , . . . , a:m ) - E(/ (a i , ...,xm ))\ e) < 2 exp . ( 12.8)
12.2 PAC
( Probably Approximately
Correct , PAC) [Valiant , 1984].
c (concept ), y
r (o0
C y c
( concept
class), C
ii hypothesis space),
1.3
. C
12.2 PAC 269
Z i e W,
hypothesis).
ce H
(separable),
(consistent ); c
-
non consistent ). *
-
non separable),
c. c
D
1.4
P D
P( E(h) 1 -5 (12.9)
PAC C.
- 5) c
e).
PAC C,
m
C
^
PAC
size(c)
C PAC
12.3 PAC PAC Learning Algorithm):
C PAC poly(l/6,1/5,
,
size(a;) size(c)), C PAC efficiently PAC learnable)
C PAC
PAC
PAC .W il
PAC W C,
PAC properly PAC learnable);
C. W
.
12.3
12.3.1
c m
D
.
• D c
c P
C. D
P
12.3 ’
K a
D
D
c PAC
D
e
- e
e P
- P ( h{ x )
^
P ( h( x ) = y ) = 1 y)
E( h)
— *
. ( 12.10)
D m P D
m
P ( ( h( xi ) yi ) ( h( x m ) = y m ) ) = { l - P ( h { x ) y) )
< (1 e) m .
^ ( 12.11)
£ W
e
P ( h e n E { h ) > e A E { h ) = 0) < \U \ ( l - e )m
< \ n\ e — me (12.12)
(12.12)
\n\ e -^ ^ d , (12.13)
- ( in \ H \ -f- In
^
m (12.14)
PAC
(12.14)
o{ ).
^
12.3.2
c
h e n, E{ h) o, W
Hoeffding
12.1 D m D
e hGH,
[ - ) —
p \ E{h) E(h)\ e ( 2 exp( 2m 2 ) .
^
(12.17)
12.1 D m P
e h e H , (12.18) -
In (2/5) In (2/S )
E(h) ~
2m
< E[h)( E(h)
2m
(12.18)
12.1 m /i
12.1 w h e n,
...
^ j2 p ( \ Ew - m\ > e )
hen
(12.17)
hen
P( \ ^W E(h)\ > e) < 2|W
| exp( 2me2)
-
12.4 VC 273
5
(
—
2 \H \ exp( 2me2) (12.19).
W
e
K .K
PAC C W
agnostic learning).
^
P(E{h) minE ) e) 1 -6 ( 12.20 )
H PAC
PAC
poly(l/e,1 5 size size(c)) PAC
W PAC
m
12.4 VC
“VC Vapnik -
Chervonenkis dimension) [Vapnik and Chervonenkis, 1971].
VC (growth function)
dichotomy) (shattering).
H D ={Xl ,x2,...,xm}, n
D
12.6 m e N, W
N
Un(m) m
12.2 m G N, e hen
[Vapnik
and Chervonenkis, 1971].
(E
P| /i|
) >e ) < 4nK(2m)exp( -l). (12.22)
W D
W D
m
D
D D
D Un(m) = 2m , D
VC
12.7 VC K
2m}.
^
V C(? ) max {m n (m) (12.23)
^
^
VC( ) = d
W H
VC P
VC
VC d
d W VC d.
VC
12.4 VC 275
12.1 [a ,6]
b}, M = R . +1,
^
[a ,6] : CL,b e R, a x G [a 6], h[a (x) =
"
{ x 1
^
h [aM { x ) 1. ci 0.5, X 2 1.5, W
"
~^^
{ [0 1] 0,2] [1 2] [2,3]} VC
{a 3,x 4 , 5}, x3 X 4
/l[a,fe]
12.2
^ -
{(a 3,+),(a 4 , ),(X5 ,+)}. VC 2.
Y = R2 . 12.1
H
VC 3.
(b)
12.1 VC
12.7 VC 12.2
[Sauer , 1972]:
“S a u e r 12.2 VC m eN
Hw ( m )
m
t d d
(12.24)
m - l , d - 1) m
- l d) _D { x i , X 2 ,...,«m},
D ={iCi •• . l }
^i\ D = { ( h ( x i ) , h ( x2 ) h ( xm) ) \ h e n } ,
n|D’ = { { h ( x i ) , h ( x 2 ) h ( «m-i ) ) \ h e 7i } .
n +i
h &
^ Xm
W
| D W|D
K|iy
. . . Vm — l ) 3 /l /i G
^Dr \ D
=hi
^ ) — JJij A
i 7 ^ h (a^ ))
TTj 1
^ ^ TTl •
^
y|jD H|D H
| zy
(12.26)
i
— O
HD \ D Ujy U { xm }
'
D
U \D VC d |D VC d
(12.27)
i=0
(12.25) (12.27)
i=0 V i =0 V 7
r -1
.i) = o-
E (xT)
i=0 7
>
D 12.2 •
12.2
12.2 VC d m d
e
(12.28)
_
12.4 VC 277
n« M (7)
d
m d-i
(7
^=
i 0
m d J.
d
d \i
(7
m
^ d.
d
^
i =0
m
m
2 =0
m\ d d m
d
i+
m —
e • m\ d
d
12.2 12.2 VC
12.3 VC d, m J
hen
Min + 81nf
^
P E( h ) — E ( h)
m
1— S . (12.29)
—f ) < 4(t
^
4 I (2m)exp( exp )
e=
8dln + 8 1nf
m
12.2, 12.3
12.3 (12.29) m
P D VC
-
distribution free)
-
data independent )
h h
(12.30)
12.4 VC W PAC
{ (In2m e
2
(12.32)
12.1
|
^
E( g ) E(g)
E(g )
1
_ J /2
8dln + 8 1nf e
m 2
(12.34)
12.3
6
P
^ -
2
E h) - E(g)( E(h) --
= E(h) - E(g )+ e
12.5 Rademacher 279
1 -6 m,
12.4 ■
12.5 Rademacher
12.4 VC
VC
VC
-
er (1892 1969).
P {( xi ,yi ), ... (® m ,2/m )}
I Iv
?
"
^^
I { h( x i ) yi )
yih { x i )
E
2 2m 2 y i h( x i ) (12.36)
ET=i h(Xi )
f e {l 2, YALI ViKxi )
arg max
hen rn
^
2 y i h{ x i ) . (12.37)
(xuyi),
Vi
0.5 0.5
Rademacher (12.37)
K
sup (12.38)
(12.38)
a = {ai , J2 , .. .
(
5 m}.
(7 (12.39) [0, 1] ,
|W|
(12.39) |W| H D (7
h{ xi ) = ai (i • •• m) r (12.39) 1.
R. Z { zuz 2 , ... , zm } , A e
(12.39) Z
12.8 7
*
Z Rademacher
Rz { ) = Eo sup
^ - i= l
• (12.40)
Rademacher 7 Z
P
P m Z
12.9 7 2"
*
D Rademacher
Rademacher [Mohri et al .
2012]:
12.5 [0,1], D Z
Z { Z m } , Zi e Z , 0 < S < 1, G J7,
12.5 Rademacher 281
1 -6
m A
y in(2im/5)
E [/ W ]
^ / (^) + 2^m (7 ) + (12.42)
'
-
i= l
m
^ ln(2/(5)
E -
E
= ^ ^
/ )+2 (
i l
j r) +3 2m
(12.43)
771
=; E/
$(Z) = sup E
feT
[f] - E z( f ) .
Z G Z G
- ( Z ) = ( supE [ f ]
- Ezr( )) ( supE[/] - Ez(/))
^ ^
(Z ) f ~
^
< sf euTp S z ( f ) - S z ( f )
= sup
/( m) — f{zm )
feT m
•
m
HZ) - ( Z’ K
m
ln ( l / 5)
(12.44)
$ ( Z)
^ Ez [ $ ( Z) ]
2m
-5 ( Ez[$(Z)]
Ez [^ { Z ) ] = Ez \ sup E /
feT
- Ez ( f ) ]
Ez [ supEz [ Ezr ( f ) - z ( /) ] ]
/
^
(12.4)
Jensen
zz L /eF 1
)
-- $1( (4)
m
= SUP / /
,1
- fe
^ i=l
tTj
_ <7i = 2E(r Z }
sup
L /eF m
i==l
m
i= l
= 2KJ7) .
ln(2/5)
Rm { ^F ) Rz { ^F ) 2m
(12.45)
- 5/2
( (12.44)
\n( 2 / 5 )
(Z) EzMZ ) ) 2m
ln { 2 / S )
$(Z) 2RZ {T ) (12.46)
2m
- (12.43)
12.5 1]
12.5
12.5 Rademacher 283
12.6 P Af
{x i ,x 2,...,xm}, Xi e S h eH ,
-
^
1
ln(l/5)
E(h) E(h) Rm{n) 2m
(12.47)
ln(2/5)
E( h )
^ E { h ) + RD { ) + Z n 2m
(12.48)
H :Z Xx
/i
fh( z ) fh[ x y) I(h( x) y) (12.49)
[0,1]
Jrn ={ fh h e H }.
Rz { n )
^ sup Vi )
- f h e T n m 1i=l
Ecr sup — V CTil( ( ) ^
' h Xi yi )
= ^ <T
lhenm
sup
—
1 ^ IIV
V] H 1 (
yih( Xi )
2
y ai + s y^ (
2
1
E<T -
lmU
1 "
m
^ \i p — Jih{ x i ) )
- yi(
— V yiGih{ Xi ) )1
2 ^ lhenm
7 a sup
^
K
m
1 1
7
^ CT sup V" (aih( xi ))
2 vhenm
^ (12.50)
(12.50)
12.6
12.6 Rademacher 12.3
VC Rademacher
(12.47) D (12.48) D
Rademacher
VC
Rademacher
E( h) ( E( h) -
2dln , IH 1 / 5 ) (12.53)
^ 2m
Rademacher VC
12.6
VC Rademacher
stability/
•
D = { zi = { xi , y1 ) , z2 = (® m 2/m )} G
P W
eH D
P
12.6
• DV D i
. ••
^i+ l
* • ••
• D i
D1 • i —1 i i+1 ,•
• • jzm }
x
- X> JD.
e( D( x),y ) y x y R+
^
£D
•
^
t{Z D,z ).
£ { 2 ,V ) = [K ^ D , Z ) ] . ( 12.54)
^ xex ,z={ x ,y )
•
^ D) = -
m
YJ= ^
i l
D . zi ) . ( 12.55)
• - -
(leave one out )
- m
eloo (
^ D) = -
TTt J2 ^ Dv ^ ) i ( 12.56 )
F uniform stability):
12.10 z= y) i
^
\ t{ Z D\ ,z )\ 1,2,..., m
^
D ,z ) i z ( 12.57)
[Bous
quet and Elisseeff, 2002].
- 12.8 D m
D,
M, J m —
qn ( l / )
^
_
V D) (4m/3 M ) 2m
(12.58)
ln(l /0
£(£,P)
^
Uil £)) + + (4m + M) 2m
(12.59)
12.8 £
(12.58)
^
0( ) ,
VC Rademacher
12.3 12.6
^
suP/ie% { EW - Eih)
)
ln(l /(5)
P) ?(£,D) m (4 M) 2m
(12.60)
12.9 ERM W
g U
12.7
£ ( g ,V ) = min £ ( h ,V ) .
hen
■
e
2
= 2 exp (-2m(e’) ) 2
Hoeffding (12.6) m|
> In
J /2 (12.60)
ln(2/(J ) e
m +( + )
4 M
2m
m = o( In
*) D) < D) +
- 5/2
(
—
£ ( 2 , V) £ ( g , V) D) ( i( g , D )
^ T( 2, D ) - ?( g , D ) + e
—J 12.9
12.7 •
deterministic)
z
stochastic)
12.2 12.1.
2e
_ 2me2
12.3 12.1.
12.4 VC 1.
12.5 VC
12.6 VC
12.7 VC
12.8 c Rademacher 0.
^
Rmi^i + 2)
*
12.9 Ji Rademacher
12.10*
Bartlett , P. L., O. Bousquet , and S. Mendelson. ( 2002 ) . “Localized Rademacher
complexities.” In Proceedings of the 15th Annual Conference on Learning
Theory ( COLT ), 44-58, Sydney, Australia.
Bartlett , P. L. and S. Mendelson. (2003) . u Rademacher and Gaussian com-
plexities: Risk bounds and structural results.” Journal of Machine Learning
Research, 3:463-482.
Ben-David , S., N. Cesa-Bianchi, D. Haussler , and P. M. Long. (1995) . “Charac-
terizations of learnability for classes of {0, , n}-valued functions.” Journal
of Computer and System Sciences , 50 (1):74-86.
Bousquet , O. and A. ElisSeeff . (2002 ) . “Stability and generalisation.” Journal
of Machine Learning Research, 2:499-526.
Devroye, L., L. Gyorfi and G. Lugosi, eds. (1996). A Probabilistic Theory of
Pattern Recognition. Springer , New York, NY.
Hoeffding, W. (1963). “Probability inequalities for sums of bounded random
variables.” Journal of the American Statistical Association, 58(301):13-30.
Kearns, M. J . and U. V. Vazirani. (1994) . An Introduction to Computational
Learning Theory. MIT Press, Cambridge, MA.
Koltchinskii, V. and D. Panchenko. (2000) . “Rademacher processes and bound-
ing the risk of function learning.” In High Dimensional Probability II ( E.
Gine, D. M. Mason, and J. A. Wellner , eds.) , 443-457, Birkhauser Boston,
Cambridge, MA.
McDiarmid , C. (1989 ) . “On the method of bounded differences.” Surveys in
Combinatorics , 141(1):148-188.
Mohri , M., A. Rostamizadeh , and A. Talwalkar , eds. ( 2012). Foundations of
Machine Learning. MIT Press, Cambridge, MA.
Mukherjee, S., P. Niyogi, T. Poggio, and R. M. Rifkin. ( 2006 ) . “Learning the-
ory: Stability is sufficient for generalization and necessary and sufficient
for consistency of empirical risk minimization.” Advances in Computational
Mathematics, 25(1-3):161-193.
Natarajan , B. K. (1989). “On learning sets and functions.” Machine Learning ,
4(1):67-97.
Sauer , N. (1972 ) . “On the density of families of sets.” Journal of Combinatorial
-
Theory Series A , 13(1):145-147.
Shalev-Shwartz, S., O. Shamir , N. Srebro, and K . Sridharan. ( 2010) . “Learn-
,
ability, stability and uniform convergence.5 Journal of Machine Learning
Research, 11:2635-2670.
Shelah , S. (1972 ) . “A combinatorial problem ; stability and order for models
and theories in infinitary languages.” Pacific Journal of Mathematics, 41
(1):247-261.
Valiant , L. G. (1984) . “A theory of the learnable.” Communications of the
ACM , 27(11) :1134-1142.
Vapnik, V. N. and A. Chervonenkis. (1971). “On the uniform convergence of
relative frequencies of events to their probabilities.” Theory of Probability
and Its Applications , 16 ( 2) :264-280.
TCS (Theoretical Computer
Science),
“P?=NP” .
(Leslie G . Valiant ,
ACM
“ ACM Turing Award Goes to Innovator in Machine Learning”
13.1
A Z
labeled )
{x i+i ,Xi+2,.• .,X /+W}, l U U
unlabeled )
A Q
A
A
SVM ,
“Yes!”
13.1
O > o
13.1
(semi-supervised learning).
(cluster assumption),
13.1
manifold assumption),
10.5
r,
13.2
( pure) ( transductive
learning),
13.2
> zMM y
13.2
13.2
( generative methods)
EM
EM 7.6
ye y {1, AT}
N
2 a i - p( x \
P{ x ) =
^ (13.1)
sill ai = p{ x I T i
9.4
A Si
f(x) G y ® ,
G {1 2,...,iV} x
N
= arg max p(y = j \ S = i,x) •p(@ = i \ x) (13.2)
^
p{x|
p = i \ x) = N (13.3)
I Mi ?
:c i y j
i j
j ©
^
py j \ Q = i) i i
^p(y = j \ Q = i) = l i j p(y j Q i) 0.
= j \ 0 = i,x)
= i \ x)
(13.2)
A {(aJUi) ...
13.2
Du = ... Xl+U } , u = m.
N }, A U Du
L L( Di U Du ) = E
( x j ,y j ) eDi
In
(s oti p{ x j| j 0 = i,X j )
\i i , Si) p( y |
+ E - In
XjED
(s Oi
^ p( X j I . (13.4)
(13.4)
EM
EM 9.4
•E
p{ *
Iji = N
*
^j • (13.5)
E1 a i P ( X j I Mi
2 =
•M i
Mi = (13.6)
® jf
E
G -Oti
+ Xj
^ D- ^ jixj ( x j ,yj ) eDiAyj=i
xj
5] =
^ 3? G ^ +h
0
Jf G D\L
( x j — P i )( x j — Pi ) T
9 (13.7)
lxj y j ) GDi /\ yj =i
oti —m ®J G Dy, J
I (13.8)
(13.3) (13.2)
13.3 SVM
S3 VM
• • •- •
•
SVM
13.3
A (»2 ,y2),
® Z+2 ,• • • ® Z+w} G{ Z I \ u = m. TSVM
~ ~
s .t .
^ + )> ^ —
yii u) Xi 6 1 5 i 2, • • • Z
(wTXi + 6)
yi i= 1 - l - - l , l + 2,
\ • ••5 m
0 ? i = l 2 ...
(w,b)
^( = i /
^
+ l ,/ + 2,...,m)
=
Q Cu
TSVM (13.9)
SVM (13.9) Du
SVM (label assignment), SVM
-
pseudo label)
(13.9) SVM
Q TSVM
(13.9)
CU = Q
SVM
TSVM 13.4
SVM
13.4
(13.10) 3.6
A {( 1,2/1),(«2 , 2 ), ( z ,yz )};
^ ^ ^
• •
{ x i+1 , X i+2 , • • • X i .u }]
a. ^
-
A SVMZ
SVMz Ax { yi+ u y i+2,...,yi+u)
Cu Cl;
while Cu < Ci do
5: (l3.9) (w , b),
A
while 3{ i , j \ < 0) A > 0) A > 0) A ^ 2 ) } do
6:
Vj = -Vj\
j
^
DhDu , y , CuCn (13.9),
end while
11: Cu = mm{ 2Cu , Ci }
12: end while
(UZ+2 .. . U
13.4 TSVM
+ = - Q-
^
(13.10)
13.4 6-1 0
(13.9),
[Joachims, 1999].
(13.9) .
SVM
(graph
kernel) LDS [Chapelle and Zien,
meanS3VM [Li et al., 2009]
13.4
(strength)
13.4
^^
2
I U I U m. G ( F E),
{xi,...,X/ ,CC/+1 ,...,xi+u}, E affinity
matrix),
ex \ \ Xi - X j \\ l
2&1 j
(W )y = (13.11)
0 otherwise
i , j G {l , 2 , . . . , m}, a > 0
{V,E) G f : V R,
yi sign(/(xj)), yi e{-l,+l}.
(energy function) [Zhu et al.,2003]: ^
m m
2
Ein )
i=l j=l
(m m m m
^
' + - X=2 (w)t/ / /
2 \i \
= j=l 2 1 j=l
m m m
^
- X=
=
^
i=l
2 dif 2{ xi )
i l j=l
(W )o / /(
= /T(D
- W)/ , (13.12)
f(x i) = y i (i = l,2,...,l),
/ A=D -W (Laplacian
matrix). Z W
W n W iU
Wui wu u
^
>11
D=
E( f ) = ( m) 0 Wn
wnZ wuu_
fi
fu
(13.13)
?
f ( Du
- Wu ) ft - 2 f ^ ^
Wutfi + f ( Duu - Wuu ) fu . (13.14)
d E( f )
df =0
'
.
fu = (Dun Wuu
^ Wulfl (13.15)
^
1
p = D 1W = D Oiu Wu Wz.
Oul I>Zu wm Ww
DfW DfWz.
..
- ^ ^^
" D-JW
(13.16)
P =D W Py = D- W , (13.15)
fu = ( Duu ( l -
= (i - D
^ w
1
_ m
1
= (I Puu ) Pui /z (13.17)
A = yi )
•••
'
(label propagation)
[Zhou et al., 2004].
A U G = (V , E ) ,
F { x i ,... ® z • W
^
D diag(di ,d2 ,..., +n )
^
Z+W x F (F? F V... FDT
F, ,
((F)il ,(F) 2 ,...,(F)w )
Vi argmaxi 3 | ( F) W
i m, j | F
13.4
if (1 i A ( yi = j )\
F (0) = (Y ) = (13.18)
^ otherwise.
Y / Z
W S D-iWD i, D - 5
diag
a e (0,1) SF Y
(13.19)
_
F* = (1 - - aS ) 1Y (13.20)
F* { yw , yw , . . +u ). 13.5
A { ( x i , y i ) , (£C2 , 2/2 ) , ( x h y i ) }\
P Du {jCz +i X^+2 , •
^
"
••
a;
a.
a W; _
W S = D IWD
(13.18) F
t
repeat
F ( i + 1) aSF + ( 1 - a ) Y;
7: t = t+l
8 until F*
9: for i = l + 1, 1 + 2, . . . , 1 + u do
F* ) -
10
11:
Vi = axgmax
end for ^^ ^ ^ ^
d = d i, ..• m+u )
13.5
E ( w ) ij ,- 2
" "
F + Fd (13.21)
"
11
2 ,i= Vdi
* i
(13.21) 13.5
"
M
(13.21)
”
F
(13.12)
(13.21)
(13.12)
0(m2),
13.5
SVM
(disagreement-based methods)
disagreement diver-
disagreement )
-
co training) [Blum and Mitchell, 1998]
-
multi view)
multi-view learning)
attribute
set), view).
( ( x\ x 2 ) , y),
W i
( ( x\ x 2 ) , y)
'
13.5
(compatibility),
y 1 y2
y y1
y1 y2
(sufficient)
13.6
2013].
T.
h Du s Ds;
Du = Du\ Ds ]
3: for 7 = 1, 2 do
Di = { ( xivi ) I { { xi xl
end for
~3
) , y i ) G A}
—
5
for t ••• T do
for j = 2 do
j 8: hj W)
={g
/ij. •
j
9: hj <,xl ~
) G Ds} p
C Ds n C
- {( x r j l ) x{ e D0
3
^^
Dp3
^
10
11 bl ~
--
3 j
{( x , i ) \ x{ e Diy,
12 Ds = Ds\ ( Dp U Dn )\
13 end for
14 if then
15 break
16 else
17 for j = 2 do
18 D )
19
20
end for
Ax 2p + 2n
'
21 end if
22: end for
/ll Zl2
13.6
2013];
13.6
13.6
(semi-supervised clustering)
10.6
-
(must link) -
cannot link)
;
A ( Constrained fe means) - [Wagstaff et al. 2001]
m};
A4
*
C;
.
f Ml , M2 , - .. , /ifc}
repeat
fc Cj = 0 ( l j k ) ;
3
4 ^ ^
for i = 1, 2, . .. , m do
-
^^
5 (1 j k) =|
\\ xi Xj|
|2
6 ={1,2,
7 is _merged =false;
8 while is _merged do
/C r = argmir
A G
&
M C ^ K
if -i is voilated then
12 Cr = Cr U i}; '
13 is _merged =true
14 else
JC = JC\ { r } ;
if /C = 0 then
break
end if
end if
end while
end for
for j = 1, /c do
•
end for
25: until
= \ c~\
^ xeCj X
13.7 /c
C, { x i , X j ) e M
fc 9.4.1 {xi . Xj ) eC
C
A:
13.7
p .202 4.0
X13
M = {( X4 , 25 ), ), X u ) },
C
^
{(x2 ,X2l ),{ x 2 l : X 2 ) ,( 13 ,«23 ) , ®13 ),(«19 ,® 23),(«23 ,
/c
^ z6 , :c27 13.8
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.7
(a) (b)
v.
0.3 -
.4\
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.3 0.7
(c) (d)
13.8 4.0 /c A: 3)
13.6
A:
D { x u x 2 , ... , x m } ,
(cluster label),
(class label ).
S \Jkj=1 S j C D, /0 j
A:
(Constrained Seed
-
Ai means) [Basu et al., 2002], 13.9
JD {xi ,x2,...,Xm}
,|
S C D|5 <|D
| . S = U =! •
for j .. . A: do
3 = 15 1
^^
x
• ^
3 end for
xeSj
4: repeat
^
5: C j = 0 (1 < j k ) ]
for j = 1, 2, . .. , /c do
for all x S j do
C j = Cj \J { x }
end for
end for
for a l l X i e D\ S do
(1 \ \2
^
j k) HA
r argmin
^ ^^ ^
Tj g ,... }
A CV CV U{^}
end for
for j = 1, 2, . .. , /c do
19 until
end for
= W i\
^ xeCj
13.9 /c
4.0
13.10 A;
^ ^
4 $ 27 5 28 ?
^^
2 = ®7 , ® 10 ,®11 ®12 , ® 18 ,®19 ,®2o}
;
( 3 = ® 13 , 21
7.6
.
o.
o.
5.4
o.
•\ •
-vr
>•••
V*
:
•
\ o.
'w
3.2
o.
.
+ o. V.
#
.
A \• ■
o.1
0.2
/
V•.
0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.2 0.6
(a) (b)
76
o.
o.
..
o..S
,
- :
?
4.32
\
f
o. i,
•• \
•
'
^
o.
\
•••
o. • •
••f
!
•
i
o.1 •
.
if
- 0.5
w \
0.7 0.2 -
If y0>.
■
- 0.9
(c) (d)
13.10 4.0 /c A; 3)
13.7
13.7
(ICML)
(paradigm)
SVM [Blum and Mitchell,
[Joachims, 1999] [Zhu et al.? 2003]
[Shahshahani and Landgrebe, 1994].
SVM
(continuation)
S3VM [Chapelle et al,2006a];
(deterministic annealing)
[Sindhwani et al., 2006]; CCCP
[Collobert et al.,2006]
[Blum and Chawla, 2001]
(mincut).
fc e
13.4 A: e
10.5.1
[Wang and Zhang, Jebara et al.,2009], (graph
kernel) [Chapelle et al., 2003].
[Zhou, 2009].
2013].
[Belkin et al., 2006]
ization)
( manifold regular
-
.
[Cozman
and Cohen, 2002] ,
SVM,
S4VM [Li and Zhou , 2015]
safe)
P ( X \ 0) = Y ai
...
^ ' Pix \ ei )
i
(13.22)
o, =
TSVM
SVM,
13.5
TSVM
13.6* TSVM
r
i 3.
13.8
-
(self training)
AA,
13.9*
13.10 13.7
. 2013). 39 (11):1871-1878.
Basu S., A. Banerjee, and R. J. Mooney. ( 2002). “Semi-supervised clustering
by seeding.” In Proceedings of the 19th International Conference on Machine
Learning ( ICML ) , 19-26, Sydney, Australia.
Belkin, M., P. Niyogi, and V. Sindhwani. ( 2006) . “Manifold regularization: A
geometric framework for learning from labeled and unlabeled examples.”
Journal of Machine Learning Research, 7:2399-2434.
Blum, A. and S. Chawla. (2001). “Learning from labeled and unlabeled data
using graph mincuts.” In Proceedings of the 18th International Conference
on Machine Learning ( ICML ), 19-26, Williamston, MA.
Blum, A. and T. Mitchell. (1998). “Combining labeled and unlabeled data with
co-training.” In Proceedings of the 11th Annual Conference on Computation-
al Learning Theory ( COLT ), 92-100, Madison , WI.
Chapelle, O., M. Chi, and A. Zien. ( 2006a) . “A continuation method for semi-
supervised SVMs.” In Proceedings of the 23rd International Conference on
Machine Learning ( ICML ) , 185-192, Pittsburgh , PA.
Chapelle, O., B. Scholkopf , and A. Zien , eds. ( 2006b). Semi-Supervised Learn-
ing. MIT Press, Cambridge, MA.
Chapelle, O., J . Weston, and B. Scholkopf. ( 2003) . “Cluster kernels for semi-
Collobert , R., F. Sinz , J. Weston , and L. Bottou. ( 2006). “Trading convexity for
scalability.” In Proceedings of the 23rd International Conference on Machine
Learning ( ICML ) , 201-208, Pittsburgh , PA.
Cozman , F. G. and I. Cohen. ( 2002 ) . “Unlabeled data can degrade classifica-
tion performance of generative classifiers.” In Proceedings of the 15th Inter-
national Conference of the Florida Artificial Intelligence Research Society
( FLAIRS ) , 327-331, Pensacola, FL.
Goldman , S. and Y. Zhou. (2000) . “Enhancing supervised learning with unla-
beled data.” In Proceedings of the 17th International Conference on Machine
Learning ( ICML ) , 327-334, San Francisco, CA.
Jebara, T., J. Wang, and S. F. Chang. ( 2009) . “Graph construction and b-
matching for semi-supervised learning.” In Proceedings of the 26th Interna-
tional Conference on Machine Learning ( ICML ), 441—448, Montreal, Cana-
da.
Joachims, T. (1999) . “Transductive inference for text classification using sup-
port vector machines.” In Proceedings of the 16th International Conference
on Machine Learning ( ICML ), 200-209, Bled , Slovenia.
Li, Y.-F. , J . T. Kwok, and Z.-H. Zhou. (2009). “Semi-supervised learning using
label mean.” In Proceedings of the 26th International Conference on Machine
Learning ( ICML ) , 633-640 , Montreal, Canada.
Li, Y.-F. and Z.-H. Zhou. ( 2015) . “Towards making unlabeled data never hurt .”
IEEE Transactions on Pattern Analysis and Machine Intelligence , 37(1):
175-188.
Miller, D. J . and H. S. Uyar. (1997) . “A mixture of experts classifier with
learning based on both labelled and unlabelled data.” In Advances in Neural
Information Processing Systems 9 ( NIPS ) ( M. Mozer , M. I. Jordan, and T.
Petsche, eds.) , 571-577, MIT Press, Cambridge, MA.
Nigam, K ., A. McCallum , S. Thrun , and T. Mitchell. (2000 ). “Text classifica-
tion from labeled and unlabeled documents using EM.” Machine Learning ,
39 ( 2-3):103—134.
Settles, B. ( 2009 ). “Active learning literature survey.” Technical Re-
port 1648, Department of Computer Sciences, University of Wis-
~
consin at Madison, Wisconsin, WI. http:// pages.cs.wisc.edu / bsettles /
pub /settles.activelearning.pdf .
Shahshahani, B. and D. Landgrebe. (1994) . “The effect of unlabeled samples
in reducing the small sample size problem and mitigating the Hughes phe-
nomenon. IEEE Transactions on Geoscience and Remote Sensing , 32 (5):
1087-1095.
Sindhwani, S. S. Keerthi, and O. Chapelle. (2006 ) . “Deterministic annealing
for semi-supervised kernel machines.” In Proceedings of the 23rd Internation-
al Conference on Machine Learning ( ICML ) , 123-130, Pittsburgh, PA.
Wagstaff , K., C. Cardie, S. Rogers, and S. Schrodl. ( 2001) . “Constrained
fc-means clustering with background knowledge.” In Proceedings of the
18th International Conference on Machine Learning ( ICML ) , 577 584, -
Williamstown , MA.
Wang, F. and C. Zhang. ( 2006 ) . “Label propagation through linear neighbor-
hoods.” In Proceedings of the 23rd International Conference on Machine
Learning ( ICML ), 985-992, Pittsburgh , PA.
Zhang , D., Z.-H. Zhou, and S. Chen. ( 2007) . “Semi-supervised dimensionality
reduction.” In Proceedings of the 7th SIAM International Conference on
Data Mining ( SDM ), 629-634, Minneapolis, MN.
Zhou, D., O. Bousquet , T. N. Lai, J . Weston , and B. Scholkopf. (2004) . “Learn-
ing with local and global consistency.” In Advances in Neural Information
Processing Systems 16 ( NIPS ) (S. Thrun, L. Saul, and B. Scholkopf , eds.) ,
284 291, MIT Press, Cambridge, MA.
-
manifold ) Mannig
-
faltigkeit , • ( Bernhard
Riemann , 1866)
(Breselenz),
Mannigfaltigkeit
(
14.1
)
(probabilistic
model)
inference),
'
0,
generative) P% discriminative)
P (X , R \ 0 ) . P(y, O )
P(y,R I O ) P (Y I O).
0(2lyW l ).
( Bayesian network);
( Markov
network)•
( Hidden Markov Model, HMM)
7.5
(dynamic Bayesian network),
14.1
{2/1, • . 2/n} Vi j
( hidden variable).
y i M y2 Vi Vn
Xi) ( X2 Xi
14.1
14.1
W
t f
- —
yt 1 , n-2
(Markov chain), BP
Vn ) = P { yi ) P { xi yi )
Y=[ P ( yi yi - i ) P { xi \ yi ) . (14.1)
i 2>
• A
[ao]ivxiV
dij = P { yt -\- i = Sj yt = Si ) ,
•
B [ b{ j ]iv x M
bij = P { xt Oj yt = si )
•
14.1
P ( Vl = Si )
TTz l
^i^ N
y
A [A,B ,TT] A
x2 , ... xn}:
(1) t 7T
(2) B
(3) A
t n t t .
G {& ••• }
Sjv {oi ,02,...,0M} t
A [A,B,TT],
{m , X2 , .. .
^} n P( X A )
• A [A,B ,7r] x 2, .
y {yi y2,… yn}
• x= A = [A B ,TT]
P(x A)
{ i ,x2,...,xn i} _
^ P (x A )
(14.1)
14.2
(potential
functions), factor ),
14.2
clique).
maximal
clique); 14.2
{ 1 , 2},{ x i , XS } ,{ 2 , 4},{ 2 , 5},{ X 2 , Xe } ,{ 3 , 5},
^^x3 ^^ ^^
{ X 2 , X },
^
{ x u x2 , xs }
{X5,a 6}
^^
XA
X2 x6
X3
14.2
n
x { x u x2 , .. .,xn}, C, Q eC
xQ , P(x)
p(x) n
Qec
(XQ ) , (14.2)
Q Q Z=
14.2
^
Ex UQec Q( XQ )
Z Z
(14.2)
9
( 3*
< Xg Q Xg* ;
XQ
P(x) C*
P (x )
Z* n
QeC*
(14.3)
Z* = EX Ugec* (XQ )
x6}, P(x)
P( X ) —^ 12 ( X1 , X 2 ) 13 (a l , 3 ) 24
^ ^ ^ ^4 ) 035 (^3, ^5 )
*
^ ^ ,^ ,
256 2 5
0256( 2 , 5 , 6)
^^ ^
*
7.5.1
14.3 B C
A B c C separating
set ).
X _
e
14.3
XC
A C
XA X _e XC7 .
xA xs xa
14.3 S (7
14.3 B C
14.3 1 4 . 4.
Q)
(x
^ )
14.4 14.3
14.4, (14.2)
,AC{ XA XC) ,
BC { XB X C )
^
YUX Xx^
• (14.5)
^^ AC XC ) B ’
BC [ X B XC )
P{ x A I xc ) =
P { x A , XC )
p( xc )
. ^ .^
T P A X XC )
^ Pix x ^ xc )
X
J2xfA Ex
JD
^ ^^
A C ( X A , X C ) B C ( XfB X C )
(14.5) (14.6)
^ ^ Xc )
P ( x A , XB I Xc ) = P { x A | xc ) P { x B I xc ) , (14.7)
CA
blanket ).
• (pairwise Markov property):
F
W WJ Xw ^v\{ u ,v)
XQ
14.4
1.5, if XA = xc\
AC { XA, XC )
^ 0.1, otherwise
0.2, if xB = xc\
,
B C{ X B XC )
^ 1.3, otherwise ,
XC7 XC7
(1 4 . 2)
M M
4Q(XQ ) e- Q(XQ )
. (14.8)
^
g(xg ) xQ
HQ ( XQ ) =
/
u ,veQ ,u v
-
OL yX n X y H-
^ :P v V (14.9)
^ VGQ
14.3
3.3
X2 , Tn} y 2/n}
P( y I
y
14.5(a)
14.5(b)
[S]
[ NP] [VP]
y
{ yi V2 V3 DA V 5 ye }
[PP] y
[ D] [ N ] [V] [ P ] [D] [N ]
{ ^1
• X2 X3 X4 X5 6 } [ D] [ N ] [V] [ P][D] [N ]
X
The boy knocked at the watermelon. The boy knocked at the watermelon . x
( a) ( b)
14.5
G (V , E ) y
r n( v ) r G
(y,x )
G
14.6
(chain-structured CRF).
{ xi X2 ... xn }
14.6
14.3
x). x, 14.6
(1 4 . 2)
(feature function),
P( y I x ) = exp
Z
(14.11)
14.5(a)
i A “knock”
A “knock” [F].
(14.11) (14.2)
14.4
( marginal distribution)
x
A
z
marginalization).
F
{X15 X2 , M
XF P(xF)
P(xF I XE ) .
P( X E , X F ) XF )
P( X
P( XF = (14.12)
P (^E ) EXFP XF)
( X
P(xs ,xF)
•
P{XE) = y P(xjg , XF )
^ . (14.13)
14.4 .1
14.4
. •
14.7(a)
X4
m43(X3) x4
© X3 X3
(b)
14.7
POr5).
{xi ,X2,x3 , 4}, BP
X4 X3 X2
^ Xi
(14.14)
{ X U X 2 , X 4 , XS }
^
2P{ 2 P{xs I 2
^ ^2 ^ ^ ^^2 ^
P( x ) = 3) P( I 3) ) I l)
^
X5 4 X2
^ Xs X4 X2 x\
p( (14.15)
= ^ I^
5 3) P ( X 4 I Xs ) P( X 3 I X 2 )m12 ( x2 )
rrHjixj ) i
Xj )
j m
^
^^
= 2 P( x 5 I Xs ) 2P( x 4 I X 3 ) m2s { xs )
2 p(x 2 P { x^ I xs )
=
^2 X3
5 I xs )m2s { xs )
^ X4
^
p( x
=
^ 1 ^3 )^23 (^3 )^43 (^3 )
= 35( 5 ) . (14.16)
^^
• m35(a 5) x5
14.7(a)
P{ XI , X2 , X
^ X^ X ^ ) = — ^ 1 2 ( X 1 , X 2 ) 2S { X 2 ,
^
Z _
P(x5)
(14.17)
^ ^ ^
P(X5) = 35( 3, 5) 34( 3 , 4) 3) 2)
Xs
^^ ^^ X4 X2
^
^^
=
^ ^ ^ ) 2^
35 3 5 34 ( x3 , x4 ) 2^ ^ , xs )m12 { x2 )
23 ( 2
14.7(a)
P ( x5 ) P(x4), { xi , X2 , X , Xs } 77 2(2:2 )
^
^
m23(x3)
14.4.2
i(xj ) JJ
ken( i )\j
mki ( xi ) (14.19)
n
A myh ). (14.15) (14.16)
14.7(b)
A
14.5
P ( xi ) ex JJ( ) mki { xi ) .
kGn i
(14.20)
14.7(b) X2
X4 m35(x5) P ( x5 ) .
•
14.7 n r4 x5
14.8
(14.20)
^43( 3) ( 3)
^ 43
^^
m2i ( x i ) rn32 ( x2 ) 21( 1) 32( 2)
^
34( 4 )
53( 3)
( 2) 23( 3)
^
12
^ 35 { x5 )
(a) (b)
14.8
14.5
(sampling),
(variational inference).
14.5.1 MCMC
14.7(a) r5
r
EP[/] f { x ) p( x ) dx (14.21)
p( x )
p( x ) ^2 , - .. /(4
(14.22)
(Markov Chain
Monte Carlo, MCMC) xGX p(z)
x A SJ
"
P ( A) = p ( x ) dx (14.23)
JA
R, f { x)
[ f ( X )] = f ( x ) p( x ) dx (14.24)
JX
a
(14.24) MCMC
x
_
p
Xl,X2 , ..,xN ,
- (14.24)
N
P( f ) = • (14.25)
p(x) p
14.5
MCMC p
x p.
r r(/
t p(x* ),
p(x)
MCMC
MCMC
Metropolis-Hastings Metropolis-Hastings MH ) MCMC
N . Metropolis
[Metropolis (reject sampling ) p. 14.9
et al., 1953], W. K . x* ,
Hastings
X* Q(X*
i-1
[Hastings, 19 70]
*
-
^
x* 1), 1
') A(X*
Q(X* x 1
"
X )
X* X *
(14.26 )
I x* ) , (14.27)
xQ ;
for t . . . do
3: Q (x* | x^1) x* ;
(14.28).
4: 0, 1)
if u A ( x* xt 1) then
_
= x*
else
xt xt-i
9 end if
10 end for
11 return x x2 ,
^
^
x x2 , .. .
14.9 Metropolis-Hastings
A (x* x4-1) (14.28)
(1)
( 2) x I X0,
xi = {XU2 , • • •
(3) |x7) •
14.5. 2
(a) ( b)
14.10
14.5
14.10 x
N
p(X 0) = Y[ J2 p( Xi , Z I ©) (14.29)
N
lnp(x 0) 0) 5 (14.30)
X ={X1,X2 ,.. ㊀ X Z
14.10 x
Z ©, p(z ©) ©.
(14.30)
EM 7.6 EM E t p(z x,
p(x,z 0); M E
e Q(0;0 )
'
©t +1 arg max Q( @ ; @t )
©
^
^
arg max p( z|x,0 )lnp(x,z|0) • (14.31)
©
(14.31) Q(0;©0 _
ln p(x z ©)
p(z x,
^
p( z x © ) z
2(0;0 ) ' EM
z
z x,©0 z
g(z)
q) dz (14.33)
KL C.3.
KL (Q p) = ~
J• ) h P(
z x)
dz . (14.34)
p( z x,© )
E
z
' z
M
( 14.35 )
z
(exponential
gt (zt ) qi .
family )
Q) =
/ n ^{ lnp ( x , z )
^ In q i > dz
const
J lnp( x , z ) JJ q i d z i dzj — qj In q j d z j 4 const
^
•
( 14.35 )
( 14.40)
( 14.35 )
[lnp(x , z )] ( 14.40 )
z ( 14.38) Zj
14.6
Zj lnp(x,z ) Z
(14.40) EM
14.6
( topic model)
- -
bag of words).
14.11
K
r iV T AT
W K iV
= G n n f
n fc G n n A; n
i
i ,2,... r)
LDA LDA
G t
©# t A:
C. I .6.
(1)
(2) 7V
0.04 0.02 0.04 0.02
0.01 ten o.o2 0.02 0.01
0.01 0.01 0.02 tm 0.01
So
-
tt
if .
■
r r» j
»
“r
m
" iiw
-> . -4't -
^
I ilT
'
h .
14.11 LDA
( a) n &n;
( b)
14.11
1),
2b), 2a).
14.12 LDA n
n,
a,
ry .
OOOO
a © t n Pk V
N T K
14.12 LDA
LDA
14.7
p(W,z,/3,0 I a 77)
p(0t a) p( f3k|rj ) a r/ K iV
r (Sfc ak)
p( Qt a ) =
nfcrK ) m 1
(14.42)
C.1.5. r Gamma a ;
r (14.41)
W { w 1 , w 2 , .. . , w T } , LDA
a ry
a, (14.43)
a T7 n
At n)
'
p(z /3 0 W,a,T7) =
p(W,z,/3,0 a,
(14.44)
p(W|a ,r/)
p(W|a,77) (14.44)
14.7
[Jordan , 1998]
[Wainwright and Jordan, 2008].
L D A [Blei et al , 2003]
[Blei, 2012].
-
(non parametric)
[Teh et al., 2006] [Ghahramani and
p.164. Griffiths , 2006]
PLSA
LSA SVD
[Hofmann , 2001], LSA
p .266.
MCMC
[Pearl,1987] MCMC
[Neal, 1993], MCMC [Andrieu et al.,
Gilks et al., 1996].
14.1
14.2 .
•
14.3
14.4
14.5
14.6
14.7 MH
14.8
14.9* LDA,
14.10* LDA
Andrieu, C., N. De Freitas , A. Doucet , and M. I. Jordan. ( 2003). “An intro-
duction to MCMC for machine learning.” Machine Learning 50 (1-2 ) :5-43.
Blei, D. M. ( 2012 ) . “Probabilistic topic models.” Communications of the ACM
^
55 7-84.
Blei, D. M., A. Ng, and M. I. Jordan. ( 2003). “Latent Dirichlet allocation.”
Journal of Machine Learning Research, 3:993-1022.
Buntine, W. (1994) . “Operations for learning with graphical models.” Journal
of Artificial Intelligence Research, 2:159-225.
Geman , S. and D. Geman. (1984) . “Stochastic relaxation , Gibbs distributions,
and the Bayesian restoration of images.” IEEE Transactions on Pattern
Analysis and Machine Intelligence , 6 ( 6) :721-741.
Ghahramani, Z. and T. L. Griffiths. ( 2006 ) . “Infinite latent feature models and
the Indian buffet process.” In Advances in Neural Information Processing
Systems 18 ( NIPS ) (Y. Weiss, B. Scholkopf , and J . C. Platt , eds.) , 475-482,
MIT Press , Cambridge, MA.
Gilks, W. R. S. Richardson , and D. J . Spiegelhalter. (1996) . Markov Chain
Monte Carlo in Practice. Chapman & Hall/ CRC , Boca Raton , FL.
Gonzalez, J . E. Y. Low, and C. Guestrin. ( 2009) . “Residual splash for optimally
parallelizing belief propagation.” In Proceedings of the 12th International
Conference on Artificial Intelligence and Statistics ( AISTATS ) , 177-184,
Clearwater Beach, FL.
Hastings, W. K. (1970 ) . “Monte Carlo sampling methods using Markov chains
and their applications.” Biometrica, 57(1):97-109.
Hofmann , T. ( 2001) . “ Unsupervised learning by probabilistic latent semantic
analysis.” Machine Learning , 42 (1) :177-196.
Jordan, M. I., ed. (1998) . Learning in Graphical Models. Kluwer , Dordrecht ,
The Netherlands.
Koller , D. and N. Friedman. ( 2009 ) . Probabilistic Graphical Models: Principles
and Techniques. MIT Press , Cambridge , MA.
Kschischang, F. R., B. J . Frey, and H.-A. Loeliger. ( 2001). “ Factor graphs and
the sum-product algorithm.” IEEE Transactions on Information Theory 47
( 2 ):498-519. .
Lafferty, J. D., A. McCallum , and F. C. N. Pereira. ( 2001) . “Conditional ran-
dom fields: Probabilistic models for segmenting and labeling sequence data.”
In Proceedings of the 18th International Conference on Machine Learning
( ICML ), 282-289, Williamstown, MA.
Lauritzen, S. L. and D. J . Spiegelhalter. (1988). “Local computations with prob-
abilities on graphical structures and their application to expert systems.”
Journal of the Royal Statistical Society - Series B, 50 ( 2):157-224.
Metropolis , N., A. W. Rosenbluth, M. N. Rosenbluth , A. H. Teller , and E.
Teller. (1953) . “Equations of state calculations by fast computing machines.”
Journal of Chemical Physics , 21(6 ) :1087-1092.
Mooij , J . M . and H. J . Kappen. ( 2007). “Sufficient conditions for convergence
of the sum-product algorithm.” IEEE Transactions on Information Theory,
53(12 ) :4422-4437.
Murphy, K. P., Y. Weiss , and M. I. Jordan. (1999) . “Loopy belief propaga-
tion for approximate inference: An empirical study.” In Proceedings of the
15th Conference on Uncertainty in Artificial Intelligence ( UAI ) , 467-475,
Stockholm, Sweden.
Neal , R. M. (1993) . “Probabilistic inference using Markov chain Monte Carlo
methods.” Technical Report CRG-TR-93-1, Department of Computer Sci-
ence University of Toronto.
Pearl , .J . (1982 ) . uAsymptotic properties of minimax trees and game-searching
procedures.” In Proceedings of the 2nd National Conference on Artificial
Intelligence ( AAAI )^ Pittsburgh, PA.
Pearl , J . (1986) . “Fusion , propagation and structuring in belief networks.”
Artificial Intelligence , 29 ( 3) :241-288.
Pearl, J . (1987). “Evidential reasoning using stochastic simulation of causal
models.” Artificial Intelligence, 32(2 ):245-258.
Pearl, J . (1988) . Probabilistic Reasoning in Intelligent Systems: Networks of
Plausible Inference. Morgan Kaufmann , San Francisco, CA.
Rabiner , L. R. (1989 ). UA tutorial on hidden Markov model and selected ap-
plications in speech recognition.” Proceedings of the IEEE , 77(2):257-286.
Sutton , C. and A. McCallum. ( 2012 ) . “An introduction to conditional random
fields.” Foundations and Trends in Machine Learning , 4 ( 4) :267-373.
Teh , Y. W., M. I. Jordan , M. J . Beal, and D. M. Blei. ( 2006) . “Hierarchical
Dirichlet processes.” Journal of the American Statistical Association , 101
(476):1566-1581.
Wainwright , M. J. and M. I. Jordan. ( 2008) . “ Graphical models, exponential
families, and variational inference.” Foundations and Trends in Machine
Learning , 1(1- 2) :1-305.
Weiss, Y. ( 2000 ). “Correctness of local probability propagation in graphical
models with loops.” Neural Computation, 12 (1):1-41.
( Judea Pearl, ).
Rutgers
RCA
1.5
Allen Newell
“9.11”
Michael Jordan
•
15.1
mle)
f A f2 A • • • A (15.1)
“i” (body),
(head),
(literal) (conjunction), “ A”
atom )
-
“if then
(X K ) (X Z) A (ZJ )”
AI
t
. (valuation)
2.oaP.76 2.0 1)
4.1.
cover).
(ordered
mle) (priority rule)
-
meta mle),
(default rule),
(propositional
rule) -
(first order rule). (propositional
atom) A) t)
P0” “VX”
“3F” Y
“VX3F (Y) t (Y = (T(X)))”
“VX KX)) e .
F
quantifier).
15.2
( relational rule).
pO t
15.2
(sequential covering),
- -
(separate and conquer )
0.2”
r.
i i
y i R[ x , y ) a:
t A
17.
P A
t A
b A ).
r/ n - -
generate then
test ) specialization)
-
(bottom up),
-
data driven)
generaiization)
—
15.2
2.0
n /m
m n
15.1
3/4.
(2/4) (2/2)
(3/4) A (1/1)
(3/5) (1/1)
(4/6)
(4/6) (2/2)
(3/6) A
2.0
p .80 4.2 15.1 2.0
15.1
A
(beam search),
15.1
15.3
4.3 (pruning).
_ +m _ log2
rh-|- +rn _
(15.2)
m + +m m+ +m _
LRS
LRS
LRS 0.99) CN2
15.3
IREP* n = IREP*(JD);
i
repeat
4 IV Post Opt (7?.) ;
5 Di = Not Covered (T?.7 , D )\
6 Ui = IREP*(A )
7 1Z = RJ U IZi ;
^
8 i i + 1;
9: until i = k
15.2 RIPPER
RIPPER
• ;
r IREP*
!j ( replacement rule);
• if IREP*
(revised rule).
r
15.2
RIPPER
RIPPER
[Fiirnkranz et al.,2012].
15.4
relation),
•
•
•
•
•
•
15.4
15.1 5.0
P.80
4.2 1) (2,6) (2,10) 14)
16) 17) 1) 6)
M) 17) (17,16)
6) (1,10) (1,15)
1 0) 10) (17,16)
(1,6) (1,7) (1,10) 15)
(17,6) 7) (17,1 0) 1 5)
10) 14) (1,15) 16)
- - -
(7, 14)
-
1)
2)
2.0
15)
(10,2)
(17, 3)
16)
(10, 3)
(17,6)
15.1
17)
(17, 7)
6)
5.0.
relational data),
( background knowledge),
(examples). 5.0
clause).
(15.1)
•
“X” “y”
y) t ( X Z ) /\ ( A Y) •
i •
(x z) (Z
(y,z),
(x,y)
(x,y)
15.5
m+ m+
F.Gain m .x log2 log2 (15.3)
^ m+ + m m+ + m
m+ ,
FOIL
4.2.1
3.6
5.0
( XT ( XT
FOIL x ( log2 g - log 2
FOIL
FOIL
f
FOIL
15.5
ILP
(logic program) PROLOG
P
P(/(/(X )))
• FOIL
FOIL
15.5.1
(grounded factM
5.0 (X,Y)”
( X ,Y ) (1,15)”
10);
LGG ( t , t ) = t ;
LGG ( s , t ) = V , LGG(s,t)
(1,10)” (1,15)”
n r2
K,
—
LGG ri r2 LGG
LGG
10)” LGG
(15.4)
K ).
LGG
(x y)”
[Lavrac and Dze-
roski , 1993] 3
(X,y )
ILP RLGG ( Relative Least
General Generalization) [Plotkin , 1971], LGG
e eIK K
LGG ri r2
ri r2 , LGG
LGG
15.5.2
deduction) induction)
AM
w. s.
Jevons J . A. Robinson
-
Q C2 L2;
L L2 , C1 = A W L , C2 = B W ^ L .
L C = AVB.
( AW B) - { B} = A , (15.6)
15.3
C = (Ci - { L})
c . c2 .
c =\
(C2 { L})
- (15.7)
(15.8)
B ^L
~
AVB
15.3
C Q
C/i # C (15.7),
C2 = ( C - ( C l - { L} ) ) L}. (15.9)
15.5
[Muggleton,1995]
pf g p ng
^^ ^
p AAB q A
(absorption): q-
' (15.10)
p qAB
^A
p <— A /\ B p AAq
(identification): q i- B p AAq (15.11)
-
(intra construction)
p
^ AAB p AAC ' (15.12)
q <r- B p AAq q ^r- C
-
p ^r- A A B
(inter constmction):
q
^ Aq A C (15.13)
p < r A5
•
— r^ A ^r AC
'
y” .
f x xhr. x
y y y
substitution)
.
{ 1/ X 2/ Y } “C (X y) (X 1T
“ C = ce = { X ,Y }
(domain).
{F/X} A {1/ F} K
^
oA; 0 { X /Y }.
unificaticm )
(1,X )” B ( y 2)”
{2/X,l / l"} = B6 = (1,2)” B
unifiable) A B unifier )
W W A
A W most
general unifier , MGU)
(1 1T {1/X},02 ={l/X,2/ y},
{1/Z,Z / X } MGU .
A L2.
G AvA c2 v L2 ,
^
C ( Ci —{LI}) V (C2 {L2})6 . (15.14)
(15.9)
(15.8), Ci C /C2 C2 C/ Ci resolution quotient ),
C G C2. A G Cl 5
h
(
A \- C
- ,
^
^ C(=
C ^
A\/ B ,
A S) .
(Ci {Lx}) hC (15.15)
G vars(Ci ), G
C vars(Li ) vars( Ci {Li})
C2 vars(L2)
la , 0()2/ o h
MGU.
L2. (15.9),
-
C2 = (C (Ci -{Li})0i (15.16)
Lp L2
——
5.0
_
C2 = y) (1,y) A (1,10.
—
“p AAB” “ pt AAC”
(15.12) Cl C2 5
WM,7V) (15.12)
Cr = Z) Z) q{M ,N )
(15.12) G /C C2 /(7
C h
—
15.6
q(M N ) . g “ q[M N)
1.4
1( M AT) Li L2 q(l ,S ),
{X /Z}, {/l M ,X/ },
iV {X /S}.
" (1,5) (1,5)”. C2 /C f
^
ILP
15.6
symbolism learning)
[Michalski, 1983]. [Fiirnkranz et al., 2012]
[Michalski, 1969] AQ
AQ
optima!
-
Algorithm Quasi
AQ AQ15 [Michalski et al,
AQ
-
AQ17 HCI [Wnek and Michalski, 1994]
PRISM AQ
WEKA PRISM
( probabilistic
ILP) [De Raedt et al,2008],
(relational Bayesian network) [Jaeger , 2002]
[FYiedman et al,
( Bayesian Logic Program) [Kersting et al.,2000]
( Markov logic network) [Richardson and Domingos,2006]
statistical relational learning) [Getoor and Taskar , 2007].
2.0 p.76 15.1
4.1.
15.2
15.4
2.0a p.86
2.0a
15.5 RIPPER
5.0
15.6 .
15.7 ri r2 , ri r2
LGG
15.9* P ( h , t 2 … ,tn ) , P
U
{ EI , E2 , . . . En } ,
S
MGU .
15.10*
Bratko, I. and S. Muggleton. (1995) . “Applications of inductive logic program-
mmg. Communicantions of the ACM , 38(11):65-70.
Brunk, C. A. and M. J . Pazzani. (1991). “An investigation of noise-tolerant re-
lational concept learning algorithms.” In Proceedings of the 8th International
Workshop on Machine Learning ( IWML ) , 389-393, Evanston, IL.
Carlson , A., J. Betteridge, B. Kisiel, B. Settles, E. R. Hruschka, and T. M.
Mitchell. ( 2010). “Toward an architecture for never-ending language learn-
ing.” In Proceedings of the 24th AAAI Conference on Artificial Intelligence
( AAAI ), 1306-1313, Atlanta, GA.
Cendrowska, J . (1987). “PRISM: An algorithm for inducing modular rules.”
-
International Journal of Man Machine Studies, 27( 4) :349-370.
Ceri, S., G. Gottlob, and L. Tanca. (1989) . “What you always wanted to know
about Datalog (and never dared to ask ).’’ IEEE Transactions on Knowledge
and Data Engineering , 1(1):146-166.
Clark, P. and T. Niblett . (1989 ) . “The CN2 induction algorithm.” Machine
Learning 3( 4):261-283.
Cohen, W. W. (1995). “Fast effective rule induction.” In Proceedings of the 12th
International Conference on Machine Learning ( ICML ) , 115-123, Tahoe,
CA.
De Raedt , L., P. Frasconi, K. Kersting, and S. Muggleton, eds. ( 2008) . Prob-
abilistic Inductive Logic Programming: Theory and Applications. Springer ,
Berlin.
Friedman , N., L. Getoor , D. Roller, and A Pfeffer. (1999) . “Learning prob-
abilistic relational models.” In Proceedings of the 16th International Joint
Conference on Artificial Intelligence ( IJCAI ), 1300-1307, Stockholm , Swe-
den.
Fiirnkranz , J . (1994) . “Top-down pruning in relational learning.” In Proceed-
ings of the 11th European Conference on Artificial Intelligence ( ECAI ) , 453-
457, Amsterdam, The Netherlands.
Fiirnkranz, J ., D. Gamberger , and N. Lavrac. ( 2012) . Foundations of Rule
Learning. Springer , Berlin.
Fiirnkranz, J. and G. Widmer. (1994). “Incremental reduced error pruning.”
In Proceedings of the 11th International Conference on Machine Learning
( ICML ), 70-77, New Brunswick , NJ.
Getoor , L. and B. Taskar. (2007) . Introduction to Statistical Relational Learn-
ing. MIT Press, Cambridge, MA.
,
Jaeger , M. ( 2002 ) . ^Relational Bayesian networks: A survey.5 Electronic Trans-
actions on Artificial Intelligence 6:Article 15.
Kersting, K., L. De Raedt , and S. Kramer. (2000). “Interpreting Bayesian logic
programs.” In Proceedings of the AAAF 2000 Workshop on Learning Statis-
tical Models from Relational Data, 29-35, Austin, TX.
Lavrac, N. and S. Dzeroski. (1993). Inductive Logic Programming: Techniques
and Applications. Ellis Horwood , New York , NY.
Michalski, R. S. (1969). “On the quasi-minimal solution of the general covering
problem.” In Proceedings of the 5th International Symposium on Information
Processing ( FCIP ), volume A3, 125-128, Bled, Yugoslavia.
Michalski, R. S. (1983). “A theory and methodology of inductive learning.” In
Machine Learning: An Artificial Intelligence Approach ( R. S. Michalski, J.
Carbonell, and T. Mitchell, eds.) , 111-161, Tioga, Palo Alto, CA.
Michalski, R. S., I. Mozetic, J . Hong, and N. Lavrac. (1986) . “The multi-purpose
incremental learning system AQ15 and its testing application to three med-
ical domains.” In Proceedings of the 5th National Conference on Artificial
Intelligence ( AAAI ) , 1041-1045, Philadelphia, PA.
Muggleton , S. (1991) . “Inductive logic programming New Generation Com-
puting ^ 8( 4) .
:295-318
Muggleton , S., ed. (1992) . Inductive Logic Programming. Academic Press, Lon-
don, UK .
Muggleton, S. (1995) . “Inverse entailment and Progol.” New Generation Com-
puting , 13(3-4):245-286.
Muggleton, S. and W. Buntine. (1988) . “Machine invention of first order predi-
cates by inverting resolution.” In Proceedings of the 5th International Work -
shop on Machine Learning ( IWML ) , 339—352, Ann Arbor , MI.
Muggleton, S. and C. Feng. (1990) . “Efficient induction of logic programs.”
In Proceedings of the 1st International Workshop on Algorithmic Learning
Theory ( ALT ) , 368-381 Tokyo, Japan.
Muggleton , S. and D. Lin. (2013) . “Meta-interpretive learning of higher-order
dyadic datalog: Predicate invention revisited.” In Proceedings of the 23rd In-
ternational Joint Conference on Artificial Intelligence ( IJCAI ), 1551-1557,
Beijing, China.
Plot kin , G. D. (1970). “A note on inductive generalization.” In Machine Intel-
ligence 5 ( B. Meltzer and D. Mitchie, eds.) , 153—165, Edinburgh University
Press, Edinburgh , Scotland.
Plotkin , G. D. (1971). “A further note on inductive generalization.” In Ma-
chine Intelligence 6 ( B. Meltzer and D. Mitchie, eds.) , 107-124, Edinburgh
University Press, Edinburgh , Scotland.
Quinlan, J. R. (1990) . “Learning logical definitions from relations.” Machine
Learning , 5(3):239-266.
Richardson, M. and P. Domingos. (2006) . “Markov logic networks.” Machine
Learning , 62 (1-2 ):107-136.
Robinson, J . A. (1965). “A machine-oriented logic based on the resolution prin-
ciple.” Journal of the ACM , 12 (1):23-41.
Srinivasan ,A.(1999 ). uThe Aleph manual.” http:/ / www.cs.ox.ac.uk / activities /
machlearn / Aleph / aleph.html.
Winston , P. H. (1970) . Learning structural descriptions from examples. Ph.D.
thesis, Department of Electrical Engineering, MIT, Cambridge, MA.
Wnek, J. and R. S. Michalski. (1994) . “Hypothesis-driven constructive induc-
tion in AQ17-HCI: A method and experiments.” Machine Learning , 2 (14) :
139-168.
sarsa
Q-learning
state value
function
V(x)