0% found this document useful (0 votes)
52 views8 pages

Random ( Statistical Stochastic) Vari-Able: V X Rain

- Bayesian models use probability theory and independence assumptions to construct multivariate probabilistic classifiers. - The naive Bayesian classifier and forest-augmented networks are examples that apply these principles. - Logistic regression provides an approximation approach. - The chain rule allows factorization of a joint probability distribution into conditional probabilities. This factorization forms the basis of Bayesian networks.

Uploaded by

Benali Amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views8 pages

Random ( Statistical Stochastic) Vari-Able: V X Rain

- Bayesian models use probability theory and independence assumptions to construct multivariate probabilistic classifiers. - The naive Bayesian classifier and forest-augmented networks are examples that apply these principles. - Logistic regression provides an approximation approach. - The chain rule allows factorization of a joint probability distribution into conditional probabilities. This factorization forms the basis of Bayesian networks.

Uploaded by

Benali Amine
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Bayesian Models and Notation

Logistic Regression
• Random (= statistical = stochastic) vari-
able: upper-case letter, e.g. V or X, or
upper-case string, e.g. RAIN
Probability theory as basis for the construction
of classifiers:
• Binary variables: take one of two values,
X = true (abbreviated x) and X = false
• Multivariate probabilistic models (abbreviated ¬x)

• Conjunctions: (X = x) ∧ (Y = y) as
• Independence assumptions
X = x, Y = y

• Naive Bayesian classifier • Templates: X, Y means X = x, Y = y, for


any value x, y, i.e. the choice of the values
x and y does not really matter
• Forest-augmented networks (FANs)
P
• X P (X) = P (x) + P (¬x), where X is bi-
• Approximation: logistic regression nary

Joint probability distribution Chain rule

Joint (= multivariate) distribution: Definition of conditional probability distribu-


P (X1 , X2 , . . . , Xn) tion:

Example of joint probability distribution: P (X1 , X2 , . . . , Xn)


P (X1 | X2 , . . . , Xn ) =
P (X2 , . . . , Xn)
P (X1 , X2, X3 ) with:
P (x1 , x2, x3) = 0.1 ⇒ P (X1 , X2, . . . , Xn ) =
P (¬x1 , x2, x3) = 0.05 P (X1 | X2 , . . . , Xn )P (X2 , . . . , Xn)
P (x1 , ¬x2, x3) = 0.10
Furthermore,
P (x1 , x2, ¬x3) = 0.0
P (¬x1 , ¬x2, x3) = 0.3 P (X2 , . . . , Xn ) =
P (x1 , ¬x2, ¬x3) = 0.2 P (X2 | X3 , . . . , Xn )P (X3 , . . . , Xn)
P (¬x1 , x2, ¬x3) = 0.1 .. .. ..
P (¬x1 , ¬x2, ¬x3) = 0.15 P (Xn−1 , Xn) = P (Xn−1 | Xn)P (Xn )
P P (Xn ) = P (Xn )
Note that: X1 ,X2,X3 P (X1 , X2, X3 ) = 1

Marginalisation: Chain rule yields factorisation:

X n
^ n
Y n
^
P (x3 ) = P (X1 , X2 , x3) = 0.55 P( Xi ) = P (Xi | Xk )
X1,X2 i=1 i=1 k=i+1
Chain rule - digraph Does the chain rule help?

X3
X3 X3

X1 X2
X1 X2 X1 X2
(1) (2)
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
Factorisation (1): P (X2 | X3 )P (X3 )

i.e. we need:
P (X1 , X2 , X3 ) = P (X1 | X2 , X3 ) ·
P (x1 | x2, x3)
P (X2 | X3 )P (X3 )
P (x1 | ¬x2, x3)
P (x1 | x2, ¬x3)
P (x1 | ¬x2, ¬x3)
Other factorisation (2):
P (x2 | x3)
P (x2 | ¬x3)
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (x3 )
P (X1 | X3 )P (X3 )
Note P (¬x1 | x2, x3) = 1 − P (x1 | x2, x3), etc.
⇒ different factorisations possible ⇒ 7 probabilities required (as for P (X1 , X2 , X3))

Use stochastic independence


Definition Bayesian Network (BN)
P (X1 , X2 , X3 ) = P (X2 | X1 , X3 ) ·
P (X3 | X1 )P (X1 ) A Bayesian network B is a pair B = (G, P ),
where:
Assume that X2 and X3 are conditionally in-
dependent given X1 : • G = (V (G), A(G)) is an acyclic directed graph,
with
P (X2 | X1 , X3 ) = P (X2 | X1 )
– V (G) = {X1, X2 , . . . , Xn }, a set of vertices
and (nodes)
P (X3 | X1 , X2 ) = P (X3 | X1 ) – A(G) ⊆ V (G) × V (G) a set of arcs

Notation: X2 X3 | X1, X3 X2 | X1
|=

|=

• P : ℘(V (G)) → [0, 1] is a joint probability


X3 X3 distribution, such that
n
Y
P (V (G)) = P (Xi | πG(Xi))
i=1
where πG(Xi ) denotes the set of immediate
X1 X2 X1 X2
ancestors (parents) of vertex Xi in G

Only 5 = 2 + 2 + 1 probabilities (instead of 7)


Reasoning: evidence propagation
Example Bayesian network
• Nothing known:
P (FL, PN, MY, FE, TEMP) MYALGIA
NO
FLU YES
NO

P (MY = y|FL = y) = 0.96 YES

FEVER
P (MY = y|FL = n) = 0.20 NO
YES
PNEUMONIA
NO TEMP
YES <=37.5

Myalgia (MY) >37.5

(yes/no) • Temperature >37.5 ◦C:


P (FL = y) = 0.1
P (FE = y|FL = y, PN = y) = 0.95 MYALGIA
Flu (FL) FLU
NO

P (FE = y|FL = n, PN = y) = 0.80 NO


YES

(yes/no) YES

P (FE = y|FL = y, PN = n) = 0.88 NO


FEVER

YES

P (FE = y|FL = n, PN = n) = 0.001 NO


PNEUMONIA
TEMP
YES <=37.5
>37.5

P (PN = y) = 0.05 Fever (FE)


Pneumonia (PN) (yes/no) TEMP • Likely symptoms of the flu?
(yes/no) (≤ 37.5/ MYALGIA
NO
FLU
> 37.5) NO
YES
YES

FEVER
P (TEMP ≤ 37.5|FE = y) = 0.1 NO
YES
PNEUMONIA
P (TEMP ≤ 37.5|FE = n) = 0.99 NO TEMP
YES <=37.5
>37.5

Bayesian network structure


learning Special form Bayesian networks

Bayesian network B = (G, P ), with Problems:


• digraph G = (V (G), A(G)), and • for many BNs too many probabilities have
• probability distribution to be assessed
Y
P (V ) = P (X | π(X)) • complex BNs do not necessarily yield better
X∈V (G) classifiers
• complex BNs may yield better estimates to
the probability distribution

Spectrum Solution:
use simple probabilistic models for classifica-
naive Bayesian tree−augmented
network Bayesian network
tion:
general Bayesian (TAN)
• naive (independent) form BN
networks
• T ree-Augmented Bayesian Network (TAN)
Un restricted Restricted Structure Learning
Structure • F orest-Augmented Bayesian Network (FAN)
Learning
Naive (independent) form BN Example of naive Bayes

··· P (myalgia | flu ) = 0.96


E2
Myalgia P (myalgia | pneu ) = 0.28
E1 Em y/n

C Disease Fever P (fever | flu ) = 0.88

• C is a class variable pneu /flu y/n P (myalgia | pneu ) = 0.82

• The evidence variables Ei in the evidence P (flu ) = 0.67


TEMP
E ⊆ {E1, . . . , Em} are conditionally indepen- P (pneu ) = 0.33 ≤ 37.5/ P (TEMP ≤ 37.5 | flu ) = 0.20
> 37.5
dent given the class variable C P (TEMP ≤ 37.5 | pneu ) = 0.26
This yields, using Bayes’ rule:
Evidence: E = {TEMP > 37.5}; computation
P (E | C)P (C) of the probability of flu using Bayes’ rule:
P (C | E) =
P (E)
P (TEMP > 37.5) | flu )P (f lu)
P (flu | TEMP > 37.5) =
with, as Ei Ej | C, for i 6= j:
|=

P (TEMP > 37.5)


Y
P (E | C) = P (E | C) by cond. ind. P (TEMP > 37.5) = P (TEMP > 37.5 | flu )P (flu ) +
E∈E
X P (TEMP > 37.5 | pneu )P (pneu )
P (E) = P (E | C)P (C) marg. & cond. = 0.8 · 0.67 + 0.74 · 0.33 ≈ 0.78
C
⇒ P (flu | TEMP ≥ 37.5) = 0.8 · 0.67/0.78 ≈ 0.687
Classifier: cmax = arg maxC P (C | E)

Computing probabilities from data More complex Bayesian networks


• We want to add arcs to a naive Bayesian
Compute the weighted average of network to improve its performance
• estimate PbD (V | π(V )) of the conditional • Result: possibly TAN
probability distribution for variable V based
···
on the dataset D
E2
• Dirichlet prior Θ, which reflects prior
knowledge E1 Em
C
These are combined as follows: • Which arc should be added?
n n0
PD (V | π(V )) = PbD (V | π(V )) + Θ E E0
n + n0 n + n0

where C
• n is the size of the dataset D
Compute mutual information between variables
• n0 is the estimated size of the (virtual) ‘dataset’ E, E 0 conditioned on the class variable C:
on which the prior knowledge is based (equiv- X P (E, E 0 | C)
IP (E, E 0 | C) = P (E, E 0, C) · log
alence sample size) P (E | C)P (E 0 | C)
E,E 0,C
FAN algorithm
Performance evaluation
Choose k ≥ 0. Given evidence variables Ei, a
class variable C, and a dataset D:
• Success rate σ based on:
1. Compute mutual infor-
mation −IP (Ei, Ej | C) cmax = argmaxc{P (c | xi )}
∀(Ei, Ej ), i 6= j, in a
complete undirected graph for xi ∈ D, i = 1, . . . , n = |D|
2. Construct a minimum-cost
spanning forest containing
exactly k edges • Total entropy (penalty):
3. Change each tree in the for-
n
X
est into a directed tree E=− ln P (c | xi)
4. Add an arc from the class i=1
vertex C to every evidence
vertex Ei in the forest – if P (c | xi) = 1, then ln P (c | xi) = 0
5. Learn conditional probabil-
ity distributions from D us- – if P (c | xi) ↓ 0 then ln P (c | xi) → −∞
ing Dirichlet distributions

Example COMIK BN Results for COMIK Dataset

CLOTTING-FACTORS FEVER
80
>0.70 NO
<=0.55 YES-WITHOUT-CHILLS
0.56-0.70 WITH-CHILLS

GALL-BLADDER
NONE
COURVOISIER
75
BILIRUBIN ASCITES FIRM-OR-TENDER LEUKAEMIA-LYMPHOMA
<200UMOL/L NO NO
>=200UMOL/L YES YES
ASAT
Correct conclusion (%)

<40U/L
40-319U/L INTERMITTENT-JAUNDICE
UPPER-ABDOMINAL-PAIN
>=320U/L
SLIGHT-MODERATE
SEVERE
NO
YES
70
LIVER-SURFACE
SMOOTH
NODULAR GI-CANCER
NO
ALKALINE-PHOSPHATASE YES
<400U/L
400-1000U/L
>1000U/L LDH
65
<1300U/L
>=1300U/L

31-64YR
AGE 60
>=65YR

CONGESTIVE-HEART-FAILURE BILIARY-COLICS-GALLSTONES

ALCOHOL
NO
YES
NO
YES 55
1-4DRINKS/DAY
>=5DRINKS/DAY

HISTORY-GE-2-WEEKS
50
SPIDERS
NO
YES
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
NO
YES
Number of added arcs

NO
WEIGHT-LOSS 1200
YES
COMIK
JAUNDICE-DUE-TO-CIRRHOSIS
NO ACUTE-NON-OBSTRUCTIVE 1175
YES CHRONIC-NON-OBSTRUCTIVE
BENIGN-OBSTRUCTIVE
MALIGNANT-OBSTRUCTIVE
OTHER 1150
1125
1100
Based on COMIK dataset: 1075
Entropy

1050
• Dataset with 1002 patient cases with liver 1025

and biliary tract disease, collected by the 1000


975
Danish COMIK group 950
925
• Has been used as vehicle for learning various 900
probabilistic models 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Number of added arcs
Naive Bayes: odds-likelihood form
Comparison Bayesian networks: NHL
For class variable C and evidence E:
BM-DEPRESSION

CT&RT-SCHEDULE
NO
YES
Q
E∈E P (E | C)P (C)
NONE
RT
GENERAL-HEALTH-STATUS CT
POOR

P (C | E) =
CT-NEXT-RT PERFORATION
AVERAGE
NO
GOOD
YES

HEMORRHAGE
P (E)
NO THERAPY-ADJUSTMENT
YES NO
YES

if E E 0 | C, ∀E, E 0 ∈ E; for C = true:

|=
AGE
10-19
20-29
30-39 EARLY-RESULT
CLINICAL-STAGE CR
40-44
I PR POST-CT&RT-SURVIVAL

Q
45-49
II1 NC NO
50-54
II2 PD
55-59 YES

E∈E P (E | c)P (c)


III
60-64
IV
65-69
70-79
80-89
>=90
P (c | E) =
BULKY-DISEASE
IMMEDIATE-SURVIVAL
NO
P (E)
YES
YES
NO

For C = false:
Q
5-YEAR-RESULT
HISTOLOGICAL-CLASSIFICATION ALIVE
LOW-GRADE DEATH

E∈E P (E | ¬c)P (¬c)


HIGH-GRADE
SURGERY
POST-SURGICAL-SURVIVAL

P (¬c | E) =
NONE
CURATIVE NO
PALLIATIVE YES

CLINICAL-PRESENTATION
NONE
HEMORRHAGE
P (E)
PERFORATION
OBSTRUCTION
Q
P (c | E) P (E | c) P (c)
CLINICAL-STAGE
I
GENERAL-HEALTH-STATUS
POOR
CT&RT-SCHEDULE ⇒ = Q E∈E
P (¬c | E) E∈E (E | ¬c) P (¬c)
P
NONE
II1 AVERAGE
RT
II2 GOOD CLINICAL-STAGE CT&RT-SCHEDULE
CT
III I NONE
CT-NEXT-RT
IV II1 RT
II2 CT

m
III CT-NEXT-RT

Y
IV
GENERAL-HEALTH-STATUS
POOR
AVERAGE

= λi · O(c)
GOOD

AGE
10-19
AGE 20-29
10-19 30-39
20-29 40-44

i=1
45-49
30-39
40-44 50-54 5-YEAR-RESULT
5-YEAR-RESULT SURGERY 55-59 ALIVE
45-49
NONE 60-64 DEATH
50-54 ALIVE
CURATIVE 65-69
55-59 DEATH

= O(c | E)
PALLIATIVE 70-79
60-64
80-89
65-69
>=90
70-79
80-89
>=90 SURGERY
NONE
CURATIVE
PALLIATIVE

YES
BULKY-DISEASE HISTOLOGICAL-CLASSIFICATION
LOW-GRADE
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
YES
NO
BULKY-DISEASE HISTOLOGICAL-CLASSIFICATION
LOW-GRADE
HIGH-GRADE
CLINICAL-PRESENTATION
NONE
HEMORRHAGE
PERFORATION
Here is O(c | E) the conditional odds, and λi =
P (Ei | c)/P (Ei | ¬c) is a likelihood ratio
NO HIGH-GRADE OBSTRUCTION OBSTRUCTION

Odds, likelihoods and logarithms


Odds and probabilities
Odds:
P (c | E)
O(c | E) =
10 P (¬c | E)
9 P (c | E)
=
8 1 − P (c | E)
7

6
Back to probabilities:
Odds O

5
O(c | E)
4 P (c | E) =
1 + O(c | E)
3

2 Logarithmic odds-likelihood form:


1 m
Y
0 ln O(c | E) = ln λi · O(c)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Probability P
i=1
m
X
Note that: = ln λi + ln O(c)
i=1
• O(c | E) = 1 if P (c | E) = 0.5 m
X
= ωi
• O(c | E) → ∞ if P (c | E) ↑ 1 i=0

with ω0 = ln O(c) and ωi = ln λi, i = 1, . . . , m


Log-odds and weights
Logistic regression
Log-odds:
m
Y y
ln O(c | E) = ln λi · O(c)
i=1
m
X
= ln λi + ln O(c)
i=1
m
X
= ωi
i=0

Back to probabilities: x
decision hyperplane
O(c | E)
P (c | E) =
1 + O(c | E)
P Hyperplane: {ω | β T ω = 0} where
exp( m i=0 ωi )
= P
1 + exp( m i=0 ωi )
• c = β0ω0 is the intercept (recall that ω0 =
ln O(c), which is independent of any evidence
Adjust ωi with weights βi based in existing in- E)
teractions between variables:
• ωi, i = 1, . . . , m correspond to the probabili-
m
X
ln O(c | E) = βi ω i = β T ω ties we want to find
i=0

Maximum likelihood estimate Maximum likelihood estimate


N 
X 
Database D, |D| = N , with independent in- L(β) = yi ln Pβ (ci|x0i) + (1 − yi) ln(1 − Pβ (ci|x0i))
stances xi ∈ D, then likelihood function l: i=1
N 
X  
T βT ω
N
Y = yiβ ω − ln 1 + e
l(β) = Pβ (Ci | x0i) i=1
i=1
Maximisation of L(β):
where Ci is the class value for instance i, and  
T
x0i is xi, without Ci ∂L(β) N
X eβ ω 
= y i ω −
T
∂β i=1 1 + eβ ω
Log-likelihood function L:
N 
X 
= yiω − Pβ (ci | x0i)
L(β) = ln l(β)
i=1
N
X = 0
= ln Pβ (Ci | x0i)
i=1
N 
Can be solved using equation solving methods,
X
= yi ln Pβ (ci | x0i) + such as Newton-Raphson method
!−1
i=1  ∂L(β) ∂ 2L(β)
(1 − yi) ln(1 − Pβ (ci | x0i)) βr+1 = βr − ·
∂β ∂β T ∂β

where ci is (always) the value Ci = true; yi = 1 for r = 0, 1, . . ., and β0 = 0; result: approxima-


if ci is the class value for xi, yi = 0, otherwise tion for β
WEKA output
Scheme: weka.classifiers.Logistic
Relation: weather.symbolic
Instances: 14
Attributes: 5
outlook
temperature
humidity
windy
play
Test mode: evaluate on training data

=== Classifier model (full training set) ===

Logistic Regression (2 classes)

Coefficients...
Variable Coeff.
1 34.9227
2 -48.1161
3 7.8472
4 17.3933
5 -33.0445
6 22.2601
7 -82.415
8 -54.6671
Intercept 66.1064

You might also like