0% found this document useful (0 votes)
15 views252 pages

AI ML Unit4

Uploaded by

bkssharma00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views252 pages

AI ML Unit4

Uploaded by

bkssharma00
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 252

Unit IV

Types of learning
Types of supervised learning
Machine learning examples
• Classification
• Spam / Ham
• Weather prediction
• Regression
• Stock market analysis
• Gold price prediction (IRCTC)?
• Similarity determination
• Collaborative filtering
• Clustering – structure in data
Supervised Learning

Machine Learning WorkShop 8


Unsupervised learning
Issues in Machine Learning
• What algorithms can approximate functions well and when?
• How does the number of training examples influence accuracy?
• How does the complexity of hypothesis representation impact
it?
• How does noisy data influence accuracy?
• What are the theoretical limits of learnability?
• How can prior knowledge of learner help?
• How can systems alter their own representation?
• What clues can we get from biological learning systems?
Machine Learning
• How use a model to make optimal decisions

• Machine learning: how to acquire a model from data / experience


• Learning parameters (e.g. probabilities)
• Learning structure (e.g. BN graphs)
• Learning hidden concepts (e.g. clustering)
Classification
Example: Spam Filter
• Input: an email Dear Sir.

• Output: spam/ham First, I must solicit your confidence in


this transaction, this is by virture of its
nature as being utterly confidencial and
• Setup: top secret. …
• Get a large collection of example emails, each labeled TO BE REMOVED FROM FUTURE
“spam” or “ham” MAILINGS, SIMPLY REPLY TO THIS
• Note: someone has to hand label all this data! MESSAGE AND PUT "REMOVE" IN THE
• Want to learn to predict labels of new, future emails SUBJECT.

99 MILLION EMAIL ADDRESSES


• Features: The attributes used to make the ham / FOR ONLY $99
spam decision Ok, Iknow this is blatantly OT but I'm
• Words: FREE! beginning to go insane. Had an old Dell
• Text Patterns: $dd, CAPS Dimension XPS sitting in the corner and
decided to put it to use, I know it was
• Non-text: SenderInContacts working pre being stuck in the corner,
• … but when I plugged it in, hit the power
nothing happened.
Example: Digit Recognition
• Input: images / pixel grids
0
• Output: a digit 0-9
1
• Setup:
• Get a large collection of example images, each labeled with a digit
• Note: someone has to hand label all this data!
2
• Want to learn to predict labels of new, future digit images

1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
??
Other Classification Tasks
• Classification: given inputs x, predict labels (classes) y

• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms,
classes: diseases)
• Automatic essay grading (input: document,
classes: grades)
• Fraud detection (input: account activity,
classes: fraud / no fraud)
• Customer service email routing
• … many more

• Classification is an important commercial technology!


Model-Based Classification
Model-Based Classification
• Model-based approach
• Build a model (e.g. Bayes’ net) where
both the label and features are
random variables
• Instantiate any observed features
• Query for the distribution of the label
conditioned on the features

• Challenges
• What structure should the BN have?
• How should we learn its parameters?
Naïve Bayes for Digits
• Naïve Bayes: Assume all features are independent effects of the label

• Simple digit recognition version: Y


• One feature (variable) Fij for each grid position <i,j>
• Feature values are on / off, based on whether intensity
is more or less than 0.5 in underlying image
• Each input maps to a feature vector, e.g.
F1 F2 Fn

• Here: lots of features, each is binary valued

• Naïve Bayes model:

• What do we need to learn?


General Naïve Bayes
• A general Naive Bayes model:
Y

|Y| parameters

F1 F2 Fn

|Y| x |F|n values n x |F| x |Y|


parameters

• We only have to specify how each feature depends on the class


• Total number of parameters is linear in n
• Model is very simplistic, but often works anyway
Inference for Naïve Bayes
• Goal: compute posterior distribution over label variable Y
• Step 1: get joint probability of label and evidence for each label

• Step 2: sum to get probability of evidence

• Step 3: normalize by dividing Step 1 by Step 2


General Naïve Bayes
• What do we need in order to use Naïve Bayes?
• Inference method (we just saw this part)
• Start with a bunch of probabilities: P(Y) and the P(Fi|Y) tables
• Use standard inference to compute P(Y|F1…Fn)
• Nothing new here

• Estimates of local conditional probability tables


• P(Y), the prior over labels
• P(Fi|Y) for each feature (evidence variable)
• These probabilities are collectively called the parameters of the model
and denoted by 
• Up until now, we assumed these appeared by magic, but…
• …they typically come from training data counts: we’ll look at this soon
Example: Conditional Probabilities

1 0.1 1 0.01 1 0.05


2 0.1 2 0.05 2 0.01
3 0.1 3 0.05 3 0.90
4 0.1 4 0.30 4 0.80
5 0.1 5 0.80 5 0.90
6 0.1 6 0.90 6 0.90
7 0.1 7 0.05 7 0.25
8 0.1 8 0.60 8 0.85
9 0.1 9 0.50 9 0.60
0 0.1 0 0.80 0 0.80
A Spam Filter
Dear Sir.
• Naïve Bayes spam filter
First, I must solicit your confidence in this
transaction, this is by virture of its nature as
being utterly confidencial and top secret. …
• Data:
• Collection of emails, labeled
spam or ham TO BE REMOVED FROM FUTURE MAILINGS,
• Note: someone has to hand SIMPLY REPLY TO THIS MESSAGE AND PUT
"REMOVE" IN THE SUBJECT.
label all this data!
• Split into training, held-out, 99 MILLION EMAIL ADDRESSES
test sets FOR ONLY $99

• Classifiers Ok, Iknow this is blatantly OT but I'm


• Learn on the training set beginning to go insane. Had an old Dell
Dimension XPS sitting in the corner and
• (Tune it on a held-out set) decided to put it to use, I know it was
• Test it on new emails working pre being stuck in the corner, but
when I plugged it in, hit the power nothing
happened.
Naïve Bayes for Text
• Bag-of-words Naïve Bayes: how many variables are there?
• Features: Wi is the word at positon i how many values?
• As before: predict label conditioned on feature variables (spam vs. ham)
• As before: assume features are conditionally independent given label
• New: each Wi is identically distributed
Word at position
i, not ith word in
• Generative model: the dictionary!

• “Tied” distributions and bag-of-words


• Usually, each variable gets its own conditional probability distribution P(F|Y)
• In a bag-of-words model
• Each position is identically distributed in When thelecture
is lecture lecturenext
is over,
overremember to wake up
person remember the
room
• All positions share the same conditional probs P(W|Y)
• Why make this assumption?
person sitting
sitting the thenext
the toto to
you
upinwake
the lecture room.
when you
• Called “bag-of-words” because model is insensitive to word order or reordering
Example: Spam Filtering
• Model:

• What are the parameters?

ham : 0.66 the : 0.0156 the : 0.0210


spam: 0.33 to : 0.0153 to : 0.0133
and : 0.0115 of : 0.0119
of : 0.0095 2002: 0.0110
you : 0.0093 with: 0.0108
a : 0.0086 from: 0.0107
with: 0.0080 and : 0.0105
from: 0.0075 a : 0.0100
... ...

• Where do these tables come from?


Spam Example
Word P(w|spam) P(w|ham) Tot Spam Tot Ham
(prior) 0.33333 0.66666 -1.1 -0.4
Gary 0.00002 0.00021 -11.8 -8.9
would 0.00069 0.00084 -19.1 -16.0
you 0.00881 0.00304 -23.8 -21.8
like 0.00086 0.00083 -30.9 -28.9
to 0.01517 0.01339 -35.1 -33.2
lose 0.00008 0.00002 -44.5 -44.0
weight 0.00016 0.00002 -53.3 -55.0
while 0.00027 0.00027 -61.5 -63.2
you 0.00881 0.00304 -66.2 -69.0
sleep 0.00006 0.00001 -76.0 -80.5

P(spam | w) = 98.9
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes

• Compute P(X|Ci) for each class >40


<=30
medium yes fair
medium yes excellent
yes
yes
31…40 medium no excellent yes
P(age = “<=30” | buys_computer = “yes”) = 2/9 = 0.222 31…40
>40
high
medium
yes fair
no excellent
yes
no

P(age = “<= 30” | buys_computer = “no”) = 3/5 = 0.6


P(income = “medium” | buys_computer = “yes”) = 4/9 = 0.444
P(income = “medium” | buys_computer = “no”) = 2/5 = 0.4
P(student = “yes” | buys_computer = “yes) = 6/9 = 0.667
P(student = “yes” | buys_computer = “no”) = 1/5 = 0.2
P(credit_rating = “fair” | buys_computer = “yes”) = 6/9 = 0.667
P(credit_rating = “fair” | buys_computer = “no”) = 2/5 = 0.4
• X = (age <= 30 , income = medium, student = yes, credit_rating = fair)
P(X|Ci) : P(X|buys_computer = “yes”) = 0.222 x 0.444 x 0.667 x 0.667 = 0.044
P(X|buys_computer = “no”) = 0.6 x 0.4 x 0.2 x 0.4 = 0.019
P(X|Ci)*P(Ci) : P(X|buys_computer = “yes”) * P(buys_computer = “yes”) = 0.028
P(X|buys_computer = “no”) * P(buys_computer = “no”) = 0.007
Therefore, X belongs to class (“buys_computer = yes”)
Avoiding the Zero-Probability Problem
• Naïve Bayesian prediction requires each conditional prob. be
non-zero. Otherwise, the predicted prob. will be zero
n
P( X | C i)   P( x k | C i)
k 1
• Ex. Suppose a dataset with 1000 tuples, income=low (0),
income= medium (990), and income = high (10)
• Use Laplacian correction (or Laplacian estimator)
• Adding 1 to each case
Prob(income = low) = 1/1003
Prob(income = medium) = 991/1003
Prob(income = high) = 11/1003
• The “corrected” prob. estimates are close to their
“uncorrected” counterparts
Naïve Bayes Summary
Naïve Bayes Generalization
Naïve Bayes Classifier: Comments
• Advantages
• Easy to implement
• Good results obtained in most of the cases
• Can deal with missing attributes some xj is not observed
• Disadvantages
• Assumption: class conditional independence, therefore loss of
accuracy
• Practically, dependencies exist among variables
• E.g., hospitals: patients: Profile: age, family history, etc.
Symptoms: fever, cough etc., Disease: lung cancer,
diabetes, etc.
• Dependencies among these cannot be modeled by Naïve Bayes
Classifier
• How to deal with these dependencies? Bayesian Belief Networks
Training and Testing
Important Concepts
• Data: labeled instances, e.g. emails marked spam/ham
• Training set
• Held out set
• Test set

• Features: attribute-value pairs which characterize each x Training


Data
• Experimentation cycle
• Learn parameters (e.g. model probabilities) on training set
• (Tune hyperparameters on held-out set)
• Compute accuracy of test set
• Very important: never “peek” at the test set!

• Evaluation Held-Out
• Accuracy: fraction of instances predicted correctly
Data
• Overfitting and generalization
• Want a classifier which does well on test data
• Overfitting: fitting the training data very closely, but not generalizing Test
well Data
• Underfitting: fits the training set poorly
Underfitting and Overfitting
30
Overfitting
25

20
Degree 15 polynomial
15

10

-5

-10

-15
0 2 4 6 8 10 12 14 16 18 20
Example: Overfitting

2 wins!!
Example: Overfitting
• Posteriors determined by relative probabilities (odds ratios):

south-west : inf screens : inf


nation : inf minute : inf
morally : inf guaranteed : inf
nicely : inf $205.00 : inf
extent : inf delivery : inf
seriously : inf signature : inf
... ...

What went wrong here?


Generalization and Overfitting
• Relative frequency parameters will overfit the training data!
• Just because we never saw a 3 with pixel (15,15) on during training doesn’t mean we won’t see it at test
time
• Unlikely that every occurrence of “minute” is 100% spam
• Unlikely that every occurrence of “seriously” is 100% ham
• What about all the words that don’t occur in the training set at all?
• In general, we can’t go around giving unseen events zero probability

• As an extreme case, imagine using the entire email as the only feature
• Would get the training data perfect (if deterministic labeling)
• Wouldn’t generalize at all
• Just making the bag-of-words assumption gives us some generalization, but isn’t enough

• To generalize better: we need to smooth or regularize the estimates


Parameter Estimation
Parameter Estimation
• Estimating the distribution of a random variable
r b br b
b r
• Elicitation: ask a human (why is this hard?) b
br
b b
r b b

• Empirically: use training data (learning!)


• E.g.: for each outcome x, look at the empirical rate of that value:
r r b

• This is the estimate that maximizes the likelihood of the data


Your First Consulting Job
• A billionaire tech entrepreneur asks you a question:
• He says: I have thumbtack, if I flip it, what’s the probability
it will fall with the nail up?
• You say: Please flip it a few times:

• You say: The probability is:


• P(H) = 3/5
• He says: Why???
• You say: Because…
Your First Consulting Job
• P(Heads) = , P(Tails) = 1-


• Flips are i.i.d.:
• Independent events D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)
• Identically distributed according to unknown distribution
• Sequence D of H Heads and T Tails
Maximum Likelihood Estimation
• Data: Observed set D of H Heads and T Tails
• Hypothesis space: Binomial distributions
• Learning: finding  is an optimization problem
• What’s the objective function?

• MLE: Choose  to maximize probability of D


Maximum Likelihood Estimation

• Set derivative to zero, and solve!


Smoothing
Maximum Likelihood?
• Relative frequencies are the maximum likelihood estimates

• Another option is to consider the most likely parameter value given the data

????
Unseen Events
Laplace Smoothing
• Laplace’s estimate:
• Pretend you saw every outcome
once more than you actually did
r r b

• Can derive this estimate with


Dirichlet priors (see cs281a)
Decision trees
Inductive inference with decision trees

 Decision Trees is one of the most widely used and practical methods
of inductive inference
 Features
 Method for approximating discrete-valued functions (including boolean)
 Learned functions are represented as decision trees (or if-then-else rules)
 Expressive hypotheses space, including disjunction
 Robust to noisy data
Decision tree representation (PlayTennis)

Outlook=Sunny, Temp=Hot, Humidity=High, Wind=Strong No


Decision trees expressivity
• Decision trees represent a disjunction of conjunctions
on constraints on the value of attributes:
(Outlook = Sunny  Humidity = Normal) 
(Outlook = Overcast) 
(Outlook = Rain  Wind = Weak)
Decision trees representation

+

When to use Decision Trees
 Problem characteristics:
 Instances can be described by attribute value pairs
 Target function is discrete valued
 Disjunctive hypothesis may be required
 Possibly noisy training data samples
 Robust to errors in training data
 Missing attribute values
 Different classification problems:
 Equipment or medical diagnosis
 Credit risk analysis
 Several tasks in natural language processing
Top-down induction of Decision Trees
• ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
• Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
• The construction of the tree is top-down. The algorithm is greedy.
• The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
• Select the best attribute
• A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
• The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?

 A statistical property called information gain, measures


how well a given attribute separates the training
examples
 Information gain uses the notion of entropy, commonly
used in information theory
 Information gain = expected reduction of entropy
Entropy in binary classification
• Entropy measures the impurity of a collection of examples.
It depends from the distribution of the random variable p.
• S is a collection of training examples
• p+ the proportion of positive examples in S
• p– the proportion of negative examples in S
Entropy (S)  – p+ log2 p+ – p–log2 p– [0 log20 = 0]
Entropy ([14+, 0–]) = – 14/14 log2 (14/14) – 0 log2 (0) = 0
Entropy ([9+, 5–]) = – 9/14 log2 (9/14) – 5/14 log2 (5/14) = 0,94
Entropy ([7+, 7– ]) = – 7/14 log2 (7/14) – 7/14 log2 (7/14) =
= 1/2 + 1/2 = 1 [log21/2 = – 1]
Note: the log of a number < 1 is negative, 0  p  1, 0  entropy  1
Entropy
Entropy in general
• Entropy measures the amount of information in a random
variable
H(X) = – p+ log2 p+ – p– log2 p– X = {+, –}
for binary classification [two-valued random variable]
c c
H(X) = –  pi log2 pi =  pi log2 1/ pi X = {i, …, c}
i=1 i=1
for classification in c classes
Example: rolling a die with 8, equally probable, sides
8
H(X) = –  1/8 log2 1/8 = – log2 1/8 = log2 8 = 3
i=1
Information gain as entropy reduction
• Information gain is the expected reduction in entropy
caused by partitioning the examples on an attribute.
• The higher the information gain the more effective
the attribute in classifying training data.
• Expected reduction in entropy knowing A
|Sv|
Gain(S, A) = Entropy(S) −  |S| Entropy(Sv)
v  Values(A)
Values(A) possible values for A
Sv subset of S for which A has value v
Example: expected information gain
• Let
• Values(Wind) = {Weak, Strong}
• S = [9+, 5−]
• SWeak = [6+, 2−]
• SStrong = [3+, 3−]
• Information gain due to knowing Wind:
Gain(S, Wind) = Entropy(S) − 8/14 Entropy(SWeak) − 6/14
Entropy(SStrong)
= 0,94 − 8/14  0,811 − 6/14  1,00
= 0,048
Which attribute is the best classifier?
Example
First step: which attribute to test at the root?

 Which attribute should be tested at the root?


 Gain(S, Outlook) = 0.246
 Gain(S, Humidity) = 0.151
 Gain(S, Wind) = 0.084
 Gain(S, Temperature) = 0.029
 Outlook provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value of Outlook
 partition the training samples according to the value of Outlook
After first step
Example
Second step
 Working on Outlook=Sunny node:
Gain(SSunny, Humidity) = 0.970  3/5  0.0  2/5  0.0 = 0.970
Gain(SSunny, Wind) = 0.970  2/5  1.0  3.5  0.918 = 0 .019
Gain(SSunny, Temp.) = 0.970  2/5  0.0  2/5  1.0  1/5  0.0 = 0.570
 Humidity provides the best prediction for the target
 Lets grow the tree:
 add to the tree a successor for each possible value of Humidity
 partition the training samples according to the value of Humidity
Second and third steps

{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A  best attribute; decision attribute for Root  A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi  subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs  {A})
return Root
Search space in Decision Tree learning
 The search space is made by partial
decision trees
 The algorithm is hill-climbing
 The evaluation function is
information gain
 The hypotheses space is complete
(represents all discrete-valued
functions)
 The search maintains a single current
hypothesis
 No backtracking; no guarantee of
optimality
 It uses all the available examples (not
incremental)
 May terminate earlier, accepting
noisy classes
Inductive bias in decision tree learning

 Inductive Bias: The inductive bias (also known as learning bias) of


a learning algorithm is the set of assumptions that the learner uses
to predict outputs given inputs that it has not encountered.
 What is the inductive bias of DT learning?
1. Shorter trees are preferred over longer trees
Not enough. This is the bias exhibited by a simple breadth first
algorithm generating all DT's selecting the shorter one
2. Prefer trees that place high information gain attributes close to the
root
 Note: DT's are not limited in representing all possible functions
Two kinds of biases
 Preference or search biases (due to the search strategy)
 ID3 searches a complete hypotheses space; the search strategy is incomplete
 Restriction or language biases (due to the set of hypotheses
expressible or considered)
 Candidate-Elimination searches an incomplete hypotheses space; the search
strategy is complete
 A combination of biases in learning a linear combination of
weighted features in board games.
Prefer shorter hypotheses: Occam's rasor
 Why prefer shorter hypotheses?
 Arguments in favor:
 There are fewer short hypotheses than long ones
 If a short hypothesis fits data unlikely to be a coincidence
 Elegance and aesthetics
 Arguments against:
 Not every short hypothesis is a reasonable one.
 Occam's razor:"The simplest explanation is usually the
best one."
 a principle usually (though incorrectly) attributed14th-century
English logician and Franciscan friar, William of Ockham.
 lex parsimoniae ("law of parsimony", "law of economy", or "law
of succinctness")
 The term razor refers to the act of shaving away unnecessary
assumptions to get to the simplest explanation.
Issues in decision trees learning
 Overfitting
 Reduced error pruning
 Rule post-pruning
 Extensions
 Continuous valued attributes
 Alternative measures for selecting attributes
 Handling training examples with missing attribute values
 Handling attributes with different costs
 Improving computational efficiency
 Most of these improvements in C4.5 (Quinlan, 1993)
Overfitting: definition
• Building trees that “adapt too much” to the training
examples may lead to “overfitting”.
• Consider error of hypothesis h over
• training data: errorD(h) empirical error
• entire distribution X of data: errorX(h) expected error
• Hypothesis h overfits training data if there is an
alternative hypothesis h'  H such that
errorD(h) < errorD(h’) and
errorX(h’) < errorX(h)
i.e. h’ behaves better over unseen data
Example

D15 Sunny Hot Normal Strong No


Overfitting in decision trees

Outlook=Sunny, Temp=Hot, Humidity=Normal, Wind=Strong, PlayTennis=No 

New noisy example causes splitting of second leaf node.


Overfitting in decision tree learning
Avoid overfitting in Decision Trees
 Two strategies:
1. Stop growing the tree earlier, before perfect classification
2. Allow the tree to overfit the data, and then post-prune the
tree
 Training and validation set
 split the training in two parts (training and validation) and use
validation to assess the utility of post-pruning
 Reduced error pruning
 Rule pruning
 Other approaches
 Use a statistical test to estimate effect of expanding or pruning
 Minimum description length principle: uses a measure of
complexity of encoding the DT and the examples, and halt
growing the tree when this encoding size is minimal
Reduced-error pruning (Quinlan 1987)
 Each node is a candidate for pruning
 Pruning consists in removing a subtree rooted in a node: the node
becomes a leaf and is assigned the most common classification
 Nodes are removed only if the resulting tree performs no worse on
the validation set.
 Nodes are pruned iteratively: at each iteration the node whose
removal most increases accuracy on the validation set is pruned.
 Pruning stops when no pruning increases accuracy
Effect of reduced error pruning
Rule post-pruning
1. Create the decision tree from the training set
2. Convert the tree into an equivalent set of rules
• Each path corresponds to a rule
• Each node along a path corresponds to a pre-condition
• Each leaf classification to the post-condition
3. Prune (generalize) each rule by removing those
preconditions whose removal improves accuracy …
• … over validation set
• … over training with a pessimistic, statistically inspired,
measure
4. Sort the rules in estimated order of accuracy, and
consider them in sequence when classifying new
instances
Converting to rules

(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
 Each distinct path produces a different rule: a condition removal may
be based on a local (contextual) criterion. Node pruning is global and
affects all the rules
 In rule form, tests are not ordered and there is no book-keeping
involved when conditions (nodes) are removed
 Converting to rules improves readability for humans
Dealing with continuous-valued
attributes
 So far discrete values for attributes and for outcome.
 Given a continuous-valued attribute A, dynamically create
a new attribute Ac
Ac = True if A < c, False otherwise
 How to determine threshold value c ?
 Example. Temperature in the PlayTennis example
 Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
 Determine candidate thresholds by averaging consecutive values
where there is a change in classification: (48+60)/2=54 and
(80+90)/2=85
 Evaluate candidate thresholds (attributes) according to
information gain. The best is Temperature>54.The new attribute
competes with the other ones
Problems with information gain
• Natural bias of information gain: it favours attributes with
many possible values.
• Consider the attribute Date in the PlayTennis example.
• Date would have the highest information gain since it perfectly
separates the training data.
• It would be selected at the root resulting in a very broad tree
• Very good on the training, this tree would perform poorly in
predicting unknown instances. Overfitting.
• The problem is that the partition is too specific, too many
small classes are generated.
• We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A)  −  log2
i=1 |S | |S |
• Si are the sets obtained by partitioning on value i of A
• SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A) 
SplitInformation(S, A)
• GainRatio penalizes attributes that split examples in many small classes such
as Date. Let |S |=n, Date splits examples in n classes
• SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
• Compare with A, which splits data in two even classes:
• SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
• Problem: SplitInformation(S, A) can be zero or very small when |Si |
≈ |S | for some value i
• To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
• Other measures have been proposed:
• Distance-based metric [Lopez-De Mantaras, 1991] on the partitions of data
• Each partition (induced by an attribute) is evaluated according to the
distance to the partition that perfectly classifies the data.
• The partition closest to the ideal partition is chosen
Handling incomplete training data
 How to cope with the problem that the value of some
attribute may be missing?
 Example: Blood-Test-Result in a medical diagnosis problem
 The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training
examples at the node
2. Assign a probability to each value, based on frequencies, and
assign values to missing attribute, according to this probability
distribution
 Missing values in new instances to be classified are
treated accordingly, and the most probable classification
is chosen (C4.5)
Handling attributes with different costs
• Instance attributes may have an associated cost: we
would prefer decision trees that use low-cost attributes
• ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A)  1
(Cost(A) + 1)w w ∈ [0,1]
A Simple Neuron

• An artificial neuron is a device with many inputs and one


output.
• The neuron has two modes of operation;
• the training mode and
• the using mode.
A Simple Neuron (Cont.)
• In the training mode, the neuron can be trained to fire (or not), for particular input patterns.
• In the using mode, when a taught input pattern is detected at the input, its associated output
becomes the current output. If the input pattern does not belong in the taught list of input
patterns, the firing rule is used to determine whether to fire or not.
• The firing rule is an important concept in neural networks and accounts for their high flexibility. A
firing rule determines how one calculates whether a neuron should fire for any input pattern. It
relates to all the input patterns, not only the ones on which the node was trained on previously.
Neural Networks
• A NN is a machine learning approach inspired by the way in
which the brain performs a particular learning task:
• Knowledge about the learning task is given in the form of
examples.

• Inter neuron connection strengths (weights) are used to store the


acquired information (the training examples).

• During the learning process the weights are modified in order to


model the particular learning task correctly on the training
examples.
When is neural network suitable?
• Problems having following characteristics
• Instances are available as many attribute-value pairs
• Target function output may be discrete-valued, real-valued,
or discrete-valued attributes
• Errors in training example
• Long training times are ok
• Fast evaluation of the learned target function may be
required
• Ability of humans to understand the learned target
function is not important
Learning
• Supervised Learning
• Recognizing hand-written digits, pattern recognition, regression.
• Labeled examples
(input , desired output)
• Neural Network models: perceptron, feed-forward, radial basis
function, support vector machine.
• Unsupervised Learning
• Find similar groups of documents in the web, content addressable
memory, clustering.
• Unlabeled examples
(different realizations of the input alone)
• Neural Network models: self organizing maps, Hopfield networks.
Network architectures

• Three different classes of network architectures

• single-layer feed-forward neurons are organized


• multi-layer feed-forward in acyclic layers
• recurrent

• The architecture of a neural network is linked with the


learning algorithm used to train
Single Layer Feed-forward

Input layer Output layer


of of
source nodes neurons
Multi layer feed-forward

3-4-2 Network

Input Output
layer layer

Hidden Layer
Recurrent network
Recurrent Network with hidden neuron(s): unit delay
operator z-1 implies dynamic system

z-1

input
z-1 hidden
output

z-1
Neural Network Architectures

Neural Networks NN 1 117


The Neuron
• The neuron is the basic information processing unit of a
NN. It consists of:
1 A set of synapses or connecting links, each link
characterized by a weight:
W1, W2, …, Wm
2 An adder function (linear combiner) which computes
the weighted sum of u   m
wjxj the inputs:
j 1

3 Activation function (squashing function)  for limiting


the amplitude of the y   (u  b) output of the
neuron.
The Neuron
Bias
b
x1 w1
Activation
Local function
Field

  ( )
Output
v
Input x2 w2 y
signal

  Summing
function

xm wm

Synaptic
weights
Bias of a Neuron
• Bias b has the effect of applying an affine transformation to
u
v=u+b
• v is the induced field of the neuron

m
u   wjxj
j 1
Bias as extra input
• Bias is an external parameter of the neuron. Can be modeled
by adding an extra input. m

w0
v   wj xj
x0 = +1 j 0

x1
w0  b
w1 Activation
Local function
Field

  ( )
Input Output
v
signal x2 w2 y

Summing
  function

xm wm Synaptic
weights
Dimensions of a Neural Network

• Various types of neurons

• Various network architectures

• Various learning algorithms

• Various applications
Face Recognition

90% accurate learning head pose, and recognizing 1-of-20 faces


Handwritten digit recognition
ALVINN drives 70 mph on highways

125
Perceptrons
• Initial proposal of connectionist networks
• Rosenblatt, 50’s and 60’s
• Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or

I2
W2
 O I2
W2 O

W3 W3

I3 Activation Function I3
 
O  i
 
1 :   wi I i     0



 0 : otherwise  1
Perceptron

Sometimes we’ll use simpler vector notation:

127
Perceptron Example
2 .5

1
.3
 =-1

2(0.5) + 1(0.3) + -1 = 0.3 , O=1

Learning Procedure:

Randomly assign weights (between 0-1)


Present inputs from training data
Get output O, nudge weights to gives results toward our desired output T
Repeat; stop when no errors, or enough epochs completed
Decision Surface of a Perceptron

Represents some useful functions


• What weights represent
g(x1, x2) = AND(x1, x2)?
But some functions not representable
• e.g., not linearly separable
• Therefore, we’ll want networks of these...

129
Perceptron training rule
wi  wi + wi
where wi =  (t – o) xi
Where:
• t = c(x) is target value
• o is perceptron output
•  is small constant (e.g., .1) called learning rate

Can prove it will converge


• If training data is linearly separable and  sufficiently
small
• If (t-o) is 0 then wi = 0 thus the weight vector does
not change
131
Gradient Descent and Delta rule
• Perceptron can be successful in finding weight vector when training
data is linearly separable
• Delta rule is assigned to overcome this difficulty. This will converge
based on a best-fit approximation when the data is non-linearly
separable
• Delta rule uses the gradient descent to search and this is the
fundamental behind backpropagation algorithm
Gradient Descent (1/4)
• To understand, consider simpler linear unit, where
o = w0 + w1x1 + ··· + wnxn
• Let's learn wi’s that minimize the squared error

• Where D is set of training examples


• Vector derivative is the gradient of E written as

133
Gradient Descent (2/4)

Gradient

Training rule:

i.e.,

134
Gradient Descent (3/4)

135
Gradient Descent (4/4)

• Initialize each wi to some small random value


• Until the termination condition is met, Do
• Initialize each wi to zero.
• For each <x, t> in training_examples, Do
* Input the instance x to the unit and compute the
output o
* For each linear unit weight wi, Do
wi  wi +  (t – o) xi
• For each linear unit weight wi , Do
wi  wi + wi
136
Summary

Perceptron training rule guaranteed to succeed if


• Training examples are linearly separable
• Sufficiently small learning rate 

Linear unit training rule uses gradient descent


• Guaranteed to converge to hypothesis with minimum
squared error
• Given sufficiently small learning rate 
• Even when training data contains noise
• Even when training data not separable by H
137
Incremental (Stochastic) Gradient Descent (1/2)

Batch mode Gradient Descent:


Do until satisfied
1. Compute the gradient ED[w]
2. w  w -  ED[w]

Incremental mode Gradient Descent:


Do until satisfied
• For each training example d in D
1. Compute the gradient Ed[w]
2. w  w -  Ed[w]

138
Incremental (Stochastic) Gradient Descent (2/2)

Incremental Gradient Descent can approximate


Batch Gradient Descent arbitrarily closely if 
made small enough

139
Difference
Perceptron Gradient descent
Training rules weights are Updated based on error in
updated based on error in unthresholded linear
threshold perceptron output combination of inputs.
For a linearly separable Converges only asymptotically
example, converges after a toward the minimum error
finite number of iterations hypothesis irrespective of the
linear / non-linear separable
data

Linear programming could be used for


learning the weight vector
Difference
Standard Gradient Stochastic gradient
Error is summed over all Weights are updated after
examples before updating examining each training
weights example
Summing over multiple Larger step size per weight
examples requires more update
computation per weight
update step
Local minima is May fail to compute Local
appropriately defined minima
Multilayer Networks of Sigmoid Units

142
Sigmoid Unit

(x) is the sigmoid function

Nice property:
We can derive gradient decent rules to train
• One sigmoid unit
• Multilayer networks of sigmoid units  Backpropagation
143
Error Gradient for a Sigmoid Unit

But we know:

So:

144
Backpropagation Learning Principles:
Hidden Layers and Gradients

There are two differences for the updating rule :

1) The activation of the hidden unit is used instead of activation


of the input value.

2) The rule contains a term for the gradient of the activation


function.
Backward Pass
• Compute deltas to weights
• from hidden layer
• to output layer

• Without changing any weights (yet), compute the actual contributions


• within the hidden layer(s)
• and compute deltas
Backpropagation Algorithm
Initialize all weights to small random numbers. Until satisfied, Do
• For each training example, Do
1. Input the training example to the network and compute the
network outputs
2. For each output unit k : k  k(1 - k) (tk - k)
3. For each hidden unit h
h  h(1 - h) 
k outputs wh,kk
4. Update each network weight wi,j
wi,j  wi,j + wi,j
where wi,j =  j xi,j

147
More on Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• In practice, often works well (can run multiple times)
• Often include weight momentum 
wi,j (n) =  j xi,j +  wi,j (n - 1)
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations  slow!
• Using network after training is very fast

148
Learning Algorithm:
Backpropagation
Each neuron is composed of two units. First unit adds products of weights coefficients and
input signals. The second unit realise nonlinear function, called neuron transfer (activation)
function. Signal e is adder output signal, and y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron.
Learning Algorithm:
Backpropagation
To teach the neural network we need training data set. The training data set consists of
input signals (x1 and x2 ) assigned with corresponding target (desired output) z.

The network training is an iterative process. In each iteration weights coefficients of nodes
are modified using new data from training data set. Modification is calculated using
algorithm described below:

Each teaching step starts with forcing both input signals from training set. After this stage
we can determine output signals values for each neuron in each network layer.
Learning Algorithm:
Backpropagation
Pictures below illustrate how signal is propagating through the network,
Symbols w(xm)n represent weights of connections between network input xm and
neuron n in input layer. Symbols yn represents output signal of neuron n.
Learning Algorithm:
Backpropagation
Pictures below illustrate how signal is propagating through the network,
Symbols w(xm)n represent weights of connections between network input xm and
neuron n in input layer. Symbols yn represents output signal of neuron n.
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Propagation of signals through the hidden layer. Symbols wmn represent weights
of connections between output of neuron m and input of neuron n in the next
layer.
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Propagation of signals through the output layer.
Learning Algorithm:
Backpropagation
In the next algorithm step the output signal of the network y is
compared with the desired output value (the target), which is found in
training data set. The difference is called error signal d of output layer
neuron
Learning Algorithm:
Backpropagation
The idea is to propagate error signal d (computed in single teaching step)
back to all neurons, which output signals were input for discussed
neuron.
Learning Algorithm:
Backpropagation
The idea is to propagate error signal d (computed in single teaching step)
back to all neurons, which output signals were input for discussed
neuron.
Learning Algorithm:
Backpropagation
The weights' coefficients wmn used to propagate errors back are equal to
this used during computing output value. Only the direction of data flow
is changed (signals are propagated from output to inputs one after the
other). This technique is used for all network layers. If propagated errors
came from few neurons they are added. The illustration is below:
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Main Idea
• Max-Margin Classifier
• Formalize notion of the best linear separator
• Lagrangian Multipliers
• Way to convert a constrained optimization problem to one that is easier to
solve
• Kernels
• Projecting data into higher-dimensional space makes it linearly separable
• Complexity
• Depends only on the number of training examples, not on dimensionality of
the kernel space!
Text categorization
Tennis example

Temperature

Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l
xi  Rd
yi  {-1,+1}

x2

=+1
=-1

x1
Linear SVM 2

Data: <xi,yi>, i=1,..,l


xi  Rd
yi  {-1,+1}

f(x) =-1
=+1

All hyperplanes in Rd are parameterize by a vector (w) and a constant b.


Can be expressed as w•x+b=0 (remember the equation for a hyperplane
from algebra!)
Our aim is to find such a hyperplane f(x)=sign(w•x+b), that
correctly classify our data.
Definitions
Define the hyperplane H such that:
xi•w+b  +1 when yi =+1 H1

xi•w+b  -1 when yi =-1


H2
H1 and H2 are the planes: d+
H1: xi•w+b = +1
d-
H2: xi•w+b = -1 H
The points on the planes H1
and H2 are the Support
Vectors
d+ = the shortest distance to the closest positive point
d- = the shortest distance to the closest negative point

The margin of a separating hyperplane is d+ + d-.


Definition
SVM
Maximizing the margin
We want a classifier with as big margin as possible.

H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||

The distance between H1 and H2 is: 2/||w||

In order to maximize the margin, we need to minimize ||w||. With the


condition that there are no datapoints between H1 and H2:
xi•w+b  +1 when yi =+1
xi•w+b  -1 when yi =-1 Can be combined into yi(xi•w)  1
Linear SVM
• SVM which is used to classify data which is linearly separable is called
Linear SVM
• Linear SVM searches for a hyperplane which maximally marginal
• SVM is called as a Maximal marginal classifier
Linear SVM
Linear SVM
Hyperplane
Finding a hyperplane
Finding a hyperplane
Finding a hyperplane
Hyperplane and Classification
Hyperplane and classification
Calculating the margin
Margin
Margin
Margin
Margin
Support vectors
Hard margin
Hard margin – Primal problem
Kernel function
Monte Carlo Artificial Intelligence:
Bayesian Networks

202
Why This Matters
• Bayesian networks have been the most important contribution to the
field of AI in the last 10 years
• Provide a way to represent knowledge in an uncertain domain and a
way to reason about this knowledge
• Many applications: medicine, factories, help desks, spam filtering, etc.

203
A Bayesian Network
B P(B) E P(E) A Bayesian network is made
false 0.999 false 0.998 up of two parts:
true 0.001 true 0.002 1. A directed acyclic graph
Burglary Earthquake
2. A set of parameters

B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph

Burglary Earthquake

Alarm

1. A directed acyclic graph:


• The nodes are random variables (which can be discrete or
continuous)
• Arrows connect pairs of nodes (X is a parent of Y if there is an
arrow from node X to node Y).

205
A Directed Acyclic Graph
Burglary Earthquake

Alarm

• Intuitively, an arrow from node X to node Y means X has a direct


influence on Y (we can say X has a casual effect on Y)
• Easy for a domain expert to determine these relationships
• The absence/presence of arrows will be made more precise later
on

206
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002

Alarm
B E A P(A|B,E)
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability
false true false 0.71 distribution P(Xi | Parents(Xi)) that quantifies the
false true true 0.29 effect of the parents on the node
true false false 0.06 The parameters are the probabilities in these
true false true 0.94 conditional probability distributions
true true false 0.05
Because we have discrete random variables, we
true true true 0.95
have conditional probability tables (CPTs)

207
A Set of Parameters
Conditional Probability Stores the probability distribution for
Distribution for Alarm Alarm given the values of Burglary
and Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the
false false true 0.001
parents (B and E in this example), the
false true false 0.71
entries for P(A=true|B,E) and
false true true 0.29 P(A=false|B,E) must add up to 1 eg.
true false false 0.06 P(A=true|B=false,E=false) +
true false true 0.94 P(A=false|B=false,E=false)=1
true true false 0.05
true true true 0.95

If you have a Boolean variable with k Boolean parents, how big is


the conditional probability table?
How many entries are independently specifiable?
Semantics of Bayesian
Networks

209
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented
directed acyclic graph, represented by the pair V , E
where:
• V is a set of vertices.
• E is a set of directed edges joining vertices. No loops of any
length are allowed.

Each vertex in V contains the following information:


• The name of a random variable
• A probability distribution table indicating how the probability
of this variable’s values depends on all possible combinations
of parental values.

210
Semantics of Bayesian Networks
Two ways to view Bayes nets:
1. A representation of a joint probability distribution
2. An encoding of a collection of conditional independence
statements

211
Bayesian Network Example
Weather Cavity

Toothache Catch P(A|B,C) = P(A|C)


I(ToothAche,Catch|Cavity)

• Weather is independent of the other variables, I(Weather,Cavity) or


P(Weather) = P(Weather|Cavity) = P(Weather|Catch) =
P(Weather|Toothache)
• Toothache and Catch are conditionally independent given Cavity
• I(Toothache,Catch|Cavity) meaning P(Toothache|Catch,Cavity) =
P(Toothache|Cavity)
212
Conditional Independence
We can look at the actual graph structure and determine
conditional independence relationships.
1. A node (X) is conditionally independent of its non-
descendants (Z1j, Znj), given its parents (U1, Um).

213
A Representation of the Full Joint Distribution
• We will use the following abbrevations:
• P(x1, …, xn) for P( X1 = x1  …  Xn = xn)
• parents(Xi) for the values of the parents of Xi
• From the Bayes net, we can calculate:

n
P( x1 ,..., xn )   P( xi | parents( X i ))
i 1

214
The Full Joint Distribution
P( x1 ,..., xn )
 P( xn | xn 1 ,..., x1 ) P( xn 1 ,..., x1 ) ( Chain Rule)

 P( xn | xn 1 ,..., x1 ) P( xn 1 | xn  2 ,..., x1 ) P( xn  2 ,..., x1 ) ( Chain Rule)


 P( xn | xn 1 ,..., x1 ) P( xn 1 | xn  2 ,..., x1 )...P( x2 | x1 ) P( x1 )
n ( Chain Rule)
  P( xi | xi 1 ,..., x1 )
i 1 We’ll look at this step
n more closely
  P( xi | parents( xi ))
i 1

215
The Full Joint Distribution
n n

 P( x | x
i 1
i i 1 ,..., x1 )   P( xi | parents( xi ))
i 1

To be able to do this, we need two things:


1. Parents(Xi)  {Xi-1, …, X1}
This is easy – we just label the nodes according to the
partial order in the graph
2. We need Xi to be conditionally independent of its
predecessors given its parents
This can be done when constructing the network. Choose
parents that directly influence Xi.

216
Example
Burglary Earthquake

Alarm

JohnCalls MaryCalls

P(JohnCalls, MaryCalls, Alarm, Burglary, Earthquake)


= P(JohnCalls | Alarm) P(MaryCalls | Alarm ) P(Alarm | Burglary,
Earthquake ) P( Burglary ) P( Earthquake )

217
Conditional Independence
• There is a general topological criterion called d-separation
• d-separation determines whether a set of nodes X is independent of
another set Y given a third set E

218
D-separation
• We will use the notation I(X, Y | E) to mean that X and Y are
conditionally independent given E
• Theorem [Verma and Pearl 1988]:
If a set of evidence variables E d-separates X and
Y in the Bayesian Network’s graph, then
I(X, Y | E)
• d-separation can be determined in linear time using a DFS-like
algorithm

219
D-separation
• Let evidence nodes E  V (where V are the vertices or nodes in the
graph), and X and Y be distinct nodes in V – E.
• We say X and Y are d-separated by E in the Bayesian network if every
undirected path between X and Y is blocked by E.
• What does it mean for a path to be blocked? There are 3 cases…

220
Conditional Independence
revisited - D-Separation

• To do inference in a Belief Network we have to


know if two sets of variables are conditionally
independent given a set of evidence.

• Method to do this is called Direction-Dependent


Separation or D-Separation.

221
D-Separation contd.

• If every undirected path from a node in X to a


node in Y is d-separated by E, then X and Y are
conditionally independent given E.

• X is a set of variables with unknown values


• Y is a set of variables with unknown values
• E is a set of variables with known values.

222
D-Separation contd.
• A set of nodes, E, d-separates two sets of nodes, X and Y, if every
undirected path from a node in X to a node in Y is Blocked given E.

• A path is blocked given a set of nodes, E if:


1) Z is in E and Z has one arrow leading in and one leading out.
2) Z is in E and has both arrows leading out.
3) Neither Z nor any descendant of Z is in E and both path arrows lead in to Z.

223
Blocking
X Y
E
Z

224
Case 1
There exists a node N on the path such that
• It is in the evidence set E (shaded grey)
• The arcs putting N in the path are “tail-to-
tail”.
X N Y

X = “Owns N = “Rich” Y = “Owns


expensive car” expensive home”

The path between X and Y is blocked by N


225
Case 2
There exists a node N on the path such that
• It is in the evidence set E
• The arcs putting N in the path are “tail-to-
head”.
X N Y

X=Education N=Job Y=Rich

The path between X and Y is blocked by N


226
Case 3
There exists a node N on the path such that
• It is NOT in the evidence set E (not shaded)
• Neither are any of its descendants
• The arcs putting N in the path are “head-to-
head”.
X N Y

The path between X and Y is blocked by N


227
(Note N is not in the evidence set)
Case 3 (Explaining Away)
Burglary Earthquake

Alarm

Your house has a twitchy burglar alarm that is also sometimes triggered by
earthquakes
Earth obviously doesn’t care if your house is currently being broken into
While you are on vacation, one of your nice neighbors calls and lets you
know your alarm went off

228
Case 3 (Explaining Away)
Burglary Earthquake

Alarm

But if you knew that a medium-sized earthquake happened, then you’re


probably relieved that it’s probably not a burglar
The earthquake “explains away” the hypothetical burglar
This means that Burglary and Earthquake are not independent given Alarm.
But Burglary and Earthquake are independent given no evidence ie. learning
about an earthquake when you know nothing about the status of your alarm
229 doesn’t give you any information about the burglary
d-separation Recipe
• To determine if I(X, Y | E), ignore the directions of the arrows, find all
paths between X and Y
• Now pay attention to the arrows. Determine if the paths are blocked
according to the 3 cases
• If all the paths are blocked, X and Y are d-separated given E
• Which means they are conditionally independent given E

230
Conditional Independence
• Note: D-separation only finds random variables
that are conditionally independent based on the
topology of the network
• Some random variables that are not d-separated
may still be conditionally independent because of
the probabilities in their CPTs

231
Bayesian Network - Definition

• Causal Structure
• Interconnected Nodes
• Directed Acyclic Links
• Joint Distribution formed
from conditional
distributions at each node.

232
Earthquake or Burglar

Burglary Earthquake

Alarm

Mary Calls John Calls


233
Bayesian Network for Alarm Domain
Burglary Earthquake
P(B) P(E)
.001 .002

B E P(A)
T T .95
T F .94
F T .29 Alarm
F F .001

A P(M) A P(J)
T .70 T .90
F .01 F .05
Mary Calls John Calls
234
Retrieving Probabilities from the Conditional
Distributions
• P(x1,…xn) = P P(xi|Parents(x
n i))
• E.g. i=1
P(J & M & A & ¬B & ¬E)
= P(J|A)P(M|A)P(A|¬B,¬E)P(¬B)P(¬E)
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998
= 0.00062

235
Constructing A Network
- Node Ordering and Compactness

• Mary Calls Mary Calls John Calls


• John Calls
• Alarm
• Burglary Alarm
• Earthquake
Burglary

Earthquake

236
Node Ordering and Compactness contd.

• Mary Calls
• Johns Calls
• Earthquake
• Burglary
• Alarm

237
Node Ordering and Compactness contd.
• Mary Calls
• Johns Calls
Mary Calls
• Earthquake John Calls
• Burglary
• Alarm
Earthquake

Burglary

Alarm

238
Inference
• Diagnostic Inferences (effects to causes)
• Causal Inferences (causes to effects)
• Intercausal Inferences - or ‘Explaining Away’ (between causes of
common effect)
• Mixed Inferences (combination of two or more of the above)

239
D-Separation - Example
• Moves and Battery are independent
given it is known about Ignition Battery
• Moves and Radio are independent if it is
known that Battery works
• Petrol and Radio are independent given Radio Ignition
no evidence. But are dependent given
evidence of Starts Petrol

Starts

Moves

240
Inference contd.

Q E Q E E

E Q E

Diagnostic Causal Intercausal Mixed

241
Inference contd.
Burglary Earthquake

Alarm

Mary Calls John Calls


242
Inference in Singly Connected Networks

• E.g. P(X|E):
Involves computing two values:
• Causal Support (evidence variables above X connected through it’s parents)
• Evidential Support (evidence variables below X connected through it’s
children

• Algorithm can perform in Linear Time.

243
Inference Algorithm

E Spreads out from Q to


evidence nodes, root
Causal Support
nodes and leaf nodes.

Each recursive call


excludes the node from
Q which it was called.
E
E
Evidential Support

244
Inference in Multiply Connected Networks

• Exact Inference is known to be NP-Hard


• Approaches include:
• Clustering
• Conditioning
• Stochastic Simulation
• Stochastic Simulation is most often used, particularly on large
networks.

245
Clustering

Cloudy Cloudy

Sprinkler Rain Sprinkler & Rain


P(S&R)
C P(R)
C TT TF FT FF
C P(S) T .08 .02
F .40 .10 T .08 .02 .72 .18
T .08 .02 F .40 .10 .40 .10
F .40 .10
Wet Grass Wet Grass

246
Conditioning
+ + - -
Cloudy Cloudy Cloudy Cloudy

Sprinkler Rain Sprinkler Rain

Wet Grass Wet Grass

247
Stochastic Simulation - Example
P(A=1) = 0.2
A

A P(B=1) A p(C=1)
0 0.2 B C 0 0.05
1 0.8 1 0.2

D E

248
Stochastic Simulation
Run repeated simulations to estimate the probability distribution

• Let Wx = the states of all other variables except x.


• Let the Markov Blanket of a node be all of its parents,
children and parents of children.
• Distribution of each node, x, conditioned upon Wx can be
computed locally from their own probability with their
children’s :
• P(a|Wa) =  . P(a) . P(b|a) . P(c|a)
• P(b|Wb) =  . P(b|a) . P(d|b,c)
• P(c|Wc) =  . P(c|a) . P(d|b,c) . P(e|c)
• Therefore, only the Markov blanket of a node is required
to compute the distribution
249
The Algorithm
• Set all observed nodes to their values
• Set all other nodes to random values

• STEP 1
• Select a node randomly from the network
• According to the states of the node’s markov blanket,
compute P(x=state, Wx) for all states
• STEP 2
• Use a random number generator that is biased according to
the distribution computed in step 1 to select the next value
of the node
• Repeat

250
Algorithm contd.

• The final probability distribution of each unobserved node is calculated from


either:

1) the number of times each node took a particular state


2) the average conditional probability of each node taking a particular state given
the other variables states.

251
Section 2 - Research Issues in Uncertainty

1 Learning Belief Networks from Data


• Assume no
Case Fraud Gas Jewellery Age Sex
Knowledge of
1 No No No 30-50 F
Probabilities 2 No No No 30-50 M
Distributions or 3 Yes Yes Yes >50 M
Causal Structure. 4 No No No 30-50 M
5 No Yes No <30 F
• Is it possible to infer 6 No No No <30 F
7 No No No >50 M
both of these from 8 No No Yes 30-50 F
data? 9 No Yes No <30 M
10 No No No <30 F

252

You might also like