AI ML Unit4
AI ML Unit4
Types of learning
Types of supervised learning
Machine learning examples
• Classification
• Spam / Ham
• Weather prediction
• Regression
• Stock market analysis
• Gold price prediction (IRCTC)?
• Similarity determination
• Collaborative filtering
• Clustering – structure in data
Supervised Learning
1
• Features: The attributes used to make the digit decision
• Pixels: (6,8)=ON
• Shape Patterns: NumComponents, AspectRatio, NumLoops
• …
??
Other Classification Tasks
• Classification: given inputs x, predict labels (classes) y
• Examples:
• Spam detection (input: document,
classes: spam / ham)
• OCR (input: images, classes: characters)
• Medical diagnosis (input: symptoms,
classes: diseases)
• Automatic essay grading (input: document,
classes: grades)
• Fraud detection (input: account activity,
classes: fraud / no fraud)
• Customer service email routing
• … many more
• Challenges
• What structure should the BN have?
• How should we learn its parameters?
Naïve Bayes for Digits
• Naïve Bayes: Assume all features are independent effects of the label
|Y| parameters
F1 F2 Fn
P(spam | w) = 98.9
Naïve Bayes Classifier: Training Dataset
age income studentcredit_rating
buys_computer
<=30 high no fair no
Class: <=30 high no excellent no
C1:buys_computer = ‘yes’ 31…40 high no fair yes
C2:buys_computer = ‘no’ >40 medium no fair yes
>40 low yes fair yes
Data to be classified: >40 low yes excellent no
31…40 low yes excellent yes
X = (age <=30,
<=30 medium no fair no
Income = medium, <=30 low yes fair yes
Student = yes >40 medium yes fair yes
Credit_rating = Fair) <=30 medium yes excellent yes
31…40 medium no excellent yes
31…40 high yes fair yes
>40 medium no excellent no
Naïve Bayes Classifier: An Example age income studentcredit_rating
buys_computer
<=30 high no fair no
<=30 high no excellent no
31…40 high no fair yes
• P(Ci): P(buys_computer = “yes”) = 9/14 = 0.643 >40
>40
medium
low
no fair
yes fair
yes
yes
>40 low yes excellent no
P(buys_computer = “no”) = 5/14= 0.357 31…40
<=30
low
medium
yes excellent
no fair
yes
no
<=30 low yes fair yes
• Evaluation Held-Out
• Accuracy: fraction of instances predicted correctly
Data
• Overfitting and generalization
• Want a classifier which does well on test data
• Overfitting: fitting the training data very closely, but not generalizing Test
well Data
• Underfitting: fits the training set poorly
Underfitting and Overfitting
30
Overfitting
25
20
Degree 15 polynomial
15
10
-5
-10
-15
0 2 4 6 8 10 12 14 16 18 20
Example: Overfitting
2 wins!!
Example: Overfitting
• Posteriors determined by relative probabilities (odds ratios):
• As an extreme case, imagine using the entire email as the only feature
• Would get the training data perfect (if deterministic labeling)
• Wouldn’t generalize at all
• Just making the bag-of-words assumption gives us some generalization, but isn’t enough
…
• Flips are i.i.d.:
• Independent events D={xi | i=1…n}, P(D | θ) = ΠiP(xi | θ)
• Identically distributed according to unknown distribution
• Sequence D of H Heads and T Tails
Maximum Likelihood Estimation
• Data: Observed set D of H Heads and T Tails
• Hypothesis space: Binomial distributions
• Learning: finding is an optimization problem
• What’s the objective function?
• Another option is to consider the most likely parameter value given the data
????
Unseen Events
Laplace Smoothing
• Laplace’s estimate:
• Pretend you saw every outcome
once more than you actually did
r r b
Decision Trees is one of the most widely used and practical methods
of inductive inference
Features
Method for approximating discrete-valued functions (including boolean)
Learned functions are represented as decision trees (or if-then-else rules)
Expressive hypotheses space, including disjunction
Robust to noisy data
Decision tree representation (PlayTennis)
+
When to use Decision Trees
Problem characteristics:
Instances can be described by attribute value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data samples
Robust to errors in training data
Missing attribute values
Different classification problems:
Equipment or medical diagnosis
Credit risk analysis
Several tasks in natural language processing
Top-down induction of Decision Trees
• ID3 (Quinlan, 1986) is a basic algorithm for learning DT's
• Given a training set of examples, the algorithms for building DT
performs search in the space of decision trees
• The construction of the tree is top-down. The algorithm is greedy.
• The fundamental question is “which attribute should be tested next?
Which question gives us more information?”
• Select the best attribute
• A descendent node is then created for each possible value of this
attribute and examples are partitioned according to this value
• The process is repeated for each successor node until all the
examples are classified correctly or there are no attributes left
Which attribute is the best classifier?
{D1, D2, D8} {D9, D11} {D4, D5, D10} {D6, D14}
No Yes Yes No
ID3: algorithm
ID3(X, T, Attrs) X: training examples:
T: target attribute (e.g. PlayTennis),
Attrs: other attributes, initially all attributes
Create Root node
If all X's are +, return Root with class +
If all X's are –, return Root with class –
If Attrs is empty return Root with class most common value of T in X
else
A best attribute; decision attribute for Root A
For each possible value vi of A:
- add a new branch below Root, for test A = vi
- Xi subset of X with A = vi
- If Xi is empty then add a new leaf with class the most common value of T in X
else add the subtree generated by ID3(Xi, T, Attrs {A})
return Root
Search space in Decision Tree learning
The search space is made by partial
decision trees
The algorithm is hill-climbing
The evaluation function is
information gain
The hypotheses space is complete
(represents all discrete-valued
functions)
The search maintains a single current
hypothesis
No backtracking; no guarantee of
optimality
It uses all the available examples (not
incremental)
May terminate earlier, accepting
noisy classes
Inductive bias in decision tree learning
(Outlook=Sunny)(Humidity=High) ⇒ (PlayTennis=No)
Why converting to rules?
Each distinct path produces a different rule: a condition removal may
be based on a local (contextual) criterion. Node pruning is global and
affects all the rules
In rule form, tests are not ordered and there is no book-keeping
involved when conditions (nodes) are removed
Converting to rules improves readability for humans
Dealing with continuous-valued
attributes
So far discrete values for attributes and for outcome.
Given a continuous-valued attribute A, dynamically create
a new attribute Ac
Ac = True if A < c, False otherwise
How to determine threshold value c ?
Example. Temperature in the PlayTennis example
Sort the examples according to Temperature
Temperature 40 48 | 60 72 80 | 90
PlayTennis No No 54 Yes Yes Yes 85 No
Determine candidate thresholds by averaging consecutive values
where there is a change in classification: (48+60)/2=54 and
(80+90)/2=85
Evaluate candidate thresholds (attributes) according to
information gain. The best is Temperature>54.The new attribute
competes with the other ones
Problems with information gain
• Natural bias of information gain: it favours attributes with
many possible values.
• Consider the attribute Date in the PlayTennis example.
• Date would have the highest information gain since it perfectly
separates the training data.
• It would be selected at the root resulting in a very broad tree
• Very good on the training, this tree would perform poorly in
predicting unknown instances. Overfitting.
• The problem is that the partition is too specific, too many
small classes are generated.
• We need to look at alternative measures …
An alternative measure: gain ratio
c |Si | |Si |
SplitInformation(S, A) − log2
i=1 |S | |S |
• Si are the sets obtained by partitioning on value i of A
• SplitInformation measures the entropy of S with respect to the values of A. The
more uniformly dispersed the data the higher it is.
Gain(S, A)
GainRatio(S, A)
SplitInformation(S, A)
• GainRatio penalizes attributes that split examples in many small classes such
as Date. Let |S |=n, Date splits examples in n classes
• SplitInformation(S, Date)= −[(1/n log2 1/n)+…+ (1/n log2 1/n)]= −log21/n =log2n
• Compare with A, which splits data in two even classes:
• SplitInformation(S, A)= − [(1/2 log21/2)+ (1/2 log21/2) ]= − [− 1/2 −1/2]=1
Adjusting gain-ratio
• Problem: SplitInformation(S, A) can be zero or very small when |Si |
≈ |S | for some value i
• To mitigate this effect, the following heuristics has been used:
1. compute Gain for each attribute
2. apply GainRatio only to attributes with Gain above average
• Other measures have been proposed:
• Distance-based metric [Lopez-De Mantaras, 1991] on the partitions of data
• Each partition (induced by an attribute) is evaluated according to the
distance to the partition that perfectly classifies the data.
• The partition closest to the ideal partition is chosen
Handling incomplete training data
How to cope with the problem that the value of some
attribute may be missing?
Example: Blood-Test-Result in a medical diagnosis problem
The strategy: use other examples to guess attribute
1. Assign the value that is most common among the training
examples at the node
2. Assign a probability to each value, based on frequencies, and
assign values to missing attribute, according to this probability
distribution
Missing values in new instances to be classified are
treated accordingly, and the most probable classification
is chosen (C4.5)
Handling attributes with different costs
• Instance attributes may have an associated cost: we
would prefer decision trees that use low-cost attributes
• ID3 can be modified to take into account costs:
1. Tan and Schlimmer (1990)
Gain2(S, A)
Cost(S, A)
2. Nunez (1988)
2Gain(S, A) 1
(Cost(A) + 1)w w ∈ [0,1]
A Simple Neuron
3-4-2 Network
Input Output
layer layer
Hidden Layer
Recurrent network
Recurrent Network with hidden neuron(s): unit delay
operator z-1 implies dynamic system
z-1
input
z-1 hidden
output
z-1
Neural Network Architectures
( )
Output
v
Input x2 w2 y
signal
Summing
function
xm wm
Synaptic
weights
Bias of a Neuron
• Bias b has the effect of applying an affine transformation to
u
v=u+b
• v is the induced field of the neuron
m
u wjxj
j 1
Bias as extra input
• Bias is an external parameter of the neuron. Can be modeled
by adding an extra input. m
w0
v wj xj
x0 = +1 j 0
x1
w0 b
w1 Activation
Local function
Field
( )
Input Output
v
signal x2 w2 y
Summing
function
xm wm Synaptic
weights
Dimensions of a Neural Network
• Various applications
Face Recognition
125
Perceptrons
• Initial proposal of connectionist networks
• Rosenblatt, 50’s and 60’s
• Essentially a linear discriminant composed of
nodes, weights
I1 W1 I1 W1
or
I2
W2
O I2
W2 O
W3 W3
I3 Activation Function I3
O i
1 : wi I i 0
0 : otherwise 1
Perceptron
127
Perceptron Example
2 .5
1
.3
=-1
Learning Procedure:
129
Perceptron training rule
wi wi + wi
where wi = (t – o) xi
Where:
• t = c(x) is target value
• o is perceptron output
• is small constant (e.g., .1) called learning rate
133
Gradient Descent (2/4)
Gradient
Training rule:
i.e.,
134
Gradient Descent (3/4)
135
Gradient Descent (4/4)
138
Incremental (Stochastic) Gradient Descent (2/2)
139
Difference
Perceptron Gradient descent
Training rules weights are Updated based on error in
updated based on error in unthresholded linear
threshold perceptron output combination of inputs.
For a linearly separable Converges only asymptotically
example, converges after a toward the minimum error
finite number of iterations hypothesis irrespective of the
linear / non-linear separable
data
142
Sigmoid Unit
Nice property:
We can derive gradient decent rules to train
• One sigmoid unit
• Multilayer networks of sigmoid units Backpropagation
143
Error Gradient for a Sigmoid Unit
But we know:
So:
144
Backpropagation Learning Principles:
Hidden Layers and Gradients
147
More on Backpropagation
• Gradient descent over entire network weight vector
• Easily generalized to arbitrary directed graphs
• Will find a local, not necessarily global error minimum
• In practice, often works well (can run multiple times)
• Often include weight momentum
wi,j (n) = j xi,j + wi,j (n - 1)
• Minimizes error over training examples
• Will it generalize well to subsequent examples?
• Training can take thousands of iterations slow!
• Using network after training is very fast
148
Learning Algorithm:
Backpropagation
Each neuron is composed of two units. First unit adds products of weights coefficients and
input signals. The second unit realise nonlinear function, called neuron transfer (activation)
function. Signal e is adder output signal, and y = f(e) is output signal of nonlinear element.
Signal y is also output signal of neuron.
Learning Algorithm:
Backpropagation
To teach the neural network we need training data set. The training data set consists of
input signals (x1 and x2 ) assigned with corresponding target (desired output) z.
The network training is an iterative process. In each iteration weights coefficients of nodes
are modified using new data from training data set. Modification is calculated using
algorithm described below:
Each teaching step starts with forcing both input signals from training set. After this stage
we can determine output signals values for each neuron in each network layer.
Learning Algorithm:
Backpropagation
Pictures below illustrate how signal is propagating through the network,
Symbols w(xm)n represent weights of connections between network input xm and
neuron n in input layer. Symbols yn represents output signal of neuron n.
Learning Algorithm:
Backpropagation
Pictures below illustrate how signal is propagating through the network,
Symbols w(xm)n represent weights of connections between network input xm and
neuron n in input layer. Symbols yn represents output signal of neuron n.
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Propagation of signals through the hidden layer. Symbols wmn represent weights
of connections between output of neuron m and input of neuron n in the next
layer.
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Learning Algorithm:
Backpropagation
Propagation of signals through the output layer.
Learning Algorithm:
Backpropagation
In the next algorithm step the output signal of the network y is
compared with the desired output value (the target), which is found in
training data set. The difference is called error signal d of output layer
neuron
Learning Algorithm:
Backpropagation
The idea is to propagate error signal d (computed in single teaching step)
back to all neurons, which output signals were input for discussed
neuron.
Learning Algorithm:
Backpropagation
The idea is to propagate error signal d (computed in single teaching step)
back to all neurons, which output signals were input for discussed
neuron.
Learning Algorithm:
Backpropagation
The weights' coefficients wmn used to propagate errors back are equal to
this used during computing output value. Only the direction of data flow
is changed (signals are propagated from output to inputs one after the
other). This technique is used for all network layers. If propagated errors
came from few neurons they are added. The illustration is below:
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Learning Algorithm:
Backpropagation
When the error signal for each neuron is computed, the weights
coefficients of each neuron input node may be modified. In formulas
below df(e)/de represents derivative of neuron activation function
(which weights are modified).
Main Idea
• Max-Margin Classifier
• Formalize notion of the best linear separator
• Lagrangian Multipliers
• Way to convert a constrained optimization problem to one that is easier to
solve
• Kernels
• Projecting data into higher-dimensional space makes it linearly separable
• Complexity
• Depends only on the number of training examples, not on dimensionality of
the kernel space!
Text categorization
Tennis example
Temperature
Humidity
= play tennis
= do not play tennis
Linear Support Vector Machines
Data: <xi,yi>, i=1,..,l
xi Rd
yi {-1,+1}
x2
=+1
=-1
x1
Linear SVM 2
f(x) =-1
=+1
H1
H
H2
Recall the distance from a point(x0,y0) to a line: d+
Ax+By+c = 0 is|A x0 +B y0 +c|/sqrt(A2+B2) d-
The distance between H and H1 is:
|w•x+b|/||w||=1/||w||
202
Why This Matters
• Bayesian networks have been the most important contribution to the
field of AI in the last 10 years
• Provide a way to represent knowledge in an uncertain domain and a
way to reason about this knowledge
• Many applications: medicine, factories, help desks, spam filtering, etc.
203
A Bayesian Network
B P(B) E P(E) A Bayesian network is made
false 0.999 false 0.998 up of two parts:
true 0.001 true 0.002 1. A directed acyclic graph
Burglary Earthquake
2. A set of parameters
B E A P(A|B,E)
Alarm
false false false 0.999
false false true 0.001
false true false 0.71
false true true 0.29
true false false 0.06
true false true 0.94
true true false 0.05
true true true 0.95
A Directed Acyclic Graph
Burglary Earthquake
Alarm
205
A Directed Acyclic Graph
Burglary Earthquake
Alarm
206
A Set of Parameters
B P(B) E P(E) Burglary Earthquake
false 0.999 false 0.998
true 0.001 true 0.002
Alarm
B E A P(A|B,E)
false false false 0.999
false false true 0.001 Each node Xi has a conditional probability
false true false 0.71 distribution P(Xi | Parents(Xi)) that quantifies the
false true true 0.29 effect of the parents on the node
true false false 0.06 The parameters are the probabilities in these
true false true 0.94 conditional probability distributions
true true false 0.05
Because we have discrete random variables, we
true true true 0.95
have conditional probability tables (CPTs)
207
A Set of Parameters
Conditional Probability Stores the probability distribution for
Distribution for Alarm Alarm given the values of Burglary
and Earthquake
B E A P(A|B,E)
false false false 0.999
For a given combination of values of the
false false true 0.001
parents (B and E in this example), the
false true false 0.71
entries for P(A=true|B,E) and
false true true 0.29 P(A=false|B,E) must add up to 1 eg.
true false false 0.06 P(A=true|B=false,E=false) +
true false true 0.94 P(A=false|B=false,E=false)=1
true true false 0.05
true true true 0.95
209
Bayes Nets Formalized
A Bayes net (also called a belief network) is an augmented
directed acyclic graph, represented by the pair V , E
where:
• V is a set of vertices.
• E is a set of directed edges joining vertices. No loops of any
length are allowed.
210
Semantics of Bayesian Networks
Two ways to view Bayes nets:
1. A representation of a joint probability distribution
2. An encoding of a collection of conditional independence
statements
211
Bayesian Network Example
Weather Cavity
213
A Representation of the Full Joint Distribution
• We will use the following abbrevations:
• P(x1, …, xn) for P( X1 = x1 … Xn = xn)
• parents(Xi) for the values of the parents of Xi
• From the Bayes net, we can calculate:
n
P( x1 ,..., xn ) P( xi | parents( X i ))
i 1
214
The Full Joint Distribution
P( x1 ,..., xn )
P( xn | xn 1 ,..., x1 ) P( xn 1 ,..., x1 ) ( Chain Rule)
215
The Full Joint Distribution
n n
P( x | x
i 1
i i 1 ,..., x1 ) P( xi | parents( xi ))
i 1
216
Example
Burglary Earthquake
Alarm
JohnCalls MaryCalls
217
Conditional Independence
• There is a general topological criterion called d-separation
• d-separation determines whether a set of nodes X is independent of
another set Y given a third set E
218
D-separation
• We will use the notation I(X, Y | E) to mean that X and Y are
conditionally independent given E
• Theorem [Verma and Pearl 1988]:
If a set of evidence variables E d-separates X and
Y in the Bayesian Network’s graph, then
I(X, Y | E)
• d-separation can be determined in linear time using a DFS-like
algorithm
219
D-separation
• Let evidence nodes E V (where V are the vertices or nodes in the
graph), and X and Y be distinct nodes in V – E.
• We say X and Y are d-separated by E in the Bayesian network if every
undirected path between X and Y is blocked by E.
• What does it mean for a path to be blocked? There are 3 cases…
220
Conditional Independence
revisited - D-Separation
221
D-Separation contd.
222
D-Separation contd.
• A set of nodes, E, d-separates two sets of nodes, X and Y, if every
undirected path from a node in X to a node in Y is Blocked given E.
223
Blocking
X Y
E
Z
224
Case 1
There exists a node N on the path such that
• It is in the evidence set E (shaded grey)
• The arcs putting N in the path are “tail-to-
tail”.
X N Y
Alarm
Your house has a twitchy burglar alarm that is also sometimes triggered by
earthquakes
Earth obviously doesn’t care if your house is currently being broken into
While you are on vacation, one of your nice neighbors calls and lets you
know your alarm went off
228
Case 3 (Explaining Away)
Burglary Earthquake
Alarm
230
Conditional Independence
• Note: D-separation only finds random variables
that are conditionally independent based on the
topology of the network
• Some random variables that are not d-separated
may still be conditionally independent because of
the probabilities in their CPTs
231
Bayesian Network - Definition
• Causal Structure
• Interconnected Nodes
• Directed Acyclic Links
• Joint Distribution formed
from conditional
distributions at each node.
232
Earthquake or Burglar
Burglary Earthquake
Alarm
B E P(A)
T T .95
T F .94
F T .29 Alarm
F F .001
A P(M) A P(J)
T .70 T .90
F .01 F .05
Mary Calls John Calls
234
Retrieving Probabilities from the Conditional
Distributions
• P(x1,…xn) = P P(xi|Parents(x
n i))
• E.g. i=1
P(J & M & A & ¬B & ¬E)
= P(J|A)P(M|A)P(A|¬B,¬E)P(¬B)P(¬E)
= 0.9 x 0.7 x 0.001 x 0.999 x 0.998
= 0.00062
235
Constructing A Network
- Node Ordering and Compactness
Earthquake
236
Node Ordering and Compactness contd.
• Mary Calls
• Johns Calls
• Earthquake
• Burglary
• Alarm
237
Node Ordering and Compactness contd.
• Mary Calls
• Johns Calls
Mary Calls
• Earthquake John Calls
• Burglary
• Alarm
Earthquake
Burglary
Alarm
238
Inference
• Diagnostic Inferences (effects to causes)
• Causal Inferences (causes to effects)
• Intercausal Inferences - or ‘Explaining Away’ (between causes of
common effect)
• Mixed Inferences (combination of two or more of the above)
239
D-Separation - Example
• Moves and Battery are independent
given it is known about Ignition Battery
• Moves and Radio are independent if it is
known that Battery works
• Petrol and Radio are independent given Radio Ignition
no evidence. But are dependent given
evidence of Starts Petrol
Starts
Moves
240
Inference contd.
Q E Q E E
E Q E
241
Inference contd.
Burglary Earthquake
Alarm
• E.g. P(X|E):
Involves computing two values:
• Causal Support (evidence variables above X connected through it’s parents)
• Evidential Support (evidence variables below X connected through it’s
children
243
Inference Algorithm
244
Inference in Multiply Connected Networks
245
Clustering
Cloudy Cloudy
246
Conditioning
+ + - -
Cloudy Cloudy Cloudy Cloudy
247
Stochastic Simulation - Example
P(A=1) = 0.2
A
A P(B=1) A p(C=1)
0 0.2 B C 0 0.05
1 0.8 1 0.2
D E
248
Stochastic Simulation
Run repeated simulations to estimate the probability distribution
• STEP 1
• Select a node randomly from the network
• According to the states of the node’s markov blanket,
compute P(x=state, Wx) for all states
• STEP 2
• Use a random number generator that is biased according to
the distribution computed in step 1 to select the next value
of the node
• Repeat
250
Algorithm contd.
251
Section 2 - Research Issues in Uncertainty
252