0% found this document useful (0 votes)
19 views51 pages

Unit Ii

The document discusses probabilistic reasoning, focusing on full joint probability distributions and their applications in inference and decision-making under uncertainty. It explains concepts such as marginal probability, Bayes' theorem, and Bayesian networks, emphasizing their importance in artificial intelligence for handling uncertain knowledge. The document also illustrates how to compute probabilities and the significance of independence in probabilistic models.

Uploaded by

annet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views51 pages

Unit Ii

The document discusses probabilistic reasoning, focusing on full joint probability distributions and their applications in inference and decision-making under uncertainty. It explains concepts such as marginal probability, Bayes' theorem, and Bayesian networks, emphasizing their importance in artificial intelligence for handling uncertain knowledge. The document also illustrates how to compute probabilities and the significance of independence in probabilistic models.

Uploaded by

annet
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

UNIT II

PROBABILISTIC REASONING
Full Joint Probability Distribution
Distributions on multiple variables For example,
Weather= {sunny, rain, cloudy, snow}
Cavity={cavity, ¬𝑐𝑎𝑣𝑖𝑡𝑦} , Joint Probability
distribution of Weather & Cavity

Here, Weather & Cavity are the variables. Weather has 4 values &
Cavity having 2 values. What are the possible combination of these
variables. Consider (W, C) = 8 combinations.
So, find the probability of distribution on these 2 variables. One of the
instance we have, sunny weather with cavity 7 sunny with no cavity
Similarly, rain with cavity, rain without cavity. So, 4*2=8 combinations
possible .So, every possible world will have some probability. Some possible
world will be more probable i.e) probability will be more, called joint
probability.

When you have multiple variables, probability distribution over multiple


variables is known as joint probability.
And if you enlist all the possible world then it becomes, full joint probability
distribution. or A joint probability distribution that covers the complete set
of random variables is called as full joint probability distribution.
A simple example of full joint probability distribution is, If problem world
consists of 3 random variables, weather, cavity, toothache then full joint
probability distribution would be, P<Weather, Cavity, Toothache> It will
be represented as, 4×2×2, table of probabilities.
Inference using Full Joint Distribution
To study the method for probabilistic inference. So, how to infer a
new fact in case of uncertainty. .When you infer a new fact, that
fact will have some probability, because uncertainty is there.

Given data itself will have a probability . Consider an instance with


three variables {catch (the dentist’s nasty steel probe catches in my
tooth)}. The table is fully depicting the joint distribution of these
three variables. Toothache has 2 values, Cavity 2 values, Catch 2
How to infer the probability of any proposition (new fact). Here, our new
proposition is Cavity or Toothache. Now, what is the probability of Cavity
with Toothache, we can have OR, we have only Cavity, only Toothache or
both
direct way to calculate the probability of any proposition, simple or
complex:
simply identify those possible worlds in which the proposition is true and
add up their probabilities. For example, there are six possible worlds in
which
𝐶𝑎𝑣 𝑖𝑡 𝑦 𝑡 𝑜𝑜𝑡 ℎ𝑎𝑐ℎ𝑒 holds:
Probabilistic inference means, computation from observed evidence of
posterior probabilities, for query propositions. The knowledge base
used for answering the query is represented as full joint distribution.
Consider simple example, consists of three boolean variables,
toothache, cavity, catch. The full joint distribution is 2×2×2
One particular common task in inferencing is to extract the
distribution over some subset of variables or a single variable. This
distribution over some variables or single variables is called as
marginal probability.
For example: P(Cavity) = 0.108 +0.012+ 0.072 + 0.008. = 0.2
This process is called as marginalization or summing. Because the
variables other than 'cavity' (that is whose probability is being
counted) are summed out.
The general marginalization rule is as follows, For any sets of
variables Y and Z,

P(Y) = Σ z P(Y, z). ….(7.1.3)

It indicates that distribution Y can be obtained by summing out all the


other paulay to variables from any joint distribution X containing Y.
Variant of above example of general marginalization rule involved the
conditional to to probabilities using product rule.

P(Y) = Σ z P(Y | z) P(z) ….(7.1.4)

This rule is conditioning rule.


For example: Computing probability of a cavity, given evidence of a
toothacheis as follows:
P(Cavity | Toothache) =P(Cavity Ʌ Toothache) /P(Toothache) =
0.108+0.012 / 0.108+0.012+0.016+0.064 = 0.6

Normalization constant: It is variable that remains constant for the


distribution, which ensures that it adds in to 1. a is used to denote such
constant.
For example: We can compute the probability of a cavity, given
evidence of a toothache, as follows:
P(Cavity | Toothache) = P(Cavity Ʌ Toothache)/ P(Toothache) =
0.108+0.012/ 0.108+0.012+0.016+0.064 = 0.6
Just to check we can also compute the probability that there is no
cavity given a toothache:
P( Cavity / Toothache)= P (¬Cavity Ʌ Tootache) /P (Toothache) =
0.016+ 0.064 / 0.108+0.012+ 0.016+ 0.064 = 0.4

Notice that in these two calculations the term 1/P (toothache) remains
constant, no matter which value of cavity we calculate. With this
notation we can write above two equations in one.

P(Cavity | Toothache) = αP(Cavity, Toothache)

= α [P(Cavity, Toothache, Catch) + P(Cavity, Toothache, ¬ Catch)]


= α [< 0.108, 0.016> + <0.012, 0.064>]

= α <0.12, 0.08> = <0.6, 0.4>

From above one can extract a general inference procedure.


To compute conditional probabilities
For example, we can compute the probability of a cavity, given
evidence of a toothache, as follows:
In order to reduce computation complexity, we can find
denominator only once, because it is repeated. For that we have a
concept of Normalization.
INDEPENDANCE
It is a relationship between two different sets of full joint distributions.
It is also called as marginal or absolute independance of the variables.
Independence indicates that whether the two full joint distributions
affects probability of each other.
The independance between variables X and Y can be written as follows,

P(X | Y) = P(X) or P(Y | X) = P(Y) or P(X, Y) = P(X) P(Y)

For example: The weather is independant of once dental problem.


Which can be shown as below equation.

P(Toothache, Catch, Cavity, Weather) = P(Toothache, Catch, Cavity)


P(Weather).
Following diagram shows factoring a large joint distributing into
smaller distributions, using absolute independence. Weather and
dental problems are independent
Thus, the 32-element table for four variables can be constructed
from one 8-element table and one 4-element table.
Bayes’ Rule and its use
When you have 2 independent events, two forms of product rule
In a task such as medical diagnosis, we often have conditional

𝑃(𝑠𝑦𝑚𝑝𝑡𝑜𝑚𝑠|𝑑𝑖𝑠𝑒𝑎𝑠𝑒)) ) and want to derive a diagnosis,


probabilities on causal relationships. The doctor knows

𝑃(𝑑𝑖𝑠𝑒𝑎𝑠𝑒|𝑠𝑦𝑚𝑝𝑡𝑜𝑚𝑠),
Generalized Bayes' rule is,

P(Y | X) = P(X | Y) P(Y) / P(X)


(where P has same meanings)
We can have more general version, conditionalized on some background
evidence e.
P(Y | X, e) = P(X |Y, e) P(Y | e) / P(X | e)
General form of Bays' rule with normalization's
P(y | x) = α P(x | y) P(y).
Bayesian inference
Bayes' theorem in Artificial intelligence
Bayes' theorem:
Bayes' theorem is also known as Bayes' rule, Bayes' law, or Bayesian
reasoning, which determines the probability of an event with uncertain
knowledge. In probability theory, it relates the conditional probability and
marginal probabilities of two random events. Bayes' theorem was named
after the British mathematician Thomas Bayes. The Bayesian inference is
an application of Bayes' theorem, which is fundamental to Bayesian statistics.
It is a way to calculate the value of P(B|A) with the knowledge of P(A|B).
Bayes' theorem allows updating the probability prediction of an event by
observing new information of the real world.
Example: If cancer corresponds to one's age then by using Bayes'
theorem, we can determine the probability of cancer more accurately
with the help of age. Bayes' theorem can be derived using product rule
and conditional probability of event A with known event B: As from
product rule we can write:

P(A ⋀ B)= P(B|A) P(A)


Similarly, the probability of event B with known event A:

Equating right hand side of both the equations, we will get:


P(A ⋀ B)= P(A|B) P(B)
or
The above equation (a) is called as Bayes' rule or Bayes' theorem.
This equation is basic of most modern AI systems for probabilistic
inference.

It shows the simple relationship between joint and conditional


probabilities. Here, P(A|B) is known as posterior, which we need to
calculate, and it will be read as Probability of hypothesis A when we have
occurred an evidence B. P(B|A) is called the likelihood, in which we
consider that hypothesis is true, then we calculate the probability of
evidence. P(A) is called the prior probability, probability of hypothesis
before considering the evidence P(B) is called marginal probability,
pure probability of an evidence.
In the equation (a), in general, we can write P (B) = P(A)*P(B|Ai), hence
the Bayes' rule can be written as:

Where A1, A2, A3, , An is a set of mutually exclusive and exhaustive


events.

Applying Bayes' rule:


Bayes' rule allows us to compute the single term P(B|A) in terms of P(A|
B), P(B), and P(A). This is very useful in cases where we have a good
probability of these three terms and want to determine the fourth one.
Suppose we want to perceive the effect of some unknown cause, and
want to compute that cause, then the Bayes' rule becomes:
Example-2:
Question: From a standard deck of playing cards, a single
card is drawn. The probability that the card is king is 4/52,
then calculate posterior probability P(King|Face), which
means the drawn face card is a king card.
Solution:

P(king): probability that the card is King= 4/52= 1/13 P(face): probability
that a card is a face card= 3/13
P(Face|King): probability of face card when we assume it is a king = 1

Putting all values in equation (i) we will get:

Application of Bayes' theorem in Artificial intelligence:


Following are some applications of Bayes' theorem:
• It is used to calculate the next step of the robot when the already executed
step is given. , Bayes' theorem is helpful in weather forecasting.
o It can solve the Monty Hall problem.
Probabilistic reasoning:

Probabilistic reasoning is a way of knowledge representation where we


apply the concept of probability to indicate the uncertainty in knowledge. In
probabilistic reasoning, we combine probability theory with logic to handle
the uncertainty. We use probability in probabilistic reasoning because it
provides a way to handle the uncertainty that is the result of someone's
laziness and ignorance.

In the real world, there are lots of scenarios, where the certainty of something
is not confirmed, such as "It will rain today," "behavior of someone for some
situations," "A match between two teams or two players." These are probable
sentences for which we can assume that it will happen but not sure about it, so
here we use probabilistic reasoning.
Need of probabilistic reasoning in AI:
When there are unpredictable outcomes, When specifications or
possibilities of predicates becomes too large to handle, When an
unknown error occurs during an experiment.
In probabilistic reasoning, there are two ways to solve problems with
uncertain knowledge:
o Bayes' rule and Bayesian Statistics
As probabilistic reasoning uses probability and related terms, so
before understanding probabilistic reasoning, let's understand some
common terms: Probability: Probability can be defined as a chance
that an uncertain event will occur. It is the numerical measure of the
likelihood that an event will occur. The value of probability always
remains between 0 and 1 that represent ideal uncertainties.
0 ≤ P(A) ≤ 1, where P(A) is the probability of an event A. P(A) = 0,
indicates total uncertainty in an event A. P(A) =1, indicates total certainty in
an event A. We can find the probability of an uncertain event by using the
below formula.

o P(¬A) = probability of a not happening event.


o P(¬A) + P(A) = 1.
Event: Each possible outcome of a variable is called an event.
Sample space: The collection of all possible events is called sample
space.
Random variables: Random variables are used to represent the events
and objects in the real world.

Prior probability: The prior probability of an event is probability


computed before observing new information.

Posterior Probability: The probability that is calculated after all


evidence or information has taken into account. It is a combination of
prior probability and new information.
Conditional probability:
Conditional probability is a probability of occurring an event when
another event has already happened.
Let's suppose, we want to calculate the event A when event B has
already occurred, "the probability of A under the conditions of B", it
can be written as:

Where P(A⋀B)= Joint probability of a and B P(B)= Marginal


probability of B. If the probability of A is given and we need to find the
probability of B, then it will be given as:

It can be explained by using the below Venn diagram, where B is


occurred event, so sample space will be reduced to set B, and now
we can only calculate event A when event B is already occurred by
dividing the probability of P(A⋀B) by P( B ).
Example:
In a class, there are 70% of the students who like English and 40% of the
students who likes English and mathematics, and then what is the percent of
students those who like English also like mathematics?
Solution:

Let, A is an event that a student likes Mathematics B is an event that a studen


likes English. Hence, 57% are the students who like English also lik
Mathematics.
1. Bayesian networks or Belief networks
Bayesian Belief Network in artificial intelligence
Bayesian belief network is key computer technology for dealing with
probabilistic events and to solve a problem which has uncertainty. We can
define a Bayesian network as:
"A Bayesian network is a probabilistic graphical model which
represents a set of variables and their conditional dependencies using
a directed acyclic graph.“ It is also called a Bayes network, belief
network, decision network, or Bayesian model.
Bayesian networks are probabilistic, because these networks are built
from a probability distribution, and also use probability theory for
prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent
the relationship between multiple events, we need a Bayesian
network. It can also be used in various tasks including prediction,
anomaly detection, diagnostics, automated insight, reasoning,
time series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts
opinions, and it consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve


decision problems under uncertain knowledge is known as an
Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links),
where:

o Each node corresponds to the random variables, and a variable can be


continuous or discrete.
o Arc or directed arrows represent the causal relationship or
conditional probabilities between random variables. These directed
links or arrows connect the pair of nodes in the graph.
These links represent that one node directly influence the other node, and if
there is no directed link that means that nodes are independent with each
other
o In the above diagram, A, B, C, and D are random variables
represented by the nodes of the network graph.
o If we are considering node B, which is connected with node A
by a directed arrow, then node A is called the parent of Node
B.
o Node C is independent of node A.

Note: The Bayesian network graph does not contain any cyclic
graph. Hence, it is known as a directed acyclic graph or DAG
The Bayesian network has mainly two components:
Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution
P(Xi |Parent(Xi) ), which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3, , xn, then the probabilities of a
different combination of x1, x2, x3.. xn, are known as Joint probability
distribution
P[x1, x2, x3, , xn], it can be written as the following way in terms of the
joint probability distribution.
= P[x1| x2, x3,....., xn]P[x2, x3, , xn]

= P[x1| x2, x3,....., xn]P[x2|x3,....., xn] P[xn-1|xn]P[xn].

In general for each variable Xi, we can write the equation as:

P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))

Explanation of Bayesian network:


Let's understand the Bayesian network through an example by
creating a directed acyclic graph:
Example: Harry installed a new burglar alarm at his home to
detect burglary. The alarm reliably responds at detecting a
burglary but also responds for minor earthquakes. Harry has two
neighbors David and Sophia, who have taken a responsibility to inform
Harry at work when they hear the alarm. David always calls Harry
when he hears the alarm, but sometimes he got confused with the phone
ringing and calls at that time too. On the other hand, Sophia likes to
listen to high music, so sometimes she misses to hear the alarm. Here we
would like to compute the probability of Burglary Alarm.
Problem:
Calculate the probability that alarm has sounded, but there is neither a
burglary, nor an earthquake occurred, and David and Sophia both
called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is showing that burglary and earthquake is the parent node of the
alarm and directly affecting the probability of alarm's going off, but David
and Sophia's calls depend on alarm probability.
o The network is representing that our assumptions do not directly perceive
the burglary and also do not notice the minor earthquake, and they also
not confer before calling.
o The conditional distributions for each node are given as conditional
probabilities table or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an exhaustive set of cases for the variable.
o In CPT, a boolean variable with k boolean parents contains 2K
probabilities. Hence, if there are two parents, then CPT will contain 4
probability values
List of all events occurring in this network:

o Burglary (B)
o Earthquake(E)
o Alarm(A)
o David Calls(D)
o Sophia calls(S)
We can write the events of problem statement in the form of
probability: P[D, S, A, B, E], can rewrite the above probability
statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]

= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]


Let's take the observed probability for the Burglary and earthquake
component:
P(B= True) = 0.002, which is the probability of burglary.
P(B= False)= 0.998, which is the probability of no burglary.
P(E= True)= 0.001, which is the probability of a minor earthquake
P(E= False)= 0.999, Which is the probability that an earthquake not
occurred.
We can provide the conditional probabilities as per the below tables:
Conditional probability table for Alarm A:The Conditional
probability of Alarm A depends on Burglar and earthquake:
B E P(A= True) P(A= False)
True True 0.94 0.06
True False 0.95 0.04
False True 0.31 0.69
False False 0.001 0.999
Conditional probability table for David Calls:
The Conditional probability of David that he will call depends on the
probability of Alarm.
A P(D= True) P(D= False)

True 0.91 0.09

False 0.05 0.95

Conditional probability table for Sophia Calls:


The Conditional probability of Sophia that she calls is depending on
its Parent Node "Alarm.“
A P(S= True) P(S= False)

True 0.75 0.25

False 0.02 0.98


From the formula of joint distribution, we can write the problem
statement in the form of probability distribution:
P(S, D, A, ¬B, ¬E) = P (S|A) *P (D|A)*P (A|¬B ^ ¬E) *P (¬B) *P (¬E).

= 0.75* 0.91* 0.001* 0.998*0.999 = 0.00068045.


Hence, a Bayesian network can answer any query about the domain by using
Joint distribution.
The semantics of Bayesian Network:
There are two ways to understand the semantics of the Bayesian network, which
is given below:
1. To understand the network as the representation of the Joint probability
distribution. It is helpful to understand how to construct the network.

2. To understand the network as an encoding of a collection of conditional


independence statements. It is helpful in designing inference procedure.
Inference in Bayesian Networks
1. Exact inference
2. Approximate inference

3. Exact inference: In exact inference, we analytically compute the


conditional probability distribution over the variables of interest. But
sometimes, that’s too hard to do, in which case we can use
approximation techniques based on statistical sampling.
Given a Bayesian network, what questions might we want to ask?
• Conditional probability query: P(x | e)
• Maximum a posteriori probability: What value of x
maximizes P(x|e) ?
General question: What’s the whole probability
distribution over variable X given evidence e, P(X |
In our discrete probability situation, the only way to answer a MAP query
is to compute the probability of x given e for all possible values of x and
see which one is greatest

So, in general, we’d like to be able to compute a whole probability


distribution over some variable or variables X, given instantiations of a
set of variables e

Using the joint distribution

To answer any query involving a conjunction of variables, sum over the


variables not involved in the query. Given the joint distribution over the
variables, we can easily answer any question about the value of a single
variable by summing (or marginalizing) over the other variables.
So, in a domain with four variables, A, B, C, and D, the probability that variable D
has value d is the sum over all possible combinations of values of the other three
variables of the joint probability of all four values. This is exactly the same as the
procedure we went through in the last lecture, where to compute the probability of
cavity, we added up the probability of cavity and toothache and the probability of
cavity and not toothache. In general, we’ll use the first notation, with a single
summation indexed by a list of variable names, and a joint probability expression
that mentions values of those variables. But here we can see the completely
written-out definition, just so we all know what the shorthand is supposed to mean.
To compute a conditional probability, we reduce it to a ratio of
conjunctive queries using the definition of conditional probability,
and then answer each of those queries by marginalizing out the
variables not mentioned.
In the numerator, here, you can see that we’re only summing over
variables A and C, because b and d are instantiated in the query.
We’re going to learn a general purpose algorithm for answering these
joint queries fairly efficiently. We’ll start by looking at a very simple
case to build up our intuitions, then we’ll write down the algorithm,
then we’ll apply it to a more complex case.
Here’s our very simple case. It’s a bayes net with four nodes, arranged in a
chain.
So, we know from before that the probability that variable D has some
value little d is the sum over A, B, and C of the joint distribution, with d
fixed.

Now, using the chain rule of Bayesian networks, we can write down the
joint probability as a product over the nodes of the probability of each
node’s value given the values of its parents. So, in this case, we get P(d|
c) times P(c|b) times P(b|a) times P(a).
If you look, for a minute, at the terms inside the summation over A,
you’ll see that we’re doing these multiplications over for each value
of C, which isn’t necessary, because they’re independent of C. Our
idea, here, is to do the multiplications once and store them for later
use. So, first, for each value of A and B, we can compute the product,
generating a two dimensional matrix.

Then, we can sum over the rows of the matrix, yielding one value of
the sum for each possible value of b.
We’ll call this set of values, which depends on b, f1 of b.

Now, we can substitute f1 of b in for the sum over A in our


previous expression. And, effectively, we can remove node A
from our diagram. Now, we express the contribution of b, which
takes the contribution of a into account, as f_1 of b.
We can continue the process in basically the same way. We can look
at the summation over b and see that the only other variable it
involves is c. We can summarize those products as a set of factors,
one for each value of c. We’ll call those factors f_2 of c.

We substitute f_2 of c into the formula, remove node b from


the diagram, and now we’re down to a simple expression in
which d is known and we have to sum over values of c.

You might also like