ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
IV. Uncertainty Handling
Agents almost never have access to the whole truth about their environment. Agents must,
therefore, act under uncertainty.
Handling Uncertain Knowledge:
Let us try to write rules for dental diagnosis using first-order logic, so that we can see how the
logical approach breaks down. Consider the following rule:
The problem is that this rule is wrong. Not all patients with toothaches have cavities; some of
them have gum disease, an abscess, or one of several other problems:
Unfortunately, in order to make the rule true, we have to add an almost unlimited list of possible
causes. We co~~tlryd turning the rule into a causal rule:
Trying to use first-order logic to cope with a domain like medical diagnosis thus fails for three
main reasons:
Laziness: It is too much work to list the complete set of antecedents or cons
Theoretical ignorance: Medical science has no complete theory for the domain.
Practical ignorance: Even if we know all the rules, we might be uncertain about a particular
patient because not all the necessary tests have been or can be run.
Probability provides a way of summarizing the uncertainty that comes from our laziness and
ignorance.
To make such choices, an agent must first have preferences between the different possible
outcomes of the various plans.
Preferences, as expressed by utilities, are combined with probabilities in the general theory of
rationa1 decisions called decision theory:
Decision theory = probability theory + utility theory.
The fundamental idea of decision theory is that an agent is rational if and only if it chooses the
action that yields the highest expected utility, averaged over all the possible outcomes of the
action. This is called the principle of Maximum Expected Utility (MEU).
A typical decision theoretic agent has shown in Figure 13.1
1|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
Basic Probability Notation:
Degrees of belief are always applied to propositions. The basic element of the language is the
random1 variable. The random variables are typically divided into three kinds.
Boolean random variables, such as Cavity, have the domain (true, false). We will often
abbreviate a proposition such as Cavity = true simply by the lowercase name cavity. Similarly,
Cavity = false would be abbreviated by ¬cavity.
Discrete random variables, which include Boolean random variables as a special case, take
on values from a countable domain. For example, the domain of Weather might be (sunny,
rainy, cloudy, snow). The values in the domain must be mutually exclusive and exhaustive.
Where no confusion arises, we: will use, for example, snow as an abbreviation for Weather =
snow.
Continuous random variables take on values from the: real numbers. The domain can be
either the entire real line or some subset such as the interval [0, 1]. For example, the proposition
X = 4.02 asserts that the random variable .X has the exact value 4.02. Propositions concerning
continuous random variables can also be inequalities, such as X <= 4.02.
Atomic events: An atomic event is a complete specification of the state of the world about
which the agent is uncertain.
Atomic events have some important properties:
They are mutually exclusive-at most one can actually be the case. For example, cavity ∧
toothache and cavity ∧ ¬toothache cannot both be the case.
The set of all possible atomic events is exhaustive-at least one must be the case. That is, the
disjunction of all atomic events is logically equivalent to true.
Any particular atomic event entails the truth or falsehood of every proposition, whether
simple or complex. For example, the atomic event cavity ∧ ¬toothache entails the truth of
cavity and the falsehood of cavity ⇒ toothache.
Any proposition is logically equivalent to the disjunction of all atomic events that entail the
truth of the proposition. For example, the proposition cavity is equivalent to disjunction of the
atomic events cavity ∧toothache and ¬cavity ∨ ¬toothache.
Prior probability:
The unconditional or prior probability associated with a proposition a is the degree of belief
accorded to it in the absence of any other information; it is written as P (a). For example, if the
prior probability that I have a cavity is 0.1, then we would write P (Cavity = true) = 0.1 or P
(cavity) = 0.1.
Ex:
P (Weather = sunny) = 0.7
P (Weather = rain) = 0.2
P (Weather = cloudy) = 0.08
P (Weather = snow) = 0.02.
We may simply write
2|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
P (Weather) = (0.7, 0.2, 0.08, 0.02).
Conditional Probabilities:
Conditional probabilities can be defined in terms of unconditional probabilities. The defining
equation is
The above equation is called product rule. We can also frame it as below.
The Axioms of Probability:
There are three axioms which are often called Kolmogorov’s axioms.
Using the axioms of probability:
We can derive a variety of useful facts from the basic axioms. For example,
INFERENCE USING FULL JOINT DISTRIBUTIONS:
We will use the full joint distribution as the "knowledge base" from which answers to all
questions may be derived.
3|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
Example: Let a domain consisting of just the three Boolean variables Toothache, Cavity, and
Catch (the dentist's nasty steel probe catches in my tooth). The full joint distribution is a 2 x 2
x 2 table as shown in Figure 13.3.
Notice that the probabilities in the joint distribution sum to 1, as required by the axioms of
probability. Now we can calculate the probability of any proposition given simple or complex.
Example1) the probability of finding cavity or toothache P (cavity ∨ toothache) can compute
as
P(cavity ∨ toothache) = 0.108 + 0.012 + 0.072 + 0.008 + 0.016 + 0.064 = 0.28 .
Example 2) the probability of having cavity P(cavity) gives the unconditional or marginal
probability of cavity
P(cavity) = 0.108 + 0.012 + 0.072 + 0.008 = 0.2.
Example 3) the probability of a cavity, given evidence of a toothache, as follows i.e.
Example 4) the probability of no cavity, given evidence of toothache, as follows i.e.
In the similar way, we can find compute the required probabilities.
Independence:
For example, if we add a fourth variable, Weather the full joint distribution then becomes
P(Toothache, Catch, Cavity, Weather ), which has 2 × 2 × 2 × 4 = 32 entries. It contains four
“editions” of the table shown in Figure 13.3, one for each kind of weather. What relationship
do these editions have to each other and to the original three-variable table? For example, how
are P (toothache, catch, cavity, cloudy) and P (toothache, catch, cavity) related?
We can use the product rule:
P (toothache, catch, cavity, cloudy) = P (cloudy | toothache, catch, cavity) P (toothache, catch,
cavity).
Therefore, the following assertion seems reasonable:
P(cloudy | toothache, catch, cavity) = P(cloudy)
From this, we can deduce
4|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
P(toothache, catch, cavity, cloudy) = P(cloudy)P(toothache, catch, cavity)
This property is called independence or marginal independence or absolute independence.
In particular, the weather is independent of one’s dental problems. Independence between
propositions a and b can be written as
P(a | b)=P(a) or P(b | a)=P(b) or P(a ∧ b)=P(a)P(b).
Independence assertions are usually based on knowledge of the domain. As the toothache–
weather example illustrates, they can dramatically reduce the amount of information
necessary to specify the full joint distribution.
If the complete set of variables can be divided into independent subsets, then the full joint
distribution can be factored into separate joint distributions on those subsets. For example, the
full joint distribution on the outcome of n independent coin flips, P (C1, . . . ,Cn), has 2n entries,
but it can be represented as the product of n single-variable distributions P(Ci). An example is
shown in Figure 13.4
BAYES’ RULE AND ITS USE:
We defined the product rule. It can actually be written in two forms:
This equation is known as Bayes’ rule (also Bayes’ law or Bayes’ theorem). This simple
equation underlies most modern AI systems for probabilistic inference.
The more general case of Bayes’ rule for multivalued variables can be written in the P
notation as follows:
5|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
We will also have occasion to use a more general version conditionalized on some
background evidence e:
Applying Bayes’ rule: The simple case:
Bayes’ rule does not seem very useful. It allows us to compute the single term P(b | a) in terms
of three terms: P(a | b), P(b), and P(a). That seems like two steps backwards, but Bayes’ rule
is useful in practice because there are many cases where we do have good probability estimates
for these three numbers and need to compute the fourth.
We can also perceive as evidence the effect of some unknown cause and we would like to
determine that cause. In that case, Bayes’ rule becomes
The conditional probability P (effect CAUSAL | cause) quantifies the relationship in the causal
direction, whereas P (cause|effect) describes the diagnostic direction. In a task such as
medical diagnosis, we often have conditional probabilities on causal relationships (that is, the
doctor knows P (symptoms|disease)) and want to derive a diagnosis, P(disease|
symptoms).
An Example Problem:
For example, a doctor knows that the disease meningitis causes the patient to have a stiff neck,
say, 70% of the time. The doctor also knows some unconditional facts: the prior probability
that a patient has meningitis is 1/50,000, and the prior probability that any patient has a stiff
neck is 1%.
Solution: Letting s be the proposition that the patient has a stiff neck and m be the proposition
that the patient has meningitis, we have
Inference:-That is, we expect less than 1 in 700 patients with a stiff neck to have meningitis.
Observation: Notice that even though a stiff neck is quite strongly indicated by meningitis
(with probability 0.7), the probability of meningitis in the patient remains small. This is because
the prior probability of stiff necks is much higher than that of meningitis.
The same process can be applied when using Bayes’ rule. We have
6|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
The general form of Bayes’ rule with normalization is
where α is the normalization constant needed to make the entries in P(Y |X) sum to 1.
Using Bayes’ rule: Combining evidence:
The bayes rule can be applied when we have two or more pieces of evidence. For example
consider the full joint distribution of dentist shown in Figure 13.3. We are reproducing the table
here.
If we know full joint distribution, we can answer the queries,
The Bayes rule can be reformulated as
Conditional Independence: It would be nice if Toothache and Catch were independent, but
they are not: if the probe catches in the tooth, then it is likely that the tooth has a cavity and
that the cavity causes a toothache. These variables are independent, however, given the
presence or the absence of a cavity. Each is directly caused by the cavity, but neither has a
direct effect on the other: toothache depends on the state of the nerves in the tooth, whereas the
probe’s accuracy depends on the dentist’s skill, to which the toothache is irrelevant.
Mathematically, this property is written as
This equation expresses the conditional independence CONDITIONAL of toothache and
catch given Cavity.
The general definition of conditional independence of two variables X and Y , given a third
variable Z, is
We can derive the decomposition as
7|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
The full joint distribution can be written as
Such a probability distribution is called a naive Bayes model—“naive” because it is often used
(as a simplifying assumption) in cases where the “effect” variables are not actually
conditionally independent given the cause variable. (The naive Bayes model is sometimes
called a Bayesian classifier, a somewhat careless usage that has prompted true Bayesians to
call it the idiot Bayes model.)
Probabilistic Reasoning:
The importance of independence and conditional independence relationships are useful in
simplifying probabilistic representations of the world.
A systematic way to represent such relationships explicitly in the form of Bayesian networks.
We define the syntax and semantics of these networks and show how they can be used to
capture uncertain knowledge in a natural and efficient way.
REPRESENTING KNOWLEDGE IN AN UNCERTAIN DOMAIN:
The independence and conditional independence relationships among variables can greatly
reduce the number of probabilities that need to be specified in order to define the full joint
distribution. This section introduces a data structure called a Bayesian network to represent
the dependencies among variables. Bayesian networks can represent essentially any full joint
probability distribution and in many cases can do so very concisely.
What are Bayesian networks?
A Bayesian network is a directed graph in which each node is annotated with quantitative
probability information. The full specification is as follows:
1. Each node corresponds to a random variable, which may be discrete or continuous.
2. A set of directed links or arrows connects pairs of nodes. If there is an arrow from node X
to node Y, X is said to be a parent of Y. The graph has no directed cycles (and hence is a
directed acyclic graph, or DAG.
3. Each node Xi has a conditional probability distribution P(Xi |Parents(Xi)) that quantifies the
effect of the parents on the node.
The topology of the network—the set of nodes and links—specifies the conditional
independence relationships that hold in the domain, in a way that will be made precise
shortly.
The intuitive meaning of an arrow is typically that X has a direct influence on Y, which
suggests that causes should be parents of effects.
It is usually easy for a domain expert to decide what direct influences exist in the
domain—much easier, in fact, than actually specifying the probabilities themselves.
Once the topology of the Bayesian network is laid out, we need only specify a
conditional probability distribution for each variable, given its parents.
8|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
We will see that the combination of the topology and the conditional distributions
suffices to specify (implicitly) the full joint distribution for all the variables.
Example 1: The variables Toothache, Cavity, Catch, and Weather. We know that Weather
is independent of the other variables; furthermore, we know that Toothache and Catch are
conditionally independent, given Cavity. These relationships are represented by the Bayesian
network structure shown in Figure 14.1.
Formally, the conditional independence of Toothache and Catch, given Cavity, is indicated
by the absence of a link between Toothache and Catch. Intuitively, the network represents
the fact that Cavity is a direct cause of Toothache and Catch, whereas no direct causal
relationship exists between Toothache and Catch.
Example 2: You have a new burglar alarm installed at home. It is fairly reliable at detecting a
burglary, but also responds on occasion to minor earthquakes. You also have two neighbors,
John and Mary, who have promised to call you at work when they hear the alarm. John nearly
always calls when he hears the alarm, but sometimes confuses the telephone ringing with the
alarm and calls then, too. Mary, on the other hand, likes rather loud music and often misses the
alarm altogether.
Query: Given the evidence of who has or has not called, we would like to estimate the
probability of a burglary.
A Bayesian network for this domain appears in Figure 14.2.
9|Page
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
The network structure shows that burglary and earthquakes directly affect the
probability of the alarm’s going off, but whether John and Mary call depends only on
the alarm.
The network thus represents our assumptions that they do not perceive burglaries
directly, they do not notice minor earthquakes, and they do not confer before calling.
The conditional distributions in Figure 14.2 are shown as a conditional probability table, or
CPT. Each row in a CPT contains the conditional probability of each node value for a
conditioning case. A conditioning case is just a possible combination of values for the parent
nodes—a miniature possible world.
Example problem: What is the probability that the alarm has sounded, but neither a
burglary nor an earth quake has occurred, and both John and Mary call.
Exercise Problems: What is the probability that the alarm has sounded, when there is a
burglary but not an earth quake has occurred, and john calls but Mary does not call.
P (a, j, ¬m,b, ¬e) = (0.90 * 0.30 * 0.94 *0.001 * 0.998)
The semantics of Bayesian network:
There are two ways in which one can understand the semantics of Bayesian networks.
The first is to see the network as a representation of the joint probability distribution.
The second is to view it as an encoding of a collection of conditional independence
statements.
The two views are equivalent, but the first turns out to be helpful in understanding how
to construct networks, whereas the second is helpful in designing inference procedures.
Representing the full joint distribution:
The conditional probabilities can be think like as members of
A generic entry in the joint distribution is the probability of a conjunction of particular
assignments to each variable, such as P(X1 =x1 ∧ . . . ∧ Xn =xn). We use the notation P(x1,
. . . , xn) as an abbreviation for this. The value of this entry is given by the formula.
Further we can rewrite the equation as
Method for constructing the Baye’s network:
10 | P a g e
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
This identity is called the chain rule. we see that the specification of the joint distribution is
equivalent to the general assertion that, for every variable Xi in the network
Provided that Parents (Xi) ⊆
{Xi−1, . . .,X1}.
Bayesian network is a correct representation of the domain only if each node is conditionally
independent of its other predecessors in the node ordering, given its parents. We can satisfy
this condition with this methodology:
1. Nodes: First determine the set of variables that are required to model the domain. Now order
them, {X1, . . . ,Xn}. Any order will work, but the resulting network will be more compact if
the variables are ordered such that causes precede effects.
2. Links: For i = 1 to n do:
• Choose, from X1, . . . , Xi−1, a minimal set of parents for Xi.
• For each parent insert a link from the parent to Xi.
• CPTs: Write down the conditional probability table, P (Xi|Parents(Xi)).
Intuitively, the parents of node Xi should contain all those nodes in X1, . . . , Xi−1 that directly
influence Xi. For example, suppose we have completed the network in Figure 14.2 except for
the choice of parents for MaryCalls. MaryCalls is certainly influenced by whether there is a
Burglary or an Earthquake, but not directly influenced. Intuitively, our knowledge of the
domain tells us that these events influence Mary’s calling behavior only through their effect on
the alarm. Also, given the state of the alarm, whether John calls has no influence on Mary’s
calling. Formally speaking, we believe that the following conditional independence statement
holds:
P(MaryCalls | JohnCalls , Alarm, Earthquake, Burglary) = P(MaryCalls | Alarm)
Thus, Alarm will be the only parent node for MaryCalls.
Compactness and node ordering:
The compactness of Bayesian networks is an example of a general property of locally
structured (also called sparse) systems. In a locally structured system, each subcomponent
interacts directly with only a bounded number of other components, regardless of the total
number of components.
An Example:
Even in a locally structured domain, we will get a compact Bayesian network only if we choose
the node ordering well. What happens if we happen to choose the wrong order? Consider the
burglary example again. Suppose we decide to add the nodes in the order MaryCalls,
JohnCalls, Alarm, Burglary, and Earthquake. We then get the somewhat more complicated
network shown in Figure 14.3(a).
11 | P a g e
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
The process goes as follows:
• Adding MaryCalls: No parents.
• Adding JohnCalls: If Mary calls, that probably means the alarm has gone off, which of course
would make it more likely that John calls. Therefore, JohnCalls needs MaryCalls as a parent.
• Adding Alarm: Clearly, if both call, it is more likely that the alarm has gone off than if just
one or neither calls, so we need both MaryCalls and JohnCalls as parents.
• Adding Burglary: If we know the alarm state, then the call from John or Mary might give us
information about our phone ringing or Mary’s music, but not about burglary: P(Burglary |
Alarm, JohnCalls ,MaryCalls) = P(Burglary | Alarm) . Hence we need just Alarm as parent.
• Adding Earthquake: If the alarm is on, it is more likely that there has been an earthquake.
(The alarm is an earthquake detector of sorts.) But if we know that there has been a burglary,
then that explains the alarm, and the probability of an earthquake would be only slightly above
normal. Hence, we need both Alarm and Burglary as parents.
The resulting network has two more links than the original network in Figure 14.2 and requires
three more probabilities to be specified. What’s worse, some of the links represent tenuous
relationships that require difficult and unnatural probability judgments, such as assessing the
probability of Earthquake, given Burglary and Alarm.
Figure 14.3(b) shows a very bad node ordering: MaryCalls, JohnCalls, Earthquake, Burglary,
Alarm. This network requires 31 distinct probabilities to be specified—exactly the same
number as the full joint distribution. It is important to realize, however, that any of the three
networks can represent exactly the same joint distribution. The last two versions simply fail to
represent all the conditional independence relationships and hence end up specifying a lot of
unnecessary numbers instead.
Conditional independence relations in Bayesian networks:
The “topological” semantics that specifies the conditional independence relationships encoded
by the graph structure, and from this we can derive the “numerical” semantics. The topological
semantics2 specifies that each variable is conditionally independent of its non-descendants,
12 | P a g e
CSE DEPARTMENT
ARTIFICIAL INTELLIGNCE IV YEAR CSE I SEMESTER
given its parents. For example, in Figure 14.2, JohnCalls is independent of Burglary,
Earthquake, and MaryCalls given the value of Alarm. The definition is illustrated in Figure
14.4(a). From these conditional independence assertions and the interpretation of the network
parameters θ(Xi |Parents(Xi)) as specifications of conditional probabilities P(Xi |Parents(Xi)),
the full joint distribution given in Equation (14.2) can be reconstructed. In this sense, the
“numerical” semantics and the “topological” semantics are equivalent.
Another important independence property is implied by the topological semantics: a node is
conditionally independent of all other nodes in the network, given its parents, children, and children’s
parents—that is, given its Markov blanket. (Exercise 14.7 asks you to prove this.) For example,
Burglary is independent of JohnCalls and MaryCalls, given Alarm and Earthquake. This property
is illustrated in Figure 14.4(b).
13 | P a g e
CSE DEPARTMENT