Artificial Intelligence
Roman Barták
          Department of Theoretical Computer Science and Mathematical Logic
                                                                              Introduction
We construct rational agents.
An agent is an entity that perceives its environment
through sensors and acts upon that environment through
actuators.
A rational agent is an agent maximizing its expected
performance measure.
          In AI 1 we dealt mainly with a logical approach
          to agent design (no uncertainty).
          We ignored
                  – interface to environment (sensors, actuators)
                  – uncertainty
                  – the possibility of self-improvement (learning)
                                          Course structure
Introduction
  – motivation and background on probability
Probabilistic reasoning
  – uncertainty, probabilistic reasoning, Bayesian
    networks, Hidden Markov Models
Rational decisions
  – utility theory, Markov Decision Processes, game
    theory, mechanism design
Machine learning
  – decision trees, regression, SVM, reinforcement
    learning
                                                 Resources
Artificial Intelligence: A Modern Approach
                       – S. Russell and P. Norvig
                       – Prentice Hall, 2010 (3rd ed.)
                       – http://aima.cs.berkeley.edu/
                       Umělá inteligence 1-6
                       – Vladimír Mařík, Olga Štěpánková,
                         Jiří Lažanský a kol.
                       – Academia
                                              Course web site
 http://ktiml.mff.cuni.cz/~bartak/ui2
                                 You can find there:
                                 – slides
                                 – links and resources
                                 – contacts
                                 – quiz
                                 –…
                                        Links to other courses
Seminar on Artificial Intelligence II
   – how to apply AI techniques in practice
Machine learning
   – how can computers learn new things
Multi-agent systems
   – how to handle multiple agents
Probabilistic graphical models
   – how to do Bayesian inference efficiently etc.
Human-like artificial agents
   – how to design agents for virtual environments
Practical course on robotics
   – how to design hardware agents
                                             Uncertainty so far
Can we handle uncertain information in the pure
logical approach?
belief states
   – represent sets of all possible world
     states for the agent
Drawbacks
   – a logical agent must consider every logically possible
     explanation for the observations, no matter how
     unlikely (large and complex representations)
   – a correct contingent plan must consider arbitrary
     likely contingencies (big plans)
   – sometimes there is no plan that is guaranteed to
     achieve the goal, yet the agent must act
                                                       Example
Diagnosing a dental patient‘s toothache
   Let us try to apply propositional logic:
     Toothache ⇒ Cavity
   Hmm, is it really true?
   – not all patients with toothaches have cavities; some
     of them have gum disease, an abscess, or other
     problems
     Toothache ⇒ Cavity ∨ GumProblem ∨ Abscess ∨ …
   We could try turning the rule into a causal rule:
     Cavity ⇒ Toothache
   But this is not right either – not all cavities cause pain
   The only way to fix the rule is to make it logically
     exhaustive!
                                      Using a logical approach
Why does logic fail to cope with a domain like medical
   diagnosis?
• laziness: it is too much work to list the complete set of
   antecedent or consequents and too hard to use such rules
• theoretical ignorance: medical science has no complete
   theory for the domain
• practical ignorance: even if we know all the rules we might
   be uncertain because not all the necessary tests have been or
   can be run
We need another tool to deal with degrees of belief –
   probability theory.
A logical agent believes each sentence to be true or false or has
   no opinion. A probabilistic agent may have a numerical degree
   of belief between 0 (certainly false) and 1 (certainly true).
                                     Basic probability notation
Like logical assertions, probabilistic assertions are about
possible worlds – sample space Ω.
   – the possible worlds are mutually exclusive and
      exhaustive
Each possible world ω is associated with a numerical
probability P(ω) such that:
   0 ≤ P(ω) ≤ 1
   Σω∈Ω P(ω) = 1
Example:
   If we are about to roll two (distinguishable) dice, there
   are 36 possible worlds to consider: (1,1), (1,2),…, (6,6)
   P(ω) =1/36
                                                              Events
The sets of possible worlds are called events.
   Example: „doubles are rolled“ is an event
The probability of event is the sum of probabilities
of possible worlds in the event.
   P(φ) = Σω∈φ P(ω)
   Example:
     P(doubles) =
     1/36+1/36+1/36+1/36+1/36+1/36 = 1/6
   These probabilities are called unconditional or
    prior probabilities („priors“ for short).
                                          Conditional probability
Frequently, we have some information (evidence) and we
are interested in probability of some event.
   For example, what is the probability of double if we already
   know that first die rolled to 5?
     P(doubles | Die1 = 5) = 1/36 / (6*1/36) = 1/6
This is called conditional or posterior probability
     P(a | b) = P(a ∧ b) / P(b), whenever P(B) > 0
   This can be also written in a different form called the product
     rule
     P(a ∧ b) = P(a | b) . P(b)
Beware! If we have more evidence then the conditional
  probability needs to assume it.
  P(doubles | Die1 = 5, Die2 = 5) = 1
                                           Random variables
In a factored representation, a possible world is
represented by a set of variable/value pairs.
Variables in probability theory are called random
variables. Every random variable has a domain –
the set of possible values it can take on (similarly to
a CSP).
   Die1 – represents a value on the first die 1 (1,…,6)
   Cavity – describes whether the patient has or has not
   cavity (true, false)
A possible world is fully identified by values of all
random variables.
   P(Die1 = 5, Die2 = 5)
                                      Probability distribution
Probability of all possible worlds can be described using a
table called a full joint probability distribution – the
elements are indexed by values of random variables.
Given the table, we can calculate probabilities of values of
any random variable:
   P(toothache=true) = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
   P(toothache=false) = 0.072+ 0.008 + 0.144 + 0.576 = 0.8
We will describe the table in a short way as:
   P(Toothache) = 〈0.2, 0.8〉
                                         Probability axioms
  P(¬a) = 1 – P(a)
  inclusion-exclusion principle
    P(a ∨ b) = P(a) + P(b) – P(a ∧ b)
  chain rule
    P(A,B,C,D)
    = P(A|B,C,D) P(B,C,D)
    = P(A|B,C,D) P(B|C,D) P(C,D)
    = P(A|B,C,D) P(B|C,D) P(C|D) P(D)
                     Inference using full joint distributions
How to answer questions?
Knowledge base is
represented using full joint distribution.
To compute posterior probability of a query
proposition given observed evidence, we add
up probabilities of possible worlds in which
the proposition is true (marginalization or
summing out).
P(φ) = Σω:ω|=φ P(ω)
P(Y) = Σz∈Z P(Y,z)
                              Example of probabilistic inference
                                        P(φ) = Σω:ω|=φ P(ω)
                                        P(Y) = Σz∈Z P(Y,z)
P(toothache) (= P(Toothache=true))
   = 0.108 + 0.012 + 0.016 + 0.064 = 0.2
P(cavity ∨ toothache)
   = 0.108+0.012+0.072+0.008+0.016+0.064 = 0.28
P(¬cavity|toothache)
   = P(¬cavity ∧ toothache) / P(toothache)
   = (0.016 + 0.064) / (0.108 + 0.012 + 0.016 + 0.064)
   = 0.4
                                                      Normalization
P(¬cavity|toothache)
    = P(¬cavity ∧ toothache) / P(toothache)
    = (0.016 + 0.064) / (0.108 + 0.012 + 0.016 + 0.064) = 0.4
P(cavity|toothache)
    = P(cavity ∧ toothache) / P(toothache)
    = (0.108 + 0.012) / (0.108 + 0.012 + 0.016 + 0.064) = 0.6
Notice that denominators are identical in both formulas!
We even do not need to know the exact value of denominator:
   P(¬cavity|toothache) + P(cavity|toothache) = 1
We can use a normalization constant α instead, computed such
   that the evaluated distribution adds up to 1.
P(Cavity|toothache) = α P(Cavity,toothache)
   = α [P(Cavity,toothache,catch) + P(Cavity,toothache,¬catch)]
   = α [ 〈0.108, 0.016〉 + 〈0.012, 0.064〉 ]
   = α [ 〈0.12, 0.08〉 ] = [ 〈0.6, 0.4〉 ]
                                        Inference via enumeration
In a typical case, we know values e of random variables E
from the observation and we are looking for probability
distribution of random variables Y from the query.
The other random variables are hidden H = X – Y – E.
P(Y | E=e) = α P(Y, E=e) = α Σh P(Y,E=e,H=h)
Some drawbacks of inference by enumeration:
• the worst-case time complexity is O(dn), where d is the
  number of values in domains of each random variable
• to store full joint probability distribution we need O(dn)
  space
• last but not least, it is not easy to obtain
  probabilities for all possible worlds
                                                          Independence
Let us expand the full joint distribution by adding a fourth variable
Weather with the domain {cloudy, sunny, rain, snow} – the new
full joint distribution has 2x2x2x4 = 32 elements (possible worlds).
  P(toothache,catch,cavity,cloudy)
  = P(cloudy|toothache,catch,cavity) * P(toothache,catch,cavity)
Do one‘s dental problems influence the weather?
  P(cloudy|toothache,catch,cavity) = P(cloudy)
We can write in general:
  P(Toothache,Catch,Cavity,Weather)
  = P(Toothache,Catch,Cavity) * P(Weather)
Hence the full joint distribution can be constructed from two
  smaller tables, one with 8 elements and one with 4 elements.
This property is called (absolute) independence:
   P(X|Y) = P(X) or P(Y|X) = P(Y) or P(X,Y) = P(X).P(Y)
                                   Conditional independence
Full independence allows us reducing the size of the domain
representation, but unfortunately full independence is rare
and even independent subsets can be quite large.
When one has cavity, does catch depend on toothache?
  P(catch | toothache, cavity) = P(catch | cavity)
  P(catch | toothache, ¬ cavity) = P(catch | ¬ cavity)
Random variables Catch and Toothache are independent if
we know the value of Cavity.
   P(Catch | Toothache, Cavity) = P(Catch | Cavity)
This property is called conditional independence:
  P(X|Y,Z) = P(X|Y) or P(Z|X,Y) = P(Z|Y) or
  P(Z,X|Y) = P(Z|Y) P(X|Y)
                         Exploiting conditional independence
 Conditional independence can be used to further reduce
 the size of domain representation.
    P(Toothache,Catch,Cavity)
    = P(Toothache|Catch,Cavity) P(Catch,Cavity)
    = P(Toothache|Catch,Cavity) P(Catch|Cavity) P(Cavity)
    = P(Toothache|Cavity) P(Catch|Cavity) P(Cavity)
 The full joint distribution can be constructed from three
 smaller tables of sizes 2 + 2 + 1 = 5 (only independent
 elements are represented).
                                           Diagnostic systems
Let us go back to diagnostic problems.
Usually we are looking for disease (the source of
problems) based on symptoms (observations).
   – we are interested in the diagnostic direction
     expressed as conditional probability
     P(disease|symptoms)
However, from past experience we often have other
information:
   – the probability of disease P(disease)
   – the probability of symptoms P(symptoms)
   – the causal relation expressed as conditional
     probability P(symptoms|disease)
How can this information be exploited to get the
probability of the diagnostic direction?
                                                     Bayes' rule
Recall the product rule
   P(a∧b) = P(a|b) P(b) = P(b|a) P(a)
We can deduce a so called Bayes‘ rule (law or theorem):
   P(a|b) = P(b|a) P(a) / P(b)
in general:
   P(Y|X) = P(X|Y) P(Y) / P(X) = α P(X|Y) P(Y)
It looks like two steps backward as now we need to know
P(X|Y), P(Y), P(X).
But these are the values that we frequently have.
    P(cause|effect) = P(effect|cause) P(cause) / P(effect)
     – P(effect|cause) describes the causal direction
     – P(cause|effect) describes the diagnostic relation
                                                           Using Bayes' rule
Medical diagnosis
   – from past cases we know P(symptoms|disease), P(disease),
     P(symptoms)
   – for a new patient we know symptoms and looking for diagnosis
     P(disease|symptoms)
Example:
   – meningitis causes a stiff neck 70% of the time
   – the prior probability of meningitis is 1/50 000
   – the prior probability of stiff neck is 1%
   What is the probability that a patient having a stiff neck has
     meningitis?
   P(m|s) = P(s|m).P(m) / P(s) = 0.7 * 1/50000 / 0.01 = 0.0014
   Why the conditional probability for the diagnostic direction is not
    stored directly?
       • diagnostic knowledge is often more fragile than causal knowledge
       • for example, if there is a sudden epidemic of meningitis, the
         unconditional probability of meningitis P(m) will go up so P(m|s) should
         also go up while the causal relation P(s|m) is unaffected by the
         epidemic, as it reflects how meningitis works
                                                        Naive Bayes model
What if there are more observations?
We can exploit conditional independence as follows
  P(Toothache,Catch,Cavity)
  = P(Toothache|Cavity) P(Catch|Cavity) P(Cavity)
If all the effects are conditionally independent given
   the cause variable, we get:
P(Cause,Effect1,…,Effectn) = P(Cause) Πi P(Effecti|Cause)
Such a probability distribution is called a naive Bayes model
  (it is often used even in cases where the “effect” variables
  are not actually conditionally independent given the value of
  the cause variable).
                                     The Wumpus world revisited
Wumpus is back!
                                  We have a maze with pits that are
                                  detected in neighboring squares
                                  via breeze (Wumpus and gold will
                                  not be assumed now).
                                  Where does the agent should go,
                                  if there is breeze at (1,2) and
                                  (2,1)?
                                  Pure logical inference can
                                  conclude nothing about which
                                  square is most likely to be safe!
       To which square does the agent should go?
                                     Wumpus: probabilistic model
Boolean variables:
  Pi,j – pit at square (i,j)
  Bi,j – breeze at square (i,j)
  (only for the observed squares B1,1, B1,2 a B2,1.
Full joint probability distribution
  P(P1,2,…,P4,4,B1,1,B1,2,B2,1)                      product rule
  = P(B1,1,B1,2,B2,1 |P1,2,…,P4,4,) * P(P1,2,…,P4,4)
  P(P1,2,…,P4,4) = Πi,j P(Pi,j)          pits are spread independently
  P(P1,2,…,P4,4) = 0.2n * 0.816-n        probability of pit is 0.2 and
                                              there are n pits
                        Wumpus: query and simple reasoning
Assume that we have evidence:
    b = b1,1 ∧ b1,2 ∧ b2,1
    known = ¬p1,1 ∧ ¬p1,2 ∧ ¬p2,1
We are interested in answering queries
  such as P(P1,3 | known, b).
Answer can be computed by enumeration of the full
  joint probability distribution.
    Let Unknown be the variables Pi,j except P1,3 and Known:
    P(P1,3 | known, b)
      = Σunknown P(P1,3, unknown, known, b)
    But it means to explore all possible values of variables
      Unknown and there are 212 = 4096 terms!
    Can we do it better (faster)?
                          Wumpus: conditional independence
Observation:
  The observed breezes are conditionally
  independent of the other variables given
  the known (white), frontier (yellow),
  and query variables.
We split the set of hidden variables into fringe and other
  variables:
   Unknown = Fringe ∪ Other
From conditional independence we get:
   P(b | P1,3, known, unknown) = P(b | P1,3, known, fringe)
Now, let us exploit this formula.
                                                        Wumpus: reasoning
P(P1,3 | known, b)
                                               product rule P(X,Y) = P(X|Y) P(Y)
= α Σunknown P(P1,3, known, unknown, b)
= α Σunknown P(b | P1,3, known, unknown) * P(P1,3, known, unknown)
= α ΣfringeΣother P(b | P1,3, known, fringe,other) * P(P1,3, known, fringe,other)
= α ΣfringeΣother P(b | P1,3,known, fringe) * P(P1,3,known,fringe,other)
= α Σfringe P(b | P1,3,known, fringe) * Σother P(P1,3,known,fringe,other)
= α Σfringe P(b | P1,3,known, fringe) * Σother P(P1,3)P(known)P(fringe)P(other)
= α P(known)P(P1,3) Σfringe P(b | P1,3,known, fringe) P(fringe) Σother P(other)
= α´ P(P1,3) Σfringe P(b | P1,3,known, fringe) P(fringe)
       α = α. P(known)
        Σother P(other) = 1
                                                           Wumpus: solution
 P(P1,3 | known, b) = α´ P(P1,3) Σfringe P(b | P1,3,known, fringe) P(fringe)
 Let us explore possible models (values) of Fringe that are
   compatible with observation b.
 P(P1,3 | known, b)
    = α´ 〈0.2 (0.04 + 0.16 + 0.16), 0.8 (0.04 + 0.16) 〉
    = 〈 0.31, 0.69 〉
 P(P2,2 | known, b) = 〈 0.86, 0.14 〉
 Definitely avoid the square (2,2)!
                                                                                Summary
Probability theory is a formal mechanism to handle
uncertainty.
Full joint distribution describes probabilities of all
possible worlds.
Answers to queries can be obtained by summing out
probabilities of possible worlds consistent with the
observation.
However, larger problems will require a better
approach.
We are going to exploit independence and
conditional independence.
                                © 2016 Roman Barták
            Department of Theoretical Computer Science and Mathematical Logic
                                 bartak@ktiml.mff.cuni.cz