Practice Assignment 12
Reinforcement Learning
Prof. B. Ravindran
Instructions: In the following questions, one or more choices may be correct. Select all that apply.
1. Suppose that we solve a POMDP using a Q-MDP like solution discussed in the lectures - where
we assume that the MDP is known and solve it to learn Q values for the true (state, action)
pairs. Which of the following are true?
(a) We can recover a policy for execution in the partially observable environment
P by weighting
Q values by the belief distribution bel so that π(s) = argmaxa s bel(s)Q(s, a).
(b) We can recover an optimal policy for the POMDP from the Q values that have been
learnt for the true (state, action) pairs.
(c) Policies recovered from Q-MDP like solution methods are always better than policies
learnt by history based methods.
(d) None of the above
Sol. (a)
(a) is true. This strategy can be used to recover a policy for execution.
(b) is false. When learning the Q values, we assumed that state was available. However this is
not true for the partially observable environment, so it will typically not be possible to recover
a policy that is optimal for the POMDP from the learnt Q values.
(c) is false.
2. Consider the below grid-world:
In the figure above, black squares are blocked. Assume the agent can see one step in the 4
cardinal directions. Assume that the agent’s observations are always correct and that there is
no prior information given regarding the states.
Assertion: If the observation is that there are no obstruction to the East or West, but are
present to the North and South, the belief that the agent is in the green shaded square is 0.5.
Reason: Only the green and blue shaded squares have obstructions to the North and South,
but not to the East or West.
1
(a) Assertion and Reason are both true and Reason is a correct explanation of Assertion.
(b) Assertion and Reason are both true and Reason is not a correct explanation of Assertion.
(c) Assertion is true but Reason is false.
(d) Assertion and Reason are both false.
Sol. (d)
The square 3 units to the north of the green square also has the same property, invalidating
the assertion and the reason.
3. Consider the grid world shown below. Walls and obstacles are colored gray. An agent is
dropped into one of the unoccupied cells of the environment uniformly at random. The agent
is equipped with a sensor that can detect the presence of walls or obstacles immediately to its
North, South, East or West. However the sensor is noisy, and an observation made in each
direction may be wrong with a probability of 0.1. Given that the agent senses no obstacles in
any direction, what is the probability that it was dropped into the cell marked ‘x’ ?
(a) 1/5
(b) 82/91
(c) 164/173
(d) None of the above.
Sol. (d)
Applying Bayes Rule,
P (A|B) = P (B|A)PP(A)+P
(B|A)P (A)
(B|¬A)P (¬A) with A being the event that the agent was “dropped into
the cell marked x” and B being “No obstacles observed.”
(0.94 )0.2 729
P (A|B) = (0.94 )0.2+(0.13 ×0.9)0.8 = 733
4. In the same environment as Question 3, what is the probability that the agent was not dropped
onto the cell marked ‘x’, if the observation made is that there are obstacles present only to the
North and to the South?
(a) 4/5
(b) 82/91
(c) 164/173
(d) None of the above.
2
Sol. (c)
Applying Bayes Rule,
P (A|B) = P (B|A)PP(A)+P
(B|A)P (A)
(B|¬A)P (¬A) with A being the event that the agent was “not dropped
into the cell marked x” and B being “Obstacles observed only to the North and to the South.”
[0.4×0.1×0.93 ]+[0.4×0.13 ×0.9] 164
P (A|B) = [0.4×0.1×0.93 ]+[0.4×0.13 ×0.9]+(0.92 ×0.12 )0.2 = 173
5. Asserion: In partially observable systems, histories that include both the sequence of ob-
servations and the sequence of actions are typically able to disambiguate the true state of an
agent better than histories that include only the sequence of observations.
Reason: Different sequences of actions can lead to different interpretations of the sequence
of sensor observations.
(a) Both Assertion and Reason are true, and Reason is a correct explanation of the Assertion.
(b) Both Assertion and Reason are true, but Reason is not a correct explanation of the
Assertion.
(c) Assertion is true, Reason is false
(d) Both Assertion and Reason are false
Sol. (a)
Both Assertion and Reason are true, and Reason is correct explanation for Assertion. Refer
to the lecture on Solving POMDPs.