0% found this document useful (0 votes)
28 views3 pages

Solution 9

Uploaded by

nandini kataria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views3 pages

Solution 9

Uploaded by

nandini kataria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Assignment 9 (Sol.

)
Reinforcement Learning
Prof. B. Ravindran

1. Which among the following is/are the advantages of using the Deep Q-learning method over
other learning methods that we have seen?

(a) a faster implementation of the Q-learning algorithm


(b) guarantees convergence to the optimal policy
(c) obviates the need to hand-craft features used in function approximation
(d) allows the use of off-policy algorithms rather than on-policy learning schemes

Sol. (c)
As we have seen with the Atari games example, the input to the network is raw pixel data
from which the network manages to learn appropriate features to be used for representing the
action value function.
2. In the Deep Q-learning method, is -greedy (or other equivalent techniques) required to ensure
exploration, or is this taken care of by the randomisation provided by experience replay?

(a) no
(b) yes

Sol. (b)
Some technique to ensure exploration is still required. As with the original Q-learning algo-
rithm, if we only consider transitions generated by the action value function (which is essentially
what we’ll get with experience replay without any exploration), a large part of the state space
will likely remain unexplored.
3. Value function based methods are oriented towards finding deterministic policies whereas policy
search methods are geared towards finding stochastic policies. True or false?

(a) false
(b) true

Sol. (b)
With value function based methods, policies are derived from the value function by considering,
for a state, that action which gives maximum value. This leads to a deterministic policy. On the
other hand, no such maximisation is at work in policy search methods, where the parameters
learned, using the gradient descent method, for example, determine the agent’s policy. This is
likely to be stochastic if the optimal policy (global or local) is stochastic.

1
4. Suppose we are using a policy gradient method to solve a reinforcement learning problem.
Assuming that the policy returned by the method is not optimal, which among the following
are plausible reasons for such an outcome?

(a) the search procedure converged to a locally optimal policy


(b) the search procedure was terminated before it could reach the optimal policy
(c) the sample trajectories arising in the problem were very long
(d) the optimal policy could not be represented by the parametrisation used to represent the
policy

Sol. (a), (b), (d)


Option (c) may result in an increase in the time it takes to converge to a policy, but does not
necessarily affect the optimality of the policy obtained.
5. In using policy gradient methods, if we make use of the average reward formulation rather
than the discounted reward formulation, then is it necessary to consider, for problems that do
not have a unique start state, a designated start state, s0 ?

(a) no
(b) yes

Sol. (a)
We used the concept of a designated start state to allow a single value that can be assigned
to a policy for the purpose of evaluation. The same result is obtained when using the average
reward formulation, i.e, by using the average reward formulation we can compare policies
according to their long term expected reward per step, ρ(π), where
1
ρ(π) = limn→∞ E{r1 + r2 + r3 + ... + rN |π}
N

6. Using similar parametrisations to represent policies, would you expect, in general, MC policy
gradient methods to converge faster or slower than actor-critic methods assuming that the
approximation to Qπ used in the actor-critic method satisfies the compatibility criteria?

(a) slower
(b) faster

Sol. (a)
As we have seen, MC policy gradient algorithms may suffer from large variance due to long
episode lengths which can slow down convergence. Actor-critic methods, by relying on value
function estimates can lead to reduced variance, and hence, faster convergence.

7. If fw approximates Qπ and is compatible with the parameterisation used for the policy, then
this indicates that we can use fw in place of Qπ in the expression for calculating the gradient
of the policy performance metric with respect to the policy parameter because

(a) Qπ (s, a) − fw (s, a) = 0 in the direction of the gradient of fw (s, a)


(b) Qπ (s, a) − fw (s, a) = 0 in the direction of the gradient of π(s, a)

2
(c) the error between Qπ and fw is orthogonal to the gradient of the policy parameterisation

Sol. (b), (c)


As indicated by options (b) & (c), we can use fw in place of Qπ if the difference between the
two is zero in the direction of the gradient of π(s, a).
8. Suppose we use the actor-critic algorithm described in the lectures where Qπ is approximated
and the approximation used is compatible with the parametrisation used for the actor. As-
suming the use of differentiable function approximators, we can conclude that the use of such
a scheme will result in

(a) convergence to a globally optimal policy


(b) convergence to a locally optimal policy
(c) cannot comment on the convergence of such an algorithm

Sol. (b)
The idea behind the two theorems (policy gradient and policy gradient with function approx-
imation) that we saw in the lectures is to show that using such an approach will result in the
policy converging. However, we can only prove convergence to a locally optimal policy.

You might also like