0% found this document useful (0 votes)

28 views3 pages

Solution 9

Uploaded by

nandini kataria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views3 pages

Solution 9

Uploaded by

nandini kataria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 3

Assignment 9 (Sol.

)
Reinforcement Learning
Prof. B. Ravindran

1. Which among the following is/are the advantages of using the Deep Q-learning method over
other learning methods that we have seen?

(a) a faster implementation of the Q-learning algorithm

(b) guarantees convergence to the optimal policy
(c) obviates the need to hand-craft features used in function approximation
(d) allows the use of off-policy algorithms rather than on-policy learning schemes

Sol. (c)
As we have seen with the Atari games example, the input to the network is raw pixel data
from which the network manages to learn appropriate features to be used for representing the
action value function.
2. In the Deep Q-learning method, is -greedy (or other equivalent techniques) required to ensure
exploration, or is this taken care of by the randomisation provided by experience replay?

(a) no
(b) yes

Sol. (b)
Some technique to ensure exploration is still required. As with the original Q-learning algo-
rithm, if we only consider transitions generated by the action value function (which is essentially
what we’ll get with experience replay without any exploration), a large part of the state space
will likely remain unexplored.
3. Value function based methods are oriented towards finding deterministic policies whereas policy
search methods are geared towards finding stochastic policies. True or false?

(a) false
(b) true

Sol. (b)
With value function based methods, policies are derived from the value function by considering,
for a state, that action which gives maximum value. This leads to a deterministic policy. On the
other hand, no such maximisation is at work in policy search methods, where the parameters
learned, using the gradient descent method, for example, determine the agent’s policy. This is
likely to be stochastic if the optimal policy (global or local) is stochastic.

1
4. Suppose we are using a policy gradient method to solve a reinforcement learning problem.
Assuming that the policy returned by the method is not optimal, which among the following
are plausible reasons for such an outcome?

(a) the search procedure converged to a locally optimal policy

(b) the search procedure was terminated before it could reach the optimal policy
(c) the sample trajectories arising in the problem were very long
(d) the optimal policy could not be represented by the parametrisation used to represent the
policy

Sol. (a), (b), (d)

Option (c) may result in an increase in the time it takes to converge to a policy, but does not
necessarily affect the optimality of the policy obtained.
5. In using policy gradient methods, if we make use of the average reward formulation rather
than the discounted reward formulation, then is it necessary to consider, for problems that do
not have a unique start state, a designated start state, s0 ?

(a) no
(b) yes

Sol. (a)
We used the concept of a designated start state to allow a single value that can be assigned
to a policy for the purpose of evaluation. The same result is obtained when using the average
reward formulation, i.e, by using the average reward formulation we can compare policies
according to their long term expected reward per step, ρ(π), where
1
ρ(π) = limn→∞ E{r1 + r2 + r3 + ... + rN |π}
N

6. Using similar parametrisations to represent policies, would you expect, in general, MC policy
gradient methods to converge faster or slower than actor-critic methods assuming that the
approximation to Qπ used in the actor-critic method satisfies the compatibility criteria?

(a) slower
(b) faster

Sol. (a)
As we have seen, MC policy gradient algorithms may suffer from large variance due to long
episode lengths which can slow down convergence. Actor-critic methods, by relying on value
function estimates can lead to reduced variance, and hence, faster convergence.

7. If fw approximates Qπ and is compatible with the parameterisation used for the policy, then
this indicates that we can use fw in place of Qπ in the expression for calculating the gradient
of the policy performance metric with respect to the policy parameter because

(a) Qπ (s, a) − fw (s, a) = 0 in the direction of the gradient of fw (s, a)

(b) Qπ (s, a) − fw (s, a) = 0 in the direction of the gradient of π(s, a)

2
(c) the error between Qπ and fw is orthogonal to the gradient of the policy parameterisation

Sol. (b), (c)

As indicated by options (b) & (c), we can use fw in place of Qπ if the difference between the
two is zero in the direction of the gradient of π(s, a).
8. Suppose we use the actor-critic algorithm described in the lectures where Qπ is approximated
and the approximation used is compatible with the parametrisation used for the actor. As-
suming the use of differentiable function approximators, we can conclude that the use of such
a scheme will result in

(a) convergence to a globally optimal policy

(b) convergence to a locally optimal policy
(c) cannot comment on the convergence of such an algorithm

Sol. (b)
The idea behind the two theorems (policy gradient and policy gradient with function approx-
imation) that we saw in the lectures is to show that using such an approach will result in the
policy converging. However, we can only prove convergence to a locally optimal policy.

Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Assignment 9: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 9: Reinforcement Learning Prof. B. Ravindran
3 pages
Solution 3
No ratings yet
Solution 3
4 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Reinforcement Learning Assignment Solutions
100% (1)
Reinforcement Learning Assignment Solutions
4 pages
3 - Chapter 10 Actor-Critic Methods
No ratings yet
3 - Chapter 10 Actor-Critic Methods
22 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Lecture 10
No ratings yet
Lecture 10
25 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
RL-solution 4
No ratings yet
RL-solution 4
4 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
2 pages
Handout 5
No ratings yet
Handout 5
72 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Exercises 4
No ratings yet
Exercises 4
10 pages
20ai903 - RL - Unit 4
No ratings yet
20ai903 - RL - Unit 4
49 pages
Home Work of Reinforcement Learning Policy Based Theory
No ratings yet
Home Work of Reinforcement Learning Policy Based Theory
10 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
AI Exam Prep for CS Students
No ratings yet
AI Exam Prep for CS Students
4 pages
Paper RL
No ratings yet
Paper RL
61 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Dis9 Sol
No ratings yet
Dis9 Sol
8 pages
HW 2
No ratings yet
HW 2
2 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
No ratings yet
High-Dimensional Continuous Control Using Generalized Advantage Estimation-1506.02438v5
14 pages
Assignment 5
100% (1)
Assignment 5
2 pages
RL 3
No ratings yet
RL 3
31 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Q Learing
No ratings yet
Q Learing
30 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
14 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
14 pages
Natural Actor-Critic Reinforcement Learning
No ratings yet
Natural Actor-Critic Reinforcement Learning
12 pages
RL 5
No ratings yet
RL 5
26 pages
How To Find Optimal Policies: Bellman Equation For A Policy
No ratings yet
How To Find Optimal Policies: Bellman Equation For A Policy
15 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning - Unit 7 - Week 4
No ratings yet
Reinforcement Learning - Unit 7 - Week 4
2 pages
Unit 4 QP
No ratings yet
Unit 4 QP
19 pages
3 - Chapter 9 Policy Gradient Methods
No ratings yet
3 - Chapter 9 Policy Gradient Methods
24 pages
Wa 2
No ratings yet
Wa 2
6 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
OCDM2223 Tutorial7solved
No ratings yet
OCDM2223 Tutorial7solved
5 pages
Introduction To Machine Learning - Unit 15 - Week 12
No ratings yet
Introduction To Machine Learning - Unit 15 - Week 12
3 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
A2 Linear-Quadratic Optimal Control
No ratings yet
A2 Linear-Quadratic Optimal Control
8 pages
Lab6 2210110797
No ratings yet
Lab6 2210110797
3 pages
Admissions 2025 Information Brochure
No ratings yet
Admissions 2025 Information Brochure
15 pages
Lab7 2210110797
No ratings yet
Lab7 2210110797
4 pages
Basic Python Programs
No ratings yet
Basic Python Programs
7 pages
Chemistry Foundations for Students
No ratings yet
Chemistry Foundations for Students
34 pages
Drug Study Antipsychotic
100% (2)
Drug Study Antipsychotic
7 pages
3 Chapter 3 Part 1 Foundations.
No ratings yet
3 Chapter 3 Part 1 Foundations.
40 pages
The Writings of Henry Barrow 1587 1590 Elizabethan Non Conformist Texts 1st Edition Henry Barrow Instant Download
No ratings yet
The Writings of Henry Barrow 1587 1590 Elizabethan Non Conformist Texts 1st Edition Henry Barrow Instant Download
37 pages
Englishfortheprofessionalnurse 150106130758 Conversion Gate02
75% (4)
Englishfortheprofessionalnurse 150106130758 Conversion Gate02
72 pages
Soulzs Playbook - Trading Manual
No ratings yet
Soulzs Playbook - Trading Manual
92 pages
Poly Product Catalog 2023
No ratings yet
Poly Product Catalog 2023
47 pages
Partner Determination in SAP SD
No ratings yet
Partner Determination in SAP SD
9 pages
Curvas de Diparo de Fusibles
No ratings yet
Curvas de Diparo de Fusibles
1 page
TWM2 Feathers and Fangs A Book of Spells For The True World
100% (1)
TWM2 Feathers and Fangs A Book of Spells For The True World
11 pages
Case 2 Presentation - Final
No ratings yet
Case 2 Presentation - Final
17 pages
Foreign Exchange Rates 11 Aug 25
No ratings yet
Foreign Exchange Rates 11 Aug 25
1 page
Engine Enclosure Acoustic Modeling
No ratings yet
Engine Enclosure Acoustic Modeling
10 pages
1.3 Responsibility Accounting Problems Answers
No ratings yet
1.3 Responsibility Accounting Problems Answers
5 pages
Kehinde Wileys Napoleon Leading The Army Over The Alps
No ratings yet
Kehinde Wileys Napoleon Leading The Army Over The Alps
5 pages
Introduction To Business Implementation
100% (1)
Introduction To Business Implementation
36 pages
SS GoingGroceryShopping
No ratings yet
SS GoingGroceryShopping
2 pages
ch01 1
No ratings yet
ch01 1
104 pages
LESSON PLAN The REFLEXIVE PRONOUNS - Visit N2
No ratings yet
LESSON PLAN The REFLEXIVE PRONOUNS - Visit N2
7 pages
Conductivity & Concentration Study
50% (4)
Conductivity & Concentration Study
4 pages
Working Capital Management and Its Impact On Profitability-A Study On Janata Bank Ltd.
100% (1)
Working Capital Management and Its Impact On Profitability-A Study On Janata Bank Ltd.
57 pages
500 V1a MLFP 10004
No ratings yet
500 V1a MLFP 10004
632 pages
Manual Partes EC25N2
No ratings yet
Manual Partes EC25N2
211 pages
Language Registers Notes2.0 PDF
No ratings yet
Language Registers Notes2.0 PDF
5 pages
Consumer Behavior On Purchase of Laptop
80% (5)
Consumer Behavior On Purchase of Laptop
20 pages
Cognitive Stimulation Using Non-Immersive Virtual Reality Tasks in Children With Learning Disabilities.
No ratings yet
Cognitive Stimulation Using Non-Immersive Virtual Reality Tasks in Children With Learning Disabilities.
159 pages
Synthesis, Characterization and Biological Activity Evaluation of Some New Pyrimidine Derivatives by Solid Base Catalyst AL2O3-OBa
No ratings yet
Synthesis, Characterization and Biological Activity Evaluation of Some New Pyrimidine Derivatives by Solid Base Catalyst AL2O3-OBa
9 pages
Chocolate Petit-Fours: The Essential Guide For Professional Pastry Chefs
100% (2)
Chocolate Petit-Fours: The Essential Guide For Professional Pastry Chefs
20 pages
8th Grade Transport Vocabulary Lesson
No ratings yet
8th Grade Transport Vocabulary Lesson
15 pages
Class 11 Biology Topic Wise Line by Line Chapter 4 Chemical Bonding and Molecular Structure
No ratings yet
Class 11 Biology Topic Wise Line by Line Chapter 4 Chemical Bonding and Molecular Structure
39 pages

Solution 9

Uploaded by

Solution 9

Uploaded by

Assignment 9 (Sol.

(a) a faster implementation of the Q-learning algorithm

(a) the search procedure converged to a locally optimal policy

Sol. (a), (b), (d)

(a) Qπ (s, a) − fw (s, a) = 0 in the direction of the gradient of fw (s, a)

Sol. (b), (c)

(a) convergence to a globally optimal policy

You might also like