0% found this document useful (0 votes)

6 views16 pages

ml4r 2025 06

Uploaded by

Smita Salunkhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views16 pages

ml4r 2025 06

Uploaded by

Smita Salunkhe

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Lecture 6:

Policy Search
Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab

Machine Learning for Robotics - SuSe 25 v1.01

Previous Lectures

Reinforcement Learning Formalism

Markov Decision Process (and variants)

2
Types of Reinforcement Learning Algorithms

Policy Search
Policy Gradient Today
Bayesian optimization
Value-based
Q-learning
Actor-critic
Model-based Reinforcement Learning

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )]

∗

t
3
Notation
Reward to maximize r
State s
Action a
Induced Trajectory τ
Discount factor γ
Future discounted reward T
Rt = ∑ γ k−t rk
k=t

State value function

V π (s) = Eπ [Rt ∣st = s]

State-Action value function (Q-function)

Qπ (s, a) = Eπ [Rt ∣st = s, at = a] 4

Policy Gradient
We just learned that policy-gradient methods aim to find parameters theta that maximize the expected return.

The idea is that we have a parameterized stochastic policy. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also
called the action preference.

5
Policy Search +^GIZ3MXGJOKTZ3OY3OSVUYYOHRK3Ȍ3:U3IUSV[ZK3 ø0Ţ3K^GIZR_3_U[狭J3TKKJ3ZU3Y[S3U\KX3GRR3VUYYOHRK3ZXGPKIZUXOKY3]NOIN3OY3
OTLKGYOHRK39U3]K3GVVXU^OSGZK3OZ3[YOTM3YGSVRKY3X[T3ZNK3VUROI_3G3LK]3ZOSKY3KYZOSGZK3ZNK3MXGJOKTZ3LXUS3ZNUYK3KVOYUJ

+T\OXUTSKTZ3J_TGSOIY3GXK3[TQTU]T3 3TUTJOLLKXKTZOGHRK3Ȍ3:NK3YZGZK3ZXGTYOZOUTY3JKVKTJ3UT3ZNK3KT\OXUTSKTZ狭Y3X[RKY
J_TGSOIY3]NOIN3]K3[Y[GRR_3IGT狭Z3JOLLKXKTZOGZK3ZNXU[MN3:NK3ROQKRONUUJ3XGZOU3ZXOIQ3G\UOJY3ZNOY3H_3UTR_3JOLLKXKTZOGZOTM3
ZNK3VUROI_3TUZ3ZNK3KT\OXUTSKTZ

Given an existing parametrized policy π(θ)

Directly optimize its parameters w.r.t. the reward

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )] = arg maxθ J (θ)

∗ π

t
In theory, this can be done with your optimizer of choice
(e.g., zero-order, or ﬁrst-order)
Algorithms using ﬁrst-order optimizers take the name of Policy Gradient
Use of zero-order optimizers is also sometimes possible
Bayesian optimization later in the course
6
REINFORCE Algorithm :NK38KOTLUXIK3GRMUXOZNS3GRYU3IGRRKJ33UTZK)GXRU3VUROI_MXGJOKTZ3OY3G3VUROI_MXGJOKTZ3

GRMUXOZNS3ZNGZ3[YKY3GT3KYZOSGZKJ3XKZ[XT3LXUS3GT3KTZOXK3KVOYUJK3ZU3[VJGZK3ZNK3VUROI_3

VGXGSKZKX

i
Execute policy πθ (at ∣st ) and obtain trajectories {τ }

Compute Gradients

π
∇θ J (θ) ≈ ∑ (∑ ∇θ logπθ (at ∣st )) (∑ r(st , at ))
i i i i

i t t

8+/4,58)+3[YKY3MXGJOKTZ3GYIKTZ3ZU3SG^OSO`K3ZNK3K^VKIZKJ3XKZ[XT3

Update Policy 0Ţ

:NK33YOMT3SKGTY3_U[3GXK3GJJOTM3ZNK3MXGJOKTZ3ZU3_U[X3I[XXKTZ3VGXGSKZKXY3
Ţ3V[YNOTM3Ţ3OT3ZNK3JOXKIZOUT3ZNGZ3OTIXKGYKY3ZNK3K^VKIZKJ3XK]GXJ

θ ← θ + α∇θ J(θ) If alpha too large, learning will not be

smooth (and might even diverge)
If alpha too small, learning is slow
Repeat How do we choose alpha to be right? 7
Natural Policy Gradients 34GZ[XGR3MXGJOKTZ3LO^KY3ZNGZ3H_3SKGY[XOTM3YZKV3YO`K3]OZN3G3JOYZXOH[ZOUTGR3JOYZGTIK3
123TUZ3]OZN3+[IROJKGT3VGXGSKZKX3JOYZGTIK3:NGZ3SGQKY3[VJGZKY3OT\GXOGTZ3ZU3

One possibility is to constrain the step size

XKVGXGSKZKXO`GZOUT3GTJ3ULZKT3S[IN3SUXK3YZGHRK

46-3YG_Y 3SKGY[XK3YZKV3YO`K3H_3NU]3S[IN3ZNK3VUROI_3JOYZXOH[ZOUT3INGTMKY3TUZ3

′ 2 H_3NU]3S[IN3VGXGSKZKXY3INGTMK3:NK3123JO\KXMKTIK3OY3G3TGZ[XGR3SKGY[XK3UL3

∥θ − θ∥ < ϵ
6RGOT3MXGJOKTZ3YZKVY3SU\K3VGXGSKZKXY3Ţ3OT3+[IROJKGT3YVGIK3([Z3ZNK3YGSK3

Does not work well in practice

JOYZXOH[ZOUT3INGTMK
+[IROJKGT3YZKV3IGT3INGTMK3ZNK3VUROI_3

JOYZXOH[ZOUT3G3RUZ3UX3G3ROZZRK3JKVKTJOTM3

UT3NU]3ZNK3VUROI_3OY3VGXGSKZKXO`KJ3

:NGZ3SGQKY3RKGXTOTM3YKTYOZO\K3ZU3NU]3

Intuition: take a step, but remain close to

_U[3VGXGSKZKXO`K3ZNK3TKZ]UXQ

D(πθ′ , πθ ) < ϵ your past policy

Kullback–Leibler divergence Fisher-information matrix

′ ′
:NK3,OYNKX3SGZXO^3GVVXU^OSGZKY3ZNK3RUIGR3I[X\GZ[XK3UL3123GTJ3

T
DKL (πθ′ ∥πθ ) ≈ (θ − θ) F (θ − θ) GIZY3ROQK3G3SKZXOI3UT3VGXGSKZKX3YVGIK3ZNGZ3SKGY[XKY3狮JOYZGTIK3
HKZ]KKT3VUROIOKY狯3XGZNKX3ZNGT3狮JOYZGTIK3HKZ]KKT3VGXGSKZKXY狯3
:NOY3OY3]N_3ZNK3[VJGZK3OY3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT

T
F = Eπθ [∇θ logπθ (a∣s)∇θ logπθ (a∣s) ]
−1
θ ← θ + αF ∇θ J(θ) 9U3]K 3INUUYK3G3JOXKIZOUT3ZNGZ3OTIXKGYKY3K^VKIZKJ3XK]GXJ3SUYZ3H[Z3QKKV3ZNK3TK]3

VUROI_3IRUYK3YSGRR3123ZU3ZNK3URJ3VUROI_

Theoretical advantage: the learning is agnostic of the parametrization of the policy

Practical advantage: faster and more robust convergence
Approximate second-order optimization
/T3\GTORRG3VUROI_3MXGJOKTZ3ZNK3SKGTOTM3UL3狮JOYZGTIK狯3HKZ]KKT3VUROIOKY3JKVKTJY3UT3NU]3Ţ3OY3VGXGSKZKXO`KJ

/T346-3ZNK3,OYNKX3SGZXO^3SGQKY3ZNK3[VJGZK3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT3ȍ3YZKVY3GXK3ZGQKT3OT3G3]G_3ZNGZ3

XKYVKIZY3ZNK3MKUSKZX_3UL3ZNK3VUROI_3YVGIK3SKGY[XKJ3H_312JO\KXMKTIK

[Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural networks, Elsevier, 2008, 21, 682-697] 8
Trust Region Policy Optimization
:8653OSVRKSKTZY3ZNK312IUTYZXGOTKJ3UVZOSO`GZOUT3OT3G3VXGIZOIGR3]G_3/TYZKGJ3UL3INUUYOTM3GT3GXHOZXGX_3ś3OZ3YUR\KY3GVVXU^OSGZKR_3ZNK3IUTYZXGOTKJ3

UVZOSO`GZOUT

−1
θ ← θ + αF ∇θ J(θ)
How do we choose alpha?

Backtracking line-search:
Choose alpha with smallest value which improve reward and satisﬁes the KL-divergence
.U]3:8653KTLUXIKY3OZ
:NK3TGZ[XGR3VUROI_3MXGJOKTZ3\KIZUX3GT3HK3\OK]KJ3GY3ZNK3YZKKVKYZ3GYIKTZ3JOXKIZOUT3
)USV[ZK3G3IGTJOJGZK3[VJGZK3[YOTM3ZNK3TGZ[XGR3VUROI_3MXGJOKTZ
;YK3HGIQZXGIQOTM3ROTK3YKGXIN OT3VUROI_3YVGIK3[TJKX3123JO\KXMKTIK3GY3G3SKZXOI3:NOY3SKGTY3ZNK3VUROI_3[VJGZK3

9ZGXZ3]OZN3G3HOM3YZKV3YO`K3 ZXOKY3ZU3SG^OSO`K3XK]GXJ3]NORK3ROSOZOTM3NU]3S[IN3ZNK3VUROI_3INGTMKY3OT3

-XGJ[GRR_3YNXOTQ3 JOYZXOH[ZOUT

ś3[TZOR
:NK3XK]GXJ3GIZ[GRR_3OSVXU\KY
:NK312JO\KXMKTIK3HKZ]KKT3URJ3GTJ3TK]3VUROI_3OY3ʇ3Ş

Intuition: Natural Policy Gradients + Automatic step-size

'J\GTZGMK3KYZOSGZKY3XKJ[IK3\GXOGTIK3GTJ3OSVXU\K3

YZGHOROZ_

[Adaptive step-size for policy gradient methods M Pirotta, M Restelli, L Bascetta - Advances in Neural Information Processing Systems, 2013]
9
[Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization]
Proximal Policy Optimization (PPO)
Aπ (s, a) = Qπ (s, a) − Vπ (s)
Advantage function: How much better is to take a speciﬁc action over randomly selecting it
6XU^OSGR36UROI_35VZOSO`GZOUT36653GT3GXINOZKIZ[XK3

ZNGZ3OSVXU\KY3U[X3GMKTZ狭Y3ZXGOTOTM3YZGHOROZ_3H_3G\UOJOTM3

VUROI_3[VJGZKY3ZNGZ3GXK3ZUU3RGXMK3:U3JU3ZNGZ3]K3[YK3G3

XGZOU3ZNGZ3OTJOIGZKY3ZNK3JOLLKXKTIK3HKZ]KKT3U[X3I[XXKTZ3

GTJ3URJ3VUROI_3GTJ3IROV3ZNOY3XGZOU3ZU3G3YVKIOLOI3XGTMK
:NOY3OY3ZNK3JOYIU[TZKJ3I[S[RGZO\K3XK]GXJ3LXUS3ZOSK3Z

:NK3OJKG3]GY3ZNGZ3H_3ZGQOTM3G3MXGJOKTZ3GYIKTZ3YZKV3UT3
ZNOY3L[TIZOUT3KW[O\GRKTZ3ZU3ZGQOTM3MXGJOKTZ3JKYIKTZ3UL3
ZNK3TKMGZO\K3UL3ZNOY3L[TIZOUT3]K3]U[RJ3V[YN3U[X3
GMKTZ3ZU3ZGQK3GIZOUTY3ZNGZ3RKGJ3ZU3NOMNKX3XK]GXJY3GTJ3
G\UOJ3NGXSL[R3GIZOUTY

.U]K\KX3ZNK3VXUHRKS3IUSKY3LXUS3ZNK3YZKV3YO`K

:UU3YSGRR3ZNK3ZXGOTOTM3VXUIKYY3]GY3ZUU3YRU]
:UU3NOMN3ZNKXK3]GY3ZUU3S[IN3\GXOGHOROZ_3OT3ZNK3ZXGOTOTM

:NK3\GR[K3L[TIZOUT3VGXGSKZKXY3
:NKT3]K3ZGQK3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ
GXK3[VJGZKJ3H_3SOTOSO`OTM3ZNK3
UHPKIZO\K3YU3ZNK3LOTGR3UHPKIZO\K3OY3G3RU]KX3HU[TJ3
SKGTYW[GXKJ3KXXUX3HKZ]KKT3
VXKJOIZKJ3\GR[KY3GTJ3ZNK3 VKYYOSOYZOI3HU[TJ3UL3ZNK3[TIROVVKJ3UHPKIZO\K

IUSV[ZKJ3XK]GXJYZUMU
:GQOTM3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ3
UHPKIZO\K3SKGTY3]K狭RR3YKRKIZ3KOZNKX3ZNK3IROVVKJ3UX3ZNK3TUT
IROVVKJ3UHPKIZO\K3HGYKJ3UT3ZNK3XGZOU3GTJ3GJ\GTZGMK3
YOZ[GZOUT

10
[Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization algorithms: deep RL with importance sampled policy gradient]
Real-world considerations

Often unstable algorithms.

Hyperparameters optimization make a
huge diﬀerence

Implementations details make a huge

diﬀerence too
To get the most out of these algorithms,
the human designed needs to learn how
to tune them!

11
[Deep Reinforcement Learning that Matters Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger 2017]
https://arxiv.org/pdf/1809.07731
2OSOZGZOUTY
.OMN3\GXOGTIK3OT3MXGJOKTZ3KYZOSGZKY
8KZ[XTY3GXK3TUOY_3ɒ3MXGJOKTZ3KYZOSGZKY3LR[IZ[GZK3ɒ3YRU]KX3IUT\KXMKTIK3[TRKYY3\GXOGTIK3XKJ[IZOUT3ROQK3HGYKROTKY3UX3-'+3OY3
[YKJ

9GSVRK3OTKLLOIOKTI_
8KW[OXK3SGT_3OTZKXGIZOUTY3]OZN3ZNK3KT\OXUTSKTZ3ZU3SGQK3VXUMXKYY3IUSVGXKJ3ZU3YUSK3\GR[KHGYKJ3SKZNUJY

6UYYOHOROZ_3UL3VUUX3RUIGR3UVZOSG
9OTIK3ZNK_3JU3MXGJOKTZ3GYIKTZ3UT3G3TUTIUT\K^3UHPKIZO\K3ZNK_3IGT3MKZ3YZ[IQ3OT3Y[HUVZOSGR3VUROIOKY

9KTYOZO\K3ZU3N_VKXVGXGSKZKXY
2KGXTOTM3XGZK3JOYIU[TZ3LGIZUX3GTJ3HGZIN3YO`K3NKG\OR_3OTLR[KTIK3YZGHOROZ_

/TYZGHOROZ_3LXUS3RGXMK3[VJGZKY
'3YOTMRK3HGJ3[VJGZK3IGT3JXGYZOIGRR_3]UXYKT3ZNK3VUROI_3]N_3SKZNUJY3ROQK3665:8653]KXK3JK\KRUVKJ

)XKJOZ3GYYOMTSKTZ3JOLLOI[RZ_
/T3RUTMNUXO`UT3ZGYQY3LOM[XOTM3U[Z3]NOIN3GIZOUTY3IG[YKJ3]NOIN3XK]GXJY3IGT3HK3YRU]

12
On-policy vs Off-policy

On-policy requires to use data that are generated by the current policy
(on the given task)
Off-policy can use any data
(from the given task)
Until recently on-policy algorithms were more effective
In the last years, Off-policy RL has become a hot topic to scale up datasets and thus
benefit from large deep learning models
We now have pretty reasonable algorithms

13
Questions ?
OPAL Course Forum
(preferred for course related question)
or
teaching@lasr.org

14
Additional Readings

Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural
networks, Elsevier, 2008, 21, 682-697
TRPO and PPO
https://www.youtube.com/playlist?list=PLB79uOaPEEU6uU1-Pfaqr08RTTzhyB8hu

15
Lecture 6:

Policy Search
Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab

Machine Learning for Robotics - SuSe 25 v1.0

Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
RL 5
No ratings yet
RL 5
26 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Lecture 7: Policy Gradient: David Silver
No ratings yet
Lecture 7: Policy Gradient: David Silver
41 pages
ml4r 2025 05
No ratings yet
ml4r 2025 05
22 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
RL 3
No ratings yet
RL 3
31 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
13 RL 3
No ratings yet
13 RL 3
48 pages
Lec 5 Policy Gradients
No ratings yet
Lec 5 Policy Gradients
40 pages
Assignment 2 - Policy Gradients
No ratings yet
Assignment 2 - Policy Gradients
7 pages
Policy Gradient Methods-BR
No ratings yet
Policy Gradient Methods-BR
14 pages
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
No ratings yet
Reinforcement Learning: B.Tech., Last Year, Semester-Viii
32 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Maxent RL
No ratings yet
Maxent RL
25 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Handout 5
No ratings yet
Handout 5
72 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
Policy-Based Reinforcement Learning: Shusen Wang
No ratings yet
Policy-Based Reinforcement Learning: Shusen Wang
46 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
cs224r L03 MDP PG
No ratings yet
cs224r L03 MDP PG
30 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Planning and Optimal Control Policy Gradient Methods
No ratings yet
Planning and Optimal Control Policy Gradient Methods
34 pages
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
No ratings yet
Trust Region Policy Optimization: John Schulman Sergey Levine Philipp Moritz Michael Jordan Pieter Abbeel
16 pages
9 Sqoop Notes
No ratings yet
9 Sqoop Notes
35 pages
Policy Approximation Document
No ratings yet
Policy Approximation Document
2 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
2023 Week5 Policy
No ratings yet
2023 Week5 Policy
62 pages
L9 - Policy Gradient Methods
No ratings yet
L9 - Policy Gradient Methods
43 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Chapter 4
No ratings yet
Chapter 4
16 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
No ratings yet
Reinforcement Learning An Introduction 2 Trimmed Edition Richard S. Sutton Updated 2025
113 pages
CH3 - 3 Policy Search Alg
No ratings yet
CH3 - 3 Policy Search Alg
9 pages
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
No ratings yet
Abdolmaleki Et Al. - 2018 - Maximum A Posteriori Policy Optimisation
23 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
L4DC PolicyOptTutorial2023
No ratings yet
L4DC PolicyOptTutorial2023
160 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
13 ML Reinforcement Learning - Policy Search
No ratings yet
13 ML Reinforcement Learning - Policy Search
10 pages
CS229
No ratings yet
CS229
17 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
Andy 2
No ratings yet
Andy 2
73 pages
Advanced Reinforcement Learning
No ratings yet
Advanced Reinforcement Learning
46 pages
Reinforcement Learning As Classification: Leveraging Modern Classifiers
No ratings yet
Reinforcement Learning As Classification: Leveraging Modern Classifiers
8 pages
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Unit7 RL
No ratings yet
Unit7 RL
7 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
ml4r 2025 04
No ratings yet
ml4r 2025 04
14 pages
02 Basics CFD
No ratings yet
02 Basics CFD
64 pages
04 Finite Volume
No ratings yet
04 Finite Volume
47 pages
04 Finite Volume 2nd Order
No ratings yet
04 Finite Volume 2nd Order
30 pages
04-Finite Volume Full Slideset
No ratings yet
04-Finite Volume Full Slideset
76 pages
03 Time Integration
No ratings yet
03 Time Integration
43 pages
This Is The Pd1
No ratings yet
This Is The Pd1
1 page
Research Methodology - Unit 7 - Week 5 - Creativity in Research
100% (2)
Research Methodology - Unit 7 - Week 5 - Creativity in Research
3 pages
Conflict Management for Managers
No ratings yet
Conflict Management for Managers
12 pages
Understanding Science, Technology & Society
No ratings yet
Understanding Science, Technology & Society
14 pages
Grade 3 Math: Equivalent Fractions
No ratings yet
Grade 3 Math: Equivalent Fractions
2 pages
Online Ielts Course
No ratings yet
Online Ielts Course
2 pages
107 Personal Safety Expressions US
No ratings yet
107 Personal Safety Expressions US
3 pages
English Activity Sheet: Quarter 4 - Melc 1
No ratings yet
English Activity Sheet: Quarter 4 - Melc 1
15 pages
EDUC109 CHAP.3 SUMMATIVE Test 1
No ratings yet
EDUC109 CHAP.3 SUMMATIVE Test 1
4 pages
TLP Notes
No ratings yet
TLP Notes
6 pages
SORN Chanreaksmey CV 2022
No ratings yet
SORN Chanreaksmey CV 2022
2 pages
Unit 03 Assessment
No ratings yet
Unit 03 Assessment
9 pages
EF3e Elem Filetest 04 Answerkey
100% (1)
EF3e Elem Filetest 04 Answerkey
6 pages
Students' Perceptions of PBL at SMKN 8 Medan
No ratings yet
Students' Perceptions of PBL at SMKN 8 Medan
11 pages
DLL - Mapeh 3 - Q3 - W10
No ratings yet
DLL - Mapeh 3 - Q3 - W10
3 pages
Effect of Technology To Students
80% (5)
Effect of Technology To Students
25 pages
IGCSE 0457 Global Perspectives Learner's Guide
No ratings yet
IGCSE 0457 Global Perspectives Learner's Guide
16 pages
ESP Annual Implementation Plan
83% (6)
ESP Annual Implementation Plan
3 pages
Module 1 - Forces & Motion
No ratings yet
Module 1 - Forces & Motion
24 pages
My First Grammar 3 Workbook
67% (3)
My First Grammar 3 Workbook
78 pages
DLL - English 6 - Q3 - W1
No ratings yet
DLL - English 6 - Q3 - W1
11 pages
The Influence of Color On Student Emotio
No ratings yet
The Influence of Color On Student Emotio
11 pages
The Juilliard School Admission Music Technology Center
No ratings yet
The Juilliard School Admission Music Technology Center
2 pages
Vocabulary Lesson Plan Unit/Class/Grade:: History Alive! - World Connections Textbook
No ratings yet
Vocabulary Lesson Plan Unit/Class/Grade:: History Alive! - World Connections Textbook
2 pages
Lesson Plan 4 - Underhand Serve
No ratings yet
Lesson Plan 4 - Underhand Serve
3 pages
Africa: NAME: - Grade Level(s) : 2 Subject(s)
No ratings yet
Africa: NAME: - Grade Level(s) : 2 Subject(s)
4 pages
Enhancing Senior High School Students Research Knowledgeand Skills Through COINS
No ratings yet
Enhancing Senior High School Students Research Knowledgeand Skills Through COINS
6 pages
Managing Training and Development
100% (1)
Managing Training and Development
1 page
English 9 Q4 Module 2
75% (8)
English 9 Q4 Module 2
30 pages
The Main Features of The Present School
No ratings yet
The Main Features of The Present School
15 pages
Chapter 1
No ratings yet
Chapter 1
8 pages

ml4r 2025 06

Uploaded by

ml4r 2025 06

Uploaded by

Lecture 6:

Machine Learning for Robotics - SuSe 25 v1.01

Reinforcement Learning Formalism

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )]

State value function

State-Action value function (Q-function)

Qπ (s, a) = Eπ [Rt ∣st = s, at = a] 4

Given an existing parametrized policy π(θ)

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )] = arg maxθ J (θ)

Update Policy 0Ţ

θ ← θ + α∇θ J(θ) If alpha too large, learning will not be

One possibility is to constrain the step size

Does not work well in practice

Intuition: take a step, but remain close to

D(πθ′ , πθ ) < ϵ your past policy

Kullback–Leibler divergence Fisher-information matrix

Theoretical advantage: the learning is agnostic of the parametrization of the policy

Intuition: Natural Policy Gradients + Automatic step-size

Often unstable algorithms.

Implementations details make a huge

Machine Learning for Robotics - SuSe 25 v1.0

You might also like

Update Policy 0Ţ