0% found this document useful (0 votes)
6 views16 pages

ml4r 2025 06

Uploaded by

Smita Salunkhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views16 pages

ml4r 2025 06

Uploaded by

Smita Salunkhe
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Lecture 6:

Policy Search
Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab

Machine Learning for Robotics - SuSe 25 v1.01


Previous Lectures

Reinforcement Learning Formalism


Markov Decision Process (and variants)

2
Types of Reinforcement Learning Algorithms

Policy Search
Policy Gradient Today
Bayesian optimization
Value-based
Q-learning
Actor-critic
Model-based Reinforcement Learning

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )]


t
3
Notation
Reward to maximize r
State s
Action a
Induced Trajectory τ
Discount factor γ
Future discounted reward T
Rt = ∑ γ k−t rk
k=t

State value function


V π (s) = Eπ [Rt ∣st = s]

State-Action value function (Q-function)

Qπ (s, a) = Eπ [Rt ∣st = s, at = a] 4


Policy Gradient
We just learned that policy-gradient methods aim to find parameters theta that maximize the expected return.

The idea is that we have a parameterized stochastic policy. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also
called the action preference.

5
Policy Search +^GIZ3MXGJOKTZ3OY3OSVUYYOHRK3Ȍ3:U3IUSV[ZK3 ø0Ţ3K^GIZR_3_U[狭J3TKKJ3ZU3Y[S3U\KX3GRR3VUYYOHRK3ZXGPKIZUXOKY3]NOIN3OY3
OTLKGYOHRK39U3]K3GVVXU^OSGZK3OZ3[YOTM3YGSVRKY3X[T3ZNK3VUROI_3G3LK]3ZOSKY3KYZOSGZK3ZNK3MXGJOKTZ3LXUS3ZNUYK3KVOYUJ

+T\OXUTSKTZ3J_TGSOIY3GXK3[TQTU]T3 3TUTJOLLKXKTZOGHRK3Ȍ3:NK3YZGZK3ZXGTYOZOUTY3JKVKTJ3UT3ZNK3KT\OXUTSKTZ狭Y3X[RKY
J_TGSOIY3]NOIN3]K3[Y[GRR_3IGT狭Z3JOLLKXKTZOGZK3ZNXU[MN3:NK3ROQKRONUUJ3XGZOU3ZXOIQ3G\UOJY3ZNOY3H_3UTR_3JOLLKXKTZOGZOTM3
ZNK3VUROI_3TUZ3ZNK3KT\OXUTSKTZ

Given an existing parametrized policy π(θ)


Directly optimize its parameters w.r.t. the reward

θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )] = arg maxθ J (θ)


∗ π

t
In theory, this can be done with your optimizer of choice
(e.g., zero-order, or first-order)
Algorithms using first-order optimizers take the name of Policy Gradient
Use of zero-order optimizers is also sometimes possible
Bayesian optimization later in the course
6
REINFORCE Algorithm :NK38KOTLUXIK3GRMUXOZNS3GRYU3IGRRKJ33UTZK)GXRU3VUROI_MXGJOKTZ3OY3G3VUROI_MXGJOKTZ3

GRMUXOZNS3ZNGZ3[YKY3GT3KYZOSGZKJ3XKZ[XT3LXUS3GT3KTZOXK3KVOYUJK3ZU3[VJGZK3ZNK3VUROI_3

VGXGSKZKX

i
Execute policy πθ (at ∣st ) and obtain trajectories {τ }

Compute Gradients

π
∇θ J (θ) ≈ ∑ (∑ ∇θ logπθ (at ∣st )) (∑ r(st , at ))
i i i i

i t t

8+/4,58)+3[YKY3MXGJOKTZ3GYIKTZ3ZU3SG^OSO`K3ZNK3K^VKIZKJ3XKZ[XT3

Update Policy 0Ţ


:NK33YOMT3SKGTY3_U[3GXK3GJJOTM3ZNK3MXGJOKTZ3ZU3_U[X3I[XXKTZ3VGXGSKZKXY3
Ţ3V[YNOTM3Ţ3OT3ZNK3JOXKIZOUT3ZNGZ3OTIXKGYKY3ZNK3K^VKIZKJ3XK]GXJ

θ ← θ + α∇θ J(θ) If alpha too large, learning will not be


smooth (and might even diverge)
If alpha too small, learning is slow
Repeat How do we choose alpha to be right? 7
Natural Policy Gradients 34GZ[XGR3MXGJOKTZ3LO^KY3ZNGZ3H_3SKGY[XOTM3YZKV3YO`K3]OZN3G3JOYZXOH[ZOUTGR3JOYZGTIK3
123TUZ3]OZN3+[IROJKGT3VGXGSKZKX3JOYZGTIK3:NGZ3SGQKY3[VJGZKY3OT\GXOGTZ3ZU3

One possibility is to constrain the step size


XKVGXGSKZKXO`GZOUT3GTJ3ULZKT3S[IN3SUXK3YZGHRK

46-3YG_Y 3SKGY[XK3YZKV3YO`K3H_3NU]3S[IN3ZNK3VUROI_3JOYZXOH[ZOUT3INGTMKY3TUZ3

′ 2 H_3NU]3S[IN3VGXGSKZKXY3INGTMK3:NK3123JO\KXMKTIK3OY3G3TGZ[XGR3SKGY[XK3UL3

∥θ − θ∥ < ϵ
6RGOT3MXGJOKTZ3YZKVY3SU\K3VGXGSKZKXY3Ţ3OT3+[IROJKGT3YVGIK3([Z3ZNK3YGSK3

Does not work well in practice


JOYZXOH[ZOUT3INGTMK
+[IROJKGT3YZKV3IGT3INGTMK3ZNK3VUROI_3

JOYZXOH[ZOUT3G3RUZ3UX3G3ROZZRK3JKVKTJOTM3

UT3NU]3ZNK3VUROI_3OY3VGXGSKZKXO`KJ3

:NGZ3SGQKY3RKGXTOTM3YKTYOZO\K3ZU3NU]3

Intuition: take a step, but remain close to


_U[3VGXGSKZKXO`K3ZNK3TKZ]UXQ

D(πθ′ , πθ ) < ϵ your past policy

Kullback–Leibler divergence Fisher-information matrix


′ ′
:NK3,OYNKX3SGZXO^3GVVXU^OSGZKY3ZNK3RUIGR3I[X\GZ[XK3UL3123GTJ3

T
DKL (πθ′ ∥πθ ) ≈ (θ − θ) F (θ − θ) GIZY3ROQK3G3SKZXOI3UT3VGXGSKZKX3YVGIK3ZNGZ3SKGY[XKY3狮JOYZGTIK3
HKZ]KKT3VUROIOKY狯3XGZNKX3ZNGT3狮JOYZGTIK3HKZ]KKT3VGXGSKZKXY狯3
:NOY3OY3]N_3ZNK3[VJGZK3OY3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT

T
F = Eπθ [∇θ logπθ (a∣s)∇θ logπθ (a∣s) ]
−1
θ ← θ + αF ∇θ J(θ) 9U3]K 3INUUYK3G3JOXKIZOUT3ZNGZ3OTIXKGYKY3K^VKIZKJ3XK]GXJ3SUYZ3H[Z3QKKV3ZNK3TK]3

VUROI_3IRUYK3YSGRR3123ZU3ZNK3URJ3VUROI_

Theoretical advantage: the learning is agnostic of the parametrization of the policy


Practical advantage: faster and more robust convergence
Approximate second-order optimization
/T3\GTORRG3VUROI_3MXGJOKTZ3ZNK3SKGTOTM3UL3狮JOYZGTIK狯3HKZ]KKT3VUROIOKY3JKVKTJY3UT3NU]3Ţ3OY3VGXGSKZKXO`KJ

/T346-3ZNK3,OYNKX3SGZXO^3SGQKY3ZNK3[VJGZK3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT3ȍ3YZKVY3GXK3ZGQKT3OT3G3]G_3ZNGZ3

XKYVKIZY3ZNK3MKUSKZX_3UL3ZNK3VUROI_3YVGIK3SKGY[XKJ3H_312JO\KXMKTIK

[Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural networks, Elsevier, 2008, 21, 682-697] 8
Trust Region Policy Optimization
:8653OSVRKSKTZY3ZNK312IUTYZXGOTKJ3UVZOSO`GZOUT3OT3G3VXGIZOIGR3]G_3/TYZKGJ3UL3INUUYOTM3GT3GXHOZXGX_3ś3OZ3YUR\KY3GVVXU^OSGZKR_3ZNK3IUTYZXGOTKJ3

UVZOSO`GZOUT

−1
θ ← θ + αF ∇θ J(θ)
How do we choose alpha?

Backtracking line-search:
Choose alpha with smallest value which improve reward and satisfies the KL-divergence
.U]3:8653KTLUXIKY3OZ
:NK3TGZ[XGR3VUROI_3MXGJOKTZ3\KIZUX3GT3HK3\OK]KJ3GY3ZNK3YZKKVKYZ3GYIKTZ3JOXKIZOUT3
)USV[ZK3G3IGTJOJGZK3[VJGZK3[YOTM3ZNK3TGZ[XGR3VUROI_3MXGJOKTZ
;YK3HGIQZXGIQOTM3ROTK3YKGXIN OT3VUROI_3YVGIK3[TJKX3123JO\KXMKTIK3GY3G3SKZXOI3:NOY3SKGTY3ZNK3VUROI_3[VJGZK3

9ZGXZ3]OZN3G3HOM3YZKV3YO`K3 ZXOKY3ZU3SG^OSO`K3XK]GXJ3]NORK3ROSOZOTM3NU]3S[IN3ZNK3VUROI_3INGTMKY3OT3

-XGJ[GRR_3YNXOTQ3 JOYZXOH[ZOUT

ś3[TZOR
:NK3XK]GXJ3GIZ[GRR_3OSVXU\KY
:NK312JO\KXMKTIK3HKZ]KKT3URJ3GTJ3TK]3VUROI_3OY3ʇ3Ş

Intuition: Natural Policy Gradients + Automatic step-size


'J\GTZGMK3KYZOSGZKY3XKJ[IK3\GXOGTIK3GTJ3OSVXU\K3

YZGHOROZ_

[Adaptive step-size for policy gradient methods M Pirotta, M Restelli, L Bascetta - Advances in Neural Information Processing Systems, 2013]
9
[Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization]
Proximal Policy Optimization (PPO)
Aπ (s, a) = Qπ (s, a) − Vπ (s)
Advantage function: How much better is to take a specific action over randomly selecting it
6XU^OSGR36UROI_35VZOSO`GZOUT36653GT3GXINOZKIZ[XK3

ZNGZ3OSVXU\KY3U[X3GMKTZ狭Y3ZXGOTOTM3YZGHOROZ_3H_3G\UOJOTM3

VUROI_3[VJGZKY3ZNGZ3GXK3ZUU3RGXMK3:U3JU3ZNGZ3]K3[YK3G3

XGZOU3ZNGZ3OTJOIGZKY3ZNK3JOLLKXKTIK3HKZ]KKT3U[X3I[XXKTZ3

GTJ3URJ3VUROI_3GTJ3IROV3ZNOY3XGZOU3ZU3G3YVKIOLOI3XGTMK
:NOY3OY3ZNK3JOYIU[TZKJ3I[S[RGZO\K3XK]GXJ3LXUS3ZOSK3Z

:NK3OJKG3]GY3ZNGZ3H_3ZGQOTM3G3MXGJOKTZ3GYIKTZ3YZKV3UT3
ZNOY3L[TIZOUT3KW[O\GRKTZ3ZU3ZGQOTM3MXGJOKTZ3JKYIKTZ3UL3
ZNK3TKMGZO\K3UL3ZNOY3L[TIZOUT3]K3]U[RJ3V[YN3U[X3
GMKTZ3ZU3ZGQK3GIZOUTY3ZNGZ3RKGJ3ZU3NOMNKX3XK]GXJY3GTJ3
G\UOJ3NGXSL[R3GIZOUTY

.U]K\KX3ZNK3VXUHRKS3IUSKY3LXUS3ZNK3YZKV3YO`K

:UU3YSGRR3ZNK3ZXGOTOTM3VXUIKYY3]GY3ZUU3YRU]
:UU3NOMN3ZNKXK3]GY3ZUU3S[IN3\GXOGHOROZ_3OT3ZNK3ZXGOTOTM

:NK3\GR[K3L[TIZOUT3VGXGSKZKXY3
:NKT3]K3ZGQK3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ
GXK3[VJGZKJ3H_3SOTOSO`OTM3ZNK3
UHPKIZO\K3YU3ZNK3LOTGR3UHPKIZO\K3OY3G3RU]KX3HU[TJ3
SKGTYW[GXKJ3KXXUX3HKZ]KKT3
VXKJOIZKJ3\GR[KY3GTJ3ZNK3 VKYYOSOYZOI3HU[TJ3UL3ZNK3[TIROVVKJ3UHPKIZO\K

IUSV[ZKJ3XK]GXJYZUMU
:GQOTM3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ3
UHPKIZO\K3SKGTY3]K狭RR3YKRKIZ3KOZNKX3ZNK3IROVVKJ3UX3ZNK3TUT
IROVVKJ3UHPKIZO\K3HGYKJ3UT3ZNK3XGZOU3GTJ3GJ\GTZGMK3
YOZ[GZOUT

10
[Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization algorithms: deep RL with importance sampled policy gradient]
Real-world considerations

Often unstable algorithms.


Hyperparameters optimization make a
huge difference

Implementations details make a huge


difference too
To get the most out of these algorithms,
the human designed needs to learn how
to tune them!

11
[Deep Reinforcement Learning that Matters Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger 2017]
https://arxiv.org/pdf/1809.07731
2OSOZGZOUTY
.OMN3\GXOGTIK3OT3MXGJOKTZ3KYZOSGZKY
8KZ[XTY3GXK3TUOY_3ɒ3MXGJOKTZ3KYZOSGZKY3LR[IZ[GZK3ɒ3YRU]KX3IUT\KXMKTIK3[TRKYY3\GXOGTIK3XKJ[IZOUT3ROQK3HGYKROTKY3UX3-'+3OY3
[YKJ

9GSVRK3OTKLLOIOKTI_
8KW[OXK3SGT_3OTZKXGIZOUTY3]OZN3ZNK3KT\OXUTSKTZ3ZU3SGQK3VXUMXKYY3IUSVGXKJ3ZU3YUSK3\GR[KHGYKJ3SKZNUJY

6UYYOHOROZ_3UL3VUUX3RUIGR3UVZOSG
9OTIK3ZNK_3JU3MXGJOKTZ3GYIKTZ3UT3G3TUTIUT\K^3UHPKIZO\K3ZNK_3IGT3MKZ3YZ[IQ3OT3Y[HUVZOSGR3VUROIOKY

9KTYOZO\K3ZU3N_VKXVGXGSKZKXY
2KGXTOTM3XGZK3JOYIU[TZ3LGIZUX3GTJ3HGZIN3YO`K3NKG\OR_3OTLR[KTIK3YZGHOROZ_

/TYZGHOROZ_3LXUS3RGXMK3[VJGZKY
'3YOTMRK3HGJ3[VJGZK3IGT3JXGYZOIGRR_3]UXYKT3ZNK3VUROI_3]N_3SKZNUJY3ROQK3665:8653]KXK3JK\KRUVKJ

)XKJOZ3GYYOMTSKTZ3JOLLOI[RZ_
/T3RUTMNUXO`UT3ZGYQY3LOM[XOTM3U[Z3]NOIN3GIZOUTY3IG[YKJ3]NOIN3XK]GXJY3IGT3HK3YRU]

12
On-policy vs Off-policy

On-policy requires to use data that are generated by the current policy
(on the given task)
Off-policy can use any data
(from the given task)
Until recently on-policy algorithms were more effective
In the last years, Off-policy RL has become a hot topic to scale up datasets and thus
benefit from large deep learning models
We now have pretty reasonable algorithms

13
Questions ?
OPAL Course Forum
(preferred for course related question)
or
teaching@lasr.org

14
Additional Readings

Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural
networks, Elsevier, 2008, 21, 682-697
TRPO and PPO
https://www.youtube.com/playlist?list=PLB79uOaPEEU6uU1-Pfaqr08RTTzhyB8hu

15
Lecture 6:

Policy Search
Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab

Machine Learning for Robotics - SuSe 25 v1.0


16

You might also like