Lecture 6:
Policy Search
            Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab
       Machine Learning for Robotics - SuSe 25        v1.01
Previous Lectures
 Reinforcement Learning Formalism
 Markov Decision Process (and variants)
                                          2
Types of Reinforcement Learning Algorithms
 Policy Search
    Policy Gradient         Today
    Bayesian optimization
 Value-based
    Q-learning
 Actor-critic
 Model-based Reinforcement Learning
                 θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )]
                  ∗
                                            t
                                                            3
Notation
 Reward to maximize        r
 State s
 Action a
 Induced Trajectory    τ
 Discount factor   γ
 Future discounted reward                T
                                 Rt = ∑ γ k−t rk
                                        k=t
 State value function
                           V π (s) = Eπ [Rt ∣st = s]
 State-Action value function (Q-function)
                           Qπ (s, a) = Eπ [Rt ∣st = s, at = a]   4
                                     Policy Gradient
We just learned that policy-gradient methods aim to find parameters theta that maximize the expected return.
The idea is that we have a parameterized stochastic policy. In our case, a neural network outputs a probability distribution over actions. The probability of taking each action is also
called the action preference.
                                                                                                                                                                                           5
Policy Search                      +^GIZ3MXGJOKTZ3OY3OSVUYYOHRK3Ȍ3:U3IUSV[ZK3   ø0Ţ3K^GIZR_3_U[狭J3TKKJ3ZU3Y[S3U\KX3GRR3VUYYOHRK3ZXGPKIZUXOKY3]NOIN3OY3
                                   OTLKGYOHRK39U3]K3GVVXU^OSGZK3OZ3[YOTM3YGSVRKY3X[T3ZNK3VUROI_3G3LK]3ZOSKY3KYZOSGZK3ZNK3MXGJOKTZ3LXUS3ZNUYK3KVOYUJ
                                   +T\OXUTSKTZ3J_TGSOIY3GXK3[TQTU]T3       3TUTJOLLKXKTZOGHRK3Ȍ3:NK3YZGZK3ZXGTYOZOUTY3JKVKTJ3UT3ZNK3KT\OXUTSKTZ狭Y3X[RKY
                                   J_TGSOIY3]NOIN3]K3[Y[GRR_3IGT狭Z3JOLLKXKTZOGZK3ZNXU[MN3:NK3ROQKRONUUJ3XGZOU3ZXOIQ3G\UOJY3ZNOY3H_3UTR_3JOLLKXKTZOGZOTM3
                                   ZNK3VUROI_3TUZ3ZNK3KT\OXUTSKTZ
 Given an existing parametrized policy           π(θ)
 Directly optimize its parameters w.r.t. the reward
θ = arg maxθ Eτ ∼pθ (τ ) [∑ r(st , at )] = arg maxθ J (θ)
 ∗                                                                                                                              π
                                           t
 In theory, this can be done with your optimizer of choice
 (e.g., zero-order, or first-order)
 Algorithms using first-order optimizers take the name of Policy Gradient
 Use of zero-order optimizers is also sometimes possible
     Bayesian optimization later in the course
                                                                                                                                                        6
REINFORCE Algorithm                                                           :NK38KOTLUXIK3GRMUXOZNS3GRYU3IGRRKJ33UTZK)GXRU3VUROI_MXGJOKTZ3OY3G3VUROI_MXGJOKTZ3
                                                                              GRMUXOZNS3ZNGZ3[YKY3GT3KYZOSGZKJ3XKZ[XT3LXUS3GT3KTZOXK3KVOYUJK3ZU3[VJGZK3ZNK3VUROI_3
                                                                              VGXGSKZKX
                                                                                                                                  i
 Execute policy            πθ (at ∣st )                           and obtain trajectories                                 {τ }
 Compute Gradients
                  π
           ∇θ J (θ) ≈                   ∑ (∑ ∇θ logπθ (at ∣st )) (∑ r(st , at ))
                                                        i i            i    i
                                        i  t                      t
                   8+/4,58)+3[YKY3MXGJOKTZ3GYIKTZ3ZU3SG^OSO`K3ZNK3K^VKIZKJ3XKZ[XT3
 Update Policy     0Ţ
                   :NK33YOMT3SKGTY3_U[3GXK3GJJOTM3ZNK3MXGJOKTZ3ZU3_U[X3I[XXKTZ3VGXGSKZKXY3
                   Ţ3V[YNOTM3Ţ3OT3ZNK3JOXKIZOUT3ZNGZ3OTIXKGYKY3ZNK3K^VKIZKJ3XK]GXJ
                 θ ← θ + α∇θ J(θ)                                             If alpha too large, learning will not be
                                                                              smooth (and might even diverge)
                                                                              If alpha too small, learning is slow
 Repeat                                                                       How do we choose alpha to be right?                                                    7
Natural Policy Gradients                                                                        34GZ[XGR3MXGJOKTZ3LO^KY3ZNGZ3H_3SKGY[XOTM3YZKV3YO`K3]OZN3G3JOYZXOH[ZOUTGR3JOYZGTIK3
                                                                                                123TUZ3]OZN3+[IROJKGT3VGXGSKZKX3JOYZGTIK3:NGZ3SGQKY3[VJGZKY3OT\GXOGTZ3ZU3
One possibility is to constrain the step size
                                                                                                XKVGXGSKZKXO`GZOUT3GTJ3ULZKT3S[IN3SUXK3YZGHRK
                                                                                                46-3YG_Y 3SKGY[XK3YZKV3YO`K3H_3NU]3S[IN3ZNK3VUROI_3JOYZXOH[ZOUT3INGTMKY3TUZ3
             ′            2                                                                   H_3NU]3S[IN3VGXGSKZKXY3INGTMK3:NK3123JO\KXMKTIK3OY3G3TGZ[XGR3SKGY[XK3UL3
        ∥θ − θ∥ < ϵ
                                 6RGOT3MXGJOKTZ3YZKVY3SU\K3VGXGSKZKXY3Ţ3OT3+[IROJKGT3YVGIK3([Z3ZNK3YGSK3
                                                                Does not work well in practice
                                                                                                JOYZXOH[ZOUT3INGTMK
                                 +[IROJKGT3YZKV3IGT3INGTMK3ZNK3VUROI_3
                                 JOYZXOH[ZOUT3G3RUZ3UX3G3ROZZRK3JKVKTJOTM3
                                 UT3NU]3ZNK3VUROI_3OY3VGXGSKZKXO`KJ3
                                 :NGZ3SGQKY3RKGXTOTM3YKTYOZO\K3ZU3NU]3
                                                                Intuition: take a step, but remain close to
                                 _U[3VGXGSKZKXO`K3ZNK3TKZ]UXQ
        D(πθ′ , πθ ) < ϵ                                                      your past policy
       Kullback–Leibler divergence                                                 Fisher-information matrix
                                                        ′                                 ′
                                                                                                              :NK3,OYNKX3SGZXO^3GVVXU^OSGZKY3ZNK3RUIGR3I[X\GZ[XK3UL3123GTJ3
                                                                             T
        DKL (πθ′ ∥πθ ) ≈ (θ − θ) F (θ − θ)                                                                    GIZY3ROQK3G3SKZXOI3UT3VGXGSKZKX3YVGIK3ZNGZ3SKGY[XKY3狮JOYZGTIK3
                                                                                                              HKZ]KKT3VUROIOKY狯3XGZNKX3ZNGT3狮JOYZGTIK3HKZ]KKT3VGXGSKZKXY狯3
                                                                                                              :NOY3OY3]N_3ZNK3[VJGZK3OY3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT
                                             T
        F = Eπθ [∇θ logπθ (a∣s)∇θ logπθ (a∣s) ]
                     −1
        θ ← θ + αF ∇θ J(θ)                                                               9U3]K 3INUUYK3G3JOXKIZOUT3ZNGZ3OTIXKGYKY3K^VKIZKJ3XK]GXJ3SUYZ3H[Z3QKKV3ZNK3TK]3
                                                                                         VUROI_3IRUYK3YSGRR3123ZU3ZNK3URJ3VUROI_
    Theoretical advantage: the learning is agnostic of the parametrization of the policy
    Practical advantage: faster and more robust convergence
    Approximate second-order optimization
                                                                         /T3\GTORRG3VUROI_3MXGJOKTZ3ZNK3SKGTOTM3UL3狮JOYZGTIK狯3HKZ]KKT3VUROIOKY3JKVKTJY3UT3NU]3Ţ3OY3VGXGSKZKXO`KJ
                                                                         /T346-3ZNK3,OYNKX3SGZXO^3SGQKY3ZNK3[VJGZK3OT\GXOGTZ3ZU3VGXGSKZKXO`GZOUT3ȍ3YZKVY3GXK3ZGQKT3OT3G3]G_3ZNGZ3
                                                                         XKYVKIZY3ZNK3MKUSKZX_3UL3ZNK3VUROI_3YVGIK3SKGY[XKJ3H_312JO\KXMKTIK
[Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural networks, Elsevier, 2008, 21, 682-697]                                             8
Trust Region Policy Optimization
         :8653OSVRKSKTZY3ZNK312IUTYZXGOTKJ3UVZOSO`GZOUT3OT3G3VXGIZOIGR3]G_3/TYZKGJ3UL3INUUYOTM3GT3GXHOZXGX_3ś3OZ3YUR\KY3GVVXU^OSGZKR_3ZNK3IUTYZXGOTKJ3
         UVZOSO`GZOUT
                                                                                                   −1
                                                             θ ← θ + αF                                   ∇θ J(θ)
                                                                     How do we choose alpha?
                                  Backtracking line-search:
    Choose alpha with smallest value which improve reward and satisfies the KL-divergence
                                      .U]3:8653KTLUXIKY3OZ
                                                                                                             :NK3TGZ[XGR3VUROI_3MXGJOKTZ3\KIZUX3GT3HK3\OK]KJ3GY3ZNK3YZKKVKYZ3GYIKTZ3JOXKIZOUT3
                                      )USV[ZK3G3IGTJOJGZK3[VJGZK3[YOTM3ZNK3TGZ[XGR3VUROI_3MXGJOKTZ
                                      ;YK3HGIQZXGIQOTM3ROTK3YKGXIN                                           OT3VUROI_3YVGIK3[TJKX3123JO\KXMKTIK3GY3G3SKZXOI3:NOY3SKGTY3ZNK3VUROI_3[VJGZK3
                                      9ZGXZ3]OZN3G3HOM3YZKV3YO`K3                                            ZXOKY3ZU3SG^OSO`K3XK]GXJ3]NORK3ROSOZOTM3NU]3S[IN3ZNK3VUROI_3INGTMKY3OT3
                                      -XGJ[GRR_3YNXOTQ3                                                      JOYZXOH[ZOUT
                                      ś3[TZOR
                                      :NK3XK]GXJ3GIZ[GRR_3OSVXU\KY
                                      :NK312JO\KXMKTIK3HKZ]KKT3URJ3GTJ3TK]3VUROI_3OY3ʇ3Ş
                                     Intuition: Natural Policy Gradients + Automatic step-size
                                                                        'J\GTZGMK3KYZOSGZKY3XKJ[IK3\GXOGTIK3GTJ3OSVXU\K3
                                                                        YZGHOROZ_
[Adaptive step-size for policy gradient methods M Pirotta, M Restelli, L Bascetta - Advances in Neural Information Processing Systems, 2013]
                                                                                                                                                                                              9
[Schulman, L., Moritz, Jordan, Abbeel (2015). Trust region policy optimization]
Proximal Policy Optimization (PPO)
                                                Aπ (s, a) = Qπ (s, a) − Vπ (s)
 Advantage function: How much better is to take a specific action over randomly selecting it
                                                                                                                         6XU^OSGR36UROI_35VZOSO`GZOUT36653GT3GXINOZKIZ[XK3
                                                                                                                         ZNGZ3OSVXU\KY3U[X3GMKTZ狭Y3ZXGOTOTM3YZGHOROZ_3H_3G\UOJOTM3
                                                                                                                         VUROI_3[VJGZKY3ZNGZ3GXK3ZUU3RGXMK3:U3JU3ZNGZ3]K3[YK3G3
                                                                                                                         XGZOU3ZNGZ3OTJOIGZKY3ZNK3JOLLKXKTIK3HKZ]KKT3U[X3I[XXKTZ3
                                                                                                                         GTJ3URJ3VUROI_3GTJ3IROV3ZNOY3XGZOU3ZU3G3YVKIOLOI3XGTMK
                                                         :NOY3OY3ZNK3JOYIU[TZKJ3I[S[RGZO\K3XK]GXJ3LXUS3ZOSK3Z
                                                                                                                               :NK3OJKG3]GY3ZNGZ3H_3ZGQOTM3G3MXGJOKTZ3GYIKTZ3YZKV3UT3
                                                                                                                               ZNOY3L[TIZOUT3KW[O\GRKTZ3ZU3ZGQOTM3MXGJOKTZ3JKYIKTZ3UL3
                                                                                                                               ZNK3TKMGZO\K3UL3ZNOY3L[TIZOUT3]K3]U[RJ3V[YN3U[X3
                                                                                                                               GMKTZ3ZU3ZGQK3GIZOUTY3ZNGZ3RKGJ3ZU3NOMNKX3XK]GXJY3GTJ3
                                                                                                                               G\UOJ3NGXSL[R3GIZOUTY
                                                                                                                               .U]K\KX3ZNK3VXUHRKS3IUSKY3LXUS3ZNK3YZKV3YO`K
                                                                                                                               :UU3YSGRR3ZNK3ZXGOTOTM3VXUIKYY3]GY3ZUU3YRU]
                                                                                                                               :UU3NOMN3ZNKXK3]GY3ZUU3S[IN3\GXOGHOROZ_3OT3ZNK3ZXGOTOTM
              :NK3\GR[K3L[TIZOUT3VGXGSKZKXY3
                                                                                                                             :NKT3]K3ZGQK3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ
              GXK3[VJGZKJ3H_3SOTOSO`OTM3ZNK3
                                                                                                                             UHPKIZO\K3YU3ZNK3LOTGR3UHPKIZO\K3OY3G3RU]KX3HU[TJ3
              SKGTYW[GXKJ3KXXUX3HKZ]KKT3
              VXKJOIZKJ3\GR[KY3GTJ3ZNK3                                                                                      VKYYOSOYZOI3HU[TJ3UL3ZNK3[TIROVVKJ3UHPKIZO\K
              IUSV[ZKJ3XK]GXJYZUMU
                                                                                                                             :GQOTM3ZNK3SOTOS[S3UL3ZNK3IROVVKJ3GTJ3TUTIROVVKJ3
                                                                                                                             UHPKIZO\K3SKGTY3]K狭RR3YKRKIZ3KOZNKX3ZNK3IROVVKJ3UX3ZNK3TUT
                                                                                                                             IROVVKJ3UHPKIZO\K3HGYKJ3UT3ZNK3XGZOU3GTJ3GJ\GTZGMK3
                                                                                                                             YOZ[GZOUT
                                                                                                                                                                                     10
[Schulman, Wolski, Dhariwal, Radford, Klimov (2017). Proximal policy optimization algorithms: deep RL with importance sampled policy gradient]
Real-world considerations
      Often unstable algorithms.
      Hyperparameters optimization make a
      huge difference
      Implementations details make a huge
      difference too
      To get the most out of these algorithms,
      the human designed needs to learn how
      to tune them!
                                                                                                                                           11
[Deep Reinforcement Learning that Matters Peter Henderson, Riashat Islam, Philip Bachman, Joelle Pineau, Doina Precup, David Meger 2017]
                                                  https://arxiv.org/pdf/1809.07731
2OSOZGZOUTY
.OMN3\GXOGTIK3OT3MXGJOKTZ3KYZOSGZKY
8KZ[XTY3GXK3TUOY_3ɒ3MXGJOKTZ3KYZOSGZKY3LR[IZ[GZK3ɒ3YRU]KX3IUT\KXMKTIK3[TRKYY3\GXOGTIK3XKJ[IZOUT3ROQK3HGYKROTKY3UX3-'+3OY3
[YKJ
9GSVRK3OTKLLOIOKTI_
8KW[OXK3SGT_3OTZKXGIZOUTY3]OZN3ZNK3KT\OXUTSKTZ3ZU3SGQK3VXUMXKYY3IUSVGXKJ3ZU3YUSK3\GR[KHGYKJ3SKZNUJY
6UYYOHOROZ_3UL3VUUX3RUIGR3UVZOSG
9OTIK3ZNK_3JU3MXGJOKTZ3GYIKTZ3UT3G3TUTIUT\K^3UHPKIZO\K3ZNK_3IGT3MKZ3YZ[IQ3OT3Y[HUVZOSGR3VUROIOKY
9KTYOZO\K3ZU3N_VKXVGXGSKZKXY
2KGXTOTM3XGZK3JOYIU[TZ3LGIZUX3GTJ3HGZIN3YO`K3NKG\OR_3OTLR[KTIK3YZGHOROZ_
/TYZGHOROZ_3LXUS3RGXMK3[VJGZKY
'3YOTMRK3HGJ3[VJGZK3IGT3JXGYZOIGRR_3]UXYKT3ZNK3VUROI_3]N_3SKZNUJY3ROQK3665:8653]KXK3JK\KRUVKJ
)XKJOZ3GYYOMTSKTZ3JOLLOI[RZ_
/T3RUTMNUXO`UT3ZGYQY3LOM[XOTM3U[Z3]NOIN3GIZOUTY3IG[YKJ3]NOIN3XK]GXJY3IGT3HK3YRU]
                                                                                                                              12
On-policy vs Off-policy
 On-policy requires to use data that are generated by the current policy
 (on the given task)
 Off-policy can use any data
 (from the given task)
 Until recently on-policy algorithms were more effective
 In the last years, Off-policy RL has become a hot topic to scale up datasets and thus
 benefit from large deep learning models
 We now have pretty reasonable algorithms
                                                                                        13
  Questions ?
         OPAL Course Forum
(preferred for course related question)
                or
         teaching@lasr.org
                                          14
Additional Readings
 Peters, J. & Schaal, S. Reinforcement learning of motor skills with policy gradients, Neural
 networks, Elsevier, 2008, 21, 682-697
 TRPO and PPO
 https://www.youtube.com/playlist?list=PLB79uOaPEEU6uU1-Pfaqr08RTTzhyB8hu
                                                                                           15
                     Lecture 6:
        Policy Search
            Roberto Calandra
Learning, Adaptive Systems, and Robotics (LASR) Lab
       Machine Learning for Robotics - SuSe 25        v1.0
                                                        16