Reinforcement Learning Cheat Sheet: Return
Reinforcement Learning Cheat Sheet: Return
the returns following first visits to s, whereas the every-visit   require that every action taken under π is also taken, at least              end
MC method averages the returns following all visits to s. The      occasionally, under b. That is, we require that π(a|s) > 0             end
every-visit MC Prediction is derived from first-visit version      implies b(a|s) > 0. This is called the assumption of coverage.       Algorithm 7: Off-policy MC Control - estimating
removing the “if” condition.                                       Off-policy Every-visit MC Prediction                                 π ∼ π∗ [§5.7]
In other words we move backward from the step T and
compute the G incrementally and associate the values of G to           Inputs: π - the policy to be evaluated
the current state and perform the average.                             Initialize: V (s) ∈ R for all s ∈ S
                                                                       Return(s) ← an empty list for all s ∈ S
                                                                       while forever - for each episode do
                                                                            Generate an episode following b:
MC Estimation of Action Values                                               S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
                                                                            G←0
                                                                            W ←1
To determine a policy, if a model is not available, the state               foreach step of episode, t = T − 1, T − 2, ..., 0 do
value is not sufficient and we have to estimate the values of                    G ← γW G + Rt+1
state–action pairs.                                                              Append G to Return(St )
The MC methods are essentially the same as just presented for                    V (St ) ← average(Return(St ))
state values, but now we have state–action pairs.                                            π(A |S )
                                                                                  W ← W b(A t|S t)
                                                                                               t   t
The only complication is that many state–action pairs may
                                                                             end
never be visited. We need to estimate the value of all the
                                                                       end
actions from each state, not just the one we currently favor.
We can specify that the episodes start in a state–action pair,       Algorithm 6: Off-policy Every-visit Monte Carlo
and that every pair has a nonzero probability of being selected      prediction - estimating V ∼ vπ [Course2-Week2]
as the start (assumption of exploring starts).
Temporal-Difference Learning                                                 TD methods update their estimates based in part on other             Q-Learning - Off-policy TD Control
                                                                             estimates. They learn a guess from a guess, i.e. they                Q-Learning is an off-policy TD control. Q-learning is a
TD Prediction
                                                                             bootstrap. TD and MC methods have an advantage over DP               sample-based version of value iteration which iteratively
Starting from Eq. 26, we can consider a generic update rule of               methods in that they do not require a model of the                   applies the Bellman’s optimality equation. The update rule:
V (St )                                                                      environment, of its reward and next-state probability
                                                                             distributions.
                                                                             The most obvious advantage of TD methods over MC methods              Q(St , At ) ←
               V (St ) ← V (St ) + α(Gt − V (St ))    [6.1]           (27)                                                                                       h                                      i
                                                                             is that they are naturally implemented in an online, fully             Q(St , At )+α Rt+1 + γ max Q(St+1 , a) − Q(St , At )       [6.8]
α is a constant step-size and we call previous method                        incremental fashion. With MC methods one must wait until                                           a
constant-α MC. MC has to wait the end of an episode to                       the end of an episode, because only then the return is known,                                                                       (34)
determine the increment to V (St ).                                          whereas with TD methods one needs wait only one time step.
Differently from MC, TD updates the value at each step of the                In practice, TD methods have usually been found to converge              Params: step size α ∈]0, 1], small  > 0
episode following the equation below:                                        faster than constant-α MC methods on stochastic tasks.                   Initialize Q(s, a) for all s ∈ S + and a ∈ A(s),
                                                                             The error, available at time t + 1, between V (St ) and the               arbitrarily except that Q(terminal − state, ·) = 0
                                                                             better estimate Rt+1 + γV (St+1 ) is called TD error :                   foreach episode do
  V (St ) ← V (St ) + α(Rt+1 + γV (St+1 ) − V (St ))          [6.2]   (28)                                                                                 Initialize S
                                                                                               .
                                                                                         δt = Rt+1 + γV (St+1 ) − V (St )   [6.5]         (32)             foreach step of episode - until S is terminal do
    Inputs: π - the policy to be evaluated                                                                                                                      Choose A from S using policy derived from Q
    Params: step size α ∈]0, 1]                                              Sarsa - On-policy TD Control                                                        (e.g. -greedy)
    Initialize: V (s) ∈ R for all s ∈ S + except for                                                                                                            Take action A, observe R, S 0
     V(terminal)=0                                                           Sarsa (State-action-reward-state-action) is an on-policy TD                        Q(S, A) ←
    foreach episode do                                                       control. Sarsa is sample-based version of policy iteration which                    Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
         Initialize S                                                        uses Bellman equations for action values. The update rule:                         S ← S0
         foreach step of episode - until S is terminal do                                                                                                  end
              A ← action given by π for S                                                                                                             end
              Take action A, observe R, S 0                                    Q(St , At ) ←
                                                                                                                                                    Algorithm 10: Q-Learning - Off-policy TD Con-
              V (S) ← V (S) + α(R + γV (S 0 ) − V (S))                          Q(St , At )+α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )]   [6.7]
              S ← S0                                                                                                                                trol - estimating π ∼ π∗ [§6.5]
                                                                                                                                           (33)
         end
    end
                                                                                                                                                  Expected Sarsa
                                                                                 Params: step size α ∈]0, 1], small  > 0                         Similar to Q-Learning, the update rule of Expected Sarsa,
  Algorithm 8: Tabular TD(0) - estimating vπ [§6.1]                              Initialize Q(s, a) for all s ∈ S + and a ∈ A(s),                 takes the expected value instead of the maximum over the
Recall that:                                                                      arbitrarily except that Q(terminal − state, ·) = 0              next state:
                   .                                                             foreach episode do
         vπ (s) = Eπ [Gt |St = s]     [3.12|6.3]                      (29)            Initialize S                                                       Q(St , At ) ←
                  = Eπ [Rt+1 + γGt+1 |St = s]        [by 3.9]         (30)            Choose A from S using policy derived from Q (e.g.           Q(St , At ) + α[Rt+1 + γEπ [Q(St+1 , At+1 )|St+1 ] − Q(St , At )]
                  = Eπ [Rt+1 + γvπ (St+1 )|St = s]       [6.4]        (31)             -greedy)                                                                           X
                                                                                      foreach step of episode - until S is terminal do            Q(St , At ) + α[Rt+1 + γ    π(a|St+1 )Q(St+1 , a) − Q(St , At )]      [6.9]
                                                                                           Take action A, observe R, S 0                                                    a
                                                                                           Choose A0 from S 0 using policy derived from Q                                                                        (35)
MC methods use an estimate of Eq. 29 as a target. The MC
target is an estimate because the expected value in Eq. 29 is                               (e.g. -greedy)                                       The next action is sampled from π. However, the expectation
not known; a sample return is used in place of the real                                    Q(S, A) ←                                              over actions is computed independently of the action actually
expected return.                                                                            Q(S, A) + α [R + γQ(S 0 , A0 ) − Q(S, A)]             selected in the next state. In fact, it is not necessary that π is
DP and TD methods use an estimate of Eq. 31 as a target.                                   S ← S0                                                 equal to the behavior policy. This means that Expected Sarsa,
The DP target is an estimate because vπ (St+1 ) is not known                               A ← A0                                                 like Q-learning, can be used to learn off-policy without
                                                                                      end
and the current estimate, V (St+1 ), is used instead.                                                                                             importance sampling.
                                                                                 end
The TD target is an estimate because it samples the expected                                                                                      If the target policy is greedy with respect to its action value
values in Eq. 31 and it uses the current estimate V instead of                 Algorithm 9: Sarsa - On-policy TD Control - es-                    estimates we obtain the Q-Learning. Hence Q-Learning is a
the true vπ .                                                                  timating Q ∼ q∗ [§6.4]                                             special case of Expected Sarsa.
n-step Bootstrapping                                            In the one-step TD, instead, the update, is based on just the    In the n-step TD, it is used the n-step return:
                                                                one next reward, bootstrapping from the value of the state one
n-step TD Prediction
                                                                step later as a proxy for the remaining rewards using one-step           .
MC methods updates the estimate of vπ (St ) for each state      return:                                                          Gt:t+n = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Vt+n−1 (St+n )
based on the entire sequence of observed rewards from that                                                                                                                                (38)
state until the end of the episode using:
               .                                                                         .
           Gt = Rt+1 + γRt+2 + ... + γ T −t−1 RT         (36)                    Gt:t+1 = Rt+1 + γVt (St+1 )              (37)
Planning and Learning                                             experience obtained with the interaction with the environment   If (e) and (f) were omitted, the remaining algorithm would be
Planning methods use simulated experience generated by a          can be used to improve directly the Policy/value function       one-step tabular Q-learning.
model, learning methods use real experience generated by the      (direct RL) or indirectly through the model learning and the    The agent responds instantly to the latest sensory information
environment. Many ideas and algorithms can be transferred         planning that use simulated experience (indirect RL) [§8.2].    and yet always planning in the background. Also the
between planning and learning.                                                                                                    model-learning process is in background. As new information
    Params: step size α ∈]0, 1], small  > 0                                                                                      is gained, the model is updated to better match reality. As the
    Initialize Q(s, a) for all s ∈ S + and ainA(s), arbitrarily                                                                   model changes, the ongoing planning process will gradually
     except that Q(terminal − state, ·) = 0                                                                                       compute a different way of behaving to match the new model.
    foreach episode do                                                                                                            Models may be incorrect for many reasons: environment is
         1. Select a state S ∈ S, and an action, A ∈ A(s),                                                                        stochastic and only a limited number of samples have been
          at random                                                                                                               observed, the model was learned using function approximation
         2. From a Sample Model obtain the sample reward                                                                          that has generalized imperfectly, the environment has changed
          and next state following: R, S’ = model (S,A)                                                                           and its new behavior has not yet been observed. When the
         3. Apply one-step tabular Q-Learning to                                                                                  model is incorrect, the planning process is likely to compute a
          S, A, R, S 0 :                                                                                                          suboptimal policy. In some cases, the suboptimal policy
         Q(S, A) ←                                                                                                                computed by planning quickly leads to the discovery and
          Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]                                                                           correction of the modeling error. This happens when the
    end                                                                                                                           model is optimistic, predicting greater reward or better state
                                                                                                                                  transition than are actually possible. It is more difficult to
  Algorithm 11: Random-sample one-step tabular                                                                                    correct a model when the environment becomes better than it
  Q-planning [§8.1]                                                                                                               was before.
                                                                      Initialize Q(s, a) and M odel(S, A) for all s ∈ S + and     In Figure below there are represented the relation among
Dyna                                                                   a ∈ A(s)                                                   algorithms presented in the Course on Coursera
Within a planning agent, there are at least two roles for real        while forever do                                            [Course3-Week1].
experience:                                                                (a) S ← current (nonterminal) state
(1) it can be used to improve the model (to match more                     (b) A ← -greedy(S, Q)
accurately the real environment), model learning,                          (c) Take action A; observe resultant reward, R,
(2) it can be used to directly improve the value function and                and state, S 0
policy using the kinds reinforcement learning methods                      (d) Q(S, A) ←
discussed before, direct RL [§8.2].                                          Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
                                                                           (e) M odel(S, A) ← R, S 0 (assuming deterministic
                                                                             environment)
                                                                           (f) foreach n times do
                                                                                S ← random previously observed state
                                                                                A ← random action previosuly taken in S
                                                                                R, S 0 ← M odel(S, A)
                                                                                Q(S, A) ←
                                                                                 Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
                                                                           end
                                                                      end
                                                                             Algorithm 12: Dyna-Q [§8.2]
                                                                  (d) Direct reinforcement learning
                                                                  (e) Model-learning
The experience can improve value functions and policies either    (f) Planning
directly or indirectly via the model (indirect RL). The real
                                                                                                                                                      .
On-policy Prediction with Function                                state’s estimate more accurate means making others’ less                For MC, Ut = Gt and hence Ut is an unbiased estimation of
                                                                  accurate.                                                               vπ (St ).
Approximation
We can approximate value function not as a table but as a                           . X
                                                                            V E(w) =      µ(s) [vπ (s) − v̂(s, w)]2 [9.1]  (39)              Inputs: π - the policy to be evaluated
parametrized functional form: v̂(s, w) ∼ vπ (s) where w ∈ Rd                          s∈S)                                                   a differentiable function v̂ : S × Rd → R
and the number of weights is much less than the number of                                                         P
                                                                                                                                             Parameters: step size α > 0
                                                                  where µ(s) is a state distribution (µ(s) ≥ 0 and s µ(s) = 1)
states (d << |S|). The value estimation can be framed as a                                                                                   Initialize: w ∈ Rd arbitrarily (e.g. w = 0)
                                                                  representing how much we care about the error in each state s.
Supervised Learning problem.                                                                                                                 while forever - for each episode do
                                                                  Usually to minimize the 35 it is used the Stochastic Gradient
The Monte Carlo methods estimate the value function using                                                                                         Generate an episode following π:
                                                                  Descent (SGD):
samples of the return so the input is the state and the targets                                                                                    S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
are the returns (pairs (Si , Gi )).                                              1                                                                foreach step of episode, t = 0, 1, ..., T − 1 do
                                                                           .
For TD methods the targets are the one-step bootstrap return        wt+1 = wt −    α∇ [vπ (St ) − v̂(St , wt )]2 [9.4]             (40)                w ← w − α [Gt − v̂(St , wt )] ∇v̂(St , w)
(pairs (Si , Ri+1 + γv̂(Si+1 , w)).                                              2
                                                                                                                                                  end
In RL setting, the data is temporally correlated and the full             = wt − α [vπ (St ) − v̂(St , wt )] ∇v̂(St , wt ) [9.5]   (41)
                                                                                                                                             end
dataset is not fixed and available from the beginning.            Usually we have only an approximation Ut of vπ (St ) but, if Ut          Algorithm 13: Gradient MC - Estimating v ∼ vπ
Moreover, due to the bootstrapping methods (TD, DP), the          is an unbiased estimation of vπ (St ), that is
target labels change.                                             E[Ut |St = s] = vπ (St ), for each t, then wt is guaranteed to           [§9.3]
In tabular case the learned values at each state were decoupled   converge to a local optimum (under the stochastic
- an update at one state affected no other. Now making one        approximation condition for decreasing α).                              https://github.com/linker81/Reinforcement-Learning-CheatSheet