Reinforcement Learning Cheat Sheet: Return

This document provides an overview of key concepts in reinforcement learning, including returns, value functions, and the agent-environment interface. It defines returns as the cumulative discounted reward over time, and distinguishes between episodic and continuing tasks. Value functions map states or state-action pairs to expected returns, and optimal value functions aim to maximize expected returns. The agent-environment interaction is framed as a Markov decision process, where the agent selects actions based on the current state and receives rewards and a new state from the environment.

Uploaded by

Joydeep Hazra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

542 views7 pages

Reinforcement Learning Cheat Sheet: Return

Uploaded by

Joydeep Hazra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Reinforcement Learning Cheat Sheet Return policy π.

Informally, is the expected return when starting from

In RL the goal of the agent is not the maximization of the s, taking action a and thereafter following π:
Recap immediate reward, but of the cumulative reward in the long .
qπ (s, a) = Eπ [Gt |St = s, At = a] [3.13] (13)
. P run. The return is a specific function of reward sequence. In
E[X] = xi xi · P r{X = xi } X h X i
P the simplest case the return is the sum of rewards: = p(s0 , r|s, a) r + γ π(a0 |s0 )qπ (a0 , s0 ) [Ex 3.17]
E[X|Y = yj ] = xi xi · P r{X = xi |Y = yj } . s0 ,r a0
P
E[X|Y = yj ] = zk P r{Z = zk |Y = yj } · E[X|Y = yj , Z = zk ] Gt = Rt+1 + Rt+2 + Rt+3 + ... + RT [3.7] (5)
(14)
where T is a final time step. When there is a natural notion of
Agent-Environment Interface final time step (T ), the agent-environment interaction breaks The last one is the Bellman equation for qπ .
naturally into sub-sequences (episodes) and the next episode
begins independently of how the previous one ended. Tasks Relation between Value Functions
with episodes are called episodic tasks. Each episodes ends in
a special state called terminal state with different rewards for
X
vπ (s) = π(a|s) · qπ (s, a) [Ex 3.12] (15)
the different outcomes. S + is the set of all states plus the a
terminal state.
= Eπ [qπ (s, a)|St = s] [Ex 3.18] (16)
When the agent-environment interaction does not break
naturally into episodes but goes on continually without limit, X h i
we call these continuing tasks. The previous formulation of qπ (s, a) = p(s0 , r|s, a) r + γvπ (s0 ) [Ex 3.13] (17)
return (Eq. 5) is problematic because T = ∞. s0 ,r
It is introduced the total discounted return expressed as the h i
The Agent at each step t receives a representation of the sum of rewards (opportunely discounted using the discount = E Rt+1 + γvπ (s0 )|St = s, At = a [Ex 3.19] (18)
environment’s state St ∈ S and it selects an action At ∈ A(s). rate 0 ≤ γ ≤ 1):
One time step later, as a consequence of its action, the agent . Optimal Value Functions
receives a reward, Rt+1 ∈ R ⊆ R and goes to the new state Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + ...
.
St+1 . ∞ v∗ (s) = max vπ (s) [3.15] (19)
π
X
= γ k Rt+k+1 [3.8] (6)
Markov Decision Process k=0 = max E[Rt+1 + γv∗ (St+1 )|St = s, At = a] [3.18]
a
A finite Markov Decision Process, MDP, is defined by: = Rt+1 + γGt+1 [3.9] (7) X h i
= max p(s0 , r|s, a) r + γv∗ (s0 ) [3.19]
finite set of states: s ∈ S, To unify the notation for episodic and continuing tasks we use:
a
s0 ,r
finite set of actions: a ∈ A
dynamics: T
. X .
0 . 0 Gt = γ k−t−1 Rk [3.11] (8) q∗ (s, a) = max qπ (s, a) [3.16] (20)
p(s , r|s, a) = P r{St = s , Rt = r|St−1 = s, At−1 = a} [3.2] π
k=t+1
(1) = E[Rt+1 + γ max q∗ (St+1 , a0 )|St = s, At = a]
including the possibility that T = ∞ or γ = 1 but not both. a0
state transition probabilities: X h i
. Policy = p(s0 , r|s, a) r + γ max q∗ (s0 , a0 ) [3.20]
p(s0 |s, a) = P r{St = s0 |St−1 = s, At−1 = a} a0
X A policy is a mapping from a state to probabilities of selecting s0 ,r
= p(s0 , r|s, a) [3.4] (2) each possible action:
r∈R π(a|s) (9) v∗ (s) = max qπ∗ (s, a) (21)
a∈A(s)
expected reward for state-action: That is the probability of select an action At = a if St = s.
Intuitively, the above equation express the fact that the value
.
r(s, a) = E[Rt |St−1 = s, At−1 = a]
Value Functions of a state under the optimal policy must be equal to the
State-Value function describes how good is to be in a specific expected return from the best action from that state.
X X
= r· p(s0 , r|s, a) [3.5] (3) state s under a certain policy π. Informally, it is the expected Relation between Optimal Value Functions
r∈R s0 ∈S return (expected cumulative discounted reward) when starting
from s and following π. For any policy π and ∀s ∈ S: X h X i
expected reward for state-action-next state: v∗ (s) = max p(s0 , r|s, a) r + γ π(a0 |s0 )q∗ (s0 , a0 ) [Ex 3.25]
. a
.
vπ (s) = Eπ [Gt |St = s] [3.12] (10) s0 ,r a0
0 0
r(s , s, a) = E[Rt |St−1 = s, At−1 = a, St = s ] (22)
= Eπ [Rt+1 + γGt+1 |St = s] [by 3.9] (11)
X p(s0 , r|s, a) h i
r·
X X
=
p(s0 |s, a)
[3.6] (4) = π(a|s) p(s0 , r|s, a) r + γvπ (s0 ) [3.14] (12)
r∈R a s0 ,r X h i
q∗ (s, a) = p(s0 , r|s, a) r + γv∗ (s0 ) [Ex 3.26] (23)
The MDP and agent together thereby give rise to a sequence The last one is the Bellman equation for vπ .
s0 ,r
or trajectory that begins like this: Action-Value function (Q-Function) describes how good is to
S0 , A0 , R1 , S1 , A1 , R2 , S2 , A2 , R3 ... perform a given action a in a given state s under a certain
Dynamic Programming 1. Initialization One sweep is one update of each state.
Assign arbitrarily V (s) ∈ R and π(s) ∈ A(s) for all In value iteration only a single iteration of policy evaluation is
s ∈ S, performed between each policy improvement. Value iteration
Collection of algorithms that can be used to compute optimal 2. Policy Evaluation combines, in each of its sweeps, one sweep of policy evaluation
policies given a perfect model of the environment as a MDP. ∆←0 and one sweep of policy improvement.
while ∆ ≥ θ do Faster convergence is often achieved by interposing multiple
foreach s ∈ S do policy evaluation sweeps between each policy improvement
Policy Evaluation [Prediction] v ← V (s) sweep. The entire class of truncated policy iteration algorithms
p(s0 , r|s, a) r + γV (s0 )
P P
V (s) ← π(a|s) can be thought of as sequences of sweeps, some of which use
a s0 ,r
If the environment’s dynamic is completely known, the Eq. 12 policy evaluation updates and some of which use value
∆ ← max(∆, |v − V (s)|)
is a system of |S| equations in |S| unknowns (vπ (s), s ∈ S). iteration updates.
end
We also can use an iterative solution, ∀s ∈ S: end Generalized Policy Iteration
3. Policy Improvement Generalized Policy Iteration is a way to refer to the general
. policy-stable ← true
vk+1 (s) = Eπ [Rt+1 + γvk (St+1 )|St = s] idea of letting policy-evaluation and policy-improvement
X X foreach s ∈ S do processes interact, independent of the granularity and other
= π(a|s) p(s0 , r|s, a)[r + γvk (s0 )] [4.5] (24) old-action ← π(s) details of the two processes.
p(s0 , r|s, a) r + γV (s0 )
P
a s0 ,r π(s) ← argmax Almost all reinforcement learning methods are well described
a s0 ,r
as GPI. That is, all have identifiable policies and value
if old-action 6= π(s) then functions, with the policy always being improved with respect
We can compute new values vk+1 (s) from old values vk (s) policy-stable ← false
without change old values or update the values in-place. to the value function and the value function always being
end
driven toward the value function for the policy, as suggested
end by the diagram below [§4.6]:
if policy-stable then
Iterative Policy Evaluation for estimating V ∼ vπ return V ≈ v∗ and π ≈ π∗
(in-place version) else
go to 2
end
Inputs: π - the policy to be evaluated Algorithm 2: Policy Iteration - estimating π ∼ π∗
Params: θ - a small positive threshold determining the - deterministic policy - [§4.3]
accuracy of the estimation
Initialize V(s), for all s ∈ S + arbitrarily, except Value Iteration
V(terminal) = 0 Instead of waiting the convergence of V (s) (policy evaluation
∆←0 loop) we can perform only one step of policy evaluation that,
while ∆ ≥ θ do combined with policy improvement, lead to the following
foreach s ∈ S do formulation:
v ← V (s) vk+1 (s) = max Eπ [Rt+1 + γvk (St+1 )|St = s]
p(s0 , r|s, a) r + γV (s0 )
P P a
V (s) ← π(a|s)
a s0 ,r
X
= max p(s0 , r|s, a)[r + γvk (s0 )] [4.10] (25)
∆ ← max(∆, |v − V (s)|) a
s0 ,r
end
end Params: θ - a small positive threshold determining the
Algorithm 1: Iterative Policy Evaluation - esti- accuracy of the estimation
Initialize V(s), for all s ∈ S + arbitrarily, except
mating V ∼ vπ - [§4.1] V(terminal) = 0
∆←0
The algorithm tests the quantity maxs∈S |vk+1 (s) − vk (s)| while ∆ ≥ θ do
after each sweep and stops when it is sufficiently small (∆ < θ). foreach s ∈ S do
v ← V (s)
p(s0 , r|s, a) r + γV (s0 )
P
V (s) ← max
a s0 ,r
Policy Iteration ∆ ← max(∆, |v − V (s)|)
end
Policy iteration consists of two simultaneous, interacting end
processes: one making the value function consistent with the output: Deterministic policy π ≈ π∗ such that
p(s0 , r|s, a) r + γV (s0 )
P
current policy (policy evaluation), and the other making the π(s) = argmax
a s0 ,r
policy greedy with respect to the current value function (policy
improvement). Algorithm 3: Value Iteration - estimating π ∼ π∗
- [§4.4]
Monte Carlo Methods MC Control Incremental Implementation
Initialise: π(s) ∈ A(s) arbitrarily, for all s ∈ S The average used to compute V (St ), can be performed
Q(s, a) ∈ R (arbitrarily) for all s ∈ S, a ∈ A(s) incrementally:
Monte Carlo (MC) methods require only experience from
Returns(s, a) ← empty list for all s ∈ S, a ∈ A(s) n
actual or simulated environment. 1X 1
while forever do Vn (St ) = Gi (t) = Vn−1 (St ) + (Gn (t) − Vn−1 (St ))
Choose S0 ∈ S and A0 ∈ A(S0 ), randomly such n i=1 n
that all pairs have probability > 0 (26)
MC Prediction Generate an episode from S0 , A0 following
π : S0 , A0 , R1 , ..., ST −1 , AT −1 , RT Off-policy MC Control
G←0 The policy used to generate behavior, called the behavior
Inputs: π - the policy to be evaluated foreach step of episode, t = T − 1, T − 2, ..., 0 do policy, may in fact be unrelated to the policy that is evaluated
Initialize: V (s) ∈ R for all s ∈ S G ← γG + Rt+1 and improved, called the target policy. An advantage of this
Return(s) ← an empty list for all s ∈ S if St , At pair is not seen before, is not in the separation is that the target policy may be deterministic (e.g.,
while forever - for each episode do sequence S0 , A0 , S1 , A1 ..., St−1 , At−1 then greedy), while the behavior policy can continue to sample all
Generate an episode following π: Append G to Returns(St , At ) possible actions.
S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT Q(St , At ) ← average(Returns(St , At ))
π(St ) ← argmaxa (Q(St , a)) Initialize:for all s ∈ S, a ∈ A(s)
G←0 Q(s, a) ∈ R C(s, a) ← 0 π(s) ← argmaxa +Q(s, a)
foreach step of episode, t = T − 1, T − 2, ..., 0 do end
while forever - for each episode do
G ← γG + Rt+1 end b ← anysof tpolicy Generate an episode following
if St is not in the sequence S0 , S1 , ..., St−1 end b: S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
(i.e. it is the first visit to St ) then Algorithm 5: First-visit Monte Carlo (Exploring G←0
Append G to Return(St )
Starts) - estimating π ∼ π∗ [§5.3] W ←1
V (St ) ← average(Return(St ))
foreach step of episode, t = T − 1, T − 2, ..., 0 do
end To remove the exploring starts assumption, we can use an
G ← γG + Rt+1
end −soft policy. Most of times it selects the greedy policy but
C(St , At ) ← C(St , At ) + W
end with probability it instead selects an action at random.
Q(St , At ) ←
Other approaches are the off-policy methods that learn about
Algorithm 4: On-policy First-visit Monte Carlo Q(St , At ) + C(SW,A ) [G − Q(St , At )]
the optimal policy while behaving according to a different t t
prediction - estimating V ∼ vπ [§5.1] exploratory policy. The policy being learned about is called π(s) ← argmaxa +Q(ST , a)
the target policy, π, and the policy used to generate behavior if At 6= π(St ) then
exit For Loop
The first-visit is the first time a particular state has been is called the behavior policy, b (usually an exploratory policy,
end
observed. e.g. random policy).
In order to use episodes from b to estimate values for π, we W ← W b(A 1|S )
The first-visit MC method estimates vπ (s) as the average of t t

the returns following first visits to s, whereas the every-visit require that every action taken under π is also taken, at least end
MC method averages the returns following all visits to s. The occasionally, under b. That is, we require that π(a|s) > 0 end
every-visit MC Prediction is derived from first-visit version implies b(a|s) > 0. This is called the assumption of coverage. Algorithm 7: Off-policy MC Control - estimating
removing the “if” condition. Off-policy Every-visit MC Prediction π ∼ π∗ [§5.7]
In other words we move backward from the step T and
compute the G incrementally and associate the values of G to Inputs: π - the policy to be evaluated
the current state and perform the average. Initialize: V (s) ∈ R for all s ∈ S
Return(s) ← an empty list for all s ∈ S
while forever - for each episode do
Generate an episode following b:
MC Estimation of Action Values S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
G←0
W ←1
To determine a policy, if a model is not available, the state foreach step of episode, t = T − 1, T − 2, ..., 0 do
value is not sufficient and we have to estimate the values of G ← γW G + Rt+1
state–action pairs. Append G to Return(St )
The MC methods are essentially the same as just presented for V (St ) ← average(Return(St ))
state values, but now we have state–action pairs. π(A |S )
W ← W b(A t|S t)
t t
The only complication is that many state–action pairs may
end
never be visited. We need to estimate the value of all the
end
actions from each state, not just the one we currently favor.
We can specify that the episodes start in a state–action pair, Algorithm 6: Off-policy Every-visit Monte Carlo
and that every pair has a nonzero probability of being selected prediction - estimating V ∼ vπ [Course2-Week2]
as the start (assumption of exploring starts).
Temporal-Difference Learning TD methods update their estimates based in part on other Q-Learning - Off-policy TD Control
estimates. They learn a guess from a guess, i.e. they Q-Learning is an off-policy TD control. Q-learning is a
TD Prediction
bootstrap. TD and MC methods have an advantage over DP sample-based version of value iteration which iteratively
Starting from Eq. 26, we can consider a generic update rule of methods in that they do not require a model of the applies the Bellman’s optimality equation. The update rule:
V (St ) environment, of its reward and next-state probability
distributions.
The most obvious advantage of TD methods over MC methods Q(St , At ) ←
V (St ) ← V (St ) + α(Gt − V (St )) [6.1] (27) h i
is that they are naturally implemented in an online, fully Q(St , At )+α Rt+1 + γ max Q(St+1 , a) − Q(St , At ) [6.8]
α is a constant step-size and we call previous method incremental fashion. With MC methods one must wait until a
constant-α MC. MC has to wait the end of an episode to the end of an episode, because only then the return is known, (34)
determine the increment to V (St ). whereas with TD methods one needs wait only one time step.
Differently from MC, TD updates the value at each step of the In practice, TD methods have usually been found to converge Params: step size α ∈]0, 1], small > 0
episode following the equation below: faster than constant-α MC methods on stochastic tasks. Initialize Q(s, a) for all s ∈ S + and a ∈ A(s),
The error, available at time t + 1, between V (St ) and the arbitrarily except that Q(terminal − state, ·) = 0
better estimate Rt+1 + γV (St+1 ) is called TD error : foreach episode do
V (St ) ← V (St ) + α(Rt+1 + γV (St+1 ) − V (St )) [6.2] (28) Initialize S
.
δt = Rt+1 + γV (St+1 ) − V (St ) [6.5] (32) foreach step of episode - until S is terminal do
Inputs: π - the policy to be evaluated Choose A from S using policy derived from Q
Params: step size α ∈]0, 1] Sarsa - On-policy TD Control (e.g. -greedy)
Initialize: V (s) ∈ R for all s ∈ S + except for Take action A, observe R, S 0
V(terminal)=0 Sarsa (State-action-reward-state-action) is an on-policy TD Q(S, A) ←
foreach episode do control. Sarsa is sample-based version of policy iteration which Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
Initialize S uses Bellman equations for action values. The update rule: S ← S0
foreach step of episode - until S is terminal do end
A ← action given by π for S end
Take action A, observe R, S 0 Q(St , At ) ←
Algorithm 10: Q-Learning - Off-policy TD Con-
V (S) ← V (S) + α(R + γV (S 0 ) − V (S)) Q(St , At )+α [Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )] [6.7]
S ← S0 trol - estimating π ∼ π∗ [§6.5]
(33)
end
end
Expected Sarsa
Params: step size α ∈]0, 1], small > 0 Similar to Q-Learning, the update rule of Expected Sarsa,
Algorithm 8: Tabular TD(0) - estimating vπ [§6.1] Initialize Q(s, a) for all s ∈ S + and a ∈ A(s), takes the expected value instead of the maximum over the
Recall that: arbitrarily except that Q(terminal − state, ·) = 0 next state:
. foreach episode do
vπ (s) = Eπ [Gt |St = s] [3.12|6.3] (29) Initialize S Q(St , At ) ←
= Eπ [Rt+1 + γGt+1 |St = s] [by 3.9] (30) Choose A from S using policy derived from Q (e.g. Q(St , At ) + α[Rt+1 + γEπ [Q(St+1 , At+1 )|St+1 ] − Q(St , At )]
= Eπ [Rt+1 + γvπ (St+1 )|St = s] [6.4] (31) -greedy) X
foreach step of episode - until S is terminal do Q(St , At ) + α[Rt+1 + γ π(a|St+1 )Q(St+1 , a) − Q(St , At )] [6.9]
Take action A, observe R, S 0 a
Choose A0 from S 0 using policy derived from Q (35)
MC methods use an estimate of Eq. 29 as a target. The MC
target is an estimate because the expected value in Eq. 29 is (e.g. -greedy) The next action is sampled from π. However, the expectation
not known; a sample return is used in place of the real Q(S, A) ← over actions is computed independently of the action actually
expected return. Q(S, A) + α [R + γQ(S 0 , A0 ) − Q(S, A)] selected in the next state. In fact, it is not necessary that π is
DP and TD methods use an estimate of Eq. 31 as a target. S ← S0 equal to the behavior policy. This means that Expected Sarsa,
The DP target is an estimate because vπ (St+1 ) is not known A ← A0 like Q-learning, can be used to learn off-policy without
end
and the current estimate, V (St+1 ), is used instead. importance sampling.
end
The TD target is an estimate because it samples the expected If the target policy is greedy with respect to its action value
values in Eq. 31 and it uses the current estimate V instead of Algorithm 9: Sarsa - On-policy TD Control - es- estimates we obtain the Q-Learning. Hence Q-Learning is a
the true vπ . timating Q ∼ q∗ [§6.4] special case of Expected Sarsa.
n-step Bootstrapping In the one-step TD, instead, the update, is based on just the In the n-step TD, it is used the n-step return:
one next reward, bootstrapping from the value of the state one
n-step TD Prediction
step later as a proxy for the remaining rewards using one-step .
MC methods updates the estimate of vπ (St ) for each state return: Gt:t+n = Rt+1 + γRt+2 + ... + γ n−1 Rt+n + γ n Vt+n−1 (St+n )
based on the entire sequence of observed rewards from that (38)
state until the end of the episode using:
. .
Gt = Rt+1 + γRt+2 + ... + γ T −t−1 RT (36) Gt:t+1 = Rt+1 + γVt (St+1 ) (37)
Planning and Learning experience obtained with the interaction with the environment If (e) and (f) were omitted, the remaining algorithm would be
Planning methods use simulated experience generated by a can be used to improve directly the Policy/value function one-step tabular Q-learning.
model, learning methods use real experience generated by the (direct RL) or indirectly through the model learning and the The agent responds instantly to the latest sensory information
environment. Many ideas and algorithms can be transferred planning that use simulated experience (indirect RL) [§8.2]. and yet always planning in the background. Also the
between planning and learning. model-learning process is in background. As new information
Params: step size α ∈]0, 1], small > 0 is gained, the model is updated to better match reality. As the
Initialize Q(s, a) for all s ∈ S + and ainA(s), arbitrarily model changes, the ongoing planning process will gradually
except that Q(terminal − state, ·) = 0 compute a different way of behaving to match the new model.
foreach episode do Models may be incorrect for many reasons: environment is
1. Select a state S ∈ S, and an action, A ∈ A(s), stochastic and only a limited number of samples have been
at random observed, the model was learned using function approximation
2. From a Sample Model obtain the sample reward that has generalized imperfectly, the environment has changed
and next state following: R, S’ = model (S,A) and its new behavior has not yet been observed. When the
3. Apply one-step tabular Q-Learning to model is incorrect, the planning process is likely to compute a
S, A, R, S 0 : suboptimal policy. In some cases, the suboptimal policy
Q(S, A) ← computed by planning quickly leads to the discovery and
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)] correction of the modeling error. This happens when the
end model is optimistic, predicting greater reward or better state
transition than are actually possible. It is more difficult to
Algorithm 11: Random-sample one-step tabular correct a model when the environment becomes better than it
Q-planning [§8.1] was before.
Initialize Q(s, a) and M odel(S, A) for all s ∈ S + and In Figure below there are represented the relation among
Dyna a ∈ A(s) algorithms presented in the Course on Coursera
Within a planning agent, there are at least two roles for real while forever do [Course3-Week1].
experience: (a) S ← current (nonterminal) state
(1) it can be used to improve the model (to match more (b) A ← -greedy(S, Q)
accurately the real environment), model learning, (c) Take action A; observe resultant reward, R,
(2) it can be used to directly improve the value function and and state, S 0
policy using the kinds reinforcement learning methods (d) Q(S, A) ←
discussed before, direct RL [§8.2]. Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
(e) M odel(S, A) ← R, S 0 (assuming deterministic
environment)
(f) foreach n times do
S ← random previously observed state
A ← random action previosuly taken in S
R, S 0 ← M odel(S, A)
Q(S, A) ←
Q(S, A) + α [R + γ maxa Q(S 0 , a) − Q(S, A)]
end
end
Algorithm 12: Dyna-Q [§8.2]
(d) Direct reinforcement learning
(e) Model-learning
The experience can improve value functions and policies either (f) Planning
directly or indirectly via the model (indirect RL). The real
.
On-policy Prediction with Function state’s estimate more accurate means making others’ less For MC, Ut = Gt and hence Ut is an unbiased estimation of
accurate. vπ (St ).
Approximation
We can approximate value function not as a table but as a . X
V E(w) = µ(s) [vπ (s) − v̂(s, w)]2 [9.1] (39) Inputs: π - the policy to be evaluated
parametrized functional form: v̂(s, w) ∼ vπ (s) where w ∈ Rd s∈S) a differentiable function v̂ : S × Rd → R
and the number of weights is much less than the number of P
Parameters: step size α > 0
where µ(s) is a state distribution (µ(s) ≥ 0 and s µ(s) = 1)
states (d << |S|). The value estimation can be framed as a Initialize: w ∈ Rd arbitrarily (e.g. w = 0)
representing how much we care about the error in each state s.
Supervised Learning problem. while forever - for each episode do
Usually to minimize the 35 it is used the Stochastic Gradient
The Monte Carlo methods estimate the value function using Generate an episode following π:
Descent (SGD):
samples of the return so the input is the state and the targets S0 , A0 , R1 , S1 , A1 , ..., ST −1 , AT −1 , RT
are the returns (pairs (Si , Gi )). 1 foreach step of episode, t = 0, 1, ..., T − 1 do
.
For TD methods the targets are the one-step bootstrap return wt+1 = wt − α∇ [vπ (St ) − v̂(St , wt )]2 [9.4] (40) w ← w − α [Gt − v̂(St , wt )] ∇v̂(St , w)
(pairs (Si , Ri+1 + γv̂(Si+1 , w)). 2
end
In RL setting, the data is temporally correlated and the full = wt − α [vπ (St ) − v̂(St , wt )] ∇v̂(St , wt ) [9.5] (41)
end
dataset is not fixed and available from the beginning. Usually we have only an approximation Ut of vπ (St ) but, if Ut Algorithm 13: Gradient MC - Estimating v ∼ vπ
Moreover, due to the bootstrapping methods (TD, DP), the is an unbiased estimation of vπ (St ), that is
target labels change. E[Ut |St = s] = vπ (St ), for each t, then wt is guaranteed to [§9.3]
In tabular case the learned values at each state were decoupled converge to a local optimum (under the stochastic
- an update at one state affected no other. Now making one approximation condition for decreasing α). https://github.com/linker81/Reinforcement-Learning-CheatSheet

New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
No ratings yet
Computational Graphs in Deep Learning Unit v4 Deep Leaerning
3 pages
Aspiring IT Master's Student's Journey
No ratings yet
Aspiring IT Master's Student's Journey
2 pages
Personal Statement - Joseph Njoroge Msc. Big Data Technologies
No ratings yet
Personal Statement - Joseph Njoroge Msc. Big Data Technologies
1 page
Utd Sop
No ratings yet
Utd Sop
2 pages
Cse-Vii-Advanced Computer Architectures (10cs74) - Solution
100% (1)
Cse-Vii-Advanced Computer Architectures (10cs74) - Solution
111 pages
MS TUM Curriculum Analysis
0% (1)
MS TUM Curriculum Analysis
5 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Formal Methods in AI/ML Safety
No ratings yet
Formal Methods in AI/ML Safety
23 pages
AI Game Strategy Basics
No ratings yet
AI Game Strategy Basics
66 pages
AD8402 - Artificial Intelligence (Unit III)
No ratings yet
AD8402 - Artificial Intelligence (Unit III)
24 pages
RL Introduction
No ratings yet
RL Introduction
225 pages
Reinforcement Learning - Introduction
No ratings yet
Reinforcement Learning - Introduction
19 pages
Sample Artificial Intelligence at Imperial College in London
No ratings yet
Sample Artificial Intelligence at Imperial College in London
1 page
ML Unit 1
No ratings yet
ML Unit 1
143 pages
Brute Force & Exhaustive Search Analysis
No ratings yet
Brute Force & Exhaustive Search Analysis
39 pages
Statement of Purpose
No ratings yet
Statement of Purpose
2 pages
Chapter 3-Problem Solving by Searching Part 1
No ratings yet
Chapter 3-Problem Solving by Searching Part 1
80 pages
Asu Sop-2
No ratings yet
Asu Sop-2
3 pages
Unit 4 AI Notes
No ratings yet
Unit 4 AI Notes
18 pages
Motor Insurance Fraud Analytics
No ratings yet
Motor Insurance Fraud Analytics
49 pages
UE20CS302 Unit3 Slides
No ratings yet
UE20CS302 Unit3 Slides
308 pages
Deep Learning in Barcode Recognition: A Systematic Literature Review
No ratings yet
Deep Learning in Barcode Recognition: A Systematic Literature Review
27 pages
Machine Learning Notes
No ratings yet
Machine Learning Notes
96 pages
Ds and Algo
No ratings yet
Ds and Algo
2 pages
Stochastic Systems Course Overview
No ratings yet
Stochastic Systems Course Overview
3 pages
Optimization in Data Science
No ratings yet
Optimization in Data Science
18 pages
Big Data Science Diploma Egypt
No ratings yet
Big Data Science Diploma Egypt
4 pages
Kannan Asu New Sop
No ratings yet
Kannan Asu New Sop
2 pages
Uta Financial Guarantee PDF
No ratings yet
Uta Financial Guarantee PDF
4 pages
State Space Representation AI
No ratings yet
State Space Representation AI
2 pages
Ai Unit 6 Techknow
No ratings yet
Ai Unit 6 Techknow
31 pages
Introduction to AI & Systems
No ratings yet
Introduction to AI & Systems
32 pages
Online Scam Awareness in India
No ratings yet
Online Scam Awareness in India
16 pages
Dheemanth SOP-Boston
No ratings yet
Dheemanth SOP-Boston
3 pages
Ineuron Intelligence PVT LTD: Internship Offer Letter
No ratings yet
Ineuron Intelligence PVT LTD: Internship Offer Letter
1 page
Unit 5 - Multistage Graph
No ratings yet
Unit 5 - Multistage Graph
6 pages
SOP UTD Narender
No ratings yet
SOP UTD Narender
1 page
Artificial Intelligence
No ratings yet
Artificial Intelligence
3 pages
Unit 4
No ratings yet
Unit 4
108 pages
Soft Computing
No ratings yet
Soft Computing
39 pages
Heuristic Search
No ratings yet
Heuristic Search
49 pages
Advanced Data Structures & Algorithm Analysis
50% (2)
Advanced Data Structures & Algorithm Analysis
2 pages
Design and Analysis of Algorithm
No ratings yet
Design and Analysis of Algorithm
85 pages
What Is Algorithm
No ratings yet
What Is Algorithm
64 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
SOP-Statistcs UMich PDF
No ratings yet
SOP-Statistcs UMich PDF
2 pages
Safety App: Crime Prediction Using GIS
No ratings yet
Safety App: Crime Prediction Using GIS
6 pages
Binary Classification Tutorial With The Keras Deep Learning Library
No ratings yet
Binary Classification Tutorial With The Keras Deep Learning Library
33 pages
Ashritha Data Science SOP
No ratings yet
Ashritha Data Science SOP
2 pages
Thejesh Venkata Arumalla: Education
100% (1)
Thejesh Venkata Arumalla: Education
1 page
A Step by Step Backpropagation Example - Matt Mazur
No ratings yet
A Step by Step Backpropagation Example - Matt Mazur
13 pages
Direct Utility Estimation
No ratings yet
Direct Utility Estimation
3 pages
Vani Sop
No ratings yet
Vani Sop
3 pages
Aiml Assignment - 1
No ratings yet
Aiml Assignment - 1
2 pages
DAA - LAB FILE Front - Pages
No ratings yet
DAA - LAB FILE Front - Pages
7 pages
Infosys Pragathi Report
No ratings yet
Infosys Pragathi Report
68 pages
AI - LAB Midterm Fall-2020 Updated Paper
No ratings yet
AI - LAB Midterm Fall-2020 Updated Paper
3 pages
AI Constraint Satisfaction Basics
No ratings yet
AI Constraint Satisfaction Basics
51 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Chance Reading List (1017 Books!)
100% (1)
Chance Reading List (1017 Books!)
15 pages
LabJack U3 Datasheet Export 20160108
No ratings yet
LabJack U3 Datasheet Export 20160108
109 pages
Tortosa, Francisco, and Civera, Cristina. Historia de La Psicología (1A. Ed.) - Madrid, Es: Mcgraw-Hill España, 2009. Proquest Ebrary. Web. 30 June 2016
No ratings yet
Tortosa, Francisco, and Civera, Cristina. Historia de La Psicología (1A. Ed.) - Madrid, Es: Mcgraw-Hill España, 2009. Proquest Ebrary. Web. 30 June 2016
14 pages
Pazqs2019 - Social Content
No ratings yet
Pazqs2019 - Social Content
9 pages
ICGCEE 2023: Green Construction Conference
No ratings yet
ICGCEE 2023: Green Construction Conference
19 pages
COVID-19 Table-Top Exercise Guide
No ratings yet
COVID-19 Table-Top Exercise Guide
15 pages
Instruction:: Govt College of Education Afzalpur Mirpur Ajk
No ratings yet
Instruction:: Govt College of Education Afzalpur Mirpur Ajk
5 pages
TTL Lesson Plan
100% (1)
TTL Lesson Plan
3 pages
Protecting Privacy in The Era of Artificial Intelligence: Keyur Tripathi & Usama Mubarak
No ratings yet
Protecting Privacy in The Era of Artificial Intelligence: Keyur Tripathi & Usama Mubarak
7 pages
DC Bus Voltage Regulation
No ratings yet
DC Bus Voltage Regulation
2 pages
Summer Internship Application Form
No ratings yet
Summer Internship Application Form
6 pages
India: Soil Types, Problems & Conservation: Dr. Supriya
No ratings yet
India: Soil Types, Problems & Conservation: Dr. Supriya
25 pages
Turbo Pump Rotor Blade Analysis
No ratings yet
Turbo Pump Rotor Blade Analysis
43 pages
1CD
100% (1)
1CD
518 pages
Protocol SLR 53481 - Andreas Tzeremes
No ratings yet
Protocol SLR 53481 - Andreas Tzeremes
11 pages
Groundwater Development Course
No ratings yet
Groundwater Development Course
2 pages
Inspection Checklist
No ratings yet
Inspection Checklist
11 pages
Mordazas Tinius Olsen
0% (1)
Mordazas Tinius Olsen
4 pages
Lab Report Sample
No ratings yet
Lab Report Sample
2 pages
Olga 2014
No ratings yet
Olga 2014
2 pages
Cleanroom Systems Ultratech Precision: Insulated Panels
No ratings yet
Cleanroom Systems Ultratech Precision: Insulated Panels
24 pages
DRCS Cover - To Author PDF
0% (1)
DRCS Cover - To Author PDF
1 page
EEX3417 Final - 20202021
No ratings yet
EEX3417 Final - 20202021
9 pages
Blow Mould Tool Design and Manufacturing Process For 1litre Pet Bottle
No ratings yet
Blow Mould Tool Design and Manufacturing Process For 1litre Pet Bottle
10 pages
PERMEABILITY
No ratings yet
PERMEABILITY
14 pages
Week 5 Money Narrative
No ratings yet
Week 5 Money Narrative
30 pages
Taekwondo in Horn of Africa
No ratings yet
Taekwondo in Horn of Africa
13 pages
Group No. 9 - Batch 1 - Lakshmi Sale Case
No ratings yet
Group No. 9 - Batch 1 - Lakshmi Sale Case
9 pages
Set Theory (Solutions)
No ratings yet
Set Theory (Solutions)
4 pages
Granulation Area Check List
No ratings yet
Granulation Area Check List
2 pages

Reinforcement Learning Cheat Sheet: Return

Uploaded by

Reinforcement Learning Cheat Sheet: Return

Uploaded by

Reinforcement Learning Cheat Sheet Return policy π.

Informally, is the expected return when starting from

You might also like