0% found this document useful (0 votes)

21 views67 pages

Lecture 3 Post

Lecture 3 focuses on model-free policy evaluation in reinforcement learning, emphasizing techniques for estimating the expected return of a policy without access to true MDP models. Key methods discussed include Monte Carlo policy evaluation, Temporal Difference learning, and batch policy evaluation. The lecture builds on previous concepts and prepares students for upcoming topics on control without a model of the environment.

Uploaded by

Alessandro Festante

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views67 pages

Lecture 3 Post

Uploaded by

Alessandro Festante

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Lecture 3: Model-Free Policy Evaluation: Policy

Evaluation Without Knowing How the World Works

Emma Brunskill

CS234 Reinforcement Learning

Winter 2025

Material builds on structure from David Silver’s Lecture 4: Model-Free

Prediction. Other resources: Sutton and Barto Jan 1 2018 draft
Chapter/Sections: 5.1; 5.5; 6.1-6.3

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How1the
/ 67Wo
L3N1 Refresh Your Knowledge [Polleverywhere Poll]

In a tabular MDP asymptotically value iteration will always yield a policy

with the same value as the policy returned by policy iteration
1 True.
2 False
3 Not sure
Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
1 True.
2 False
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How2the
/ 67Wo
L3N1 Refresh Your Knowledge

In a tabular MDP asymptotically value iteration will always yield a policy

with the same value as the policy returned by policy iteration
Answer. True. Both are guaranteed to converge to the optimal value
function and a policy with an optimal value

Can value iteration require more iterations than |A||S| to compute the
optimal value function? (Assume |A| and |S| are small enough that each
round of value iteration can be done exactly).
Answer: True. As an example, consider a single state, single action MDP
where r (s, a) = 1, γ = .9 and initialize V0 (s) = 0. V ∗ (s) = 1−γ
1
but after
the first iteration of value iteration, V1 (s) = 1.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How3the
/ 67Wo
Today’s Plan

Last Time:
Markov reward / decision processes
Policy evaluation & control when have true model (of how the world works)
Today
Policy evaluation without known dynamics & reward models
Next Time:
Control when don’t have a model of how the world works

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How4the
/ 67Wo
Evaluation through Direct Experience

Estimate expected return of policy π

Only using data from environment1 (direct experience)
Why is this important?
What properties do we want from policy evaluation algorithms?

1
Assume today this experience comes from executing the policy π. Later will
consider how to do policy evaluation using data gathered from other policies.
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How5the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to

true MDP models
Monte Carlo policy evaluation
Policy evaluation when don’t have a model of how the world works
Given on-policy samples
Temporal Difference (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How6the
/ 67Wo
Recall

Definition of Return, Gt (for a MRP)

Discounted sum of rewards from time step t to horizon

Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · ·

Definition of State Value Function, V π (s)

Expected return starting in state s under policy π

V π (s) = Eπ [Gt |st = s] = Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]

Definition of State-Action Value Function, Q π (s, a)

Expected return starting in state s, taking action a and following policy π

Q π (s, a) = Eπ [Gt |st = s, at = a]

= Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s, at = a]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How7the
/ 67Wo
Recall: Dynamic Programming for Policy Evaluation
In a Markov decision process

V π (s) = Eπ [Gt |st = s]

= Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]
X
= R(s, π(s)) + γ P(s ′ |s, π(s))V π (s ′ )
s ′ ∈S

If given dynamics and reward models, can do policy evaluation through

dynamic programming
X
Vkπ (s) = r (s, π(s)) + γ p(s ′ |s, π(s))Vk−1
π
(s ′ ) (1)
s ′ ∈S

Note: before convergence, Vkπ is an estimate of V π

In Equation 1 we are substituting s ′ ∈S p(s ′ |s, π(s))Vk−1
π
(s ′ ) for
P
2
Eπ [rt+1 + γrt+2 + γ rt+3 + · · · |st = s].
This substitution is an instance of bootstrapping
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How8the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to

true MDP models
Monte Carlo policy evaluation
Policy evaluation when don’t have a model of how the world work
Given on-policy samples
Temporal Difference (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How9the
/ 67Wo
Monte Carlo (MC) Policy Evaluation

Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · + γ Ti −t rTi in MDP M under policy π

V π (s) = Eτ ∼π [Gt |st = s]
Expectation over trajectories τ generated by following π

Simple idea: Value = mean return

If trajectories are all finite, sample set of trajectories & average returns
Note: all trajectories may not be same length (e.g. consider MDP with
terminal states)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
10the
/ 67Wo
Monte Carlo (MC) Policy Evaluation

If trajectories are all finite, sample set of trajectories & average returns
Does not require MDP dynamics/rewards
Does not assume state is Markov
Can be applied to episodic MDPs
Averaging over returns from a complete episode
Requires each episode to terminate

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
11the
/ 67Wo
First-Visit Monte Carlo (MC) On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 ∀s ∈ S

Loop
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti as return from time
step t onwards in ith episode
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
12the
/ 67Wo
Every-Visit Monte Carlo (MC) On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 ∀s ∈ S

Loop
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti as return from time
step t onwards in ith episode
For each time step t until Ti ( the end of the episode i)
state s is the state visited at time step t in episodes i
Increment counter of total visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
13the
/ 67Wo
Optional Worked Example: MC On Policy Evaluation

Initialize N(s) = 0, G (s) = 0 ∀s ∈ S

Loop

Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i (for first visit MC)
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Mars rover: R(s) = [ 1 0 0 0 0 0 +10]

Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
Let γ < 1. Compute the first visit & every visit MC estimates of s2 .
See solutions at the end of the slides

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
14the
/ 67Wo
Incremental Monte Carlo (MC) On Policy Evaluation

After each episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . .

Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · as return from time step t
onwards in ith episode
For state s visited at time step t in episode i
Increment counter of total visits: N(s) = N(s) + 1
Update estimate
N(s) − 1 Gi,t 1
V π (s) = V π (s) + = V π (s) + (Gi,t − V π (s))
N(s) N(s) N(s)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
15the
/ 67Wo
Incremental Monte Carlo (MC) On Policy Evaluation

Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))

We will see many algorithms of this form with a learning rate, target, and
incremental update

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
16the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
17the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
18the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
19the
/ 67Wo
Policy Evaluation Diagram

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
20the
/ 67Wo
MC Policy Evaluation

V π (s) = V π (s) + α(Gi,t − V π (s))

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
21the
/ 67Wo
MC Policy Evaluation

V π (s) = V π (s) + α(Gi,t − V π (s))

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
22the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach

Consistency: with enough data, does the estimate converge to the true value
of the policy?
Computational complexity: as get more data, computational cost of
updating estimate
Memory requirements
Statistical efficiency (intuitively, how does the accuracy of the estimate
change with the amount of data)
Empirical accuracy, often evaluated by mean squared error

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
23the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach:
Bias, Variance and MSE
Consider a statistical model that is parameterized by θ and that determines
a probability distribution over observed data P(x|θ)
Consider a statistic θ̂ that provides an estimate of θ and is a function of
observed data x
E.g. for a Gaussian distribution with known variance, the average of a set of
i.i.d data points is an estimate of the mean of the Gaussian
Definition: the bias of an estimator θ̂ is:
Biasθ (θ̂) = Ex|θ [θ̂] − θ

Definition: the variance of an estimator θ̂ is:

Var (θ̂) = Ex|θ [(θ̂ − E[θ̂])2 ]

Definition: mean squared error (MSE) of an estimator θ̂ is:

MSE (θ̂) = Var (θ̂) + Biasθ (θ̂)2

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
24the
/ 67Wo
Evaluation of the Quality of a Policy Estimation Approach:
Consistent Estimator
Consider a statistical model that is parameterized by θ and that determines
a probability distribution over observed data P(x|θ)
Consider a statistic θ̂ that provides an estimate of θ and is a function of
observed data x
Definition: the bias of an estimator θ̂ is:

Biasθ (θ̂) = Ex|θ [θ̂] − θ

Let n be the number of data points x used to estimate the parameter θ and
call the resulting estimate of θ using that data θ̂n
Then the estimator θ̂n is consistent if, for all ϵ > 0

lim Pr (|θ̂n − θ| > ϵ) = 0

n→∞

If an estimator is unbiased (bias = 0) is it consistent?

Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
25the
/ 67Wo
Properties of Monte Carlo On Policy Evaluators

Properties:
First-visit Monte Carlo
V π estimator is an unbiased estimator of true Eπ [Gt |st = s]
By law of large numbers, as N(s) → ∞, V π (s) → Eπ [Gt |st = s]
Every-visit Monte Carlo
V π every-visit MC estimator is a biased estimator of V π
But consistent estimator and often has better MSE
Incremental Monte Carlo
Properties depends on the learning rate α

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
26the
/ 67Wo
Properties of Monte Carlo On Policy Evaluators

Update is: V π (sit ) = V π (sit ) + αk (sj )(Gi,t − V π (sit ))

where we have allowed α to vary (let k be the total number of updates done
so far, for state sit = sj )
If
∞
X
αn (sj ) = ∞,
n=1
X∞
αn2 (sj ) < ∞
n=1

then incremental MC estimate will converge to true policy value V π (sj )

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
27the
/ 67Wo
Monte Carlo (MC) Policy Evaluation Key Limitations

Generally high variance estimator

Reducing variance can require a lot of data
In cases where data is very hard or expensive to acquire, or the stakes are
high, MC may be impractical
Requires episodic settings
Episode must end before data from episode can be used to update V

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
28the
/ 67Wo
Monte Carlo (MC) Policy Evaluation Summary

Aim: estimate V π (s) given episodes generated under policy π

s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from π

Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · under policy π

V π (s) = Eπ [Gt |st = s]
Simple: Estimates expectation by empirical average (given episodes sampled
from policy of interest)
Updates V estimate using sample of return to approximate the expectation
Does not assume Markov process
Converges to true value under some (generally mild) assumptions
Note: Sometimes is preferred over dynamic programming for policy
evaluation even if know the true dynamics model and reward

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
29the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to

true MDP models
Monte Carlo policy evaluation
Temporal Difference (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
30the
/ 67Wo
Temporal Difference Learning

“If one had to identify one idea as central and novel to reinforcement
learning, it would undoubtedly be temporal-difference (TD) learning.” –
Sutton and Barto 2017
Combination of Monte Carlo & dynamic programming methods
Model-free
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s ′ ) tuple

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
31the
/ 67Wo
Temporal Difference Learning for Estimating V

Aim: estimate V π (s) given episodes generated under policy π

Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · in MDP M under policy π
V π (s) = Eπ [Gt |st = s]
Recall Bellman operator (if know MDP models)
X
B π V (s) = r (s, π(s)) + γ p(s ′ |s, π(s))V (s ′ )
s ′ ∈S

In incremental every-visit MC, update estimate using 1 sample of return (for

the current ith episode)

V π (s) = V π (s) + α(Gi,t − V π (s))

Idea: have an estimate of V π , use to estimate expected return

V π (s) = V π (s) + α([rt + γV π (st+1 )] − V π (s))

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
32the
/ 67Wo
Temporal Difference [TD(0)] Learning

Aim: estimate V π (s) given episodes generated under policy π

s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from π
TD(0) learning / 1-step TD learning: update estimate towards target

V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))

| {z }
TD target

TD(0) error:
δt = rt + γV π (st+1 ) − V π (st )

Can immediately update value estimate after (s, a, r , s ′ ) tuple

Don’t need episodic setting

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
33the
/ 67Wo
Temporal Difference [TD(0)] Learning Algorithm

Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
34the
/ 67Wo
Worked Example TD Learning

Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target

Example Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
35the
/ 67Wo
Worked Example TD Learning
Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target

Example:
Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
TD estimate of all states (init at 0) with α = 1, γ < 1 at end of this
episode?
V = [1 0 0 0 0 0 0 0]

First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]

Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
36the
/ 67Wo
Temporal Difference (TD) Policy Evaluation
X
V π (st ) = r (st , π(st )) + γ P(st+1 |st , π(st ))V π (st+1 )
st+1

V (st ) = V (st ) + α([rt + γV π (st+1 )] − V π (st ))

π π

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
37the
/ 67Wo
Check Your Understanding L3N2: Polleverywhere Poll
Temporal Difference [TD(0)] Learning Algorithm

Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target

Select all that are true

1 If α = 0 TD will weigh the TD target more than the past V estimate
2 If α = 1 TD will update the V estimate to the TD target
3 If α = 1 TD in MDPs where the policy goes through states with multiple
possible next states, V may oscillate forever
4 There exist deterministic MDPs where α = 1 TD will converge

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
38the
/ 67Wo
Check Your Understanding L3N2: Polleverywhere Poll
Temporal Difference [TD(0)] Learning Algorithm

Input: α
Initialize V π (s) = 0, ∀s ∈ S
Loop
Sample tuple (st , at , rt , st+1 )
V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
| {z }
TD target

Answers. If α = 1 TD will update to the TD target. If α = 1 TD in

MDPs where the policy goes through states with multiple possible next
states, V may oscillate forever. There exist deterministic MDPs where
α = 1 TD will converge.

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
39the
/ 67Wo
Summary: Temporal Difference Learning

Combination of Monte Carlo & dynamic programming methods

Model-free
Bootstraps and samples
Can be used in episodic or infinite-horizon non-episodic settings
Immediately updates estimate of V after each (s, a, r , s ′ ) tuple
Biased estimator (early on will be influenced by initialization, and won’t be
unibased estimator)
Generally lower variance than Monte Carlo policy evaluation
Consistent estimator if learning rate α satisfies same conditions specified for
incremental MC policy evaluation to converge
Note: algorithm I introduced is TD(0). In general can have
approaches that interpolate between TD(0) and Monte Carlo
approach

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
40the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to

true MDP models
Monte Carlo policy evaluation
Temporal Difference (TD)
Certainty Equivalence with dynamic programming
Batch policy evaluation

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
41the
/ 67Wo
Certainty Equivalence V π MLE MDP Model Estimates

Model-based option for policy evaluation without true models

After each (si , ai , ri , si+1 ) tuple
Recompute maximum likelihood MDP model for (s, a)
i
1
1(sk = s, ak = a, sk+1 = s ′ )
X
P̂(s ′ |s, a) =
N(s, a)
k=1

i
1
1(sk = s, ak = a)rk
X
rˆ(s, a) =
N(s, a)
k=1

Compute V π using MLE MDP 2

(using any dynamic programming method
from lecture 2))

Optional worked example at end of slides for Mars rover domain.

2
Requires initializing for all (s, a) pairs
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
42the
/ 67Wo
Certainty Equivalence V π MLE MDP Model Estimates
Model-based option for policy evaluation without true models
After each (s, a, r , s ′ ) tuple
Recompute maximum likelihood MDP model for (s, a)
K TXk −1
1 X
P̂(s ′ |s, a) = 1(sk,t = s, ak,t = a, sk,t+1 = s ′ )
N(s, a) t=1
k=1

K TXk −1
1 X
rˆ(s, a) = 1(sk,t = s, ak,t = a)rt,k
N(s, a) t=1
k=1
Compute V π using MLE MDP
Cost: Updating MLE model and MDP planning at each update (O(|S|3 ) for
analytic matrix solution, O(|S|2 |A|) for iterative methods)
Very data efficient and very computationally expensive
Consistent (will converge to right estimate for Markov models)
Can also easily be used for off-policy evaluation (which we will shortly define
and discuss)
Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
43the
/ 67Wo
This Lecture: Policy Evaluation

Estimating the expected return of a particular policy if don’t have access to

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
44the
/ 67Wo
Batch MC and TD

Batch (Offline) solution for finite dataset

Given set of K episodes
Repeatedly sample an episode from K
Apply MC or TD(0) to the sampled episode

What do MC and TD(0) converge to?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
45the
/ 67Wo
AB Example: (Ex. 6.4, Sutton & Barto, 2018)

Two states A, B with γ = 1

Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0
Imagine running TD updates over data infinite number of times
V (B) =

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
46the
/ 67Wo
AB Example: (Ex. 6.4, Sutton & Barto, 2018)

TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))

| {z }
TD target

Two states A, B with γ = 1

Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times

V (B) = 0.75 by TD or MC
What about V (A)?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
47the
/ 67Wo
Check Your Understanding L3N3: AB Example: (Ex. 6.4,
Sutton & Barto, 2018)

TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))

| {z }
TD target

Two states A, B with γ = 1

Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times

V (B) = 0.75 by TD or MC
What about V (A)?
Respond in Poll

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
48the
/ 67Wo
Check Your Understanding L3N3: AB Example: (Ex. 6.4,
Sutton & Barto, 2018)

TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))

| {z }
TD target

Two states A, B with γ = 1

Given 8 episodes of experience:
A, 0, B, 0
B, 1 (observed 6 times)
B, 0

Imagine run TD updates over data infinite number of times

V (B) = 0.75 by TD or MC
What about V (A)?
V MC (A) = 0 V TD (A) = .75

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
49the
/ 67Wo
Batch MC and TD: Convergence

Monte Carlo in batch setting converges to min MSE (mean squared error)
Minimize loss with respect to observed returns
In AB example, V (A) = 0

TD(0) converges to DP policy V π for the MDP with the maximum

likelihood model estimates
Aka same as dynamic programming with certainty equivalence!
Maximum likelihood Markov decision process model
i
1
1(sk = s, ak = a, sk+1 = s ′ )
X
P̂(s ′ |s, a) =
N(s, a)
k=1

i
1
1(sk = s, ak = a)rk
X
rˆ(s, a) =
N(s, a)
k=1

Compute V π using this model

In AB example, V (A) = 0.75

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
50the
/ 67Wo
Some Important Properties to Evaluate Model-free Policy
Evaluation Algorithms

Data efficiency & Computational efficiency

In simple TD(0), use (s, a, r , s ′ ) once to update V (s)
O(1) operation per update
In an episode of length L, O(L)

In MC have to wait till episode finishes, then also O(L)

MC can be more data efficient than simple TD
But TD exploits Markov structure
If in Markov domain, leveraging this is helpful

Dynamic programming with certainty equivalence also uses Markov structure

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
51the
/ 67Wo
Summary: Policy Evaluation
Estimating the expected return of a particular policy if don’t have access
to true MDP models. Ex. evaluating average purchases per session of new
product recommendation system
Monte Carlo policy evaluation
Policy evaluation when we don’t have a model of how the world works
Given on policy samples
Given off policy samples
Temporal Difference (TD)
Dynamic Programming with certainty equivalence
*Understand what MC vs TD methods compute in batch evaluations
Metrics / Qualities to evaluate and compare algorithms
Uses Markov assumption
Accuracy / MSE / bias / variance
Data efficiency
Computational efficiency

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
52the
/ 67Wo
Today’s Plan

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
53the
/ 67Wo
Optional Worked Example MC On Policy Evaluation
Answers
Initialize N(s) = 0, G (s) = 0 ∀s ∈ S
Loop
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
For each time step t until Ti ( the end of the episode i)
If this is the first time t that state s is visited in episode i
Increment counter of total first visits: N(s) = N(s) + 1
Increment total return G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
Let γ < 1. Compare the first visit & every visit MC estimates of s2 .
γ 2 +γ
First visit: V MC (s2 ) = γ 2 , Every visit: V MC (s2 ) = 2

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
54the
/ 67Wo
Optional Check Your Understanding L3: Incremental MC
(State if each is True or False)
First or Every Visit MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)
Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
1 Incremental MC with α = 1 is the same as first visit MC
2 Incremental MC with α = N(s1 ) is the same as every visit MC
it

3 Incremental MC with α > N(s1 ) could be helpful in non-stationary domains

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
55the
/ 67Wo
Optional Check Your Understanding L3 Incremental MC
Answers
First or Every Visit MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i

Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,T

For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i

for t = 1 : Ti where Ti is the length of the i-th episode

V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))

1 Incremental MC with α = 1 is the same as first visit MC

false

2 Incremental MC with α = N(s1 ) is the same as every visit MC

it
true

3 Incremental MC with α > N(s1 ) could help in non-stationary domains

it
true

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
56the
/ 67Wo
Optional Check Your Understanding L3 Incremental MC
(State if each is True or False)
First or Every Visit MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)
Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
for t = 1 : Ti where Ti is the length of the i-th episode
V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
1 Incremental MC with α = 1 is the same as first visit MC
2 Incremental MC with α = N(s1 ) is the same as every visit MC
it

3 Incremental MC with α > N(s1 ) could be helpful in non-stationary domains

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
57the
/ 67Wo
Check Your Understanding L3N1: Polleverywhere Poll
Incremental MC Answers
First or Every Visit MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i

Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,T

For all s, for first or every time t that state s is visited in episode i
N(s) = N(s) + 1, G (s) = G (s) + Gi,t
Update estimate V π (s) = G (s)/N(s)

Incremental MC
Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
i i i

for t = 1 : Ti where Ti is the length of the i-th episode

V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))

1 Incremental MC with α = 1 is the same as first visit MC

false

2 Incremental MC with α = N(s1 ) is the same as every visit MC

it
true

3 Incremental MC with α > N(s1 ) could help in non-stationary domains

it
true

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
58the
/ 67Wo
Certainty Equivalence V π MLE MDP Worked Example

!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
Optional exercise: What is the certainty equivalent estimate?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
59the
/ 67Wo
Certainty Equivalence V π MLE MDP Worked Ex Solution
!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

γ∗0.5 γ 2 ∗0.5
V = [0 1−0.5γ 1−0.5γ 0 0 0 0]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
60the
/ 67Wo
Recall: Dynamic Programming for Policy Evaluation

If we knew dynamics and reward model, we can do policy evaluation

Initialize V0π (s) = 0 for all s
For k = 1 until convergence
For all s in S
X
Vkπ (s) = r (s, π(s)) + γ p(s ′ |s, π(s))Vk−1
π
(s ′ )
s ′ ∈S

Vkπ (s) is exactly the k-horizon value of state s under policy π

Vkπ (s) is an estimate of the infinite horizon value of state s under policy π

V π (s) = Eπ [Gt |st = s] ≈ Eπ [rt + γVk−1 |st = s]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
61the
/ 67Wo
Dynamic Programming Policy Evaluation
V π (s) ← Eπ [rt + γVk−1 |st = s]

Bootstrapping: Update for V uses an estimate

Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
62the
/ 67Wo
Dynamic Programming Policy Evaluation
V π (s) ← Eπ [rt + γVk−1 |st = s]

Bootstrapping: Update for V uses an estimate

Emma Brunskill (CS234 Reinforcement Learning)
Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
63the
/ 67Wo
What about when we don’t know the models?

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
64the
/ 67Wo
!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
What is the certainty equivalent estimate?
rˆ = [1 0 0 0 0 0 0], p̂(terminate|s1 , a1 ) = p̂(s2 |s3 , a1 ) = 1
p̂(s1 |s2 , a1 ) = .5, p̂(s2 |s2 , a1 ) = .5, V = [1 1 1 0 0 0 0]

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
65the
/ 67Wo
Bias/Variance of Model-free Policy Evaluation Algorithms

Return Gt is an unbiased estimate of V π (st )

TD target [rt + γV π (st+1 )] is a biased estimate of V π (st )
But often much lower variance than a single return Gt
Return function of multi-step sequence of random actions, states & rewards
TD target only has one random action, reward and next state
MC
Unbiased (for first visit)
High variance
Consistent (converges to true) even with function approximation
TD
Some bias
Lower variance
TD(0) converges to true value with tabular representation
TD(0) does not always converge with function approximation

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
66the
/ 67Wo
!" !# !$ !% !& !' !(

) !" = +1 ) !# = 0 ) !$ = 0 ) !% = 0 ) !& = 0 ) !' = 0 ) !( = +10

89/: ./01/!123
.2456 7214 .2456 7214

Mars rover: R = [ 1 0 0 0 0 0 +10] for any action

π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of V of each state? [1 1 1 0 0 0 0]
TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
TD(0) only uses a data point (s, a, r , s ′ ) once
Monte Carlo takes entire return from s to end of episode

Emma Brunskill (CS234 Reinforcement Learning)

Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
Winter
Without
2025Knowing How
67the
/ 67Wo

Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Reinforcement Learning I
No ratings yet
Reinforcement Learning I
85 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Handout 5
No ratings yet
Handout 5
72 pages
Policy (RL IITH)
No ratings yet
Policy (RL IITH)
46 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Lec 5
No ratings yet
Lec 5
13 pages
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
No ratings yet
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
65 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
Lecture 4 Pre
No ratings yet
Lecture 4 Pre
85 pages
Monte Carlo Methods for RL Experts
No ratings yet
Monte Carlo Methods for RL Experts
28 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Ar514 MDP
No ratings yet
Ar514 MDP
27 pages
CS229
No ratings yet
CS229
17 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Lec 12
No ratings yet
Lec 12
60 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
L5-Monte Carlo Methods
No ratings yet
L5-Monte Carlo Methods
48 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Lecture Notes RL
No ratings yet
Lecture Notes RL
14 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
62 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
30 pages
RL Complete Unit-5
No ratings yet
RL Complete Unit-5
30 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Module 04
No ratings yet
Module 04
63 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
1 Markov
No ratings yet
1 Markov
34 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
05 MC Methods
No ratings yet
05 MC Methods
53 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Pengu's Silly Spells - The Homebrewery
No ratings yet
Pengu's Silly Spells - The Homebrewery
1 page
Masters of The Nine Swords - A Tome of Battle Inspired 5e Compendium - GM Binder
100% (3)
Masters of The Nine Swords - A Tome of Battle Inspired 5e Compendium - GM Binder
12 pages
In The Neon Glow of Midnight Screens
No ratings yet
In The Neon Glow of Midnight Screens
1 page
Oath of Secrecy v10
No ratings yet
Oath of Secrecy v10
4 pages
2795364-AllOlives Monk v11.11 - The Homebrewery
No ratings yet
2795364-AllOlives Monk v11.11 - The Homebrewery
21 pages
D&D 5e: Familiars Guide
No ratings yet
D&D 5e: Familiars Guide
18 pages
Dio Della Guerra Classe DND
No ratings yet
Dio Della Guerra Classe DND
14 pages
Lecture 1 Pre
No ratings yet
Lecture 1 Pre
71 pages
5302 Advanced Microeconomics Lecture 3: Competition As Rivalry Karl W Arneryd Stockholm School of Economics September 2017
No ratings yet
5302 Advanced Microeconomics Lecture 3: Competition As Rivalry Karl W Arneryd Stockholm School of Economics September 2017
18 pages
PZO2106 Character Sheet - Updated
No ratings yet
PZO2106 Character Sheet - Updated
2 pages
Finance Exam for Students
No ratings yet
Finance Exam for Students
6 pages
International Finance Mock Exam
No ratings yet
International Finance Mock Exam
6 pages
Lucas 140 Syllabus 18
No ratings yet
Lucas 140 Syllabus 18
6 pages
Engineering Math for B.Tech Students
No ratings yet
Engineering Math for B.Tech Students
71 pages
Sequence and Series
No ratings yet
Sequence and Series
10 pages
Gamma Function Inequalities
No ratings yet
Gamma Function Inequalities
7 pages
Taylor's Theorem: Harvey Mudd College Math Tutorial
No ratings yet
Taylor's Theorem: Harvey Mudd College Math Tutorial
4 pages
LA
No ratings yet
LA
138 pages
Engineering Stress Analysis Guide
No ratings yet
Engineering Stress Analysis Guide
46 pages
_______________
No ratings yet
_______________
3 pages
1 IEOR 4700: Introduction To Stochastic Integration
No ratings yet
1 IEOR 4700: Introduction To Stochastic Integration
8 pages
Press 5
No ratings yet
Press 5
12 pages
Kinematics of Rigid Body
No ratings yet
Kinematics of Rigid Body
6 pages
Answers To Exercises For Chapter 5 Integrals
No ratings yet
Answers To Exercises For Chapter 5 Integrals
4 pages
Advanced Electromagnetic Theory
No ratings yet
Advanced Electromagnetic Theory
36 pages
Day 4 Problem Set
No ratings yet
Day 4 Problem Set
4 pages
CBSE Class - VIII Mathematics NCERT Solutions Chapter - 1 Rational Numbers (Ex. 1.1)
No ratings yet
CBSE Class - VIII Mathematics NCERT Solutions Chapter - 1 Rational Numbers (Ex. 1.1)
6 pages
100 Solved Problems in Algebra, Trigonometry, and Statistics
No ratings yet
100 Solved Problems in Algebra, Trigonometry, and Statistics
68 pages
Dokumen - Pub Optimal Control Theory An Introduction Solution Manual 9780486434841 0 486 43484 2
No ratings yet
Dokumen - Pub Optimal Control Theory An Introduction Solution Manual 9780486434841 0 486 43484 2
185 pages
Q3 BasCal Problems
No ratings yet
Q3 BasCal Problems
5 pages
SSSP Questions Bank
No ratings yet
SSSP Questions Bank
6 pages
Z transform08SS
No ratings yet
Z transform08SS
47 pages
Gen Math Quiz # 2 Second Quarter
No ratings yet
Gen Math Quiz # 2 Second Quarter
1 page
Math Test: 36 Questions, 5 Sections
No ratings yet
Math Test: 36 Questions, 5 Sections
13 pages
Vector Quiz PDF
No ratings yet
Vector Quiz PDF
2 pages
PDE Wave Equation Guide
100% (1)
PDE Wave Equation Guide
3 pages
Unit 2 Lecture 4 Transportation & Assignment Models
No ratings yet
Unit 2 Lecture 4 Transportation & Assignment Models
33 pages
Eee115 Electromagnetic-Field-Theory TH 1.20 Ac19
No ratings yet
Eee115 Electromagnetic-Field-Theory TH 1.20 Ac19
2 pages
Multi-Variable Calculus
100% (1)
Multi-Variable Calculus
66 pages
Case Study - Cbse 10
75% (4)
Case Study - Cbse 10
3 pages
Master's Thesis - Quaternion-Based Control For Aggressive Trajectory Tracking With A Micro-Quadrotor UAV
No ratings yet
Master's Thesis - Quaternion-Based Control For Aggressive Trajectory Tracking With A Micro-Quadrotor UAV
113 pages
WTW 134 Study Guide 2023
No ratings yet
WTW 134 Study Guide 2023
40 pages