Lecture 3: Model-Free Policy Evaluation: Policy
Evaluation Without Knowing How the World Works
                                                  Emma Brunskill
                                            CS234 Reinforcement Learning
                                                     Winter 2025
           Material builds on structure from David Silver’s Lecture 4: Model-Free
           Prediction. Other resources: Sutton and Barto Jan 1 2018 draft
           Chapter/Sections: 5.1; 5.5; 6.1-6.3
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How1the
                                                                                                                          / 67Wo
  L3N1 Refresh Your Knowledge [Polleverywhere Poll]
           In a tabular MDP asymptotically value iteration will always yield a policy
           with the same value as the policy returned by policy iteration
               1   True.
               2   False
               3   Not sure
           Can value iteration require more iterations than |A||S| to compute the
           optimal value function? (Assume |A| and |S| are small enough that each
           round of value iteration can be done exactly).
               1   True.
               2   False
               3   Not sure
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How2the
                                                                                                                          / 67Wo
  L3N1 Refresh Your Knowledge
           In a tabular MDP asymptotically value iteration will always yield a policy
           with the same value as the policy returned by policy iteration
           Answer. True. Both are guaranteed to converge to the optimal value
           function and a policy with an optimal value
           Can value iteration require more iterations than |A||S| to compute the
           optimal value function? (Assume |A| and |S| are small enough that each
           round of value iteration can be done exactly).
           Answer: True. As an example, consider a single state, single action MDP
           where r (s, a) = 1, γ = .9 and initialize V0 (s) = 0. V ∗ (s) = 1−γ
                                                                            1
                                                                               but after
           the first iteration of value iteration, V1 (s) = 1.
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How3the
                                                                                                                          / 67Wo
  Today’s Plan
           Last Time:
                   Markov reward / decision processes
                   Policy evaluation & control when have true model (of how the world works)
           Today
                   Policy evaluation without known dynamics & reward models
           Next Time:
                   Control when don’t have a model of how the world works
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How4the
                                                                                                                          / 67Wo
  Evaluation through Direct Experience
           Estimate expected return of policy π
           Only using data from environment1 (direct experience)
           Why is this important?
           What properties do we want from policy evaluation algorithms?
      1
        Assume today this experience comes from executing the policy π. Later will
   consider how to do policy evaluation using data gathered from other policies.
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How5the
                                                                                                                          / 67Wo
  This Lecture: Policy Evaluation
           Estimating the expected return of a particular policy if don’t have access to
           true MDP models
           Monte Carlo policy evaluation
                   Policy evaluation when don’t have a model of how the world works
                         Given on-policy samples
           Temporal Difference (TD)
           Certainty Equivalence with dynamic programming
           Batch policy evaluation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How6the
                                                                                                                          / 67Wo
  Recall
           Definition of Return, Gt (for a MRP)
                   Discounted sum of rewards from time step t to horizon
                                            Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · ·
           Definition of State Value Function, V π (s)
                   Expected return starting in state s under policy π
                       V π (s) = Eπ [Gt |st = s] = Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]
           Definition of State-Action Value Function, Q π (s, a)
                   Expected return starting in state s, taking action a and following policy π
           Q π (s, a) = Eπ [Gt |st = s, at = a]
                         = Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s, at = a]
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How7the
                                                                                                                          / 67Wo
  Recall: Dynamic Programming for Policy Evaluation
           In a Markov decision process
                        V π (s)      = Eπ [Gt |st = s]
                                     = Eπ [rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · |st = s]
                                                       X
                                     = R(s, π(s)) + γ       P(s ′ |s, π(s))V π (s ′ )
                                                                 s ′ ∈S
           If given dynamics and reward models, can do policy evaluation through
           dynamic programming
                                                   X
                         Vkπ (s) = r (s, π(s)) + γ   p(s ′ |s, π(s))Vk−1
                                                                     π
                                                                         (s ′ )                                       (1)
                                                                  s ′ ∈S
           Note: before convergence, Vkπ is an estimate of V π
           In Equation 1 we are substituting s ′ ∈S p(s ′ |s, π(s))Vk−1
                                                                    π
                                                                        (s ′ ) for
                                                 P
                               2
           Eπ [rt+1 + γrt+2 + γ rt+3 + · · · |st = s].
           This substitution is an instance of bootstrapping
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How8the
                                                                                                                          / 67Wo
  This Lecture: Policy Evaluation
           Estimating the expected return of a particular policy if don’t have access to
           true MDP models
           Monte Carlo policy evaluation
                   Policy evaluation when don’t have a model of how the world work
                         Given on-policy samples
           Temporal Difference (TD)
           Certainty Equivalence with dynamic programming
           Batch policy evaluation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How9the
                                                                                                                          / 67Wo
  Monte Carlo (MC) Policy Evaluation
           Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · + γ Ti −t rTi in MDP M under policy π
           V π (s) = Eτ ∼π [Gt |st = s]
                   Expectation over trajectories τ generated by following π
           Simple idea: Value = mean return
           If trajectories are all finite, sample set of trajectories & average returns
           Note: all trajectories may not be same length (e.g. consider MDP with
           terminal states)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        10the
                                                                                                                          / 67Wo
  Monte Carlo (MC) Policy Evaluation
           If trajectories are all finite, sample set of trajectories & average returns
           Does not require MDP dynamics/rewards
           Does not assume state is Markov
           Can be applied to episodic MDPs
                   Averaging over returns from a complete episode
                   Requires each episode to terminate
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        11the
                                                                                                                          / 67Wo
  First-Visit Monte Carlo (MC) On Policy Evaluation
   Initialize N(s) = 0, G (s) = 0 ∀s ∈ S
   Loop
           Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
           Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti as return from time
           step t onwards in ith episode
           For each time step t until Ti ( the end of the episode i)
                   If this is the first time t that state s is visited in episode i
                          Increment counter of total first visits: N(s) = N(s) + 1
                          Increment total return G (s) = G (s) + Gi,t
                          Update estimate V π (s) = G (s)/N(s)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        12the
                                                                                                                          / 67Wo
  Every-Visit Monte Carlo (MC) On Policy Evaluation
   Initialize N(s) = 0, G (s) = 0 ∀s ∈ S
   Loop
           Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
           Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti as return from time
           step t onwards in ith episode
           For each time step t until Ti ( the end of the episode i)
                   state s is the state visited at time step t in episodes i
                   Increment counter of total visits: N(s) = N(s) + 1
                   Increment total return G (s) = G (s) + Gi,t
                   Update estimate V π (s) = G (s)/N(s)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        13the
                                                                                                                          / 67Wo
  Optional Worked Example: MC On Policy Evaluation
   Initialize N(s) = 0, G (s) = 0 ∀s ∈ S
   Loop
           Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
           Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
           For each time step t until Ti ( the end of the episode i)
                   If this is the first time t that state s is visited in episode i (for first visit MC)
                          Increment counter of total first visits: N(s) = N(s) + 1
                          Increment total return G (s) = G (s) + Gi,t
                          Update estimate V π (s) = G (s)/N(s)
           Mars rover: R(s) = [ 1 0 0 0 0 0 +10]
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           Let γ < 1. Compute the first visit & every visit MC estimates of s2 .
           See solutions at the end of the slides
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        14the
                                                                                                                          / 67Wo
  Incremental Monte Carlo (MC) On Policy Evaluation
   After each episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . .
           Define Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · as return from time step t
           onwards in ith episode
           For state s visited at time step t in episode i
                   Increment counter of total visits: N(s) = N(s) + 1
                   Update estimate
                                    N(s) − 1   Gi,t              1
           V π (s) = V π (s)                 +      = V π (s) +      (Gi,t − V π (s))
                                     N(s)      N(s)             N(s)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        15the
                                                                                                                          / 67Wo
  Incremental Monte Carlo (MC) On Policy Evaluation
           Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , , ai,Ti , ri,Ti
           Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
           for t = 1 : Ti where Ti is the length of the i-th episode
                   V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
           We will see many algorithms of this form with a learning rate, target, and
           incremental update
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        16the
                                                                                                                          / 67Wo
  Policy Evaluation Diagram
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        17the
                                                                                                                          / 67Wo
  Policy Evaluation Diagram
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        18the
                                                                                                                          / 67Wo
  Policy Evaluation Diagram
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        19the
                                                                                                                          / 67Wo
  Policy Evaluation Diagram
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        20the
                                                                                                                          / 67Wo
  MC Policy Evaluation
                                   V π (s) = V π (s) + α(Gi,t − V π (s))
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        21the
                                                                                                                          / 67Wo
  MC Policy Evaluation
                                   V π (s) = V π (s) + α(Gi,t − V π (s))
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        22the
                                                                                                                          / 67Wo
  Evaluation of the Quality of a Policy Estimation Approach
           Consistency: with enough data, does the estimate converge to the true value
           of the policy?
           Computational complexity: as get more data, computational cost of
           updating estimate
           Memory requirements
           Statistical efficiency (intuitively, how does the accuracy of the estimate
           change with the amount of data)
           Empirical accuracy, often evaluated by mean squared error
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        23the
                                                                                                                          / 67Wo
  Evaluation of the Quality of a Policy Estimation Approach:
  Bias, Variance and MSE
           Consider a statistical model that is parameterized by θ and that determines
           a probability distribution over observed data P(x|θ)
           Consider a statistic θ̂ that provides an estimate of θ and is a function of
           observed data x
                   E.g. for a Gaussian distribution with known variance, the average of a set of
                   i.i.d data points is an estimate of the mean of the Gaussian
           Definition: the bias of an estimator θ̂ is:
                                                  Biasθ (θ̂) = Ex|θ [θ̂] − θ
           Definition: the variance of an estimator θ̂ is:
                                               Var (θ̂) = Ex|θ [(θ̂ − E[θ̂])2 ]
           Definition: mean squared error (MSE) of an estimator θ̂ is:
                                             MSE (θ̂) = Var (θ̂) + Biasθ (θ̂)2
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        24the
                                                                                                                          / 67Wo
  Evaluation of the Quality of a Policy Estimation Approach:
  Consistent Estimator
           Consider a statistical model that is parameterized by θ and that determines
           a probability distribution over observed data P(x|θ)
           Consider a statistic θ̂ that provides an estimate of θ and is a function of
           observed data x
           Definition: the bias of an estimator θ̂ is:
                                                  Biasθ (θ̂) = Ex|θ [θ̂] − θ
           Let n be the number of data points x used to estimate the parameter θ and
           call the resulting estimate of θ using that data θ̂n
           Then the estimator θ̂n is consistent if, for all ϵ > 0
                                                 lim Pr (|θ̂n − θ| > ϵ) = 0
                                                n→∞
           If an estimator is unbiased (bias = 0) is it consistent?
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        25the
                                                                                                                          / 67Wo
  Properties of Monte Carlo On Policy Evaluators
   Properties:
           First-visit Monte Carlo
                   V π estimator is an unbiased estimator of true Eπ [Gt |st = s]
                   By law of large numbers, as N(s) → ∞, V π (s) → Eπ [Gt |st = s]
           Every-visit Monte Carlo
                   V π every-visit MC estimator is a biased estimator of V π
                   But consistent estimator and often has better MSE
           Incremental Monte Carlo
                   Properties depends on the learning rate α
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        26the
                                                                                                                          / 67Wo
  Properties of Monte Carlo On Policy Evaluators
           Update is: V π (sit ) = V π (sit ) + αk (sj )(Gi,t − V π (sit ))
           where we have allowed α to vary (let k be the total number of updates done
           so far, for state sit = sj )
           If
                                                    ∞
                                                    X
                                                          αn (sj )    = ∞,
                                                    n=1
                                                    X∞
                                                          αn2 (sj ) < ∞
                                                    n=1
           then incremental MC estimate will converge to true policy value V π (sj )
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        27the
                                                                                                                          / 67Wo
  Monte Carlo (MC) Policy Evaluation Key Limitations
           Generally high variance estimator
                   Reducing variance can require a lot of data
                   In cases where data is very hard or expensive to acquire, or the stakes are
                   high, MC may be impractical
           Requires episodic settings
                   Episode must end before data from episode can be used to update V
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        28the
                                                                                                                          / 67Wo
  Monte Carlo (MC) Policy Evaluation Summary
           Aim: estimate V π (s) given episodes generated under policy π
                   s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from π
           Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · under policy π
           V π (s) = Eπ [Gt |st = s]
           Simple: Estimates expectation by empirical average (given episodes sampled
           from policy of interest)
           Updates V estimate using sample of return to approximate the expectation
           Does not assume Markov process
           Converges to true value under some (generally mild) assumptions
           Note: Sometimes is preferred over dynamic programming for policy
           evaluation even if know the true dynamics model and reward
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        29the
                                                                                                                          / 67Wo
  This Lecture: Policy Evaluation
           Estimating the expected return of a particular policy if don’t have access to
           true MDP models
           Monte Carlo policy evaluation
           Temporal Difference (TD)
           Certainty Equivalence with dynamic programming
           Batch policy evaluation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        30the
                                                                                                                          / 67Wo
  Temporal Difference Learning
           “If one had to identify one idea as central and novel to reinforcement
           learning, it would undoubtedly be temporal-difference (TD) learning.” –
           Sutton and Barto 2017
           Combination of Monte Carlo & dynamic programming methods
           Model-free
           Can be used in episodic or infinite-horizon non-episodic settings
           Immediately updates estimate of V after each (s, a, r , s ′ ) tuple
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        31the
                                                                                                                          / 67Wo
  Temporal Difference Learning for Estimating V
           Aim: estimate V π (s) given episodes generated under policy π
           Gt = rt + γrt+1 + γ 2 rt+2 + γ 3 rt+3 + · · · in MDP M under policy π
           V π (s) = Eπ [Gt |st = s]
           Recall Bellman operator (if know MDP models)
                                                     X
                         B π V (s) = r (s, π(s)) + γ   p(s ′ |s, π(s))V (s ′ )
                                                                     s ′ ∈S
           In incremental every-visit MC, update estimate using 1 sample of return (for
           the current ith episode)
                                         V π (s) = V π (s) + α(Gi,t − V π (s))
           Idea: have an estimate of V π , use to estimate expected return
                                V π (s) = V π (s) + α([rt + γV π (st+1 )] − V π (s))
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        32the
                                                                                                                          / 67Wo
  Temporal Difference [TD(0)] Learning
           Aim: estimate V π (s) given episodes generated under policy π
                   s1 , a1 , r1 , s2 , a2 , r2 , . . . where the actions are sampled from π
           TD(0) learning / 1-step TD learning: update estimate towards target
                               V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                                         |      {z         }
                                                                      TD target
           TD(0) error:
                                             δt = rt + γV π (st+1 ) − V π (st )
           Can immediately update value estimate after (s, a, r , s ′ ) tuple
           Don’t need episodic setting
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        33the
                                                                                                                          / 67Wo
  Temporal Difference [TD(0)] Learning Algorithm
   Input: α
   Initialize V π (s) = 0, ∀s ∈ S
   Loop
           Sample tuple (st , at , rt , st+1 )
           V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                     |      {z         }
                                                  TD target
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        34the
                                                                                                                          / 67Wo
  Worked Example TD Learning
   Input: α
   Initialize V π (s) = 0, ∀s ∈ S
   Loop
           Sample tuple (st , at , rt , st+1 )
           V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                     |      {z         }
                                                  TD target
   Example Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        35the
                                                                                                                          / 67Wo
  Worked Example TD Learning
   Input: α
   Initialize V π (s) = 0, ∀s ∈ S
   Loop
           Sample tuple (st , at , rt , st+1 )
           V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                     |      {z         }
                                                  TD target
   Example:
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           TD estimate of all states (init at 0) with α = 1, γ < 1 at end of this
           episode?
           V = [1 0 0 0 0 0 0 0]
           First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        36the
                                                                                                                          / 67Wo
  Temporal Difference (TD) Policy Evaluation
                                                            X
                       V π (st ) = r (st , π(st )) + γ              P(st+1 |st , π(st ))V π (st+1 )
                                                             st+1
                           V (st ) = V (st ) + α([rt + γV π (st+1 )] − V π (st ))
                              π             π
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        37the
                                                                                                                          / 67Wo
  Check Your Understanding L3N2: Polleverywhere Poll
  Temporal Difference [TD(0)] Learning Algorithm
   Input: α
   Initialize V π (s) = 0, ∀s ∈ S
   Loop
           Sample tuple (st , at , rt , st+1 )
           V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                     |      {z         }
                                                  TD target
   Select all that are true
       1   If α = 0 TD will weigh the TD target more than the past V estimate
       2   If α = 1 TD will update the V estimate to the TD target
       3   If α = 1 TD in MDPs where the policy goes through states with multiple
           possible next states, V may oscillate forever
       4   There exist deterministic MDPs where α = 1 TD will converge
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        38the
                                                                                                                          / 67Wo
  Check Your Understanding L3N2: Polleverywhere Poll
  Temporal Difference [TD(0)] Learning Algorithm
   Input: α
   Initialize V π (s) = 0, ∀s ∈ S
   Loop
           Sample tuple (st , at , rt , st+1 )
           V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                     |      {z         }
                                                  TD target
   Answers. If α = 1 TD will update to the TD target. If α = 1 TD in
   MDPs where the policy goes through states with multiple possible next
   states, V may oscillate forever. There exist deterministic MDPs where
   α = 1 TD will converge.
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        39the
                                                                                                                          / 67Wo
  Summary: Temporal Difference Learning
           Combination of Monte Carlo & dynamic programming methods
           Model-free
           Bootstraps and samples
           Can be used in episodic or infinite-horizon non-episodic settings
           Immediately updates estimate of V after each (s, a, r , s ′ ) tuple
           Biased estimator (early on will be influenced by initialization, and won’t be
           unibased estimator)
           Generally lower variance than Monte Carlo policy evaluation
           Consistent estimator if learning rate α satisfies same conditions specified for
           incremental MC policy evaluation to converge
           Note: algorithm I introduced is TD(0). In general can have
           approaches that interpolate between TD(0) and Monte Carlo
           approach
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        40the
                                                                                                                          / 67Wo
  This Lecture: Policy Evaluation
           Estimating the expected return of a particular policy if don’t have access to
           true MDP models
           Monte Carlo policy evaluation
           Temporal Difference (TD)
           Certainty Equivalence with dynamic programming
           Batch policy evaluation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        41the
                                                                                                                          / 67Wo
  Certainty Equivalence V π MLE MDP Model Estimates
             Model-based option for policy evaluation without true models
             After each (si , ai , ri , si+1 ) tuple
                   Recompute maximum likelihood MDP model for (s, a)
                                                              i
                                                        1
                                                                1(sk = s, ak = a, sk+1 = s ′ )
                                                              X
                                    P̂(s ′ |s, a) =
                                                      N(s, a)
                                                                 k=1
                                                                 i
                                                           1
                                                                   1(sk = s, ak = a)rk
                                                                 X
                                           rˆ(s, a) =
                                                         N(s, a)
                                                                   k=1
                   Compute V π using MLE MDP                 2
                                                                 (using any dynamic programming method
                   from lecture 2))
             Optional worked example at end of slides for Mars rover domain.
        2
            Requires initializing for all (s, a) pairs
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        42the
                                                                                                                          / 67Wo
  Certainty Equivalence V π MLE MDP Model Estimates
           Model-based option for policy evaluation without true models
           After each (s, a, r , s ′ ) tuple
                   Recompute maximum likelihood MDP model for (s, a)
                                                        K TXk −1
                                                  1     X
                              P̂(s ′ |s, a) =                    1(sk,t = s, ak,t = a, sk,t+1 = s ′ )
                                                N(s, a)    t=1
                                                          k=1
                                                           K TXk −1
                                                     1     X
                                      rˆ(s, a) =                    1(sk,t = s, ak,t = a)rt,k
                                                   N(s, a)    t=1
                                                              k=1
                   Compute V π using MLE MDP
           Cost: Updating MLE model and MDP planning at each update (O(|S|3 ) for
           analytic matrix solution, O(|S|2 |A|) for iterative methods)
           Very data efficient and very computationally expensive
           Consistent (will converge to right estimate for Markov models)
           Can also easily be used for off-policy evaluation (which we will shortly define
           and discuss)
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        43the
                                                                                                                          / 67Wo
  This Lecture: Policy Evaluation
           Estimating the expected return of a particular policy if don’t have access to
           true MDP models
           Monte Carlo policy evaluation
                   Policy evaluation when don’t have a model of how the world work
                         Given on-policy samples
           Temporal Difference (TD)
           Certainty Equivalence with dynamic programming
           Batch policy evaluation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        44the
                                                                                                                          / 67Wo
  Batch MC and TD
           Batch (Offline) solution for finite dataset
                   Given set of K episodes
                   Repeatedly sample an episode from K
                   Apply MC or TD(0) to the sampled episode
           What do MC and TD(0) converge to?
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        45the
                                                                                                                          / 67Wo
  AB Example: (Ex. 6.4, Sutton & Barto, 2018)
           Two states A, B with γ = 1
           Given 8 episodes of experience:
                   A, 0, B, 0
                   B, 1 (observed 6 times)
                   B, 0
           Imagine running TD updates over data infinite number of times
           V (B) =
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        46the
                                                                                                                          / 67Wo
  AB Example: (Ex. 6.4, Sutton & Barto, 2018)
           TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                                |      {z         }
                                                                      TD target
           Two states A, B with γ = 1
           Given 8 episodes of experience:
                   A, 0, B, 0
                   B, 1 (observed 6 times)
                   B, 0
           Imagine run TD updates over data infinite number of times
           V (B) = 0.75 by TD or MC
           What about V (A)?
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        47the
                                                                                                                          / 67Wo
  Check Your Understanding L3N3: AB Example: (Ex. 6.4,
  Sutton & Barto, 2018)
           TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                                |      {z         }
                                                                      TD target
           Two states A, B with γ = 1
           Given 8 episodes of experience:
                   A, 0, B, 0
                   B, 1 (observed 6 times)
                   B, 0
           Imagine run TD updates over data infinite number of times
           V (B) = 0.75 by TD or MC
           What about V (A)?
           Respond in Poll
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        48the
                                                                                                                          / 67Wo
  Check Your Understanding L3N3: AB Example: (Ex. 6.4,
  Sutton & Barto, 2018)
           TD Update: V π (st ) = V π (st ) + α([rt + γV π (st+1 )] −V π (st ))
                                                |      {z         }
                                                                      TD target
           Two states A, B with γ = 1
           Given 8 episodes of experience:
                   A, 0, B, 0
                   B, 1 (observed 6 times)
                   B, 0
           Imagine run TD updates over data infinite number of times
           V (B) = 0.75 by TD or MC
           What about V (A)?
           V MC (A) = 0 V TD (A) = .75
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        49the
                                                                                                                          / 67Wo
  Batch MC and TD: Convergence
           Monte Carlo in batch setting converges to min MSE (mean squared error)
                   Minimize loss with respect to observed returns
                   In AB example, V (A) = 0
           TD(0) converges to DP policy V π for the MDP with the maximum
           likelihood model estimates
           Aka same as dynamic programming with certainty equivalence!
                   Maximum likelihood Markov decision process model
                                                              i
                                                        1
                                                                1(sk = s, ak = a, sk+1 = s ′ )
                                                              X
                                    P̂(s ′ |s, a) =
                                                      N(s, a)
                                                                k=1
                                                                 i
                                                           1
                                                                   1(sk = s, ak = a)rk
                                                                 X
                                           rˆ(s, a) =
                                                         N(s, a)
                                                                   k=1
                   Compute V π using this model
                   In AB example, V (A) = 0.75
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        50the
                                                                                                                          / 67Wo
  Some Important Properties to Evaluate Model-free Policy
  Evaluation Algorithms
           Data efficiency & Computational efficiency
           In simple TD(0), use (s, a, r , s ′ ) once to update V (s)
                   O(1) operation per update
                   In an episode of length L, O(L)
           In MC have to wait till episode finishes, then also O(L)
           MC can be more data efficient than simple TD
           But TD exploits Markov structure
                   If in Markov domain, leveraging this is helpful
           Dynamic programming with certainty equivalence also uses Markov structure
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        51the
                                                                                                                          / 67Wo
  Summary: Policy Evaluation
   Estimating the expected return of a particular policy if don’t have access
   to true MDP models. Ex. evaluating average purchases per session of new
   product recommendation system
           Monte Carlo policy evaluation
                   Policy evaluation when we don’t have a model of how the world works
                         Given on policy samples
                         Given off policy samples
           Temporal Difference (TD)
           Dynamic Programming with certainty equivalence
           *Understand what MC vs TD methods compute in batch evaluations
           Metrics / Qualities to evaluate and compare algorithms
                   Uses Markov assumption
                   Accuracy / MSE / bias / variance
                   Data efficiency
                   Computational efficiency
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        52the
                                                                                                                          / 67Wo
  Today’s Plan
           Last Time:
                   Markov reward / decision processes
                   Policy evaluation & control when have true model (of how the world works)
           Today
                   Policy evaluation without known dynamics & reward models
           Next Time:
                   Control when don’t have a model of how the world works
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        53the
                                                                                                                          / 67Wo
  Optional Worked Example MC On Policy Evaluation
  Answers
   Initialize N(s) = 0, G (s) = 0 ∀s ∈ S
   Loop
           Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
           Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
           For each time step t until Ti ( the end of the episode i)
                   If this is the first time t that state s is visited in episode i
                          Increment counter of total first visits: N(s) = N(s) + 1
                          Increment total return G (s) = G (s) + Gi,t
                          Update estimate V π (s) = G (s)/N(s)
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           Let γ < 1. Compare the first visit & every visit MC estimates of s2 .
                                                                                      γ 2 +γ
           First visit: V MC (s2 ) = γ 2 , Every visit: V MC (s2 ) =                     2
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        54the
                                                                                                                          / 67Wo
  Optional Check Your Understanding L3: Incremental MC
  (State if each is True or False)
   First or Every Visit MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
                   For all s, for first or every time t that state s is visited in episode i
                         N(s) = N(s) + 1, G (s) = G (s) + Gi,t
                         Update estimate V π (s) = G (s)/N(s)
   Incremental MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
            for t = 1 : Ti where Ti is the length of the i-th episode
                   V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
        1   Incremental MC with α = 1 is the same as first visit MC
        2   Incremental MC with α = N(s1 ) is the same as every visit MC
                                        it
        3   Incremental MC with α > N(s1 ) could be helpful in non-stationary domains
                                        it
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        55the
                                                                                                                          / 67Wo
  Optional Check Your Understanding L3 Incremental MC
  Answers
   First or Every Visit MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
                                                                                     i      i       i
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,T
                                                                           i
                      For all s, for first or every time t that state s is visited in episode i
                               N(s) = N(s) + 1, G (s) = G (s) + Gi,t
                               Update estimate V π (s) = G (s)/N(s)
   Incremental MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
                                                                                     i      i         i
            for t = 1 : Ti where Ti is the length of the i-th episode
                      V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
        1   Incremental MC with α = 1 is the same as first visit MC
            false
        2   Incremental MC with α = N(s1 ) is the same as every visit MC
                                        it
            true
        3   Incremental MC with α > N(s1 ) could help in non-stationary domains
                                        it
            true
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        56the
                                                                                                                          / 67Wo
  Optional Check Your Understanding L3 Incremental MC
  (State if each is True or False)
   First or Every Visit MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
                   For all s, for first or every time t that state s is visited in episode i
                         N(s) = N(s) + 1, G (s) = G (s) + Gi,t
                         Update estimate V π (s) = G (s)/N(s)
   Incremental MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,Ti , ai,Ti , ri,Ti
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,Ti
            for t = 1 : Ti where Ti is the length of the i-th episode
                   V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
        1   Incremental MC with α = 1 is the same as first visit MC
        2   Incremental MC with α = N(s1 ) is the same as every visit MC
                                        it
        3   Incremental MC with α > N(s1 ) could be helpful in non-stationary domains
                                        it
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        57the
                                                                                                                          / 67Wo
  Check Your Understanding L3N1: Polleverywhere Poll
  Incremental MC Answers
   First or Every Visit MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
                                                                                     i      i       i
            Gi,t = ri,t + γri,t+1 + γ 2 ri,t+2 + · · · γ Ti −1 ri,T
                                                                           i
                      For all s, for first or every time t that state s is visited in episode i
                               N(s) = N(s) + 1, G (s) = G (s) + Gi,t
                               Update estimate V π (s) = G (s)/N(s)
   Incremental MC
            Sample episode i = si,1 , ai,1 , ri,1 , si,2 , ai,2 , ri,2 , . . . , si,T , ai,T , ri,T
                                                                                     i      i         i
            for t = 1 : Ti where Ti is the length of the i-th episode
                      V π (sit ) = V π (sit ) + α(Gi,t − V π (sit ))
        1   Incremental MC with α = 1 is the same as first visit MC
            false
        2   Incremental MC with α = N(s1 ) is the same as every visit MC
                                        it
            true
        3   Incremental MC with α > N(s1 ) could help in non-stationary domains
                                        it
            true
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        58the
                                                                                                                          / 67Wo
  Certainty Equivalence V π MLE MDP Worked Example
                                             !"          !#         !$         !%        !&          !'           !(
                                          ) !" = +1    ) !# = 0   ) !$ = 0   ) !% = 0   ) !& = 0   ) !' = 0   ) !( = +10
                                            89/:                                                              ./01/!123
                                          .2456 7214                                                          .2456 7214
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
           TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
           Optional exercise: What is the certainty equivalent estimate?
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        59the
                                                                                                                          / 67Wo
  Certainty Equivalence V π MLE MDP Worked Ex Solution
                                             !"          !#         !$         !%        !&          !'           !(
                                          ) !" = +1    ) !# = 0   ) !$ = 0   ) !% = 0   ) !& = 0   ) !' = 0   ) !( = +10
                                            89/:                                                              ./01/!123
                                          .2456 7214                                                          .2456 7214
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
           TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
           Optional exercise: What is the certainty equivalent estimate?
           rˆ = [1 0 0 0 0 0 0], p̂(terminate|s1 , a1 ) = p̂(s2 |s3 , a1 ) = 1
           p̂(s2 |s2 , a1 ) = 0.5 =p̂(s1 |s2 , a1 )
                        γ∗0.5 γ 2 ∗0.5
           V = [0      1−0.5γ 1−0.5γ         0 0 0 0]
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        60the
                                                                                                                          / 67Wo
  Recall: Dynamic Programming for Policy Evaluation
           If we knew dynamics and reward model, we can do policy evaluation
           Initialize V0π (s) = 0 for all s
           For k = 1 until convergence
                   For all s in S
                                                                  X
                               Vkπ (s) = r (s, π(s)) + γ                   p(s ′ |s, π(s))Vk−1
                                                                                           π
                                                                                               (s ′ )
                                                                  s ′ ∈S
           Vkπ (s) is exactly the k-horizon value of state s under policy π
           Vkπ (s) is an estimate of the infinite horizon value of state s under policy π
                                V π (s) = Eπ [Gt |st = s] ≈ Eπ [rt + γVk−1 |st = s]
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        61the
                                                                                                                          / 67Wo
  Dynamic Programming Policy Evaluation
  V π (s) ← Eπ [rt + γVk−1 |st = s]
           Bootstrapping: Update for V uses an estimate
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        62the
                                                                                                                          / 67Wo
  Dynamic Programming Policy Evaluation
  V π (s) ← Eπ [rt + γVk−1 |st = s]
           Bootstrapping: Update for V uses an estimate
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        63the
                                                                                                                          / 67Wo
  What about when we don’t know the models?
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        64the
                                                                                                                          / 67Wo
                                 !"         !#         !$         !%        !&          !'           !(
                              ) !" = +1   ) !# = 0   ) !$ = 0   ) !% = 0   ) !& = 0   ) !' = 0   ) !( = +10
                               89/:                                                              ./01/!123
                             .2456 7214                                                          .2456 7214
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           First visit MC estimate of V of each state? [1 γ γ 2 0 0 0 0]
           TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
           What is the certainty equivalent estimate?
           rˆ = [1 0 0 0 0 0 0], p̂(terminate|s1 , a1 ) = p̂(s2 |s3 , a1 ) = 1
           p̂(s1 |s2 , a1 ) = .5, p̂(s2 |s2 , a1 ) = .5, V = [1 1 1 0 0 0 0]
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        65the
                                                                                                                          / 67Wo
  Bias/Variance of Model-free Policy Evaluation Algorithms
           Return Gt is an unbiased estimate of V π (st )
           TD target [rt + γV π (st+1 )] is a biased estimate of V π (st )
           But often much lower variance than a single return Gt
           Return function of multi-step sequence of random actions, states & rewards
           TD target only has one random action, reward and next state
           MC
                   Unbiased (for first visit)
                   High variance
                   Consistent (converges to true) even with function approximation
           TD
                   Some bias
                   Lower variance
                   TD(0) converges to true value with tabular representation
                   TD(0) does not always converge with function approximation
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        66the
                                                                                                                          / 67Wo
                                 !"         !#         !$         !%        !&          !'           !(
                              ) !" = +1   ) !# = 0   ) !$ = 0   ) !% = 0   ) !& = 0   ) !' = 0   ) !( = +10
                               89/:                                                              ./01/!123
                             .2456 7214                                                          .2456 7214
           Mars rover: R = [ 1 0 0 0 0 0 +10] for any action
           π(s) = a1 ∀s, γ = 1. any action from s1 and s7 terminates episode
           Trajectory = (s3 , a1 , 0, s2 , a1 , 0, s2 , a1 , 0, s1 , a1 , 1, terminal)
           First visit MC estimate of V of each state? [1 1 1 0 0 0 0]
           TD estimate of all states (init at 0) with α = 1 is [1 0 0 0 0 0 0]
           TD(0) only uses a data point (s, a, r , s ′ ) once
           Monte Carlo takes entire return from s to end of episode
Emma Brunskill (CS234 Reinforcement Learning)
                                          Lecture 3: Model-Free Policy Evaluation: Policy Evaluation
                                                                                                   Winter
                                                                                                     Without
                                                                                                          2025Knowing How
                                                                                                                        67the
                                                                                                                          / 67Wo