0% found this document useful (0 votes)

12 views63 pages

RL Unit-3

The document discusses Finite Markov Decision Processes (MDPs) as a framework for modeling sequential decision-making in reinforcement learning, emphasizing the agent-environment interface, goals, rewards, and value functions. It explains the concepts of policies, optimality, and the distinction between episodic and continuing tasks, including the importance of reward design and the return in evaluating agent performance. Additionally, it covers the formulation of policies and value functions, including the Bellman equation and examples of applications in various robotic tasks.

Uploaded by

22091a3340

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views63 pages

RL Unit-3

Uploaded by

22091a3340

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

UNIT – III

Finite Markov Decision

Processes
The Agent–Environment Interface, Goals and
Rewards, Returns and Episodes, Unified Notation for
Episodic and Continuing Tasks, Policies and Value
Functions, Optimal Policies and Optimal Value
Functions, Optimality and Approximation.
Finite Markov Decision
Processes
• MDPs are a mathematical framework for modeling
sequential decision-making problems in reinforcement
learning, where actions affect immediate rewards and
future states, leading to delayed rewards.
• They are used to formalize problems where an agent
learns to maximize cumulative rewards through
interaction with an environment.
• It's called finite because it assumes:
– A finite number of states,
– A finite number of actions, and
– A finite set of possible rewards.
The Agent–Environment Interface
• The agent is the learner and decision-maker, while the
environment includes everything outside the agent.
• At each discrete time step t, the agent observes the
environment’s state St, selects an action At, and receives a
reward Rt+1 and the next state St+1.
• The environment’s response is governed by transition
probabilities p(s′,r∣s,a), which define the likelihood of
moving to state s′ and receiving reward r given state s and
action a.
• States and actions can be physical or abstract (e.g., mental
states like uncertainty or decisions about attention focus).
The Agent–Environment Interface
• A trajectory is a sequential list of states,
actions, and rewards (e.g. S₀, A₀, R₁, S₁, A₁, R₂,
…)

Figure: The agent–environment interaction in a Markov decision

process.
Environment Dynamics (Function p)
• The probabilistic behavior of the environment is described by:

• This function completely characterizes the environment.

• It obeys the Markov property: the next state and reward only depend on
the current state and action, not on any earlier history.
Derived Quantities from p
Abstraction and Flexibility
• Time steps can represent logical stages, not
necessarily real-time.
• States can be low-level (sensor data) or high-
level (symbolic info or memory).
• Actions can range from motor commands to
mental choices (e.g., shifting attention).
• The reward is always a scalar number
reflecting the goal.
Where is the Agent–Environment
Boundary?
• The boundary is not necessarily physical (e.g.,
robot motors and sensors are often considered
part of the environment).
• General rule:
Anything the agent cannot arbitrarily change is
part of the environment, even if the agent
knows how it works.
• This boundary marks the limit of control, not
knowledge.
Why Use the Agent–Environment
Interface?
• It simplifies goal-directed learning from
interaction into just three signals:
– State (what is perceived)
– Action (what is done)
– Reward (what is achieved)
• This abstraction has wide applicability across
robotics, games, decision-making, and more.
Examples
1. Bioreactor
2. Pick-and-Place Robot
3. Recycling Robot
1. Bioreactor
• Agent: A reinforcement learning system controlling a bioreactor (a
vat for producing chemicals).
• Environment: The bioreactor, including nutrients, bacteria, and
sensors.
• States: Sensor readings (e.g., thermocouple data) and symbolic
inputs (e.g., ingredients and target chemical).
• Actions: Setting target temperatures and stirring rates, passed to
lower-level control systems.
• Rewards: Measures of chemical production or system state, not
always single numbers.
• Interaction: The agent adjusts temperature/stirring based on sensor
data, receiving feedback on chemical output to optimize production.
2. Pick-and-Place Robot
• Agent: A reinforcement learning system controlling a robot
arm for repetitive pick-and-place tasks.
• Environment: The robot arm and its workspace.
• States: Joint angles and velocities from low-latency sensors.
• Actions: Voltages applied to motors at each joint to control
movement.
• Rewards: Small negative rewards per time step to encourage
fast, smooth movements.
• Interaction: The agent applies voltages based on joint
positions, receiving feedback on movement smoothness to
learn optimal control.
3. Recycling Robot
• Agent: A reinforcement learning system controlling a mobile robot that collects
empty soda cans in an office.
• Environment: The office environment, including cans, obstacles, and the robot’s
battery.
• States: The robot’s battery level, with two possible states: high or low
(S={high,low} ).
• Actions:
– In both states: (1) Search (actively seek cans), (2) Wait (stay in place for cans to be brought).
– In the low state only: (3) Recharge (return to a charging station, unavailable in the high state).
• Rewards:
– Zero most of the time.
– Positive when a can is collected.
– Large negative if the battery depletes fully, requiring rescue.
• Interaction: The agent observes the battery state (e.g., high or low) and chooses
an action (e.g., search). The environment responds with a new state (e.g., battery
drops to low after searching) and a reward (e.g., positive for collecting a can). For
example, searching in the high state may keep the battery high or drop it to low,
with a probability of collecting a can and earning a positive reward.
Goals and Rewards
• In reinforcement learning, the agent’s goal is
defined in terms of a reward signal. This
reward is a scalar number (Rt) passed from the
environment to the agent at each time step.
The Agent’s Objective
• The agent’s purpose is to maximize the total
reward it receives over time, not just the
immediate reward.
• This is formalized in the reward hypothesis:
“All of what we mean by goals and purposes
can be well thought of as the maximization of
the expected value of the cumulative sum of
a received scalar signal (called reward).”
Why Rewards?
• Using rewards to define goals is a core idea in
reinforcement learning.
• Although it might seem restrictive, it’s actually
very flexible and widely applicable in
practice.
Examples of Reward Design
• Robot learning to walk: Reward is
proportional to forward motion.
• Maze escape: Reward of –1 per time step
encourages fast escape.
• Can-collecting robot: +1 per can, 0 otherwise.
• Chess-playing agent: +1 for winning, –1 for
losing, 0 for draw or in-progress.
Important Principle
• Rewards define what the agent should
achieve, not how to achieve it.
• Don’t reward subgoals (like capturing a chess
piece), because the agent might learn to
optimize those at the expense of the main
goal (winning the game).
• Subgoal rewards may mislead the agent.
Key Takeaway
• The reward signal is how we communicate
the desired behavior to the agent. It must
align exactly with the actual goal, as the agent
will learn only to maximize this signal —
nothing more, nothing less.
Returns and Episodes
• In reinforcement learning, the agent’s
objective is to maximize the cumulative
reward it receives over time. This accumulated
total is called the return, denoted by Gt.
• The return helps quantify how well the agent
is performing based on its actions over time.
Episodic Tasks and Return Definition
• In tasks that naturally break into segments—called
episodic tasks—the agent interacts with the
environment over a finite sequence of steps that
ends in a terminal state. The return in this case is
defined as the sum of rewards received from time
step t+1 until the end of the episode at time T:

• Examples of episodic tasks include playing a game,

completing a maze, or repeated attempts at a task
like pole-balancing.
Continuing Tasks and the Problem of
Infinity
• In contrast, continuing tasks have no terminal
state—the interaction goes on indefinitely.
These arise in settings like industrial control
systems or long-lived autonomous robots.
• In such cases, the straightforward sum of
future rewards could become infinite,
especially if a constant reward is received at
each step. This makes direct maximization of
return problematic.
Discounted Return
• To overcome this, reinforcement learning introduces the
concept of discounted return, which gives less weight to
rewards further in the future. The return is then defined
as:

• Here, γ is the discount rate, a parameter between 0 and 1.

When γ=0, the agent focuses only on immediate reward;
when γ approaches 1, the agent becomes more farsighted,
valuing long-term outcomes more strongly. Discounting
ensures the return remains finite even in continuing tasks,
provided the rewards are bounded.
Recursive Relationship
• A fundamental property of the discounted
return is its recursive structure, which is
expressed as:

• This relation is central to many RL algorithms

and simplifies calculations and learning
updates. It holds for every time step before
termination, and is especially useful for
iterative learning methods.
Example: Pole-Balancing
• A clear example is the pole-balancing task.
• If modeled as episodic, the agent could receive a
reward of +1 at each time step until the pole falls;
hence, the return equals the number of steps before
failure.
• Alternatively, if treated as a continuing task, the agent
might receive 0 reward at each step and a penalty of –
1 on failure.
• The return in this case would reflect how long the
agent can prevent failure, but in a discounted manner.
Summary
• Whether the task is episodic or continuing, the
return provides a formal objective for the agent: to
maximize cumulative reward.
• For episodic problems, this is a straightforward
sum; for continuing problems, discounting is used
to maintain tractability and reflect time preference.
• This concept of return—both undiscounted and
discounted—is fundamental to the design and
understanding of reinforcement learning systems.
Unified Notation for Episodic and
Continuing Tasks
• In reinforcement learning, tasks can be classified as either
episodic—where interactions naturally break into distinct
episodes—or continuing, where interaction goes on
indefinitely.
• Episodic tasks are often mathematically simpler because
each action affects only a finite number of future rewards
within the same episode.
• However, in practice, reinforcement learning deals with
both kinds of tasks, and sometimes even needs to handle
them simultaneously. To make this possible, it is useful to
adopt a unified notation that works for both settings.
Handling Episodes in Notation
• When modeling episodic tasks precisely, each episode is
treated as a separate finite sequence of time steps, and we
number these steps starting from zero for each episode.
• This would require distinguishing between states, actions,
and rewards using double indices such as S t,i, At,i, Rt,i, where t
is the time step and i is the episode number.
• However, in practice, this level of detail is rarely needed.
• Most of the time, we deal with a single episode or express
something true across all episodes, so it is convenient to
abuse notation slightly by dropping the episode index and
simply writing St, At, and so on.
Absorbing State and Return Formulation
• The key to unifying episodic and continuing tasks lies in how we define the
return.
• In episodic tasks, the return was originally defined as a finite sum, while in
continuing tasks, it was an infinite sum possibly with discounting.
• To bring both under a single formulation, we imagine that at the end of an
episode, the agent enters a special absorbing state. This state transitions
only to itself and yields zero rewards indefinitely.
• For example, a reward sequence like +1,+1,+1,0,0,0,…results from entering
the absorbing state after three steps.
• Whether we sum over the finite episode or over the full sequence
including the absorbing state, the total return remains the same, even
with discounting.
General Unified Return Expression
• With this absorbing state idea, we can now define the return in
a general form that covers both task types:

• Here, T can be finite (in episodic tasks) or infinite (in continuing

tasks), and the discount factor γ can be equal to 1 (if the sum
still converges, such as when episodes always terminate).
• This convention simplifies notation throughout reinforcement
learning theory and reveals the deep similarities between
episodic and continuing tasks, allowing us to use a common
mathematical framework for both.
Policies and Value Functions
• In reinforcement learning, the goal of an agent
is to maximize long-term rewards. To reason
about this, we use policies and value
functions.
Policy (π)
• A policy tells an agent how to behave in an
environment.
• It’s basically a strategy or rule for choosing actions.
Formal Definition:
A policy π is a mapping from states to a probability
distribution over actions.

• This means: if you’re in state s, the probability of

choosing action a is π(a|s).
Types of Policies
• Deterministic policy: Always picks the same
action in a state.
Example: π(s) = go north.
• Stochastic policy: Chooses among actions
with certain probabilities.
Example: π(north|s) = 0.7, π(east|s) = 0.3.
Value Functions
• Value functions answer the question:
“How good is it to be in a state (or to take an
action in a state) if I follow policy π?”
• We measure “goodness” in terms of expected
return (future rewards).
(a) State‑Value Function
• Tells us how good it is to be in a particular
state s under policy π.
• Denoted as vπ(s).
Where
is the total discounted return.
(b) Action‑Value Function
• Tells us how good it is to take action a in state
s under policy π.
• Denoted as qπ(s, a).
Relationship Between vπ(s) and qπ(s, a)
• The value of a state equals the average
action‑value, weighted by the probabilities of
each action under π:
The Bellman Equation for vπ
• The most important recursive relationship in
RL:

• This says: The value of a state equals the

expected reward plus the discounted value of
the next state, averaged over all actions and
outcomes.
Backup Diagram for vπ

• The arrows show possible actions from state s,

each leading to different next states with
rewards. The Bellman equation combines
them into the expected value.
Estimation
• If the agent records the average returns
observed after visiting states (or state-action
pairs), these averages converge to vπ(s) and
qπ(s, a).
• Such estimation is often done via Monte Carlo
methods or function approximation when the
state space is very large.
Example: Gridworld
Policy
• Suppose the agent chooses all four actions
with equal probability in every state:

• This is called an equiprobable random policy.

Environment Setup
• The world is a rectangular grid where each cell = a state.
• At each cell, the agent can take four actions:
– north, south, east, west.
• Movement rules:
– Moving off the grid keeps the agent in the same cell but gives a
reward of –1.
– Normal moves give a reward of 0, except:
• From special state A, any action gives +10 reward and moves the
agent to A′.
• From special state B, any action gives +5 reward and moves the agent
to B′.
State-Value Function vπ
• For each state, vπ(s) is the expected long-term reward
starting from s and following the random policy.
• Near the edges, vπ(s) becomes negative, because the
agent often bumps into the grid’s edge and receives –
1 penalties.
• State A has high value (close to +10) because it gives a
big immediate reward.
• State B has a value greater than 5, since after
reaching B′, there’s a good chance to later get to A
(earning +10).
State-Value Function vπ
• In the figure (right), the state in the center has
value +0.7, computed from the Bellman
equation considering its four neighbors.
Bellman Equation in Action
• For a given state s:

• For example, at the center state (valued at

+0.7), its neighbors have values +2.3, +0.4, –
0.4, and +0.7.
• The equation balances the immediate reward
(mostly 0) and the discounted future values of
these neighbor states.
Optimal Policies and Optimal Value
Functions
• In reinforcement learning, “solving” a problem
means finding a policy that maximizes long-
term reward.
• A policy (π) is basically the agent’s “decision
rule” — it says:
– In each state, what action should I take?
• We’re looking for optimal policies — ones
that are at least as good as every other policy,
in every state.
Comparing Policies
Optimal Value Functions
• Two key “best possible” value functions are
defined:
1. Optimal State-Value Function
2. Optimal Action-Value Function
Optimal State-Value Function v*

• This is the maximum expected return you can

get starting in state s, over all possible policies
π.
• It answers:
– “If I start in state s and act optimally forever,
what’s my expected long-term reward?”
Optimal Action-Value Function q*

• This is the maximum expected return from

taking action a in state s, then following an
optimal policy afterward.
• It answers:
– “If I’m in state s, take action a now, and then act
optimally, what’s my expected long-term reward?”
Relationship Between v* and q*
Bellman Optimality Equations
• These express a self-consistency rule:
– The value of a state (or state-action pair) under
optimal behavior equals the best possible
immediate reward plus the discounted value of
what comes next.
Why They’re Important
• If you know v*:
– You can do a one-step lookahead to pick the best
action in each state.
• If you know q*:
– You can immediately pick the best action without
even knowing the environment dynamics.
• They’re the key to defining optimal policies —
any policy that’s greedy with respect to v* or
q* is optimal.
Backup diagrams for v* and q*

best next action--> future rewards max over a' of q*(s',a')

Example
• Imagine playing golf:
– v*(s) tells you: “From here, if I play perfectly, I’ll
finish in about 3 strokes.”
– q*(s, a) tells you: “If I drive the ball now, and then
play perfectly afterward, I’ll finish in about 3
strokes; if I putt instead, I’ll finish in 10.”
Practical Issues
• In theory, you can solve these equations
exactly — but in real problems:
– The environment might be unknown.
– The number of states can be huge.
– You might not have enough computation power.
• So in practice, we use approximation
methods (like dynamic programming, Monte
Carlo, TD learning) to estimate v* or q*.
Optimality and Approximation
• Optimal value functions (v* and q*) and
optimal policies (π*) are well-defined
mathematically.
• In practice, computing them exactly is rarely
possible for real-world problems due to
computational and memory constraints.
• So, in RL, we settle for approximations.
Why Exact Optimality is Hard?
• Even with a perfect environment model, solving the
Bellman optimality equation exactly is computationally
prohibitive.
• Example: Chess
– Fully deterministic.
– Small compared to real life.
– But still unsolved optimally — even massive, custom-
designed computers can’t compute the optimal move for
every state.
• Real agents have time step limits → can’t do unlimited
computation before choosing an action.
Memory Constraints
• Tabular case: If the state space is small and
finite, we can store value functions in an array
(one entry per state or state–action pair).
• Large/continuous state spaces: Impossible to
store full tables → must use function
approximation (e.g., neural networks, linear
approximators).
Strategic Approximation
• Not all states are equally important:
– Some states occur rarely → making mistakes there has
little effect on total reward.
– Some states occur often → must learn to act well there.
• Example: TD-Gammon
– Plays world-class backgammon.
– Likely makes poor moves in many never-encountered
states.
– Still succeeds because it plays optimally for the states it
actually visits.
RL’s Advantage in Approximation
• RL is online and experience-driven:
– Focuses learning on frequently visited states.
– Saves effort in unimportant regions of the state
space.
• This focus distinguishes RL from many other
methods for approximating MDP solutions,
which often try to approximate over all states
equally.

Unit 02 REL
No ratings yet
Unit 02 REL
127 pages
Reinf 2
No ratings yet
Reinf 2
4 pages
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
63 pages
Module 5-1
No ratings yet
Module 5-1
12 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
Lecture 2
No ratings yet
Lecture 2
47 pages
21ai020 & Reinforcement Learning: The Agent-Environment Interface
No ratings yet
21ai020 & Reinforcement Learning: The Agent-Environment Interface
8 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
RL - 01 Introduction To Reinforcement Learning
No ratings yet
RL - 01 Introduction To Reinforcement Learning
62 pages
Reinforcement Learning & MDPs
100% (1)
Reinforcement Learning & MDPs
8 pages
RL 1
No ratings yet
RL 1
29 pages
ML Unit-5
No ratings yet
ML Unit-5
9 pages
Reinforcement Learning & Robotics
No ratings yet
Reinforcement Learning & Robotics
35 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
Bellman Equation & Markov Processes
No ratings yet
Bellman Equation & Markov Processes
10 pages
Unit 5 ML
No ratings yet
Unit 5 ML
15 pages
RL Frra
No ratings yet
RL Frra
9 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Module 1
No ratings yet
Module 1
85 pages
Kguh
No ratings yet
Kguh
38 pages
ML Unit 5 Possible Questions and Answers
No ratings yet
ML Unit 5 Possible Questions and Answers
47 pages
Chapter 2
No ratings yet
Chapter 2
21 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Deep Reinforcement Learning - Guide To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning - Guide To Deep Q-Learning
1 page
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
RL & DL Notes
No ratings yet
RL & DL Notes
43 pages
Sections
No ratings yet
Sections
76 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
ML Unit-V
No ratings yet
ML Unit-V
20 pages
10 ReinforcementLearning
No ratings yet
10 ReinforcementLearning
59 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
RL Unit 2
No ratings yet
RL Unit 2
11 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Agent Environment Interface
No ratings yet
Agent Environment Interface
19 pages
COMP3411 Week 4 - RL
No ratings yet
COMP3411 Week 4 - RL
79 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
32 pages
Rewards in Reinforcement Learning
No ratings yet
Rewards in Reinforcement Learning
12 pages
Unit III Chapter8 Reinforcement Learning
No ratings yet
Unit III Chapter8 Reinforcement Learning
42 pages
Unit 5
No ratings yet
Unit 5
10 pages
Maai 6
No ratings yet
Maai 6
143 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
RL & DL Notes
No ratings yet
RL & DL Notes
73 pages
Reinforcement Learning Details
No ratings yet
Reinforcement Learning Details
9 pages
UNIT V 5.3 ML Reinforcement Learning
No ratings yet
UNIT V 5.3 ML Reinforcement Learning
41 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Ngữ pháp tiếng anh tiểu học đặc biệt, file Word chất và đẹp UNIT 10
No ratings yet
Ngữ pháp tiếng anh tiểu học đặc biệt, file Word chất và đẹp UNIT 10
9 pages
Basaan Sa San Juan
30% (10)
Basaan Sa San Juan
14 pages
KSPRS Newsletter Vol.137
No ratings yet
KSPRS Newsletter Vol.137
21 pages
Revised Templates of Various Discourses For Kerala Class 10 Students Prepared by Mahmud Sir
No ratings yet
Revised Templates of Various Discourses For Kerala Class 10 Students Prepared by Mahmud Sir
7 pages
Marine Pump Systems & Equipment Guide
No ratings yet
Marine Pump Systems & Equipment Guide
34 pages
Delusions May Not Always Be Delusions
No ratings yet
Delusions May Not Always Be Delusions
2 pages
4G LTE Router with Detachable Antennas
No ratings yet
4G LTE Router with Detachable Antennas
4 pages
Occlusion and Cusp Height Guide
No ratings yet
Occlusion and Cusp Height Guide
38 pages
(Ebook) The ECT (Electroconvulsive Therapy) Handbook, 2nd Edition by Allan Scott ISBN 9781904671220, 1904671225 Instant Download
100% (1)
(Ebook) The ECT (Electroconvulsive Therapy) Handbook, 2nd Edition by Allan Scott ISBN 9781904671220, 1904671225 Instant Download
46 pages
Catfish Clarias Gariepinus - Docx44
No ratings yet
Catfish Clarias Gariepinus - Docx44
223 pages
India's Current and Potential MSW Use in RDF Production ICR October 2020
No ratings yet
India's Current and Potential MSW Use in RDF Production ICR October 2020
5 pages
Complete Forex Trading Course 20250715
No ratings yet
Complete Forex Trading Course 20250715
3 pages
Die Casting Design and Spec Guide PDF
No ratings yet
Die Casting Design and Spec Guide PDF
16 pages
FLA Solar Pump User Manual
No ratings yet
FLA Solar Pump User Manual
10 pages
T2 E 502 Using Coordinating Conjunctions and or But So SPaG Grammar Powerpoint Quiz Ver 3
No ratings yet
T2 E 502 Using Coordinating Conjunctions and or But So SPaG Grammar Powerpoint Quiz Ver 3
11 pages
Nur Dania Binti Mohd Yusoff Moe - Expt 7 Pre Lab - Diversity of Bacteria
No ratings yet
Nur Dania Binti Mohd Yusoff Moe - Expt 7 Pre Lab - Diversity of Bacteria
3 pages
Monthly Exam Part I Aurora English Course 1 (KD 1, KD2, PKD3)
No ratings yet
Monthly Exam Part I Aurora English Course 1 (KD 1, KD2, PKD3)
20 pages
SLD3237VF Datasheet (Sony)
No ratings yet
SLD3237VF Datasheet (Sony)
2 pages
DEWA Form
No ratings yet
DEWA Form
5 pages
Hts
No ratings yet
Hts
60 pages
Reduce Paper Consumption in Office
No ratings yet
Reduce Paper Consumption in Office
2 pages
Productivity Improvement by Using Quality Control Tools - A Case Study of Chartered Rubber Products
No ratings yet
Productivity Improvement by Using Quality Control Tools - A Case Study of Chartered Rubber Products
7 pages
Software Metric Calculator Report
No ratings yet
Software Metric Calculator Report
66 pages
Ukho E-Np Factsheet For Web
No ratings yet
Ukho E-Np Factsheet For Web
4 pages
E - Linear Slot Diffusers PDF
No ratings yet
E - Linear Slot Diffusers PDF
17 pages
DBMS 13
No ratings yet
DBMS 13
4 pages
HCI 2021 Prelim Paper 3 Solutions
No ratings yet
HCI 2021 Prelim Paper 3 Solutions
27 pages
124.80 KWp Solar PV Layout Plan
No ratings yet
124.80 KWp Solar PV Layout Plan
1 page
Economics Notes PDF Class 12 - Infrastructure
No ratings yet
Economics Notes PDF Class 12 - Infrastructure
5 pages
UPSC Environment MCQs Guide
No ratings yet
UPSC Environment MCQs Guide
5 pages

RL Unit-3

Uploaded by

RL Unit-3

Uploaded by

UNIT – III

Finite Markov Decision

Figure: The agent–environment interaction in a Markov decision

• This function completely characterizes the environment.

• Examples of episodic tasks include playing a game,

• Here, γ is the discount rate, a parameter between 0 and 1.

• This relation is central to many RL algorithms

• Here, T can be finite (in episodic tasks) or infinite (in continuing

• This means: if you’re in state s, the probability of

• This says: The value of a state equals the

• The arrows show possible actions from state s,

• This is called an equiprobable random policy.

• For example, at the center state (valued at

• This is the maximum expected return you can

• This is the maximum expected return from

best next action--> future rewards max over a' of q*(s',a')

You might also like