UNIT – III
Finite Markov Decision
Processes
The Agent–Environment Interface, Goals and
Rewards, Returns and Episodes, Unified Notation for
Episodic and Continuing Tasks, Policies and Value
Functions, Optimal Policies and Optimal Value
Functions, Optimality and Approximation.
Finite Markov Decision
Processes
• MDPs are a mathematical framework for modeling
sequential decision-making problems in reinforcement
learning, where actions affect immediate rewards and
future states, leading to delayed rewards.
• They are used to formalize problems where an agent
learns to maximize cumulative rewards through
interaction with an environment.
• It's called finite because it assumes:
– A finite number of states,
– A finite number of actions, and
– A finite set of possible rewards.
The Agent–Environment Interface
• The agent is the learner and decision-maker, while the
environment includes everything outside the agent.
• At each discrete time step t, the agent observes the
environment’s state St, selects an action At, and receives a
reward Rt+1 and the next state St+1.
• The environment’s response is governed by transition
probabilities p(s′,r∣s,a), which define the likelihood of
moving to state s′ and receiving reward r given state s and
action a.
• States and actions can be physical or abstract (e.g., mental
states like uncertainty or decisions about attention focus).
The Agent–Environment Interface
• A trajectory is a sequential list of states,
actions, and rewards (e.g. S₀, A₀, R₁, S₁, A₁, R₂,
…)
Figure: The agent–environment interaction in a Markov decision
process.
Environment Dynamics (Function p)
• The probabilistic behavior of the environment is described by:
• This function completely characterizes the environment.
• It obeys the Markov property: the next state and reward only depend on
the current state and action, not on any earlier history.
Derived Quantities from p
Abstraction and Flexibility
• Time steps can represent logical stages, not
necessarily real-time.
• States can be low-level (sensor data) or high-
level (symbolic info or memory).
• Actions can range from motor commands to
mental choices (e.g., shifting attention).
• The reward is always a scalar number
reflecting the goal.
Where is the Agent–Environment
Boundary?
• The boundary is not necessarily physical (e.g.,
robot motors and sensors are often considered
part of the environment).
• General rule:
Anything the agent cannot arbitrarily change is
part of the environment, even if the agent
knows how it works.
• This boundary marks the limit of control, not
knowledge.
Why Use the Agent–Environment
Interface?
• It simplifies goal-directed learning from
interaction into just three signals:
– State (what is perceived)
– Action (what is done)
– Reward (what is achieved)
• This abstraction has wide applicability across
robotics, games, decision-making, and more.
Examples
1. Bioreactor
2. Pick-and-Place Robot
3. Recycling Robot
1. Bioreactor
• Agent: A reinforcement learning system controlling a bioreactor (a
vat for producing chemicals).
• Environment: The bioreactor, including nutrients, bacteria, and
sensors.
• States: Sensor readings (e.g., thermocouple data) and symbolic
inputs (e.g., ingredients and target chemical).
• Actions: Setting target temperatures and stirring rates, passed to
lower-level control systems.
• Rewards: Measures of chemical production or system state, not
always single numbers.
• Interaction: The agent adjusts temperature/stirring based on sensor
data, receiving feedback on chemical output to optimize production.
2. Pick-and-Place Robot
• Agent: A reinforcement learning system controlling a robot
arm for repetitive pick-and-place tasks.
• Environment: The robot arm and its workspace.
• States: Joint angles and velocities from low-latency sensors.
• Actions: Voltages applied to motors at each joint to control
movement.
• Rewards: Small negative rewards per time step to encourage
fast, smooth movements.
• Interaction: The agent applies voltages based on joint
positions, receiving feedback on movement smoothness to
learn optimal control.
3. Recycling Robot
• Agent: A reinforcement learning system controlling a mobile robot that collects
empty soda cans in an office.
• Environment: The office environment, including cans, obstacles, and the robot’s
battery.
• States: The robot’s battery level, with two possible states: high or low
(S={high,low} ).
• Actions:
– In both states: (1) Search (actively seek cans), (2) Wait (stay in place for cans to be brought).
– In the low state only: (3) Recharge (return to a charging station, unavailable in the high state).
• Rewards:
– Zero most of the time.
– Positive when a can is collected.
– Large negative if the battery depletes fully, requiring rescue.
• Interaction: The agent observes the battery state (e.g., high or low) and chooses
an action (e.g., search). The environment responds with a new state (e.g., battery
drops to low after searching) and a reward (e.g., positive for collecting a can). For
example, searching in the high state may keep the battery high or drop it to low,
with a probability of collecting a can and earning a positive reward.
Goals and Rewards
• In reinforcement learning, the agent’s goal is
defined in terms of a reward signal. This
reward is a scalar number (Rt) passed from the
environment to the agent at each time step.
The Agent’s Objective
• The agent’s purpose is to maximize the total
reward it receives over time, not just the
immediate reward.
• This is formalized in the reward hypothesis:
“All of what we mean by goals and purposes
can be well thought of as the maximization of
the expected value of the cumulative sum of
a received scalar signal (called reward).”
Why Rewards?
• Using rewards to define goals is a core idea in
reinforcement learning.
• Although it might seem restrictive, it’s actually
very flexible and widely applicable in
practice.
Examples of Reward Design
• Robot learning to walk: Reward is
proportional to forward motion.
• Maze escape: Reward of –1 per time step
encourages fast escape.
• Can-collecting robot: +1 per can, 0 otherwise.
• Chess-playing agent: +1 for winning, –1 for
losing, 0 for draw or in-progress.
Important Principle
• Rewards define what the agent should
achieve, not how to achieve it.
• Don’t reward subgoals (like capturing a chess
piece), because the agent might learn to
optimize those at the expense of the main
goal (winning the game).
• Subgoal rewards may mislead the agent.
Key Takeaway
• The reward signal is how we communicate
the desired behavior to the agent. It must
align exactly with the actual goal, as the agent
will learn only to maximize this signal —
nothing more, nothing less.
Returns and Episodes
• In reinforcement learning, the agent’s
objective is to maximize the cumulative
reward it receives over time. This accumulated
total is called the return, denoted by Gt.
• The return helps quantify how well the agent
is performing based on its actions over time.
Episodic Tasks and Return Definition
• In tasks that naturally break into segments—called
episodic tasks—the agent interacts with the
environment over a finite sequence of steps that
ends in a terminal state. The return in this case is
defined as the sum of rewards received from time
step t+1 until the end of the episode at time T:
• Examples of episodic tasks include playing a game,
completing a maze, or repeated attempts at a task
like pole-balancing.
Continuing Tasks and the Problem of
Infinity
• In contrast, continuing tasks have no terminal
state—the interaction goes on indefinitely.
These arise in settings like industrial control
systems or long-lived autonomous robots.
• In such cases, the straightforward sum of
future rewards could become infinite,
especially if a constant reward is received at
each step. This makes direct maximization of
return problematic.
Discounted Return
• To overcome this, reinforcement learning introduces the
concept of discounted return, which gives less weight to
rewards further in the future. The return is then defined
as:
• Here, γ is the discount rate, a parameter between 0 and 1.
When γ=0, the agent focuses only on immediate reward;
when γ approaches 1, the agent becomes more farsighted,
valuing long-term outcomes more strongly. Discounting
ensures the return remains finite even in continuing tasks,
provided the rewards are bounded.
Recursive Relationship
• A fundamental property of the discounted
return is its recursive structure, which is
expressed as:
• This relation is central to many RL algorithms
and simplifies calculations and learning
updates. It holds for every time step before
termination, and is especially useful for
iterative learning methods.
Example: Pole-Balancing
• A clear example is the pole-balancing task.
• If modeled as episodic, the agent could receive a
reward of +1 at each time step until the pole falls;
hence, the return equals the number of steps before
failure.
• Alternatively, if treated as a continuing task, the agent
might receive 0 reward at each step and a penalty of –
1 on failure.
• The return in this case would reflect how long the
agent can prevent failure, but in a discounted manner.
Summary
• Whether the task is episodic or continuing, the
return provides a formal objective for the agent: to
maximize cumulative reward.
• For episodic problems, this is a straightforward
sum; for continuing problems, discounting is used
to maintain tractability and reflect time preference.
• This concept of return—both undiscounted and
discounted—is fundamental to the design and
understanding of reinforcement learning systems.
Unified Notation for Episodic and
Continuing Tasks
• In reinforcement learning, tasks can be classified as either
episodic—where interactions naturally break into distinct
episodes—or continuing, where interaction goes on
indefinitely.
• Episodic tasks are often mathematically simpler because
each action affects only a finite number of future rewards
within the same episode.
• However, in practice, reinforcement learning deals with
both kinds of tasks, and sometimes even needs to handle
them simultaneously. To make this possible, it is useful to
adopt a unified notation that works for both settings.
Handling Episodes in Notation
• When modeling episodic tasks precisely, each episode is
treated as a separate finite sequence of time steps, and we
number these steps starting from zero for each episode.
• This would require distinguishing between states, actions,
and rewards using double indices such as S t,i, At,i, Rt,i, where t
is the time step and i is the episode number.
• However, in practice, this level of detail is rarely needed.
• Most of the time, we deal with a single episode or express
something true across all episodes, so it is convenient to
abuse notation slightly by dropping the episode index and
simply writing St, At, and so on.
Absorbing State and Return Formulation
• The key to unifying episodic and continuing tasks lies in how we define the
return.
• In episodic tasks, the return was originally defined as a finite sum, while in
continuing tasks, it was an infinite sum possibly with discounting.
• To bring both under a single formulation, we imagine that at the end of an
episode, the agent enters a special absorbing state. This state transitions
only to itself and yields zero rewards indefinitely.
• For example, a reward sequence like +1,+1,+1,0,0,0,…results from entering
the absorbing state after three steps.
• Whether we sum over the finite episode or over the full sequence
including the absorbing state, the total return remains the same, even
with discounting.
General Unified Return Expression
• With this absorbing state idea, we can now define the return in
a general form that covers both task types:
• Here, T can be finite (in episodic tasks) or infinite (in continuing
tasks), and the discount factor γ can be equal to 1 (if the sum
still converges, such as when episodes always terminate).
• This convention simplifies notation throughout reinforcement
learning theory and reveals the deep similarities between
episodic and continuing tasks, allowing us to use a common
mathematical framework for both.
Policies and Value Functions
• In reinforcement learning, the goal of an agent
is to maximize long-term rewards. To reason
about this, we use policies and value
functions.
Policy (π)
• A policy tells an agent how to behave in an
environment.
• It’s basically a strategy or rule for choosing actions.
Formal Definition:
A policy π is a mapping from states to a probability
distribution over actions.
• This means: if you’re in state s, the probability of
choosing action a is π(a|s).
Types of Policies
• Deterministic policy: Always picks the same
action in a state.
Example: π(s) = go north.
• Stochastic policy: Chooses among actions
with certain probabilities.
Example: π(north|s) = 0.7, π(east|s) = 0.3.
Value Functions
• Value functions answer the question:
“How good is it to be in a state (or to take an
action in a state) if I follow policy π?”
• We measure “goodness” in terms of expected
return (future rewards).
(a) State‑Value Function
• Tells us how good it is to be in a particular
state s under policy π.
• Denoted as vπ(s).
Where
is the total discounted return.
(b) Action‑Value Function
• Tells us how good it is to take action a in state
s under policy π.
• Denoted as qπ(s, a).
Relationship Between vπ(s) and qπ(s, a)
• The value of a state equals the average
action‑value, weighted by the probabilities of
each action under π:
The Bellman Equation for vπ
• The most important recursive relationship in
RL:
• This says: The value of a state equals the
expected reward plus the discounted value of
the next state, averaged over all actions and
outcomes.
Backup Diagram for vπ
• The arrows show possible actions from state s,
each leading to different next states with
rewards. The Bellman equation combines
them into the expected value.
Estimation
• If the agent records the average returns
observed after visiting states (or state-action
pairs), these averages converge to vπ(s) and
qπ(s, a).
• Such estimation is often done via Monte Carlo
methods or function approximation when the
state space is very large.
Example: Gridworld
Policy
• Suppose the agent chooses all four actions
with equal probability in every state:
• This is called an equiprobable random policy.
Environment Setup
• The world is a rectangular grid where each cell = a state.
• At each cell, the agent can take four actions:
– north, south, east, west.
• Movement rules:
– Moving off the grid keeps the agent in the same cell but gives a
reward of –1.
– Normal moves give a reward of 0, except:
• From special state A, any action gives +10 reward and moves the
agent to A′.
• From special state B, any action gives +5 reward and moves the agent
to B′.
State-Value Function vπ
• For each state, vπ(s) is the expected long-term reward
starting from s and following the random policy.
• Near the edges, vπ(s) becomes negative, because the
agent often bumps into the grid’s edge and receives –
1 penalties.
• State A has high value (close to +10) because it gives a
big immediate reward.
• State B has a value greater than 5, since after
reaching B′, there’s a good chance to later get to A
(earning +10).
State-Value Function vπ
• In the figure (right), the state in the center has
value +0.7, computed from the Bellman
equation considering its four neighbors.
Bellman Equation in Action
• For a given state s:
• For example, at the center state (valued at
+0.7), its neighbors have values +2.3, +0.4, –
0.4, and +0.7.
• The equation balances the immediate reward
(mostly 0) and the discounted future values of
these neighbor states.
Optimal Policies and Optimal Value
Functions
• In reinforcement learning, “solving” a problem
means finding a policy that maximizes long-
term reward.
• A policy (π) is basically the agent’s “decision
rule” — it says:
– In each state, what action should I take?
• We’re looking for optimal policies — ones
that are at least as good as every other policy,
in every state.
Comparing Policies
Optimal Value Functions
• Two key “best possible” value functions are
defined:
1. Optimal State-Value Function
2. Optimal Action-Value Function
Optimal State-Value Function v*
• This is the maximum expected return you can
get starting in state s, over all possible policies
π.
• It answers:
– “If I start in state s and act optimally forever,
what’s my expected long-term reward?”
Optimal Action-Value Function q*
• This is the maximum expected return from
taking action a in state s, then following an
optimal policy afterward.
• It answers:
– “If I’m in state s, take action a now, and then act
optimally, what’s my expected long-term reward?”
Relationship Between v* and q*
Bellman Optimality Equations
• These express a self-consistency rule:
– The value of a state (or state-action pair) under
optimal behavior equals the best possible
immediate reward plus the discounted value of
what comes next.
Why They’re Important
• If you know v*:
– You can do a one-step lookahead to pick the best
action in each state.
• If you know q*:
– You can immediately pick the best action without
even knowing the environment dynamics.
• They’re the key to defining optimal policies —
any policy that’s greedy with respect to v* or
q* is optimal.
Backup diagrams for v* and q*
best next action--> future rewards max over a' of q*(s',a')
Example
• Imagine playing golf:
– v*(s) tells you: “From here, if I play perfectly, I’ll
finish in about 3 strokes.”
– q*(s, a) tells you: “If I drive the ball now, and then
play perfectly afterward, I’ll finish in about 3
strokes; if I putt instead, I’ll finish in 10.”
Practical Issues
• In theory, you can solve these equations
exactly — but in real problems:
– The environment might be unknown.
– The number of states can be huge.
– You might not have enough computation power.
• So in practice, we use approximation
methods (like dynamic programming, Monte
Carlo, TD learning) to estimate v* or q*.
Optimality and Approximation
• Optimal value functions (v* and q*) and
optimal policies (π*) are well-defined
mathematically.
• In practice, computing them exactly is rarely
possible for real-world problems due to
computational and memory constraints.
• So, in RL, we settle for approximations.
Why Exact Optimality is Hard?
• Even with a perfect environment model, solving the
Bellman optimality equation exactly is computationally
prohibitive.
• Example: Chess
– Fully deterministic.
– Small compared to real life.
– But still unsolved optimally — even massive, custom-
designed computers can’t compute the optimal move for
every state.
• Real agents have time step limits → can’t do unlimited
computation before choosing an action.
Memory Constraints
• Tabular case: If the state space is small and
finite, we can store value functions in an array
(one entry per state or state–action pair).
• Large/continuous state spaces: Impossible to
store full tables → must use function
approximation (e.g., neural networks, linear
approximators).
Strategic Approximation
• Not all states are equally important:
– Some states occur rarely → making mistakes there has
little effect on total reward.
– Some states occur often → must learn to act well there.
• Example: TD-Gammon
– Plays world-class backgammon.
– Likely makes poor moves in many never-encountered
states.
– Still succeeds because it plays optimally for the states it
actually visits.
RL’s Advantage in Approximation
• RL is online and experience-driven:
– Focuses learning on frequently visited states.
– Saves effort in unimportant regions of the state
space.
• This focus distinguishes RL from many other
methods for approximating MDP solutions,
which often try to approximate over all states
equally.