UNIT – III
Finite Markov Decision
                 Processes
The Agent–Environment Interface, Goals and
Rewards, Returns and Episodes, Unified Notation for
Episodic and Continuing Tasks, Policies and Value
Functions, Optimal Policies and Optimal Value
Functions, Optimality and Approximation.
               Finite Markov Decision
                      Processes
• MDPs are a mathematical framework for modeling
  sequential decision-making problems in reinforcement
  learning, where actions affect immediate rewards and
  future states, leading to delayed rewards.
• They are used to formalize problems where an agent
  learns to maximize cumulative rewards through
  interaction with an environment.
• It's called finite because it assumes:
  – A finite number of states,
  – A finite number of actions, and
  – A finite set of possible rewards.
The Agent–Environment Interface
• The agent is the learner and decision-maker, while the
  environment includes everything outside the agent.
• At each discrete time step t, the agent observes the
  environment’s state St, selects an action At, and receives a
  reward Rt+1 and the next state St+1.
• The environment’s response is governed by transition
  probabilities p(s′,r∣s,a), which define the likelihood of
  moving to state s′ and receiving reward r given state s and
  action a.
• States and actions can be physical or abstract (e.g., mental
  states like uncertainty or decisions about attention focus).
The Agent–Environment Interface
• A trajectory is a sequential list of states,
  actions, and rewards (e.g. S₀, A₀, R₁, S₁, A₁, R₂,
  …)
Figure: The agent–environment interaction in a Markov decision
   process.
   Environment Dynamics (Function p)
• The probabilistic behavior of the environment is described by:
• This function completely characterizes the environment.
• It obeys the Markov property: the next state and reward only depend on
  the current state and action, not on any earlier history.
Derived Quantities from p
      Abstraction and Flexibility
• Time steps can represent logical stages, not
  necessarily real-time.
• States can be low-level (sensor data) or high-
  level (symbolic info or memory).
• Actions can range from motor commands to
  mental choices (e.g., shifting attention).
• The reward is always a scalar number
  reflecting the goal.
     Where is the Agent–Environment
                Boundary?
• The boundary is not necessarily physical (e.g.,
  robot motors and sensors are often considered
  part of the environment).
• General rule:
  Anything the agent cannot arbitrarily change is
  part of the environment, even if the agent
  knows how it works.
• This boundary marks the limit of control, not
  knowledge.
     Why Use the Agent–Environment
               Interface?
• It simplifies goal-directed learning from
  interaction into just three signals:
  – State (what is perceived)
  – Action (what is done)
  – Reward (what is achieved)
• This abstraction has wide applicability across
  robotics, games, decision-making, and more.
                Examples
1. Bioreactor
2. Pick-and-Place Robot
3. Recycling Robot
                     1. Bioreactor
• Agent: A reinforcement learning system controlling a bioreactor (a
  vat for producing chemicals).
• Environment: The bioreactor, including nutrients, bacteria, and
  sensors.
• States: Sensor readings (e.g., thermocouple data) and symbolic
  inputs (e.g., ingredients and target chemical).
• Actions: Setting target temperatures and stirring rates, passed to
  lower-level control systems.
• Rewards: Measures of chemical production or system state, not
  always single numbers.
• Interaction: The agent adjusts temperature/stirring based on sensor
  data, receiving feedback on chemical output to optimize production.
          2. Pick-and-Place Robot
• Agent: A reinforcement learning system controlling a robot
  arm for repetitive pick-and-place tasks.
• Environment: The robot arm and its workspace.
• States: Joint angles and velocities from low-latency sensors.
• Actions: Voltages applied to motors at each joint to control
  movement.
• Rewards: Small negative rewards per time step to encourage
  fast, smooth movements.
• Interaction: The agent applies voltages based on joint
  positions, receiving feedback on movement smoothness to
  learn optimal control.
                         3. Recycling Robot
•   Agent: A reinforcement learning system controlling a mobile robot that collects
    empty soda cans in an office.
•   Environment: The office environment, including cans, obstacles, and the robot’s
    battery.
•   States: The robot’s battery level, with two possible states: high or low
    (S={high,low} ).
•   Actions:
     –   In both states: (1) Search (actively seek cans), (2) Wait (stay in place for cans to be brought).
     –   In the low state only: (3) Recharge (return to a charging station, unavailable in the high state).
•   Rewards:
     –   Zero most of the time.
     –   Positive when a can is collected.
     –   Large negative if the battery depletes fully, requiring rescue.
•   Interaction: The agent observes the battery state (e.g., high or low) and chooses
    an action (e.g., search). The environment responds with a new state (e.g., battery
    drops to low after searching) and a reward (e.g., positive for collecting a can). For
    example, searching in the high state may keep the battery high or drop it to low,
    with a probability of collecting a can and earning a positive reward.
           Goals and Rewards
• In reinforcement learning, the agent’s goal is
  defined in terms of a reward signal. This
  reward is a scalar number (Rt) passed from the
  environment to the agent at each time step.
        The Agent’s Objective
• The agent’s purpose is to maximize the total
  reward it receives over time, not just the
  immediate reward.
• This is formalized in the reward hypothesis:
  “All of what we mean by goals and purposes
  can be well thought of as the maximization of
  the expected value of the cumulative sum of
  a received scalar signal (called reward).”
              Why Rewards?
• Using rewards to define goals is a core idea in
  reinforcement learning.
• Although it might seem restrictive, it’s actually
  very flexible and widely applicable in
  practice.
    Examples of Reward Design
• Robot learning to walk: Reward is
  proportional to forward motion.
• Maze escape: Reward of –1 per time step
  encourages fast escape.
• Can-collecting robot: +1 per can, 0 otherwise.
• Chess-playing agent: +1 for winning, –1 for
  losing, 0 for draw or in-progress.
          Important Principle
• Rewards define what the agent should
  achieve, not how to achieve it.
• Don’t reward subgoals (like capturing a chess
  piece), because the agent might learn to
  optimize those at the expense of the main
  goal (winning the game).
• Subgoal rewards may mislead the agent.
              Key Takeaway
• The reward signal is how we communicate
  the desired behavior to the agent. It must
  align exactly with the actual goal, as the agent
  will learn only to maximize this signal —
  nothing more, nothing less.
         Returns and Episodes
• In reinforcement learning, the agent’s
  objective is to maximize the cumulative
  reward it receives over time. This accumulated
  total is called the return, denoted by Gt.
• The return helps quantify how well the agent
  is performing based on its actions over time.
  Episodic Tasks and Return Definition
• In tasks that naturally break into segments—called
  episodic tasks—the agent interacts with the
  environment over a finite sequence of steps that
  ends in a terminal state. The return in this case is
  defined as the sum of rewards received from time
  step t+1 until the end of the episode at time T:
• Examples of episodic tasks include playing a game,
  completing a maze, or repeated attempts at a task
  like pole-balancing.
   Continuing Tasks and the Problem of
                 Infinity
• In contrast, continuing tasks have no terminal
  state—the interaction goes on indefinitely.
  These arise in settings like industrial control
  systems or long-lived autonomous robots.
• In such cases, the straightforward sum of
  future rewards could become infinite,
  especially if a constant reward is received at
  each step. This makes direct maximization of
  return problematic.
              Discounted Return
• To overcome this, reinforcement learning introduces the
  concept of discounted return, which gives less weight to
  rewards further in the future. The return is then defined
  as:
• Here, γ is the discount rate, a parameter between 0 and 1.
  When γ=0, the agent focuses only on immediate reward;
  when γ approaches 1, the agent becomes more farsighted,
  valuing long-term outcomes more strongly. Discounting
  ensures the return remains finite even in continuing tasks,
  provided the rewards are bounded.
        Recursive Relationship
• A fundamental property of the discounted
  return is its recursive structure, which is
  expressed as:
• This relation is central to many RL algorithms
  and simplifies calculations and learning
  updates. It holds for every time step before
  termination, and is especially useful for
  iterative learning methods.
        Example: Pole-Balancing
• A clear example is the pole-balancing task.
• If modeled as episodic, the agent could receive a
  reward of +1 at each time step until the pole falls;
  hence, the return equals the number of steps before
  failure.
• Alternatively, if treated as a continuing task, the agent
  might receive 0 reward at each step and a penalty of –
  1 on failure.
• The return in this case would reflect how long the
  agent can prevent failure, but in a discounted manner.
                   Summary
• Whether the task is episodic or continuing, the
  return provides a formal objective for the agent: to
  maximize cumulative reward.
• For episodic problems, this is a straightforward
  sum; for continuing problems, discounting is used
  to maintain tractability and reflect time preference.
• This concept of return—both undiscounted and
  discounted—is fundamental to the design and
  understanding of reinforcement learning systems.
       Unified Notation for Episodic and
               Continuing Tasks
• In reinforcement learning, tasks can be classified as either
  episodic—where interactions naturally break into distinct
  episodes—or continuing, where interaction goes on
  indefinitely.
• Episodic tasks are often mathematically simpler because
  each action affects only a finite number of future rewards
  within the same episode.
• However, in practice, reinforcement learning deals with
  both kinds of tasks, and sometimes even needs to handle
  them simultaneously. To make this possible, it is useful to
  adopt a unified notation that works for both settings.
    Handling Episodes in Notation
• When modeling episodic tasks precisely, each episode is
  treated as a separate finite sequence of time steps, and we
  number these steps starting from zero for each episode.
• This would require distinguishing between states, actions,
  and rewards using double indices such as S t,i, At,i, Rt,i, where t
  is the time step and i is the episode number.
• However, in practice, this level of detail is rarely needed.
• Most of the time, we deal with a single episode or express
  something true across all episodes, so it is convenient to
  abuse notation slightly by dropping the episode index and
  simply writing St, At, and so on.
 Absorbing State and Return Formulation
• The key to unifying episodic and continuing tasks lies in how we define the
  return.
• In episodic tasks, the return was originally defined as a finite sum, while in
  continuing tasks, it was an infinite sum possibly with discounting.
• To bring both under a single formulation, we imagine that at the end of an
  episode, the agent enters a special absorbing state. This state transitions
  only to itself and yields zero rewards indefinitely.
• For example, a reward sequence like +1,+1,+1,0,0,0,…results from entering
  the absorbing state after three steps.
• Whether we sum over the finite episode or over the full sequence
  including the absorbing state, the total return remains the same, even
  with discounting.
General Unified Return Expression
• With this absorbing state idea, we can now define the return in
  a general form that covers both task types:
• Here, T can be finite (in episodic tasks) or infinite (in continuing
  tasks), and the discount factor γ can be equal to 1 (if the sum
  still converges, such as when episodes always terminate).
• This convention simplifies notation throughout reinforcement
  learning theory and reveals the deep similarities between
  episodic and continuing tasks, allowing us to use a common
  mathematical framework for both.
    Policies and Value Functions
• In reinforcement learning, the goal of an agent
  is to maximize long-term rewards. To reason
  about this, we use policies and value
  functions.
                    Policy (π)
• A policy tells an agent how to behave in an
  environment.
• It’s basically a strategy or rule for choosing actions.
  Formal Definition:
  A policy π is a mapping from states to a probability
  distribution over actions.
• This means: if you’re in state s, the probability of
  choosing action a is π(a|s).
           Types of Policies
• Deterministic policy: Always picks the same
  action in a state.
  Example: π(s) = go north.
• Stochastic policy: Chooses among actions
  with certain probabilities.
  Example: π(north|s) = 0.7, π(east|s) = 0.3.
             Value Functions
• Value functions answer the question:
  “How good is it to be in a state (or to take an
  action in a state) if I follow policy π?”
• We measure “goodness” in terms of expected
  return (future rewards).
       (a) State‑Value Function
• Tells us how good it is to be in a particular
  state s under policy π.
• Denoted as vπ(s).
  Where
  is the total discounted return.
      (b) Action‑Value Function
• Tells us how good it is to take action a in state
  s under policy π.
• Denoted as qπ(s, a).
Relationship Between vπ(s) and qπ(s, a)
• The value of a state equals the average
  action‑value, weighted by the probabilities of
  each action under π:
    The Bellman Equation for vπ
• The most important recursive relationship in
  RL:
• This says: The value of a state equals the
  expected reward plus the discounted value of
  the next state, averaged over all actions and
  outcomes.
        Backup Diagram for vπ
• The arrows show possible actions from state s,
  each leading to different next states with
  rewards. The Bellman equation combines
  them into the expected value.
                Estimation
• If the agent records the average returns
  observed after visiting states (or state-action
  pairs), these averages converge to vπ(s) and
  qπ(s, a).
• Such estimation is often done via Monte Carlo
  methods or function approximation when the
  state space is very large.
Example: Gridworld
                   Policy
• Suppose the agent chooses all four actions
  with equal probability in every state:
• This is called an equiprobable random policy.
               Environment Setup
• The world is a rectangular grid where each cell = a state.
• At each cell, the agent can take four actions:
  – north, south, east, west.
• Movement rules:
  – Moving off the grid keeps the agent in the same cell but gives a
    reward of –1.
  – Normal moves give a reward of 0, except:
    •   From special state A, any action gives +10 reward and moves the
        agent to A′.
    •   From special state B, any action gives +5 reward and moves the agent
        to B′.
         State-Value Function vπ
• For each state, vπ(s) is the expected long-term reward
  starting from s and following the random policy.
• Near the edges, vπ(s) becomes negative, because the
  agent often bumps into the grid’s edge and receives –
  1 penalties.
• State A has high value (close to +10) because it gives a
  big immediate reward.
• State B has a value greater than 5, since after
  reaching B′, there’s a good chance to later get to A
  (earning +10).
        State-Value Function vπ
• In the figure (right), the state in the center has
  value +0.7, computed from the Bellman
  equation considering its four neighbors.
     Bellman Equation in Action
• For a given state s:
• For example, at the center state (valued at
  +0.7), its neighbors have values +2.3, +0.4, –
  0.4, and +0.7.
• The equation balances the immediate reward
  (mostly 0) and the discounted future values of
  these neighbor states.
    Optimal Policies and Optimal Value
                Functions
• In reinforcement learning, “solving” a problem
  means finding a policy that maximizes long-
  term reward.
• A policy (π) is basically the agent’s “decision
  rule” — it says:
  – In each state, what action should I take?
• We’re looking for optimal policies — ones
  that are at least as good as every other policy,
  in every state.
Comparing Policies
      Optimal Value Functions
• Two key “best possible” value functions are
  defined:
1. Optimal State-Value Function
2. Optimal Action-Value Function
 Optimal State-Value Function v*
• This is the maximum expected return you can
  get starting in state s, over all possible policies
  π.
• It answers:
   – “If I start in state s and act optimally forever,
     what’s my expected long-term reward?”
Optimal Action-Value Function q*
• This is the maximum expected return from
  taking action a in state s, then following an
  optimal policy afterward.
• It answers:
  – “If I’m in state s, take action a now, and then act
    optimally, what’s my expected long-term reward?”
Relationship Between v* and q*
   Bellman Optimality Equations
• These express a self-consistency rule:
  – The value of a state (or state-action pair) under
    optimal behavior equals the best possible
    immediate reward plus the discounted value of
    what comes next.
       Why They’re Important
• If you know v*:
  – You can do a one-step lookahead to pick the best
    action in each state.
• If you know q*:
  – You can immediately pick the best action without
    even knowing the environment dynamics.
• They’re the key to defining optimal policies —
  any policy that’s greedy with respect to v* or
  q* is optimal.
Backup diagrams for v* and q*
best next action--> future rewards   max over a' of q*(s',a')
                     Example
• Imagine playing golf:
  – v*(s) tells you: “From here, if I play perfectly, I’ll
    finish in about 3 strokes.”
  – q*(s, a) tells you: “If I drive the ball now, and then
    play perfectly afterward, I’ll finish in about 3
    strokes; if I putt instead, I’ll finish in 10.”
             Practical Issues
• In theory, you can solve these equations
  exactly — but in real problems:
  – The environment might be unknown.
  – The number of states can be huge.
  – You might not have enough computation power.
• So in practice, we use approximation
  methods (like dynamic programming, Monte
  Carlo, TD learning) to estimate v* or q*.
  Optimality and Approximation
• Optimal value functions (v* and q*) and
  optimal policies (π*) are well-defined
  mathematically.
• In practice, computing them exactly is rarely
  possible for real-world problems due to
  computational and memory constraints.
• So, in RL, we settle for approximations.
   Why Exact Optimality is Hard?
• Even with a perfect environment model, solving the
  Bellman optimality equation exactly is computationally
  prohibitive.
• Example: Chess
  – Fully deterministic.
  – Small compared to real life.
  – But still unsolved optimally — even massive, custom-
    designed computers can’t compute the optimal move for
    every state.
• Real agents have time step limits → can’t do unlimited
  computation before choosing an action.
          Memory Constraints
• Tabular case: If the state space is small and
  finite, we can store value functions in an array
  (one entry per state or state–action pair).
• Large/continuous state spaces: Impossible to
  store full tables → must use function
  approximation (e.g., neural networks, linear
  approximators).
        Strategic Approximation
• Not all states are equally important:
  – Some states occur rarely → making mistakes there has
    little effect on total reward.
  – Some states occur often → must learn to act well there.
• Example: TD-Gammon
  – Plays world-class backgammon.
  – Likely makes poor moves in many never-encountered
    states.
  – Still succeeds because it plays optimally for the states it
    actually visits.
 RL’s Advantage in Approximation
• RL is online and experience-driven:
  – Focuses learning on frequently visited states.
  – Saves effort in unimportant regions of the state
    space.
• This focus distinguishes RL from many other
  methods for approximating MDP solutions,
  which often try to approximate over all states
  equally.