0% found this document useful (0 votes)
3 views25 pages

Week 14 RL October 19

Uploaded by

Deepak Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views25 pages

Week 14 RL October 19

Uploaded by

Deepak Joshi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Reinforcement Learning Refresher

Dr. Somdyuti Paul


Department of Artificial Intelligence
IIT Kharagpur
https://somdyuti2.github.io/

1
Hands-on approach to AI for real-world applications
What is Reinforcement Learning?
● Reinforcement Learning (RL) is simply learning by doing or in other words, learning from experience.

● The objective of a learner, also referred to as an agent is to figure out the best ways to achieve specific goals
under constraints imposed by the system in which it acts, referred to as the environment.

Agent

Environment

2
Hands-on approach to AI for real-world applications
What is Reinforcement Learning?
● At any instance, the information about the environment available to the agent is encapsulated by a random variable
called the state.
● The agent interacts with the environment by taking specific actions, which correspond to the permissible moves or
decisions in the environment
● Actions result in rewards (or penalties), which the agent receives as feedback from the environment.

Goal

+100 points
3
Hands-on approach to AI for real-world applications
What is Reinforcement Learning?
● The reward (+ve or -ve) allows the agent to update its rule for deciding on actions to
execute at each state, which is called its policy.
● The agent’s goal is to learn a policy that maximizes the cumulative reward over time.

= Great!
Follow the same path
again

4
Hands-on approach to AI for real-world applications
RL Notations and Markov Decision Process
● Notations used:
● St : state of the agent at step t
● At: action taken at step t
● Rt+1: reward received after executing At at St.

● An RL problem is typically formulated as a Markov Decision Process (MDP), which has


the Markov property of being memoryless, i.e. the next state depends only on the
current state and action taken, and not on the history of past states.

Markov Property

Pr(St+1 = st+1 | St = st, St−1 = st−1, ⋯S1 = s1) = Pr(St+1 = st+1 | St = st)

5
Hands-on approach to AI for real-world applications
RL Assignment: Question 1

You are developing a reinforcement learning-based system for an autonomous drone tasked with delivering
packages in an urban environment. The drone must navigate through a city, avoiding obstacles such as
buildings, trees, and other ying objects. It must also op mize its ight path to minimize delivery me while
ensuring safety and ba ery e ciency. The drone is equipped with sensors for detec ng obstacles and GPS for
naviga on.

Formulate this problem as an RL task. Specify the following components of the associated MDP:
• State Space (S): Describe how you would de ne the state space for this problem.
• Ac on Space (A): De ne the ac on space available for the drone
• Reward Func on (R): Propose a reward func on that encourages e cient and safe deliveries

6
Hands-on approach to AI for real-world applications
ti
ti
ti
fi
tt
fl
ffi
ti
fi
ti
ti
fl
ffi
ti
ti
RL Assignment: Question 1
Solution:

State Space (S):The state space should capture all relevant informa on about the drone's current situa on and environment. A possible state
representa on could include:
• The current posi on and al tude of the drone.
• The current velocity of the drone.
• The posi on of the delivery des na on.
• Obstacles in the vicinity
• The remaining ba ery level of the drone.
• The current weather condi ons (e.g., wind speed, rain).
Ac on Space: the ac on space consists of the set of possible maneuvers the drone can perform.These ac ons include:
• Move Forward: Increase the drone’s forward velocity.
• Move Backward: Decrease the drone’s forward velocity.
• Ascend: Increase the drone’s al tude.
• Descend: Decrease the drone’s al tude.
• Turn Le : Adjust the drone’s direc on to the le .
• Turn Right: Adjust the drone’s direc on to the right.
• Hover: Maintain the current posi on and al tude.

7
Hands-on approach to AI for real-world applications
ti
ft
ti
ti
ti
tt
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ti
ft
ti
ti
ti
RL Assignment: Question 1

Reward Function : the reward function should incentivize efficient, safe, and timely deliveries. A possible reward
function could be:
• +100 for successfully delivering the package to the destination.
• -1 for each second of flight time to encourage minimizing delivery time.
• -10 for collisions with obstacles or other drones.
• -0.1 point for each percent drop in battery level to encourage energy efficiency.

8
Hands-on approach to AI for real-world applications
Discounted Rewards and Return
● Often, an immediate reward is more valuable
to an agent than the same or even higher
reward received in the future.
● e.g., in the task of managing an investment
portfolio with RL, an action that yields $100
today might be preferred over another
action that yields $100 after 12 months.

● To account for this effect, a discount factor


γ ∈ (0,1] is introduced to scale future
Return
rewards.
2 T−t−1
Gt = Rt+1 + γRt+2 + γ Rt+3 + ⋯ + γ RT
● The cumulative discounted reward is called = Rt+1 + γGt+1
return.

9
Hands-on approach to AI for real-world applications
State and Action Value Functions
● Value functions are used to quantify the desirability of particular states or state,
action pairs from the agent’s perspective.
How good is it to be
How good is it to in state s and take
be in state s action a in state s
State (s)
State (s)
State Value V(s) Action Value Q(s, a)
Function Function
Action (a)

● The value functions are learned with respect to a policy (π).


● Learning the value functions enables the agent to find the optimal policy, which is the
policy that maximizes the agent’s return.

10
Hands-on approach to AI for real-world applications
Exploration vs. Exploitation Tradeoff
● While learning from the environment, the agent has the
following choices
● Exploration: Trying new actions to explore new states with
unknown rewards
● Exploitation: following past actions to visit known states with
known rewards.
Where to
● An agent that does not explore the environment enough may dine?
miss out on discovering actions that could lead to potentially
higher rewards.

● An agent that keeps exploring cannot effectively maximize its


return, as it keeps on selecting suboptimal actions without
leveraging any information learned about the environment.

11
Hands-on approach to AI for real-world applications
Exploration vs. Exploitation Tradeoff
● A simple, yet effective strategy to balance
exploration and exploitation is to randomly Pick Z ∼ U[0,1)
choose between exploration and exploitation at
each learning step: Z≤ϵ Z>ϵ
● With probability ϵ, the agent selects a random
action at a given state. Pick random action Pick greedy action
● With probability 1 − ϵ, the agent selects the
Exploration Exploitation
greedy action, i.e. the action which has the
maximum action value (q-value) at a given
state.
● The above policy is called ϵ-greedy.

12
Hands-on approach to AI for real-world applications
Exploration vs. Exploitation Tradeoff
● To encourage effective exploration in the early
stages of learning a large ϵ is chosen initially. ϵ0
● ϵ is gradually reduced as learning progresses, i.e.
more information about the environment δ = 0.99 (Decay rate)
t
becomes available to the agent for effective ϵ0δ
exploitation.
Exploration Exploitation

Epsilon Decay ϵmin


ϵt = max (ϵmin, ϵ0 ⋅ δ )t

● A fully trained agent relies on exploitation to


maximize its return.

13
Hands-on approach to AI for real-world applications
The Q-Table
For finite state-action spaces, the q-value function could be represented as a table:
The corresponding Q-table is a 16 × 4 matrix
0 1 2 3 Actions
Up
0 s1 s2 s3 s4
Q a1 : ↑ a2 : ↓ a3 : ← a4 : →
1 s5 s1 : (0,0) Q(s1, a1) Q(s1, a2) Q(s1, a3) Q(s1, a4)

⋮ ⋱ Left Right s2 : (0,1) Q(s2, a1) Q(s2, a2) Q(s2, a3) Q(s2, a4)

States
2 s3 : (0,2) Q(s3, a1) Q(s3, a2) Q(s3, a3) Q(s3, a4)

3 ⋯ s16 s4 : (0,3)

Q(s4, a1) Q(s4, a2)
⋮ ⋮
Q(s4, a3)

Q(s4, a4)

Down
s16 : (3,3) Q(s16, a1) Q(s16, a2) Q(s16, a3) Q(s16, a4)
16 states
4 actions

14
Hands-on approach to AI for real-world applications
Q-learning
● The goal of Q-learning is to learn the action values in the Q-table, such that the
learned q-values give a policy that maximizes the agent’s return.
● This learning is accomplished using a recursive formulation called the Bellman
equation, that allows the q-value function of a given state to be expressed in-terms
of the q-value function of the next state, under a given policy π.

The Bellman Equation The expected return on taking action a in


state s, and following policy π thereafter.
Qπ(s, a) = [Rt+1 + γQπ(St+1, At+1) St = s, At = a]

15
𝔼
Hands-on approach to AI for real-world applications
Q-learning
Initialize Q-table

Behavior Policy is
Pick action A for the current
taken to be ϵ-greedy
state S using behavior policy

Perform action A

Observe reward R and next


state S′

Update Rule (Based on Bellman’s Equation)


Update Q-table
Q(S, A) ← Q(S, A) + α[R + γ maxQ(S′, a) − Q(S, A)]
a

Update current state

16
Hands-on approach to AI for real-world applications


Understanding the Q-learning Update Rule
From Bellman’s equation, we can estimate the q-value of taking action A taken at the state S and
following a greedy policy thereafter as follows:
̂ A) = R + γmaxQ(S′, a)
Q(S,
a

The difference between q-value estimated from the Bellman’s equation and the current q-value is
called the temporal difference error or TD-error:
̂ A) − Q(S, A)
TD-Error = Q(S,

The Q-value is updated to correct the previous estimate according to the new estimate:
Q(S, A) ← Q(S, A) + α ⋅ TD-Error, where α is the learning rate.
Q(S, A) ← Q(S, A) + α[R + γmaxQ(S′, a) − Q(S, A)]
a

17
Hands-on approach to AI for real-world applications


Q-learning Example
Consider the following 3 × 3 gridworld, where the robot’s goal is to navigate from the initial state, s1,
to the goal state, s9, to make a delivery while avoiding potential obstacles or hazards.

• On visiting any state, the robot’s battery depletes by 1%, s1 s2 s3


which corresponds to a reward of -1.
-1 -1 -1
• On visiting the state s5 which has a hazard, the robot gets s4 s5 s6
damaged, cannot continue its task and has to start again
from s1, which corresponds to a reward of -10.
-1 -10 -1
s7 s8 s9
• On making a successful delivery by reaching the goal state, -1 +10
the robot receives a reward of +10. -1

18
Hands-on approach to AI for real-world applications
Q-learning Example
s1 s2 s3
• The Q-table is initialized with 0s. Let γ = 0.9 and
α = 0.1. -1 -1 -1
• Let the agent take the right action from its initial s4 s5 s6
state s1 to get a reward of -1.
-1 -10 -1
s7 s8 s9
• The TD error for Q(s1, a4) is thus
(R + γ . maxQ(s2, a)) − Q(s1, a4) = (−1 + 0.9 × 0) − 0 = − 1
a -1 -1 +10
a1
• So, Q(s1, a4) is updated as,
Q(s1, a4) ← Q(s1, a4) + α ⋅ TD-Error
Q(s1, a4) ← 0 + 0.1 ⋅ (−1) = − 0.1 a3 a4

a2 19
Hands-on approach to AI for real-world applications
Q-learning Example
Suppose the Q-table learned by the agent after 1 episode is as shown:
Q-table after 1 episode
In the next episode, if the agent starting at state s1
takes the right action, the corresponding update to
Q(s1, a4) would be performed as follows:
TD-Error = (R + γ . maxQ(s2, a)) − Q(s1, a4)

States
a
= (−1 + 0.9 × 0) − (−0.19)
= − 0.81
Q(s1, a4) ← Q(s1, a4) + α ⋅ TD-Error
Q(s1, a4) ← − 0.19 + 0.1 ⋅ (−0.81) = − 0.271 Actions

20
Hands-on approach to AI for real-world applications
Q-learning Example
After several episodes, the q-values converge to the values shown in the following Q-table.

s1 s2 s3
-1 -1 -1
s4 s5 s6

-1 -10 -1
s7 s8 s9

-1 -1 +10
Learned Q-table Optimal policy of the agent using
the learned Q-table
21
Hands-on approach to AI for real-world applications
RL Assignment Question 3
In a simple MDP, an agent is in a state s, and the actions it can take can lead to the following outcomes:
With probability 0.4, the agent transitions to state s′, with reward R = 10, and v(s′) = 5
With probability 0.6, the agent transitions to state s′′, with reward R = 2, and v(s′) = 3.

The discount factor γ is 0.5. Using Bellman equation, find the expected value of state s.

Solution:
vπ(s) = π[Rt+1 + γvπ(St+1) | St = s]
According to the given policy,
[Rt+1] = 0.4 × 10 + 0.6 × 2 = 5.2
[vπ(St+1)] = 0.4 × 5 + 0.6 × 3 = 3.8
So, we have:
vπ(s) = 5.2 + 0.5 × 3.8 = 7.1
Thus, the expected value of state s is 7.1.
22
Hands-on approach to AI for real-world applications
𝔼
𝔼





𝔼
Key Learnings

This lecture should enable you to:


• Identify if a particular problem is suitable to be formulated as a RL problem.
• Identify states, actions and rewards in a particular environment.
• Understand the exploration-exploitation trade-off involved in RL
• Understand the basics of Q-learning
• Apply concepts to simple scenarios

23
Hands-on approach to AI for real-world applications
RL Resources
• This was a very high-level overview!
• Resources for learning RL:
• Reinforcement Learning: An Introduction by Richard S. Sutton and Andrew G. Barto
• RL course lectures by David Silver
• Resources for implementing RL algorithms:
• Gymnasium - a diverse collection of RL environments, including classic control, robotics, games etc.
• Stable Baselines3 - implementations of well-known and useful RL algorithms in PyTorch with trained
models and example code snippets.
• TensorFlow Agents - A TensorFlow based library for implementing and testing RL algorithms.
• https://www.reddit.com/r/reinforcementlearning/ - community for discussing everything related to RL
(such as project ideas, help in understanding research papers, etc.)
24
Hands-on approach to AI for real-world applications
Thank You!!
Presentation link:
https://www.icloud.com/keynote/08fwyHyWUKQb32IpZIR2Nvoww#RL%5FOctober%5F19

25
Hands-on approach to AI for real-world applications

You might also like