0% found this document useful (0 votes)
5 views13 pages

Lec 21

The document discusses the concept of Markov Decision Processes (MDPs) in the context of reinforcement learning, specifically through the example of a recycling robot. It outlines the robot's states, actions, and rewards, emphasizing the importance of careful modeling to balance the robot's battery usage while maximizing its ability to gather cans. The lecture also touches on the complexities of defining transition probabilities and the trade-offs involved in simplifying the model for practical application.

Uploaded by

mayankgopal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views13 pages

Lec 21

The document discusses the concept of Markov Decision Processes (MDPs) in the context of reinforcement learning, specifically through the example of a recycling robot. It outlines the robot's states, actions, and rewards, emphasizing the importance of careful modeling to balance the robot's battery usage while maximizing its ability to gather cans. The lecture also touches on the complexities of defining transition probabilities and the trade-offs involved in simplifying the model for practical application.

Uploaded by

mayankgopal2004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

NPTEL

NPTEL ONLINE COURSE

REINFORCEMENT LEARNING

MDP Modelling

Prof. Balaraman Ravindran


Department of Computer Science and Engineering
Indian Institute of Technology Madras

So we will continue looking at the reinforcement the full reinforcement learning problem so we
left off where we defined a Markov decision process right and so today's class will probably be a
short one okay because the next thing next topic I need to start requires me to define a lot of
notations and stuff like that so if I start that today I will probably end up just defining the notation
today and then when I start again next week I will ask you to I will have to ask you to recall the
notation and stuff like that or spend time writing it out again so I will not do it.

So if I finish whatever I want to before the proofs start right I will just stop today so I will not start
the notations okay so let us see how it goes so right.

Schlumberger-Private
(Refer Slide Time: 01:09)

So we started looking at Markov decision process last time so we said that it is specified by set of
states, actions right, transition probabilities and right and possibly a γ so if you look at the RL
textbook right they leave out the γ right if you look at any OR textbook where also MDPs are used
widely right they would have the γ in there right so for them γ is a very inherent part of the problem
but in the RL books you would see that they would leave out the γ even though I mean γ is inherent
part of the problem and somehow they just do not acknowledge the fact that γ changes the solution
right explicitly in the problem definition itself so this is essentially what an MDP is all about so
what I will do is walk through one of the examples from the textbook right.

So I am sticking to the textbook example so that you can go back and read the book and also
because it is kind of makes the course more related to the minor guys here okay, so let us take the
example of a recycling robot from the book so how many of you have been reading the book okay
the number went up by two since last time okay good So do you remember the recycling robot
yeah so what was the thing about hmm okay hmm hmm ah haan okay hmm so it is interesting did
they just turn on the mike or something anyway it is interesting that you used the description states
okay so people heard him so I will repeat it since I seem to have the louder mike so you know that
those things can also pick up but I seem to have the louder mike so let me repeat it so there is this

Schlumberger-Private
robot okay that has been created for the express purpose of Swachh Bharat okay so it goes around
the building and finds all the empty cans and gathers it and puts it in some kind of a you know
trash recovery system right recycling system or whatever.

So for every can the bot gathers it is going to get a reward of 1 right this is a number I am pulling
out of the hat right so let us not talk about the reward now right so the robot is supposed to gather
cans that is essentially what we wanted to do but being a robot it is going to have a limited amount
of power that it can spend right it has a limited amount of battery power that it can spend so it has
to choose carefully how it spends the battery right it can go around looking for cans right and if it
is going to run out of battery then it can do one of two things it can recharge itself right.

Or it can just sit in a place like a trash can and expect people to drop the empty cans into that bot
okay just sits in a place and the people can just use it like a trash can right so this is essentially the
set up so I have this bot that should I would like it to go actively gather cans but I also do not like
it to run out of battery because if it runs out of battery somebody has to go physically either roll
the robot or carry the lift the robot and take it to a charging station and charge it ok that is a very
painful operation so I really do not want it to run out of battery.

But then I do not want it to just sit idle as as well because what it says is supposed to do is gather
cans right so how would you model this problem so what are the states and what are the actions
what should be the reward, the robot seems to be easy right so well what do you want it to do you
want it gather cans so whenever it gathers can whenever it gathers a can, you give it a reward right
so whenever it gathers a can you give it a reward of +1 right so I am going to call this r can +1
right then what you do not want it to do you do not want it to run out of battery.

Minus…it is just yeah something large right so yeah so I would I would have preferred to give it -
10 or something but since I am walking through the example in the book okay so it is a -3 right so
it is not that bad I mean if we can collect 15 cans and then die it is okay right so it is not too it is
not that bad I think so what else if it charges if it chooses to charge right so what happens do we
give what do we do for it do you give a positive reward or why does no reward make sense it will

Schlumberger-Private
give yeah just the heck for heck of getting that reward it will get recharged right but then recharging
is also good we want to convey the fact that recharging also good.

So it is better than running out of battery and if you recharge you have the potential for getting
more rewards in the future because you go out and seek more cans right so that way you have an
indirect incentive for recharging because in the future it allows you to do something better right if
you just look at the immediate reward it looks like a bad thing but if you think about it so in the
future you are making it more attractive for recharge I mean if you look at the future outcomes
recharging is also very attractive okay.

What is below the threshold ..no no this is not this is if you think about it this is more to do with
actions or something with outcomes yeah battery levels hmm you could you could make it all the
more detailed you can make it as detailed as you want right now comes the crux how detailed do
you want to make the representation to be do you want to make it a real number saying how much
battery charge I have you could get a continuous state there right so what do you want to do you
want to do some kind of discretization right so if I want to do a discretization so how do I discretize
ahh that brings us to another question that completely we forgot to answer that question what is
the state?

The battery charge level should be the state right and then anything else the number of cans no let
us just assume that magically it takes a can and transports it to yeah but that is an interesting if it
has a finite carrying capacity then you probably need to throw in how many cans it has and so
before it has to make a trip to the dump dumping ground or something like that so let us assume
for the time being it so as soon as you put a can into that there is a teleporter that takes it to the
dumping ground okay.

Right or the bot has an infinite carry I am just trying to make the problem simpler right if you want
to make the problem more and more realistic right you will have to think about a whole bunch of
other things right for example I am kind of making it at a very high level I am not talking about
the pose and I mean the pose of the robot right the velocity with which the bot is moving so all of

Schlumberger-Private
these things you need right if you want to actually going to control the bot and then this assuming
all of those are somehow going to be magically taken care of right.

So we are just assuming that you are only worried about the charge of the bot right yeah you had
a question.. yeah we have been talking about only discrete states and discrete action so far we have
not talked about continuous states and continuous actions but you have continuous state and
continuous action MDP okay, so the state is going to be the charge right and possibly number of
cans right and other things but for the time being I am just going to say I am not going to care
about all of that we will just look at charge just to keep it a simple example right.

I am also keeping it in sync with the book so and again so what do I do with charge I can treat it
as a number in which case I will have to figure out how to handle a continuous value right or I can
start discretizing it so how will I discretize it yeah so I again well percentage is also not a
discretization unless you round it off okay. So yeah so you could do something like that or it could
be even more aggressive right I could do low medium high right and if you really stop and think
about it when would you if I said if I do low medium high or something like that when would I
actually change my decision about charging versus not charging right only when I am low right so
in some sense you could even try operating with high and low right or sufficient and low whatever
okay.

What is it? Yeah great..actually I do not I am not having my location at all here right so assuming
that there is always a convenient charging point located nearby okay it is not like a single base
station I have to go there so as soon as I find out okay I am running out of charge right let us
assume that as long as the battery is low I can still get to a charging point in fact I am going to
assume even something stronger so I could with some probability keep operating even if my
battery is low right.

And I still not run out of assumption I make is the following whenever I say recharge okay I will
have enough battery power to reach the nearest recharge station and recharge so after I pick the
option to recharge I will not run out of battery power, it is again a very simplifying assumption
right this makes it easier for us to draw pictures as I will do in a minute right but all the points that

Schlumberger-Private
are being raised are very good points right so when you start trying to model a real system right
these are all calls that you will have to make right.

So what is it that I would throw into my state right what is it that I can ignore right and so on so
forth and yeah so Pallavi do you remember the homework assignment we are asking them yeah
the third one we are asking them we are giving them a word problem and asking them to construct
an MDP out of it right you think so hmm..new assignment the third assignment the one that was
released today morning if it was released yesterday then you are time traveling okay released this
morning.

Yeah so the new assignment and it is due on Feb 14 yeah due on Feb 14th haan 14 which is what
Sunday yeah hmm hmm well plan ahead if it is important to you finish on the 13th or as it is more
likely for lot of people who have nothing to do 14th well you have one more day, last question is
that right so there is a word problem you are expected to construct an MDP, so you can think of
similar things but that is a more constrained word problem so you will see that there is not too
many ways in which you can construct an MDP out of it.

So but then the idea is that a lot of design choices that you make when I give you a problem
specification right there are a lot of choices that you make when you go from the problem
specification to an MDP right so when you are talking about designing a reinforcement learning
solution for a problem the things that you have to decide are the states, the actions and the rewards
these are the three things that you have to decide and when you are constructing an MDP
additionally you will have to figure out what this the transition probabilities are also going to be
because without that you cannot solve the problem.

All right but in we are trying to use reinforcement learning techniques to solve these problems as
we will see later on you do not need the transition dynamics right you do not need to explicitly
model the transition dynamics but you still need to make choices for states actions and rewards so
that is where we are right now so after we finish this I will tell you some simple way of constructing
the transition probability for the robot but that is fine so essentially what I am going to do charge
is going to be now low or high okay.

Schlumberger-Private
And then what about actions, there is no pick up let us say when I actively we find a can I will
automatically pick it up okay so I will go so explore essentially or as they say in the book actively
seek cans because explore is overloaded right so we have explore exploit those kinds of things
already so I do not want to have an action called explore okay, so you have you actively seek cans
or wait for cans or recharge right so is there anything else that we need to define here well yeah so
I will tell you how we can get around yeah that is a good point so maybe you need a zero charge
state.

But I will also tell you how to get around that right so the assumption we are going to make is if
you run out of charge somebody is going to take you and charge you right then you are going to
get this penalty of -3 so what does that mean so from low I can actually go to high with a penalty
of -3 ok so that is that is equivalent to the zero charge state that you can make it more complex so
right now I am assuming once you plug it in to charge I do not worry about it until it is fully
charged later right see this is another thing which you with an interesting point because I have
discretized it into low and high right.

So when will it run out of charge when it when it will go from high to low state so if it high is
100% high right it is probably not going to go to the low state when i take an action but if high is
say 40% high almost surely as soon as I take an action it is going to go into a low state let us say
40 is the cutoff point right so above 40 I say high below 40 I say low that is percentage right for
40% I will say high below 40% I will say low, almost surely as soon as I take an action it is going
to go to low.

But then I am not making this distinction I have made a discretization here right so what will it
look like to me and if I run the robot multiple times what will it look like to me sometimes it goes
from high to low sometimes it does not go from high to low it stays in high so what is going to
happen here is I am going to translate this into the probabilities of transition right so what is the
probability of going from high to low right in fact if I you know if I knew all about my the battery
dynamics if I know how much power each and every action takes and if I am modeling this as a
continuous valued state there might not be any stochasticity here.

Schlumberger-Private
Right I might be able to tell you very exactly okay now at this point if this is the action you are
doing this is so much power that you will consume and therefore the battery will go from high to
low all right so I might be able to give that to you exactly so inherently this might not be a stochastic
system okay I am only saying might not be because people have worked with robots so you know
that you might specify as many initial conditions as you want right the bot will do what it wants
to do right.

So it is still be a stochastic system but it could potentially not be a stochastic system if you know
everything about the world right but we are making it stochastic by virtue of ignoring certain things
that we could potentially know about the system right and ignoring this makes the system design
a lot simpler maybe we are giving up some things in terms of how optimal our solutions can be
but that is a trade-off that you will often have to make right so just the formulation itself is not an
easy task you have to think about it a lot to figure out what would be the right place to draw the
line okay.

So these are the actions that we have to take so I can actively seek cans, wait for cans, recharge is
there something else that we have to specify so we have specified the states, we have specified the
actions we have specified rewards for events right so if I collect a can, I get a +1 right if I run out
of battery you get a -3 then if I recharge you get 0, these are events that could happen in the world
right so these are so I have to relate this to the states and actions only then I can tell you what is
the expected reward.

We will not talk about time at all right so I told you for that at least for the initial part of the course
we will not talk about time with how long it run say so what is it that we need suppose I am actively
seeking cans what is the probability I will get a reward of +1, if I am actively seeking cans right
what is the probability that I will find a can correct right so I need to still specify that if I am
actively you are if I am waiting for cans to come like what is the probability that I will get a can
what is the derivative of what derivative of which one?

Schlumberger-Private
Yeah yeah what is that, yeah why yeah maybe may be not I do not know why you chose to move
from one to the other right I mean I could have just stopped seeking because it is running low on
battery right so if I am in some facility where people are just taking cans and throwing them in the
dustbin right I might just happily go sit next to the can dispensing place and act as a dustbin right
I will be happily getting rewards right I might not even worry about recharging or anything right.

So I might just choose to stop right so it depends on that depends on the kind of environment you
are operating in right so even if I act as a dustbin do I get enough cans then I might not want to
seek right so maybe there is some time of the day maybe between 12 and 1.30 or something lot of
cans will come if I just go stand outside the cafeteria I do not have to actively seek cans maybe
after 1.30 I have to go around seeking cans right I do not know it depends on the system right.

I am not trying to be too overly realistic here but I am just giving you what are the factors that you
have to take into account right so if you start talking about what could be optimal policies right
then you are going down a dangerous path that means you are going to start designing your
problem itself depending on what you think the solution should be which is a bad thing right you
should put in all the factors about the problem in the design and then you should go and let the
algorithm figure out what the solution should be.

So you should not start thinking about what the solution should be and then try to create the rewards
and the model itself I will give you an example of that after I finish this I will give an example of
when that became a problem okay so is it clear so now I need to figure out if I actively seek ends
what is the probability that I will get cans right is there anything else I need to know yeah we will
come to that.

Yeah it is related to that if I am actively seeking cans what is the probability that I will run out of
battery because that also determines my reward right or if I am waiting for cans to come what is
the probability I will run out of battery may be 0 because I have turned off everything and just
waiting for or maybe there is a leakage that is happening yes a very small probability that I might
actually run out of battery right so. So I need both of these to actually figure out what will be the
expected reward right for a state action pair.

Schlumberger-Private
So I need to know what will be the expected reward so I have to actually think about both of these
things if I am in a high state right and I am actively seeking cans right so what is the probability I
will find a can does it depend on whether I am in a high state or not no right so if I am actively
seeking cans what is the probability I will get a can okay so that is one thing I need likewise if I
am not actively seeking cans what is the probability I will get a can that is second thing you need
on the other hand if I am in a high state and actively seeking cans.

What is the probability I will run out of battery there the state matters right and likewise if I am in
a low state and actively seeking cans what is the probability I will run out of battery there state
matters likewise for waiting for cans if I am in a high state and I am waiting for a can yeah
probability of me running out of battery is very very minuscule almost 0 but if I am in a low state
and waiting for a can so even with a leakage power drawn I met actually run out of battery so these
are the things so you need to know all of these right.

For time being I will say α is the probability of you getting a can right and β is the probability of
you getting a can if you are waiting and then α is greater than β hopefully typically right we are
talking about probabilities I mean that might be momentarily where β is higher than α like I said
stand outside a cafeteria during lunch time but then overall we are going to assume that α is greater
than β since we have not actually modeled the time of day anywhere right we are just going to say
α is greater than β right.

So now what else do we need yet we need to define p right so for that what do we need well if I
am in state high what is the probability I will go to state low okay forget about going to out of
battery that is what we needed for the rewards but for the state transition so from high to low what
is the probability that will happen if I am actively seeking a can so that is basically what we need
right so I am in state high, I am actively seeking a can what is the probability I will stay in state
high what is the probability will go to state low.

Right so like that and likewise if I am in state low and I am actively seeking a can what is the
probability I will stay in state low what is the probability will go to state high and I am in state low

Schlumberger-Private
and I am waiting for can what is the probability I will stay in state low what is the probability I
will go to state high right and so on so forth right so these are all the things that you need for
defining the state transition probabilities okay of course if I am in state high and I do action
recharge what do we want? Exactly.

So if I am in state low I do action recharge with probability 1 I will go to state high right so we
will actually prohibit recharge action in state High well it does not make sense if I am already fully
charged I do not want to go plugging in again all right so we will disallow that well it all depends
right I could just have a stochastic policy I can say that ok my policy is that if I am in state high
with probability .8 I will search with probability .2 I will just stay so I mean so we are not talking
about solutions here.

I am just talking about defining the problem this is what I was mentioning earlier also do not do
not get into how the solution will look like right sometimes you will have to think about how the
solution will look like because you need to know if you are giving the state sufficient
expressiveness so that you know when the solution should change and things like that but do not
worry too much about the structure of the solution. So now we will just draw this some kind of a
state transition diagram.

(Refer Slide Time: 28:10)

Schlumberger-Private
Right so let us do the at state high right so there are two actions from state high one is actively
seek cans right so active so what are the possible outcomes of active ok so what is the probability
that I am going to be in high some values equal is also possible well it depends on how wide my
high band is rate and how much power my actor action is going to consider I don't know this is to
say some numbers here right what is other action I can take here right and what is the something
some numbers it has now is there anything else I need to specify I need to look at what the reward
is going to be for having done that right.

So what will be the reward here what yeah I will get +1 for a can but the expected number of cans
I am going to collect is α right that is the probability of me getting a can so the expected reward
will be α right likewise here the expected what is going to be β okay, so from low I have again
three actions no from low I have three actions not two right I have three actions so what are the
actions right so I will do the easy one first right from low if I do recharge with probability 1 I will
go to high and I will get 0 reward right.

So now comes the more interesting case suppose I do wait here is there any chance that I will go
to high I could right so that is also possible but very low very low probability because I am not
doing anything so if I already set . one here as the probability of discharging I will say . 1 -.. and I
might still get a can right and this one is likewise if I am actively seeking does that make sense

Schlumberger-Private
right there is some small probability I might not discharge and I will keep accruing rewards at α
right but then I might actually discharge with a high probability and then I go to the high state and
I also get the additional -3 along the way.

I mean whether you do this +βor +α or not is again a choice right you can say that hey if I run out
nobody is actually going to give me a can right so people might snatch away the can I already have
I do not know what you can choose whatever I mean this is again a choice I just I just chose to add
the α and β here you can choose not to but you can you get the sense of what we are doing here
right so is a very very simple problem it took us a while to identify all the factors right and we
chose to ignore a whole bunch of factors because you identified a lot of useful factors right we
chose to ignore a whole bunch of factors or in order to come up with a simple tractable MDP right
great.

IIT Madras Production

Funded by
Department of Higher Education
Ministry of Human Resource Development
Government of India

www.nptel.ac.in

Copyrights reserved

Schlumberger-Private

You might also like