0% found this document useful (0 votes)
132 views6 pages

Assignment #3:: Group 15

1. This document is a group assignment from students Ajay Guru, Heet Sankesara, and Mahima Arora submitted on January 31, 2019. 2. It involves two questions - the first asks to find the optimal value function for an agent moving in a 4x3 gridworld using value iteration for different reward functions, and the second formulates an MDP for managing bicycle rentals between two locations. 3. The third question modifies the bicycle rental problem by allowing free movement of one bike between locations and adding a cost if over 10 bikes are kept at a location overnight.

Uploaded by

hackta ku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
132 views6 pages

Assignment #3:: Group 15

1. This document is a group assignment from students Ajay Guru, Heet Sankesara, and Mahima Arora submitted on January 31, 2019. 2. It involves two questions - the first asks to find the optimal value function for an agent moving in a 4x3 gridworld using value iteration for different reward functions, and the second formulates an MDP for managing bicycle rentals between two locations. 3. The third question modifies the bicycle rental problem by allowing free movement of one bike between locations and adding a cost if over 10 bikes are kept at a location overnight.

Uploaded by

hackta ku
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Assignment #3:

Group 15

Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

January 31, 2019


Assignment #3 (Group 15):Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

Question 1
Suppose that an agent is situated in the 4x3 environment as shown in Figure . Beginning in the
start state, it must choose an action at each time step. The interaction with the environment
terminates when the agent reaches one of the goal states, marked +1 or -1. We assume that
the environment is fully observable, so that the agent always knows where it is. You may
decide to take following four actions in every state: Up, Down, Left and Right. However, the
environment is stochastic, that means the action that you take may not lead you to desired
state. Each action achieves the intended effect with probability 0.8, but the rest of the time,
the action moves the agent at right angles to the intended direction with equal probabilities.
Furthermore, if the agent bumps into a wall, it stays in the same square. The immediate reward
for moving to any state (s) except for the terminal states S+ is r(s)= -0.04. And the reward for
moving to terminal states is +1 and -1 respectively. Find the value function corresponding to
the optimal policy using value iteration. Find the value functions corresponding optimal policy
for the following: r(s)=-2 r(s)=0.1 r(s)=0.02 r(s)=1

Transition Function:
X
V ∗ (s) = max∀a T (s, a, s0 )[R(s, a, s0 ) + rV ∗ (s0 ) (1)
s0

s: Probability of Transition
V∗ (s) = 0

W hileVn+1 6= Vn∗ :

∗ (s) = 0 0 0
P
∀sVn+1 s0 T (s, a, s )[R(s, a, s ) + rVn (s )]
Policy Extraction:
X
x∗ (s) = argmax T (s, a, s0 )[R(s, a, s0 ) + rV ∗ (s0 ) (2)
s0

Question 1 continued on next page. . . Page 2 of 6


Assignment #3 (Group 15):Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

(a) r(s)=-2

(b) r(s)=0.1

(c) r(s)=0.02

Question 1 continued on next page. . . Page 3 of 6


Assignment #3 (Group 15):Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

(d) r(s)=1

Question 2
[Gbike bicycle rental] You are managing two locations for Gbike. Each day, some number of
customers arrive at each location to rent bicycles. If you have a bike available, you rent it out
and earn INR 10 from Gbike. If you are out of bikes at that location, then the business is lost.
Bikes become available for renting the day after they are returned. To help ensure that cars
are available where they are needed, you can move them between the two locations overnight,
at a cost of INR 2 per bike moved.
Assumptions: Assume that the number of bikes requested and returned at each locations are
Poisson random variables. Expected numbers of rental requests are 3 and 4 and returns are 3
and 2 at the first and second locations respectively. No more than 20 bikes can be parked at

Question 2 continued on next page. . . Page 4 of 6


Assignment #3 (Group 15):Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

either of the locations. You may move maximum 5 bikes from one location to the other in one
night. Consider the discount rate to be 0.9.
Formulate the continuing finite MDP, where time steps are days, the state is the number of
cars at each location at the end of the day, and the actions are the net number of bikes moved
between the two locations overnight.

Question 3
Write a program for policy iteration and re-solve gbike bicycle rental problem with the following
changes. One of your employee at the first location rides a bus home each night and lives
near the second location. She is happy to shuttle one bike to the second location for free.
Each additional bike still costs INR 2, as do all bikes moved in the other direction. In addition,
you have limited parking space at each location. If more than 10 bikes are kept overnight at a
location (after any moving of cars), then an additional cost of INR 4 must be incurred to use a
second parking lot (independent of how many cars are kept there).

Question 3 continued on next page. . . Page 5 of 6


Assignment #3 (Group 15):Ajay Guru(201651005)|Heet Sankesara(201651018)|Mahima Arora(201651055)

Optimal Policy:

Page 6 of 6

You might also like