UNIT-5 PART C
1)Explain the Q function and Q learning algorithm assuming
deterministic rewards and actions with example.
ans)
https://www.freecodecamp.org/news/an-introduction-to-q-learn
ing-reinforcement-learning-14ac0b4493cc/
2) Explain the K – nearest neighbor algorithm for
approximating a discrete– valued function f : Hn→ V with
pseudo code.
3)Compare unsupervised learning and reinforcement learning
with examples.
4)Develop a Q learning task for the recommendation system of
an online shopping website. What will be the environment of the
system? Write the cost function and value function for the
system
A Q-learning task has these components: Agent, Environment,
State, Reward Function, Value Function and Policy.
1)To simplify the problem, we assume a hypothetical user whose
experience on an online shopping store is pooled from all the
actual users.
2)Our recommender model is going to be the agent of the
system that processes products to this hypothetical user that
will buy/not-buy the recommendation.
3)The user behaves as the system’s environment, responding to
the system’s recommendation depending to the state of the
system.
4)User feedback determines our reward, that is, one score only
if the user buys.
5)Action of the agent is product recommended.
6)Our state is defined as the product features and
corresponding user reactions of past 5 steps, excluding the
current step.
7)Therefore, feedback and action together give us the next
state.
The goal of the agent is to learn a policy that maximizes
accumulated rewards.
5)Identify the suitable learning method for training a robotic
arm and explain it.
ans) Industrial robots deployed today across various industries
are mostly doing repetitive tasks. Basically, moving or putting
objects in predefined trajectories. But the reality is that the
ability of robots to handle different or complex environments is
really limited in today’s manufacturing.The main challenge that
we have to overcome is designing adaptable control algorithms
that can easily adapt to new environments.
Reinforcement learning (RL) is a type of Machine Learning where
we can teach an agent how to behave in an environment by
performing actions and seeing the results.
The concept of Reinforcement Learning has been around for a
while but the algorithm was not very adaptable and was
incapable of doing continuous tasks.
For RL, we use a framework called the Markov Decision Process
(MDP) which produces an easy framework for a really complex
problem. An agent (e.g. robotic arm) would first observe the
environment it’s in and take actions accordingly. Rewards are
given out according to the result.
For robotic control the state is measured by using sensors to
measure the joint angles, velocity, and the end-effector pose:
Policy
The main objective is to find a policy. A policy is something that
tells us how to act in a particular state. The objective is to find a
policy that makes the most rewarding decisions.
Now, you put the objective together. We want to find a sequence
of actions that maximize expected rewards or minimize cost
Q-Learning
Q-learning is a model-free reinforcement learning algorithm
which means that it does not require a model of the
environment. It’s especially effective because it can handle
problems with random transitions and rewards, without
requiring adaptations.The most common Q-learning method
consists of these steps:
1. Sample an action.
2. Observe the reward and the next state.
3. Take the action with the highest Q.
7 . How does Q function become able to learn with and without
complete knowledge of reward function and state transition
function.
Q-learning is a model-free reinforcement learning algorithm
which means that it does not require a model of the
environment. It’s especially effective because it can handle
problems with random transitions and rewards, without
requiring adaptations.
Q-learning is an off-policy learner. Means it learns the value of the
optimal policy independently of the agent’s actions. On the other
hand, an on-policy learner learns the value of the policy being carried
out by the agent, including the exploration steps and it will find a
policy that is optimal, taking into account the exploration inherent in
the policy.
8 . How does setting a Reinforcement Learning problem require
an understanding of the following parameters of the problem?
(a) Delayed reward
(b) Exploring unknown or exploiting already learned states and
actions.
(c) Number of old states should be considered to decide action
Ans. (a) Delayed reward:
In the general case of the reinforcement learning problem, the
agent's actions determine not only its immediate reward, but
also the next state of the environment. The agent must take into
account the next state as well as the immediate reward when it
decides which action to take. The model of long-run optimality
the agent is using determines exactly how it should take the
value of the future into account. The agent will have to be able
to learn from delayed reinforcement: it may take a long
sequence of actions, receiving insignificant reinforcement, then
finally arrive at a state with high reinforcement. The agent must
be able to learn which of its actions are desirable based on
reward that can take place arbitrarily far in the future.
(b)The agents have to explore in order to improve the state
which potentially yields higher rewards in the future or exploit
the state that yields the highest reward based on the existing
knowledge. Pure exploration degrades the agent’s learning but
increases the flexibility of the agent to adapt in a dynamic
environment. On the other hand pure exploitation drives the
agent’s learning process to locally optimal solutions.
© The state is the current board position, the actions are the
different places in which you can place an ‘X’ or ‘O’ in a game of
Tic Tac Toe, and the reward is +1 or -1 depending on whether you
win or lose the game. The “state space” is the total number of
possible states in a particular RL setup. Tic tac toe has a small
enough state space (one reasonable estimate being 593) that we
can actually remember a value for each individual state, using a
table. This is called a tabular method for this reason. For models
like playing chess we use value function approximation as the
total number of possibilities is around 1049