0% found this document useful (0 votes)

8 views11 pages

Report

The document discusses the application of Deep Q-Learning with Experience Replay to control agents in Atari games, addressing challenges in Reinforcement Learning (RL) with high-dimensional sensory inputs. It details the architecture and modifications of the Q-learning algorithm, including the use of Double Deep Q-Learning to mitigate overestimation issues. Results from experiments on games like Pong, Breakout, and Boxing are presented, highlighting the effectiveness of the methods used and suggesting future improvements through prioritized experience replay.

Uploaded by

Bibek Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views11 pages

Report

Uploaded by

Bibek Saha

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

1

Contents
1 Introduction 1

2 Background 1

3 Literature Survey 2
3.1 TD-Gammon Neural Network . . . . . . . . . . . . . . . . . . . . . . . . 2
3.2 Neural Fitted Q-Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . 3

4 Method 3

5 Model Architecture 4

6 Modifications 5

7 Experiments and Results 6

7.1 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
7.2 Breakout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
7.3 Boxing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

8 Discussion and Future Work 8

References 9
1 Introduction
Controlling agents with high dimensional sensory inputs is a long-standing challenge
of Reinforcement Learning (RL). Deep Learning (DL) has actually made it possible to
extract high-level features from raw sensory data by utilizing a wide range of neural
network architectures (convolutional networks, multi-layer perceptrons, restricted Boltz-
mann machines and recurrent neural networks). Implementation of similar techniques in
RL, although potentially very beneficial when dealing with sensory data, can be quite
challenging from a DL perspective. Couple of these challenges are listed below:

• Typically, DL requires large amounts of hand-labelled data whereas RL requires

us to to learn from a scalar reward signal which can be sparse, noisy and delayed.
The delay between actions and rewards can be quite long, as compared to the
direct instantaneous association between inputs and targets in conventional DL
applications.

• Another major concern is that the data in DL is almost always assumed to be

independent, whereas in RL, the states can be highly correlated. Moreover, this
data distribution is subject to change as our RL algorithm learns, whereas DL
assumes a fixed underlying data distribution.

The paper (Mnih et al., 2013) demonstrates how all of the above listed challenges can be
overcome using a convolutional neural network to learn successful control policies from
sensory input data in complex reinforcement learning enviroments. This convolutional
neural network is trained with a variant of Q-learning algorithm and stochastic gradient
descent is used to update the weights. An experience replay mechanism is implemented
to take into account the correlations in the data. The methods are applied primarily to
the Atari 2600 game, Pong. Results are also obtained on Atari games such as Breakout
and Boxing.

2 Background
In Reinforcement Learning (RL), the agent interacts with the environment in a sequence
of actions, observations and rewards. At each time-step, the agent chooses an action at
from the set of legal game actions, A = {1, ..., K}, which modifies the internal state of
the emulator. This is observed in the form of an image xt , which is a vector of raw pixel
values representing the current screen. In addition it receives a reward rt representing
the change in game score.
Since it is impossible to fully understand the current situation from only the current
screen xt , we consider sequences of actions and observations, st = (x1 , a1 , x2 , ..., at−1 , xt ),
and learn game strategies that depend upon these sequences. Each of these sequences
are assumed to terminate in a finite number of time steps. This formalism leads to a
large Markov Decision Process (MDP) problem where the goal of the agent is to learn
the actions which maximizes the future rewards.

1
The future discounted return at time step t with a discounting factor γ per time-step, is
defined as:
T
0
X
Rt = γ t −t rt0
t0 =t

where T is the time step at which the game terminates.

The optimal Action value function, defined as the maximum expected return achievable
by following any strategy, after seeing some sequence s and then taking some action a, is
given by:
Q∗ (s, a) = maxπ E[Rt |st = s, at = a, π]
where π is a policy mapping sequences to actions.
In RL problems such as learning to play Atari games, where the input state space is
computationally very large, it is not feasible to maintain a tabular estimate for the action
value function for each state-action pair. Thus, a linear function approximator is used
to estimate the action value function as Q(s, a, θ) ≈ Q∗ (s, a). A neural network function
approximator (referred to as a Q-network ) is trained by minimising a sequence of loss
functions Li (θi ) that change at each iteration i:

Li (θi ) = Es,a ρ(.) [(yi − Q(s, a; θi ))2 ]

where θi are the weights, yi is the target for the ith iteration and ρ(s, a) is the probability
distribution over the sequences and actions.
This loss function is optimised using stochastic gradient descent. If the weights are
updated after every time-step, and the expectations are replaced by single samples from
the behaviour distribution and the emulator, the algorithm is known as Q-learning .
This algorithm is well known to be model-free and off-policy.

3 Literature Survey

3.1 TD-Gammon Neural Network

Temporal difference methods are based on the concept that learning relies on the dif-
ference between temporally successive predictions, with the goal of making the learner’s
predictions for the current input, match closely the ones at the subsequent time step.
The TD-Gammon Neural Network (Tesauro, 1994) works with little knowledge of the
game backgammon and learns to play at superhuman levels. Organised as a standard
multi-layer perceptron architecture, TD Gammon was designed to learn complex non-
linear functions. The neural network observes a sequence of board positions which are
fed in as input vectors. For each of these input vectors, the neural network generates an
output vector indicating the exected outcome of that particular input vector. The weight
update at each time step is done using the TD(λ) algorithm as follows:
t
X
wt+1 − wt = α(Yt+1 − Yt ) λt−k ∇w Yk
k=1

2
where α is the learning rate, w is the weights vector, Yk is the output of the network and
∇w Yk is the gradient of network output with respect to the weights. When the game ends,
a reward r is assigned and the weights are updated using (r − Yx ) instead of (Yt+1 − Yt ).

3.2 Neural Fitted Q-Iteration

The Neural Fitted Q-Learning algorithm (NFQ) (Riedmiller, 2005) belongs to the family
of fitted value iteration algorithms and is a special form of experience replay technique,
implemented as a multi-layer perceptron. The updates are performed off-line by consid-
ering an entire set of transition experiences of the form (s, a, s0 ), where s is the original
state, a is the action and s0 is the transitioned state. The supervised learning method
for batch learning, RPROP was used. The target is calculated as the sum of the cost of
transition and the expected minimal path cost for the state s0 , which is computed on the
basis of the current estimate of the Q-function. This memory based method of training
by storing and reusing all transition experiences makes the neural learning process quite
data efficient and reliable.

4 Method
In this project, we use the algorithm, Deep Q-Learning with Experience Replay ,
for learning to play Atari Games. We store the agent’s experiences at each time-step,
et = (st , at , rt , st+1 ) in a data-set D = {e1 , ...eN }, pooled over many episodes into a replay
memory. During the inner loop of the algorithm, we apply Q-learning minibatch updates
to samples, e ∼ D, drawn at random from the pool of stored samples. After performing
experience replay, the agent executes an action according to an -greedy policy. The
Q-function works on fixed length representation of histories produced by a function φ.

Algorithm: Deep Q-Learning with Experience Replay

Initialize replay memory D to capacity N
Initialize action - value function Q with random weights
for episode = 1 to M do
Initialise sequence s1 = {x1 } and preprocessed sequence φ1 = φ(s1 )
for t = 1 to T do
With probability select a random action at
otherwise select at = maxa Q∗ (φ(st ), a; θ)
Execute action at in emulator and observe reward rt
and image xt+1
Set st+1 = (st , at , xt+1 ) and preprocess φt+1 = φ(st+1 )
Store transition (φt , at , rt , φt+1 ) in D
Sample random minibatch of transitions (φj , aj , rj , φj+1 ) from D
(
rj , for terminal φj+1 ,
Set yj = 0
rj + γmaxa0 {Q(φj+1 , a ; θ)}, for non-terminal φj+1 .
Perform a gradient descent step on (yj − Q(φj , aj ; θ))2

3
5 Model Architecture
Raw Atari frames are 210 × 160 pixel images with a 128 color palette. These raw frames
are preprocessed by gray-scaling, down-sampling and cropping into an (84 × 84) image.
We have used Open AI Gym Environment: PongNoFrameskip-v4. For the experiments
here, this preprocessing is performed on the last 4 frames of a history and stacked together
to produce the (4 × 84 × 84) input to the Q-network.
The way we have parameterized the Q-function is by using an architecture in which there
is a separate output unit for each possible action, and only the state representation is an
input to the neural network. The outputs correspond to the predicted Q-values of the
individual action for the input state.
The (4×84×84) input image to the neural network is acted upon by the first hidden layer,
consisting of 32 (8 × 8) convolutional filters with stride 4, followed by the application of
a rectifier nonlinearity. The second hidden layer convolves 64 (4 × 4) filters with stride
2, again followed by a rectifier nonlinearity. The next hidden layer convolves 64 (3 × 3)
filters with stride 1, again followed by ReLU. The final hidden layer is fully-connected,
consisting of (512×(64×7×7)) linear weights and 512 rectifier units. The output layer is
a fully-connected linear layer with a single output for each valid action. The valid actions
in the case of Pong were the following 6 actions: [‘NOOP’, ‘FIRE’, ‘RIGHT’, ‘LEFT’,
‘RIGHTFIRE’, ‘LEFTFIRE’]

Figure 1: Deep Q-Network (DQN) Architecture

4
6 Modifications
Q-Learning algorithms suffer from substantial over-estimations of Q-values because they
include a maximization step over estimated Q-values. These over-estimations, not only
increase the number of episodes of training required to reach optimal policies, but can
even lead to sub-optimal policies asymptotically (Thrun and Schwartz, 1993). We pro-
pose using a Double Deep Q-Learning Network (DDQN) to tackle this problem. DDQN
uses two separate Q-value estimators (2 neural networks). A target model(Q0 ) for action
selection and a primary model (Q) for action evaluation. Using these independent esti-
mators, we can avoid the maximization bias by disentangling our updates from biased
estimates.
For updating model Q, we use the following target values:

Q∗ (st , at ) ≈ rt + γQ(st+1 , argmaxa0 {Q0 (st+1 , a0 )})

And then perform gradient descent on Q using:

Loss = (Q∗ (st , at ) − Q((st , at )))2

To update the target model, we periodically set the weights equal to the weights of the
primary model (Q)

Algorithm: Double Deep Q-Learning

Initialize primary network Qθ , target network Qθ0 , replay buffer D,
τ << 1
for each iteration do
for each environment step do
Observe state st and select at ∼ π(at , st )
Execute at and observe next state st+1 and reward rt = R(st , at )
Store (st , at , rt , st+1 ) in replay buffer D
for each update step do
sample et = (st , at , rt , st+1 ) ∼ D
Compute target Q value :
Q∗ (st , at ) ∼ rt , +γQθ (st+1 , argmaxa0 Qθ0 (st+1 , a0 )
Perform gradient descent step on (Q∗ (st , at ) − Qθ0 (st+1 , a0 ))
Update target network parameters :
θ0 ←− τ ∗ θ + (1 − τ ) ∗ θ0

5
7 Experiments and Results

7.1 Pong
Both DQN and DDQN are able to achieve SOTA score of 20 with an average reward
over the last 10 epsisodes being 19.6 for DQN and 18.6 for DDQN. DQN takes around
600 episodes of training to achieve peak performance while DDQN achieves the same in
200 episodes. Playing at various model checkpoints reveals how good our agent after a
certain number of episodes.

DQN

Figure 2: Pong-DQN.

DDQN

Figure 3: Pong-DDQN.

6
7.2 Breakout
Both DQN and DDQN show steady improvement in incurred reward as they train. How-
ever, due to lack of time and computational resources, we were not able to achieve SOTA
score. The test performance clearly shows the ability of the agent to play well.

DQN

Figure 4: Breakout-DQN.

DDQN

Figure 5: Breakout-DDQN.

7
7.3 Boxing
We did not achieve significant breakthroughs in just 400 episodes, although it maintains
an average score close to 0, which implies it has learned to play at least as good as the
opponent. In the few tests that we ran, the agent beat the opponent.

DQN

Figure 6: Boxing-DQN.

8 Discussion and Future Work

• Using a Deep Q-Learning model with experience replay memory with only raw
pixels as input, we were able to achieve SOTA score in Pong Atari.

• The over-optimistic nature of Q-Learning models can affect their performance. A

Double Deep Q-Learning model was used to successfully reduce this overoptimism,
resulting in more stable and faster learning with no adjustment of the architecture
or hyper-parameters.

• The same DQN and DDQN networks were used in Atari games Breakout and Box-
ing, with decent results considering the time and computational resources.

• Further improvements can be made by using Prioritized Replay as opposed to ran-

dom sampling from the memory buffer. Experiences (transitions) that have a higher
TD-loss, and thus had a higher impact on learning will have a higher chance of being
sampled. This will make the training faster and more efficient

8
References
V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
M. Riedmiller. Playing Atari with Deep Reinforcement Learning. arXiv e-prints, art.
arXiv:1312.5602, Dec. 2013.

M. Riedmiller. Neural fitted q iteration – first experiences with a data efficient neural re-
inforcement learning method. In J. Gama, R. Camacho, P. B. Brazdil, A. M. Jorge, and
L. Torgo, editors, Machine Learning: ECML 2005, pages 317–328, Berlin, Heidelberg,
2005. Springer Berlin Heidelberg. ISBN 978-3-540-31692-3.

G. Tesauro. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level

Play. Neural Computation, 6(2):215–219, 03 1994. ISSN 0899-7667. doi: 10.1162/neco.
1994.6.2.215. URL https://doi.org/10.1162/neco.1994.6.2.215.

S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement

learning. In D. T. J. E. M. Mozer, P. Smolensky and A. Weigend, editors, Proceedings
of the 1993 Connectionist Models Summer School. Erlbaum Associates, June 1993.

Assignment3 Yash Patel
No ratings yet
Assignment3 Yash Patel
10 pages
AI Plays Geometry Dash
No ratings yet
AI Plays Geometry Dash
7 pages
DeepMind Whitepaper
No ratings yet
DeepMind Whitepaper
9 pages
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
No ratings yet
Assignment 3 - ReinforcementLearning - 200508263 - AdityaAnantharaman - Trikkur
9 pages
Lecture 21
No ratings yet
Lecture 21
29 pages
DQN 1
No ratings yet
DQN 1
9 pages
Deep Learning Book Part5
No ratings yet
Deep Learning Book Part5
142 pages
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
No ratings yet
Deep Reinforcement Learning: From Q-Learning To Deep Q-Learning
9 pages
IMPLing The DQN
No ratings yet
IMPLing The DQN
9 pages
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
No ratings yet
Lecture Notes Deep Reinforcement Learning: Generalizability in Deep RL
7 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
Untitled Document
No ratings yet
Untitled Document
11 pages
DQN Atari
No ratings yet
DQN Atari
26 pages
Lecture 6 Deep Q Network and Its Variants 1 20
No ratings yet
Lecture 6 Deep Q Network and Its Variants 1 20
20 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Yang 20 A
No ratings yet
Yang 20 A
4 pages
Lecture 6 Deep Q Network and Its Variants
No ratings yet
Lecture 6 Deep Q Network and Its Variants
59 pages
Enhancing DDPG with Prioritized Replay
No ratings yet
Enhancing DDPG with Prioritized Replay
10 pages
Playing FPS Games With Deep Reinforcement Learning: Guillaume Lample, Devendra Singh Chaplot
No ratings yet
Playing FPS Games With Deep Reinforcement Learning: Guillaume Lample, Devendra Singh Chaplot
7 pages
RL Unit V Qa
No ratings yet
RL Unit V Qa
13 pages
Doom AI
No ratings yet
Doom AI
7 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
18 Deeprl
No ratings yet
18 Deeprl
19 pages
Self-Driving Car Racing: Application of Deep Reinforcement Learning
No ratings yet
Self-Driving Car Racing: Application of Deep Reinforcement Learning
12 pages
Advanced Deep RL for AI Experts
No ratings yet
Advanced Deep RL for AI Experts
10 pages
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
No ratings yet
Walking Through Original DQN Paper - by Stas Olekhnovich - Medium
13 pages
RLDL PBL AmriteshChandra 09411503121
No ratings yet
RLDL PBL AmriteshChandra 09411503121
15 pages
Deep Reinforcement Learning in Video Games
No ratings yet
Deep Reinforcement Learning in Video Games
13 pages
Actor Mimic
No ratings yet
Actor Mimic
16 pages
2 4+Advanced+Tricks+for+DQNs
No ratings yet
2 4+Advanced+Tricks+for+DQNs
82 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
No ratings yet
Multi-Agent Deep Reinforcement Learning: Maxim Egorov Stanford University
8 pages
RL Review-1
No ratings yet
RL Review-1
6 pages
Markov Decision Process: Reinforcement Learning
No ratings yet
Markov Decision Process: Reinforcement Learning
10 pages
TD Learning & Deep Q-Networks
No ratings yet
TD Learning & Deep Q-Networks
20 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Applsci 13 02443
No ratings yet
Applsci 13 02443
23 pages
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
No ratings yet
Applying Q (λ) -learning in Deep Reinforcement Learning to Play Atari Games
6 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
Deep Reinforcement Learning For Multiagent Systems A Review of Challenges Solutions and Applications
No ratings yet
Deep Reinforcement Learning For Multiagent Systems A Review of Challenges Solutions and Applications
14 pages
Experience Replay
No ratings yet
Experience Replay
17 pages
RADL LACuong
No ratings yet
RADL LACuong
81 pages
Yuming Li Pin Ni Victor Chang
No ratings yet
Yuming Li Pin Ni Victor Chang
19 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
AI Learning for Advanced Users
No ratings yet
AI Learning for Advanced Users
12 pages
1 Related Works
No ratings yet
1 Related Works
2 pages
AI 20report 205
No ratings yet
AI 20report 205
18 pages
Project3-Arc1 1
No ratings yet
Project3-Arc1 1
7 pages
Disertatie
No ratings yet
Disertatie
5 pages
Q Learning
No ratings yet
Q Learning
187 pages
Il&drl Eric
No ratings yet
Il&drl Eric
26 pages
HER Hindsight Experience Replay
No ratings yet
HER Hindsight Experience Replay
11 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
TD Gammon
No ratings yet
TD Gammon
14 pages
Som Lect 4 BPSC Iti VPN
No ratings yet
Som Lect 4 BPSC Iti VPN
25 pages
Numerical Simulation of Particle Growth Process in A Polysilicon Fluidized Bed Reactor
No ratings yet
Numerical Simulation of Particle Growth Process in A Polysilicon Fluidized Bed Reactor
11 pages
Som 1 BPSC Iti Vp..
No ratings yet
Som 1 BPSC Iti Vp..
44 pages
2025-07-26-18 - 27 - 07 - 277 - 7686594735307598774evs Ese 26-07-2025
No ratings yet
2025-07-26-18 - 27 - 07 - 277 - 7686594735307598774evs Ese 26-07-2025
28 pages
Electric Ckts L-2 (Basics)
No ratings yet
Electric Ckts L-2 (Basics)
24 pages
C1 IITs JFSM Momaya Bhat Lalwani Institutional Growth Industrial Competi Tiveness 17 822
No ratings yet
C1 IITs JFSM Momaya Bhat Lalwani Institutional Growth Industrial Competi Tiveness 17 822
13 pages
June. 2022 Mess Bill
No ratings yet
June. 2022 Mess Bill
171 pages
Session Objectives Problem Structuring 822
No ratings yet
Session Objectives Problem Structuring 822
3 pages
Quality Parameters of Water: (Gravimetric Technique)
No ratings yet
Quality Parameters of Water: (Gravimetric Technique)
30 pages
International Competitiveness Abstract
No ratings yet
International Competitiveness Abstract
2 pages
Surveying BPSC Lec 1
No ratings yet
Surveying BPSC Lec 1
9 pages
Laminar and Turbulent Flow
No ratings yet
Laminar and Turbulent Flow
15 pages
IITB SOM Competitiveness Project Ideas 0707722
No ratings yet
IITB SOM Competitiveness Project Ideas 0707722
7 pages
1 s2.0 S0098135414002038 Main
No ratings yet
1 s2.0 S0098135414002038 Main
14 pages
Dimensionless Numbers
No ratings yet
Dimensionless Numbers
20 pages
Stream Line Path Line Streaklines
No ratings yet
Stream Line Path Line Streaklines
10 pages
1 s2.0 S0009250922001737 Main
No ratings yet
1 s2.0 S0009250922001737 Main
14 pages
General Engineering Science 200 Most Imp Topics-1
No ratings yet
General Engineering Science 200 Most Imp Topics-1
14 pages
Bernoulli Equation and Applications
No ratings yet
Bernoulli Equation and Applications
21 pages
1 July 2025 (9 Concepts Per Day)
No ratings yet
1 July 2025 (9 Concepts Per Day)
18 pages
7 July 2025 (9 Concepts Per Day)
No ratings yet
7 July 2025 (9 Concepts Per Day)
15 pages
Alcohol Protecting Groups Guide
No ratings yet
Alcohol Protecting Groups Guide
15 pages
Exercise Testing Protocol Using A Roller System For Manual Wheelchair Users With Spinal Cord Injury
No ratings yet
Exercise Testing Protocol Using A Roller System For Manual Wheelchair Users With Spinal Cord Injury
11 pages
A63 L1&L2 Service Manual PDF
No ratings yet
A63 L1&L2 Service Manual PDF
28 pages
4.nf.a.1, 4.nf.a.2, 4.NF.B.3, 4.NF.B.4
No ratings yet
4.nf.a.1, 4.nf.a.2, 4.NF.B.3, 4.NF.B.4
32 pages
Test 3 Global WF With Answers
No ratings yet
Test 3 Global WF With Answers
5 pages
SPM Chemistry Revision Guide
100% (1)
SPM Chemistry Revision Guide
36 pages
PhonePe Statement Jun2024 Jun2025
No ratings yet
PhonePe Statement Jun2024 Jun2025
24 pages
Finland Danske Bank
No ratings yet
Finland Danske Bank
1 page
2022年港澳杯初赛 P1
No ratings yet
2022年港澳杯初赛 P1
4 pages
Digital Business (BADB2034)
No ratings yet
Digital Business (BADB2034)
9 pages
Earth Portrait of A Planet 5th Edition by Stephen Marshak Unlocked Test Bank
No ratings yet
Earth Portrait of A Planet 5th Edition by Stephen Marshak Unlocked Test Bank
313 pages
The Impact of Sales Motivation On Organisational Performance by Shivani Singh
No ratings yet
The Impact of Sales Motivation On Organisational Performance by Shivani Singh
23 pages
40 Years 40 Anos 40 Ans PDF
No ratings yet
40 Years 40 Anos 40 Ans PDF
35 pages
Direct Activity Allocation
No ratings yet
Direct Activity Allocation
14 pages
Ervice AND Arts Anual
No ratings yet
Ervice AND Arts Anual
74 pages
JIG Bulletin-146-Independent-Deadman-Testing-141022
No ratings yet
JIG Bulletin-146-Independent-Deadman-Testing-141022
3 pages
Unit 5 - Queues
No ratings yet
Unit 5 - Queues
16 pages
Teacher Guidelines for Haoclass
No ratings yet
Teacher Guidelines for Haoclass
21 pages
Key Account Management Presentation
100% (2)
Key Account Management Presentation
37 pages
Singson Vs Coa
50% (2)
Singson Vs Coa
2 pages
MCA Exam: Mathematical Foundations
No ratings yet
MCA Exam: Mathematical Foundations
3 pages
THAI COATER Spec PDF
No ratings yet
THAI COATER Spec PDF
4 pages
China's Diverse Language Dialects
No ratings yet
China's Diverse Language Dialects
4 pages
Sexually Abused Children
No ratings yet
Sexually Abused Children
9 pages
Mobileye 8 Connect User Manual v0.5
No ratings yet
Mobileye 8 Connect User Manual v0.5
51 pages
İlk Gün Formu Son Hali
No ratings yet
İlk Gün Formu Son Hali
2 pages
Gas Laws Practice Problems
No ratings yet
Gas Laws Practice Problems
7 pages
Digital Indicator Model DI25: Operating Instructions
No ratings yet
Digital Indicator Model DI25: Operating Instructions
44 pages
Age of Fantasy - Saurians v2.0: Background About OPR
No ratings yet
Age of Fantasy - Saurians v2.0: Background About OPR
2 pages
Technical Sales Cover Letter Guide
100% (2)
Technical Sales Cover Letter Guide
5 pages

Report

Uploaded by

Report

Uploaded by

1

7 Experiments and Results 6

8 Discussion and Future Work 8

• Typically, DL requires large amounts of hand-labelled data whereas RL requires

• Another major concern is that the data in DL is almost always assumed to be

where T is the time step at which the game terminates.

Li (θi ) = Es,a ρ(.) [(yi − Q(s, a; θi ))2 ]

3.1 TD-Gammon Neural Network

3.2 Neural Fitted Q-Iteration

Algorithm: Deep Q-Learning with Experience Replay

Figure 1: Deep Q-Network (DQN) Architecture

Q∗ (st , at ) ≈ rt + γQ(st+1 , argmaxa0 {Q0 (st+1 , a0 )})

And then perform gradient descent on Q using:

Loss = (Q∗ (st , at ) − Q((st , at )))2

Algorithm: Double Deep Q-Learning

8 Discussion and Future Work

• The over-optimistic nature of Q-Learning models can affect their performance. A

• Further improvements can be made by using Prioritized Replay as opposed to ran-

G. Tesauro. TD-Gammon, a Self-Teaching Backgammon Program, Achieves Master-Level

S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement

You might also like