Op Tim Ization
Op Tim Ization
time than previous GPU-based algorithms, using far less          proaches have recently been applied to some visual rein-
resource than massively distributed approaches. The best         forcement learning tasks. In one example, (Koutnk et al.,
of the proposed methods, asynchronous advantage actor-           2014) evolved convolutional neural network controllers for
critic (A3C), also mastered a variety of continuous motor        the TORCS driving simulator by performing fitness evalu-
control tasks as well as learned general strategies for ex-      ations on 8 CPU cores in parallel.
ploring 3D mazes purely from visual inputs. We believe
that the success of A3C on both 2D and 3D games, discrete        3. Reinforcement Learning Background
and continuous action spaces, as well as its ability to train
feedforward and recurrent agents makes it the most general       We consider the standard reinforcement learning setting
and successful reinforcement learning agent to date.             where an agent interacts with an environment E over a
                                                                 number of discrete time steps. At each time step t, the
2. Related Work                                                  agent receives a state st and selects an action at from some
                                                                 set of possible actions A according to its policy , where
The General Reinforcement Learning Architecture (Gorila)          is a mapping from states st to actions at . In return, the
of (Nair et al., 2015) performs asynchronous training of re-     agent receives the next state st+1 and receives a scalar re-
inforcement learning agents in a distributed setting. In Go-     ward rt . The process continues until the agent reaches a
rila, each process contains an actor that acts in its own copy   terminal
                                                                        Pstate  after which the process restarts. The return
                                                                            
of the environment, a separate replay memory, and a learner      Rt = k=0  k rt+k is the total accumulated return from
that samples data from the replay memory and computes            time step t with discount factor   (0, 1]. The goal of the
gradients of the DQN loss (Mnih et al., 2015) with respect       agent is to maximize the expected return from each state st .
to the policy parameters. The gradients are asynchronously
                                                                 The action value Q (s, a) = E [Rt |st = s, a] is the ex-
sent to a central parameter server which updates a central
                                                                 pected return for selecting action a in state s and follow-
copy of the model. The updated policy parameters are sent
                                                                 ing policy . The optimal value function Q (s, a) =
to the actor-learners at fixed intervals. By using 100 sep-
                                                                 max Q (s, a) gives the maximum action value for state
arate actor-learner processes and 30 parameter server in-
                                                                 s and action a achievable by any policy. Similarly, the
stances, a total of 130 machines, Gorila was able to signif-
                                                                 value of state s under policy  is defined as V  (s) =
icantly outperform DQN over 49 Atari games. On many
                                                                 E [Rt |st = s] and is simply the expected return for follow-
games Gorila reached the score achieved by DQN over 20
                                                                 ing policy  from state s.
times faster than DQN. We also note that a similar way of
parallelizing DQN was proposed by (Chavez et al., 2015).         In value-based model-free reinforcement learning methods,
                                                                 the action value function is represented using a function ap-
In earlier work, (Li & Schuurmans, 2011) applied the
                                                                 proximator, such as a neural network. Let Q(s, a; ) be an
Map Reduce framework to parallelizing batch reinforce-
                                                                 approximate action-value function with parameters . The
ment learning methods with linear function approximation.
                                                                 updates to  can be derived from a variety of reinforcement
Parallelism was used to speed up large matrix operations
                                                                 learning algorithms. One example of such an algorithm is
but not to parallelize the collection of experience or sta-
                                                                 Q-learning, which aims to directly approximate the optimal
bilize learning. (Grounds & Kudenko, 2008) proposed a
                                                                 action value function: Q (s, a)  Q(s, a; ). In one-step
parallel version of the Sarsa algorithm that uses multiple
                                                                 Q-learning, the parameters  of the action value function
separate actor-learners to accelerate training. Each actor-
                                                                 Q(s, a; ) are learned by iteratively minimizing a sequence
learner learns separately and periodically sends updates to
                                                                 of loss functions, where the ith loss function defined as
weights that have changed significantly to the other learn-
ers using peer-to-peer communication.                                                                                         2
                                                                                                 0 0
                                                                   Li (i ) = E r +  max0
                                                                                             Q(s  , a ;  i1 )  Q(s, a;  i )
                                                                                         a
(Tsitsiklis, 1994) studied convergence properties of Q-
learning in the asynchronous optimization setting. These         where s0 is the state encountered after state s.
results show that Q-learning is still guaranteed to converge
when some of the information is outdated as long as out-         We refer to the above method as one-step Q-learning be-
dated information is always eventually discarded and sev-        cause it updates the action value Q(s, a) toward the one-
eral other technical assumptions are satisfied. Even earlier,    step return r +  maxa0 Q(s0 , a0 ; ). One drawback of us-
(Bertsekas, 1982) studied the related problem of distributed     ing one-step methods is that obtaining a reward r only di-
dynamic programming.                                             rectly affects the value of the state action pair s, a that led
                                                                 to the reward. The values of other state action pairs are
Another related area of work is in evolutionary meth-            affected only indirectly through the updated value Q(s, a).
ods, which are often straightforward to parallelize by dis-      This can make the learning process slow since many up-
tributing fitness evaluations over multiple machines or          dates are required the propagate a reward to the relevant
threads (Tomassini, 1999). Such parallel evolutionary ap-        preceding states and actions.
                                   Asynchronous Methods for Deep Reinforcement Learning
One way of propagating rewards faster is by using n-              Algorithm 1 Asynchronous one-step Q-learning - pseu-
step returns (Watkins, 1989; Peng & Williams, 1996).              docode for each actor-learner thread.
In n-step Q-learning, Q(s, a) is updated toward the n-              // Assume global shared ,  , and counter T = 0.
step return defined as rt + rt+1 +    +  n1 rt+n1 +          Initialize thread step counter t  0
maxa  n Q(st+n , a). This results in a single reward r di-         Initialize target network weights   
                                                                    Initialize network gradients d  0
rectly affecting the values of n preceding state action pairs.      Get initial state s
This makes the process of propagating rewards to relevant           repeat
state-action pairs potentially much more efficient.                     Take action a with -greedy policy based on Q(s, a; )
                                                                                             0
                                                                               new state s and reward r
                                                                        Receive
In contrast to value-based methods, policy-based model-                           r                             for terminal s0
                                                                        y=                         0  0  
free methods directly parameterize the policy (a|s; ) and                       r +  maxa0 Q(s , a ;  )     for non-terminal s0
                                                                                                                              2
update the parameters  by performing, typically approx-               Accumulate gradients wrt : d  d + (yQ(s,a;))
                                                                                                                     
imate, gradient ascent on E[Rt ]. One example of such                  s = s0
a method is the REINFORCE family of algorithms due                     T  T + 1 and t  t + 1
to Williams (1992). Standard REINFORCE updates the                     if T mod Itarget == 0 then
                                                                           Update the target network   
policy parameters  in the direction  log (at |st ; )Rt ,          end if
which is an unbiased estimate of  E[Rt ]. It is possible to          if t mod IAsyncU pdate == 0 or s is terminal then
reduce the variance of this estimate while keeping it unbi-                Perform asynchronous update of  using d.
ased by subtracting a learned function of the state bt (st ),              Clear gradients d  0.
known as a baseline (Williams, 1992), from the return. The             end if
                                                                    until T > Tmax
resulting gradient is  log (at |st ; ) (Rt  bt (st )).
A learned estimate of the value function is commonly used
as the baseline bt (st )  V  (st ) leading to a much lower      learners running in parallel are likely to be exploring dif-
variance estimate of the policy gradient. When an approx-         ferent parts of the environment. Moreover, one can explic-
imate value function is used as the baseline, the quantity        itly use different exploration policies in each actor-learner
Rt  bt used to scale the policy gradient can be seen as          to maximize this diversity. By running different explo-
an estimate of the advantage of action at in state st , or        ration policies in different threads, the overall changes be-
A(at , st ) = Q(at , st )V (st ), because Rt is an estimate of   ing made to the parameters by multiple actor-learners ap-
Q (at , st ) and bt is an estimate of V  (st ). This approach   plying online updates in parallel are likely to be less corre-
can be viewed as an actor-critic architecture where the pol-      lated in time than a single agent applying online updates.
icy  is the actor and the baseline bt is the critic (Sutton &    Hence, we do not use a replay memory and rely on parallel
Barto, 1998; Degris et al., 2012).                                actors employing different exploration policies to perform
                                                                  the stabilizing role undertaken by experience replay in the
4. Asynchronous RL Framework                                      DQN training algorithm.
We now present multi-threaded asynchronous variants of            In addition to stabilizing learning, using multiple parallel
one-step Sarsa, one-step Q-learning, n-step Q-learning, and       actor-learners has multiple practical benefits. First, we ob-
advantage actor-critic. The aim in designing these methods        tain a reduction in training time that is roughly linear in
was to find RL algorithms that can train deep neural net-         the number of parallel actor-learners. Second, since we no
work policies reliably and without large resource require-        longer rely on experience replay for stabilizing learning we
ments. While the underlying RL methods are quite dif-             are able to use on-policy reinforcement learning methods
ferent, with actor-critic being an on-policy policy search        such as Sarsa and actor-critic to train neural networks in a
method and Q-learning being an off-policy value-based             stable way. We now describe our variants of one-step Q-
method, we use two main ideas to make all four algorithms         learning, one-step Sarsa, n-step Q-learning and advantage
practical given our design goal.                                  actor-critic.
First, we use asynchronous actor-learners, similarly to the       Asynchronous one-step Q-learning: Pseudocode for our
Gorila framework (Nair et al., 2015), but instead of using        variant of Q-learning, which we call Asynchronous one-
separate machines and a parameter server, we use multi-           step Q-learning, is shown in Algorithm 1. Each thread in-
ple CPU threads on a single machine. Keeping the learn-           teracts with its own copy of the environment and at each
ers on a single machine removes the communication costs           step computes a gradient of the Q-learning loss. We use
of sending gradients and parameters and enables us to use         a shared and slowly changing target network in comput-
Hogwild! (Recht et al., 2011) style updates for training.         ing the Q-learning loss, as was proposed in the DQN train-
Second, we make the observation that multiple actors-             ing method. We also accumulate gradients over multiple
                                                                  timesteps before they are applied, which is similar to us-
                                    Asynchronous Methods for Deep Reinforcement Learning
ing minibatches. This reduces the chances of multiple ac-           by tmax . The pseudocode for the algorithm is presented in
tor learners overwriting each others updates. Accumulat-           Supplementary Algorithm S3.
ing updates over several steps also provides some ability to
                                                                    As with the value-based methods we rely on parallel actor-
trade off computational efficiency for data efficiency.
                                                                    learners and accumulated updates for improving training
Finally, we found that giving each thread a different explo-        stability. Note that while the parameters  of the policy
ration policy helps improve robustness. Adding diversity            and v of the value function are shown as being separate
to exploration in this manner also generally improves per-          for generality, we always share some of the parameters in
formance through better exploration. While there are many           practice. We typically use a convolutional neural network
possible ways of making the exploration policies differ we          that has one softmax output for the policy (at |st ; ) and
experiment with using -greedy exploration with  periodi-          one linear output for the value function V (st ; v ), with all
cally sampled from some distribution by each thread.                non-output layers shared.
Asynchronous one-step Sarsa: The asynchronous one-                  We also found that adding the entropy of the policy  to the
step Sarsa algorithm is the same as asynchronous one-step           objective function improved exploration by discouraging
Q-learning as given in Algorithm 1 except that it uses a dif-       premature convergence to suboptimal deterministic poli-
ferent target value for Q(s, a). The target value used by           cies. This technique was originally proposed by (Williams
one-step Sarsa is r + Q(s0 , a0 ;  ) where a0 is the action      & Peng, 1991), who found that it was particularly help-
taken in state s0 (Rummery & Niranjan, 1994; Sutton &               ful on tasks requiring hierarchical behavior. The gradi-
Barto, 1998). We again use a target network and updates             ent of the full objective function including the entropy
accumulated over multiple timesteps to stabilize learning.          regularization term with respect to the policy parame-
                                                                    ters takes the form 0 log (at |st ; 0 )(Rt  V (st ; v )) +
Asynchronous n-step Q-learning: Pseudocode for our
                                                                    0 H((st ; 0 )), where H is the entropy. The hyperpa-
variant of multi-step Q-learning is shown in Supplementary
                                                                    rameter  controls the strength of the entropy regulariza-
Algorithm S2. The algorithm is somewhat unusual because
                                                                    tion term.
it operates in the forward view by explicitly computing n-
step returns, as opposed to the more common backward                Optimization: We investigated three different optimiza-
view used by techniques like eligibility traces (Sutton &           tion algorithms in our asynchronous framework  SGD
Barto, 1998). We found that using the forward view is eas-          with momentum, RMSProp (Tieleman & Hinton, 2012)
ier when training neural networks with momentum-based               without shared statistics, and RMSProp with shared statis-
methods and backpropagation through time. In order to               tics. We used the standard non-centered RMSProp update
compute a single update, the algorithm first selects actions        given by
using its exploration policy for up to tmax steps or until a
                                                                                                          
terminal state is reached. This process results in the agent           g = g + (1  )2 and           ,                (1)
receiving up to tmax rewards from the environment since                                                    g+
its last update. The algorithm then computes gradients for          where all operations are performed elementwise. A com-
n-step Q-learning updates for each of the state-action pairs        parison on a subset of Atari 2600 games showed that a vari-
encountered since the last update. Each n-step update uses          ant of RMSProp where statistics g are shared across threads
the longest possible n-step return resulting in a one-step          is considerably more robust than the other two methods.
update for the last state, a two-step update for the second         Full details of the methods and comparisons are included
last state, and so on for a total of up to tmax updates. The        in Supplementary Section 7.
accumulated updates are applied in a single gradient step.
Asynchronous advantage actor-critic: The algorithm,                 5. Experiments
which we call asynchronous advantage actor-critic (A3C),            We use four different platforms for assessing the properties
maintains a policy (at |st ; ) and an estimate of the value       of the proposed framework. We perform most of our exper-
function V (st ; v ). Like our variant of n-step Q-learning,       iments using the Arcade Learning Environment (Bellemare
our variant of actor-critic also operates in the forward view       et al., 2012), which provides a simulator for Atari 2600
and uses the same mix of n-step returns to update both the          games. This is one of the most commonly used benchmark
policy and the value-function. The policy and the value             environments for RL algorithms. We use the Atari domain
function are updated after every tmax actions or when a             to compare against state of the art results (Van Hasselt et al.,
terminal state is reached. The update performed by the al-          2015; Wang et al., 2015; Schaul et al., 2015; Nair et al.,
gorithm can be seen as 0 log (at |st ; 0 )A(st , at ; , v )   2015; Mnih et al., 2015), as well as to carry out a detailed
where A(st , at ; , v ) is an estimate of the advantage func-     stability and scalability analysis of the proposed methods.
               Pk1
tion given by i=0  i rt+i +  k V (st+k ; v )  V (st ; v ),     We performed further comparisons using the TORCS 3D
where k can vary from state to state and is upper-bounded           car racing simulator (Wymann et al., 2013). We also use
                                                        Asynchronous Methods for Deep Reinforcement Learning
        16000             Beamrider           600              Breakout           30           Pong                12000           Q*bert           1600           Space Invaders
                    DQN                                 DQN                                                                DQN                                 DQN
        14000       1-step Q                  500       1-step Q                  20                               10000   1-step Q                 1400       1-step Q
        12000       1-step SARSA                        1-step SARSA                                                       1-step SARSA             1200       1-step SARSA
                    n-step Q                            n-step Q                                                           n-step Q                            n-step Q
        10000       A3C                       400       A3C                       10                               8000    A3C                      1000       A3C
Score
Score
Score
Score
                                                                                                                                            Score
         8000                                 300                                 0                                6000                              800
         6000                                 200                                 10                DQN          4000                                600
         4000                                                                                       1-step Q                                         400
                                              100                                 20                1-step SARSA 2000
         2000                                                                                       n-step Q                                         200
                                                                                                    A3C
            0                                   0                                 30                                0                                  0
                0 2 4 6 8 10 12 14                  0 2 4 6 8 10 12 14               0 2 4 6 8 10 12 14               0 2 4 6 8 10 12 14                   0 2 4 6 8 10 12 14
                   Training time (hours)               Training time (hours)            Training time (hours)            Training time (hours)                Training time (hours)
Figure 1. Learning speed comparison for DQN and the new asynchronous algorithms on five Atari 2600 games. DQN was trained on
a single Nvidia K40 GPU while the asynchronous methods were trained using 16 CPU cores. The plots are averaged over 5 runs. In
the case of DQN the runs were for different seeds with fixed hyperparameters. For asynchronous methods we average over the best 5
models from 50 experiments with learning rates sampled from LogU nif orm(104 , 102 ) and all other hyperparameters fixed.
two additional domains to evaluate only the A3C algorithm                                         Method                        Training Time                   Mean           Median
 Mujoco and Labyrinth. MuJoCo (Todorov, 2015) is a                                               DQN                           8 days on GPU                  121.9%           47.5%
physics simulator for evaluating agents on continuous mo-                                         Gorila                     4 days, 100 machines              215.2%           71.3%
                                                                                                  D-DQN                         8 days on GPU                  332.9%          110.9%
tor control tasks with contact dynamics. Labyrinth is a new                                       Dueling D-DQN                 8 days on GPU                  343.8%          117.1%
3D environment where the agent must learn to find rewards                                         Prioritized DQN               8 days on GPU                  463.6%          127.6%
in randomly generated mazes from a visual input. The pre-                                         A3C, FF                        1 day on CPU                  344.1%           68.2%
cise details of our experimental setup can be found in Sup-                                       A3C, FF                       4 days on CPU                  496.8%          116.6%
plementary Section 8.                                                                             A3C, LSTM                     4 days on CPU                  623.0%          112.6%
16000 A3C, Beamrider 1000 A3C, Breakout 30 A3C, Pong 12000 A3C, Q*bert 1400 A3C, Space Invaders
Score
Score
Score
                                                                                                                                                                                       Score
                                                          400                                             0
           6000                                                                                                                                  4000                                          600
           4000                                           200                                             10
                                                                                                                                                 2000                                          400
           2000
                                                            0                                             20                                        0                                          200
              0
           2000                                           200                                             30                                     2000                                            0
                   10-4        10-3        10-2                  10-4          10-3        10-2                10-4       10-3        10-2               10-4       10-3        10-2                  10-4          10-3           10-2
                           Learning rate                                   Learning rate                              Learning rate                             Learning rate                                   Learning rate
Figure 2. Scatter plots of scores obtained by asynchronous advantage actor-critic on five games (Beamrider, Breakout, Pong, Q*bert,
Space Invaders) for 50 different learning rates and random initializations. On each game, there is a wide range of learning rates for
which all random initializations acheive good scores. This shows that A3C is quite robust to learning rates and initial random weights.
numbers of actor-learners and training methods on five                                                                       substantially improve the data efficiency of these methods
Atari games, and Figure 4, which shows plots of the av-                                                                      by reusing old data. This could in turn lead to much faster
erage score against wall-clock time.                                                                                         training times in domains like TORCS where interacting
                                                                                                                             with the environment is more expensive than updating the
5.6. Robustness and Stability                                                                                                model for the architecture we used.
Finally, we analyzed the stability and robustness of the                                                                     Combining other existing reinforcement learning meth-
four proposed asynchronous algorithms. For each of the                                                                       ods or recent advances in deep reinforcement learning
four algorithms we trained models on five games (Break-                                                                      with our asynchronous framework presents many possibil-
out, Beamrider, Pong, Q*bert, Space Invaders) using 50                                                                       ities for immediate improvements to the methods we pre-
different learning rates and random initializations. Figure 2                                                                sented. While our n-step methods operate in the forward
shows scatter plots of the resulting scores for A3C, while                                                                   view (Sutton & Barto, 1998) by using corrected n-step re-
Supplementary Figure S11 shows plots for the other three                                                                     turns directly as targets, it has been more common to use
methods. There is usually a range of learning rates for each                                                                 the backward view to implicitly combine different returns
method and game combination that leads to good scores,                                                                       through eligibility traces (Watkins, 1989; Sutton & Barto,
indicating that all methods are quite robust to the choice of                                                                1998; Peng & Williams, 1996). The asynchronous ad-
learning rate and random initialization. The fact that there                                                                 vantage actor-critic method could be potentially improved
are virtually no points with scores of 0 in regions with good                                                                by using other ways of estimating the advantage function,
learning rates indicates that the methods are stable and do                                                                  such as generalized advantage estimation of (Schulman
not collapse or diverge once they are learning.                                                                              et al., 2015b). All of the value-based methods we inves-
                                                                                                                             tigated could benefit from different ways of reducing over-
6. Conclusions and Discussion                                                                                                estimation bias of Q-values (Van Hasselt et al., 2015; Belle-
                                                                                                                             mare et al., 2016). Yet another, more speculative, direction
We have presented asynchronous versions of four standard                                                                     is to try and combine the recent work on true online tempo-
reinforcement learning algorithms and showed that they                                                                       ral difference methods (van Seijen et al., 2015) with non-
are able to train neural network controllers on a variety                                                                    linear function approximation.
of domains in a stable manner. Our results show that in
                                                                                                                             In addition to these algorithmic improvements, a number
our proposed framework stable training of neural networks
                                                                                                                             of complementary improvements to the neural network ar-
through reinforcement learning is possible with both value-
                                                                                                                             chitecture are possible. The dueling architecture of (Wang
based and policy-based methods, off-policy as well as on-
                                                                                                                             et al., 2015) has been shown to produce more accurate es-
policy methods, and in discrete as well as continuous do-
                                                                                                                             timates of Q-values by including separate streams for the
mains. When trained on the Atari domain using 16 CPU
                                                                                                                             state value and advantage in the network. The spatial soft-
cores, the proposed asynchronous algorithms train faster
                                                                                                                             max proposed by (Levine et al., 2015) could improve both
than DQN trained on an Nvidia K40 GPU, with A3C sur-
                                                                                                                             value-based and policy-based methods by making it easier
passing the current state-of-the-art in half the training time.
                                                                                                                             for the network to represent feature coordinates.
One of our main findings is that using parallel actor-
learners to update a shared model had a stabilizing effect on                                                                ACKNOWLEDGMENTS
the learning process of the three value-based methods we
considered. While this shows that stable online Q-learning                                                                   We thank Thomas Degris, Remi Munos, Marc Lanctot,
is possible without experience replay, which was used for                                                                    Sasha Vezhnevets and Joseph Modayil for many helpful
this purpose in DQN, it does not mean that experience re-                                                                    discussions, suggestions and comments on the paper. We
play is not useful. Incorporating experience replay into                                                                     also thank the DeepMind evaluation team for setting up the
the asynchronous reinforcement learning framework could                                                                      environments used to evaluate the agents in the paper.
                                                                                                  Asynchronous Methods for Deep Reinforcement Learning
           10000                            Beamrider                              350                            Breakout                                    20                              Pong                                       4500                              Q*bert                            800                       Space Invaders
                        1-step Q, 1 threads                                                  1-step Q, 1 threads                                                                                                                                    1-step Q, 1 threads
                        1-step Q, 2 threads                                                  1-step Q, 2 threads                                              15                                                                         4000       1-step Q, 2 threads
                        1-step Q, 4 threads                                        300       1-step Q, 4 threads                                                                                                                                    1-step Q, 4 threads                                      700
           8000         1-step Q, 8 threads                                                  1-step Q, 8 threads
                                                                                                                                                              10                                                                         3500       1-step Q, 8 threads
                        1-step Q, 16 threads                                                 1-step Q, 16 threads                                                                                                                                   1-step Q, 16 threads
                                                                                   250                                                                                                                                                                                                                       600
                                                                                                                                                               5                                                                         3000
           6000                                                                    200                                                                                                                                                                                                                       500
                                                                                                                                                               0                                                                         2500
   Score
Score
Score
Score
                                                                                                                                                                                                                                                                                                     Score
                                                                                   150                                                                         5                                                                         2000                                                                400
           4000
                                                                                                                                                              10                                                                         1500
                                                                                   100                                                                                                                                                                                                                       300
                                                                                                                                                              15                                        1-step Q, 1 threads              1000                                                                                                         1-step Q, 1 threads
           2000                                                                                                                                                                                         1-step Q, 2 threads                                                                                                                           1-step Q, 2 threads
                                                                                    50                                                                        20
                                                                                                                                                                                                        1-step Q, 4 threads
                                                                                                                                                                                                                                          500                                                                200                                      1-step Q, 4 threads
                                                                                                                                                                                                        1-step Q, 8 threads                                                                                                                           1-step Q, 8 threads
                                                                                                                                                                                                        1-step Q, 16 threads                                                                                                                          1-step Q, 16 threads
              0                                                                      0                                                                        25                                                                            0                                                                100
                   0         10            20        30         40                       0         10            20        30           40                         0         10            20        30           40                            0        10           20        30        40                       0        10           20        30          40
                                         Training epochs                                                       Training epochs                                                           Training epochs                                                            Training epochs                                                    Training epochs
           12000                            Beamrider                              350                            Breakout                                    20                               Pong                                      6000                            Q*bert                              800                       Space Invaders
                        n-step Q, 1 threads                                                  n-step Q, 1 threads                                                                                                                                    n-step Q, 1 threads
                        n-step Q, 2 threads                                                  n-step Q, 2 threads                                              15                                                                                    n-step Q, 2 threads
           10000        n-step Q, 4 threads                                        300       n-step Q, 4 threads                                                                                                                         5000       n-step Q, 4 threads                                      700
                        n-step Q, 8 threads                                                  n-step Q, 8 threads                                                                                                                                    n-step Q, 8 threads
                        n-step Q, 16 threads                                                 n-step Q, 16 threads                                             10                                                                                    n-step Q, 16 threads
                                                                                   250                                                                                                                                                                                                                       600
           8000                                                                                                                                                5                                                                         4000
                                                                                   200                                                                         0                                                                                                                                             500
   Score
Score
Score
Score
                                                                                                                                                                                                                                                                                                     Score
           6000                                                                                                                                                                                                                          3000
                                                                                   150                                                                         5                                                                                                                                             400
           4000                                                                                                                                               10                                                                         2000
                                                                                   100                                                                                                                                                                                                                       300
                                                                                                                                                              15                                        n-step Q, 1 threads                                                                                                                           n-step Q, 1 threads
           2000                                                                                                                                                                                         n-step Q, 2 threads              1000                                                                                                         n-step Q, 2 threads
                                                                                    50                                                                        20
                                                                                                                                                                                                        n-step Q, 4 threads                                                                                  200                                      n-step Q, 4 threads
                                                                                                                                                                                                        n-step Q, 8 threads                                                                                                                           n-step Q, 8 threads
                                                                                                                                                                                                        n-step Q, 16 threads                                                                                                                          n-step Q, 16 threads
              0                                                                      0                                                                        25                                                                            0                                                                100
                   0         10            20        30         40                       0         10            20        30           40                         0         10            20        30           40                            0        10           20        30        40                       0        10           20        30          40
                                         Training epochs                                                       Training epochs                                                           Training epochs                                                            Training epochs                                                    Training epochs
           16000                            Beamrider                              800                            Breakout                                    30                               Pong                                     12000                            Q*bert                             1400                       Space Invaders
                        A3C, 1 threads                                                                                              A3C, 1 threads                                                                                                  A3C, 1 threads                                                     A3C, 1 threads
                        A3C, 2 threads                                                                                              A3C, 2 threads                                                                                                  A3C, 2 threads                                                     A3C, 2 threads
           14000        A3C, 4 threads                                             700                                              A3C, 4 threads                                                                                                  A3C, 4 threads                                          1200       A3C, 4 threads
                        A3C, 8 threads                                                                                              A3C, 8 threads            20                                                                        10000       A3C, 8 threads                                                     A3C, 8 threads
           12000        A3C, 16 threads                                            600                                              A3C, 16 threads                                                                                                 A3C, 16 threads                                                    A3C, 16 threads
                                                                                                                                                                                                                                                                                                            1000
                                                                                                                                                              10                                                                         8000
           10000                                                                   500
                                                                                                                                                                                                                                                                                                             800
   Score
Score
Score
Score
                                                                                                                                                                                                                                                                                                    Score
           8000                                                                    400                                                                         0                                                                         6000
                                                                                                                                                                                                                                                                                                             600
           6000                                                                    300
                                                                                                                                                              10                                                                         4000
                                                                                                                                                                                                                                                                                                             400
           4000                                                                    200                                                                                                                        A3C, 1 threads
                                                                                                                                                              20                                              A3C, 2 threads             2000
           2000                                                                    100                                                                                                                        A3C, 4 threads                                                                                 200
                                                                                                                                                                                                              A3C, 8 threads
                                                                                                                                                                                                              A3C, 16 threads
              0                                                                      0                                                                        30                                                                            0                                                                  0
                   0         10            20        30         40                       0         10            20        30           40                         0         10            20        30           40                            0        10           20        30        40                       0        10           20        30          40
                                         Training epochs                                                       Training epochs                                                           Training epochs                                                            Training epochs                                                    Training epochs
Figure 3. Data efficiency comparison of different numbers of actor-learners for three asynchronous methods on five Atari games. The
x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads). The y-axis
shows the average score. Each curve shows the average over the three best learning rates. Single step methods show increased data
efficiency from more parallel workers. Results for Sarsa are shown in Supplementary Figure S9.
           9000                            Beamrider                              300                            Breakout                                     20                              Pong                                      4000                             Q*bert                              800                       Space Invaders
                       1-step Q, 1 threads                                                   1-step Q, 1 threads                                                       1-step Q, 1 threads                                                          1-step Q, 1 threads                                                1-step Q, 1 threads
           8000        1-step Q, 2 threads                                                   1-step Q, 2 threads                                              15       1-step Q, 2 threads                                                          1-step Q, 2 threads                                                1-step Q, 2 threads
                       1-step Q, 4 threads                                                   1-step Q, 4 threads                                                       1-step Q, 4 threads                                              3500        1-step Q, 4 threads                                      700       1-step Q, 4 threads
                       1-step Q, 8 threads                                        250        1-step Q, 8 threads                                                       1-step Q, 8 threads                                                          1-step Q, 8 threads                                                1-step Q, 8 threads
           7000        1-step Q, 16 threads                                                  1-step Q, 16 threads                                             10       1-step Q, 16 threads                                             3000        1-step Q, 16 threads                                               1-step Q, 16 threads
                                                                                                                                                                                                                                                                                                             600
           6000                                                                   200                                                                          5
                                                                                                                                                                                                                                        2500
           5000                                                                                                                                               0                                                                                                                                              500
   Score
Score
Score
Score
                                                                                                                                                                                                                                                                                                     Score
                                                                                  150                                                                                                                                                   2000
           4000                                                                                                                                                5                                                                                                                                             400
                                                                                                                                                                                                                                         1500
           3000                                                                   100                                                                         10
                                                                                                                                                                                                                                                                                                             300
           2000                                                                                                                                               15                                                                         1000
                                                                                   50                                                                                                                                                                                                                        200
           1000                                                                                                                                               20                                                                         500
              0                                                                     0                                                                         25                                                                           0                                                                 100
                  0     2         4        6      8     10     12    14                 0     2         4        6      8     10      12      14                   0    2         4        6      8     10       12     14                      0    2        4        6      8     10    12   14                  0    2        4        6      8     10     12      14
                                      Training time (hours)                                                 Training time (hours)                                                     Training time (hours)                                                       Training time (hours)                                              Training time (hours)
           12000                             Beamrider                             350                             Breakout                                   20                               Pong                                      4500                            Q*bert                              800                        Space Invaders
                        n-step Q, 1 threads                                                  n-step Q, 1 threads                                                                                                                                    n-step Q, 1 threads                                                n-step Q, 1 threads
                        n-step Q, 2 threads                                                  n-step Q, 2 threads                                              15                                                                         4000       n-step Q, 2 threads                                                n-step Q, 2 threads
           10000        n-step Q, 4 threads                                        300       n-step Q, 4 threads                                                                                                                                    n-step Q, 4 threads                                      700       n-step Q, 4 threads
                        n-step Q, 8 threads                                                  n-step Q, 8 threads                                                                                                                                    n-step Q, 8 threads                                                n-step Q, 8 threads
                        n-step Q, 16 threads                                                 n-step Q, 16 threads                                             10                                                                         3500       n-step Q, 16 threads                                               n-step Q, 16 threads
                                                                                   250                                                                                                                                                                                                                       600
           8000                                                                                                                                                5                                                                         3000
                                                                                   200                                                                         0                                                                         2500                                                                500
   Score
Score
Score
Score
Score
           6000
                                                                                   150                                                                         5                                                                         2000                                                                400
           4000                                                                                                                                               10                                                                         1500
                                                                                   100                                                                                                                                                                                                                       300
                                                                                                                                                              15                                        n-step Q, 1 threads              1000
           2000                                                                                                                                                                                         n-step Q, 2 threads
                                                                                    50                                                                        20
                                                                                                                                                                                                        n-step Q, 4 threads
                                                                                                                                                                                                                                          500                                                                200
                                                                                                                                                                                                        n-step Q, 8 threads
                                                                                                                                                                                                        n-step Q, 16 threads
              0                                                                      0                                                                        25                                                                            0                                                                100
                   0     2        4         6      8     10    12    14                  0     2        4        6      8     10       12      14                  0     2        4        6      8     10       12      14                     0    2        4        6      8     10    12   14                  0    2        4        6      8     10     12      14
                                       Training time (hours)                                                Training time (hours)                                                     Training time (hours)                                                       Training time (hours)                                              Training time (hours)
           16000                            Beamrider                              600                             Breakout                                   30                              Pong                                      12000                            Q*bert                             1600                        Space Invaders
                        A3C, 1 threads                                                       A3C, 1 threads                                                            A3C, 1 threads                                                               A3C, 1 threads                                                     A3C, 1 threads
                        A3C, 2 threads                                                       A3C, 2 threads                                                            A3C, 2 threads                                                               A3C, 2 threads                                                     A3C, 2 threads
           14000        A3C, 4 threads                                                       A3C, 4 threads                                                            A3C, 4 threads                                                               A3C, 4 threads                                          1400       A3C, 4 threads
                        A3C, 8 threads                                             500       A3C, 8 threads                                                   20       A3C, 8 threads                                                   10000       A3C, 8 threads                                                     A3C, 8 threads
           12000        A3C, 16 threads                                                      A3C, 16 threads                                                           A3C, 16 threads                                                              A3C, 16 threads                                         1200       A3C, 16 threads
                                                                                   400                                                                        10                                                                         8000
           10000                                                                                                                                                                                                                                                                                            1000
   Score
Score
Score
Score
Score
Figure 4. Training speed comparison of different numbers of actor-learners on five Atari games. The x-axis shows training time in
hours while the y-axis shows the average score. Each curve shows the average over the three best learning rates. All asynchronous
methods show significant speedups from using greater numbers of parallel actor-learners. Results for Sarsa are shown in Supplementary
Figure S10.
                                 Asynchronous Methods for Deep Reinforcement Learning
7. Optimization Details
We investigated two different optimization algorithms with our asynchronous framework  stochastic gradient
descent and RMSProp. Our implementations of these algorithms do not use any locking in order to maximize
throughput when using a large number of threads.
Momentum SGD: The implementation of SGD in an asynchronous setting is relatively straightforward and
well studied (Recht et al., 2011). Let  be the parameter vector that is shared across all threads and let i
be the accumulated gradients of the loss with respect to parameters  computed by thread number i. Each
thread i independently applies the standard momentum SGD update mi = mi + (1  )i followed by
    mi with learning rate , momentum  and without any locks. Note that in this setting, each thread
maintains its own separate gradient and momentum vector.
RMSProp: While RMSProp (Tieleman & Hinton, 2012) has been widely used in the deep learning literature,
it has not been extensively studied in the asynchronous optimization setting. The standard non-centered
RMSProp update is given by
                                        g = g + (1  )2                                              (S2)
                                                      
                                                    ,                                              (S3)
                                                      g+
where all operations are performed elementwise. In order to apply RMSProp in the asynchronous optimiza-
tion setting one must decide whether the moving average of elementwise squared gradients g is shared or
per-thread. We experimented with two versions of the algorithm. In one version, which we refer to as RM-
SProp, each thread maintains its own g shown in Equation S2. In the other version, which we call Shared
RMSProp, the vector g is shared among threads and is updated asynchronously and without locking. Sharing
statistics among threads also reduces memory requirements by using one fewer copy of the parameter vector
per thread.
We compared these three asynchronous optimization algorithms in terms of their sensitivity to different learn-
ing rates and random network initializations. Figure S5 shows a comparison of the methods for two different
reinforcement learning methods (Async n-step Q and Async Advantage Actor-Critic) on four different games
(Breakout, Beamrider, Seaquest and Space Invaders). Each curve shows the scores for 50 experiments that
correspond to 50 different random learning rates and initializations. The x-axis shows the rank of the model
after sorting in descending order by final average score and the y-axis shows the final average score achieved
by the corresponding model. In this representation, the algorithm that performs better would achieve higher
maximum rewards on the y-axis and the algorithm that is most robust would have its slope closest to horizon-
tal, thus maximizing the area under the curve. RMSProp with shared statistics tends to be more robust than
RMSProp with per-thread statistics, which is in turn more robust than Momentum SGD.
                                  Asynchronous Methods for Deep Reinforcement Learning
8. Experimental Setup
The experiments performed on a subset of Atari games (Figures 1, 3, 4 and Table 2) as well as the TORCS
experiments (Figure S6) used the following setup. Each experiment used 16 actor-learner threads running
on a single machine and no GPUs. All methods performed updates after every 5 actions (tmax = 5 and
IU pdate = 5) and shared RMSProp was used for optimization. The three asynchronous value-based methods
used a shared target network that was updated every 40000 frames. The Atari experiments used the same
input preprocessing as (Mnih et al., 2015) and an action repeat of 4. The agents used the network architecture
from (Mnih et al., 2013). The network used a convolutional layer with 16 filters of size 8  8 with stride
4, followed by a convolutional layer with with 32 filters of size 4  4 with stride 2, followed by a fully
connected layer with 256 hidden units. All three hidden layers were followed by a rectifier nonlinearity. The
value-based methods had a single linear output unit for each action representing the action-value. The model
used by actor-critic agents had two set of outputs  a softmax output with one entry per action representing the
probability of selecting the action, and a single linear output representing the value function. All experiments
used a discount of  = 0.99 and an RMSProp decay factor of  = 0.99.
The value based methods sampled the exploration rate  from a distribution taking three values 1 , 2 , 3 with
probabilities 0.4, 0.3, 0.3. The values of 1 , 2 , 3 were annealed from 1 to 0.1, 0.01, 0.5 respectively over
the first four million frames. Advantage actor-critic used entropy regularization with a weight  = 0.01 for
all Atari and TORCS experiments. We performed a set of 50 experiments for five Atari games and every
TORCS level, each using a different random initialization and initial learning rate. The initial learning rate
was sampled from a LogU nif orm(104 , 102 ) distribution and annealed to 0 over the course of training.
Note that in comparisons to prior work (Tables 1 and S3) we followed standard evaluation protocol and used
fixed hyperparameters.
case the we used a cost on the differential entropy of the normal distribution defined by the output of the
actor network,  12 (log(2 2 ) + 1), we used a constant multiplier of 104 for this cost across all of the tasks
examined. The asynchronous advantage actor-critic algorithm finds solutions for all the domains. Figure S8
shows learning curves against wall-clock time, and demonstrates that most of the domains from states can be
solved within a few hours. All of the experiments, including those done from pixel based observations, were
run on CPU. Even in the case of solving the domains directly from pixel inputs we found that it was possible
to reliably discover solutions within 24 hours. Figure S7 shows scatter plots of the top scores against the
sampled learning rates. In most of the domains there is large range of learning rates that consistently achieve
good performance on the task.
Score
Score
Score
                                                                                                                                                                           Score
                     200                                                                                                   3000
                                                                                                                                                                                   800
                     150                                                 10000
                                                                                                                           2000                                                    600
                     100                                                                                                                                                           400
                                                                          5000                                             1000
                      50                                                                                                                                                            200
                       0                                                 0                                            0                                                       0
                                   10   20       30       40      50             10   20       30       40      50                       10   20       30       40      50                10    20       30       40      50
                                         Model Rank                                    Model Rank                                              Model Rank                                        Model Rank
                     900                  Breakout                   25000             Beamrider                   1800                         Seaquest                   4000                Space Invaders
                                               A3C, SGD                                      A3C, SGD                                                A3C, SGD                                          A3C, SGD
                     800                       A3C, RMSProp                                  A3C, RMSProp          1600                              A3C, RMSProp          3500                        A3C, RMSProp
                                               A3C, Shared RMSProp                           A3C, Shared RMSProp                                     A3C, Shared RMSProp                               A3C, Shared RMSProp
                     700                                             20000
                                                                                                                   1400                                                    3000
                     600
                                                                         15000                                             1200                                                    2500
                     500
             Score
Score
Score
                                                                                                                                                                           Score
                                                                                                                           1000                                                    2000
                     400
                                                                         10000                                              800                                                    1500
                     300
                     200                                                                                                    600                                                    1000
                                                                          5000
                     100                                                                                                    400                                                     500
                       0                                                    0                                               200                                                      0
                                   10   20       30     40       50              10   20       30      40      50                        10   20       30      40      50                 10   20       30       40       50
                                         Model Rank                                    Model Rank                                              Model Rank                                       Model Rank
Figure S5. Comparison of three different optimization methods (Momentum SGD, RMSProp, Shared RMSProp) tested
using two different algorithms (Async n-step Q and Async Advantage Actor-Critic) on four different Atari games (Break-
out, Beamrider, Seaquest and Space Invaders). Each curve shows the final scores for 50 experiments sorted in descending
order that covers a search over 50 random initializations and learning rates. The top row shows results using Async n-step
Q algorithm and bottom row shows results with Async Advantage Actor-Critic. Each individual graph shows results for
one of the four games and three different optimization methods. Shared RMSProp tends to be more robust to different
learning rates and random initializations than Momentum SGD and RMSProp without sharing.
4000 4000
                      3000                                                                                                        3000
           Score
Score
2000 2000
                      5000                                                                                                        5000
                      4000                                                                                                    4000
                      3000                                                                                                        3000
           Score
Score
                      2000                                                                                                        2000
                                                                             Async 1-step Q                                                                                          Async 1-step Q
                      1000                                                   Async SARSA                                          1000                                               Async SARSA
                                                                             Async n-step Q                                                                                          Async n-step Q
                           0                                                 Async actor-critic                                     0                                                Async actor-critic
                                                                             Human tester                                                                                            Human tester
                      1000                                                                                                        1000
                               0            10              20           30          40                                                  0          10              20           30          40
                                                       Training time (hours)                                                                                   Training time (hours)
Figure S6. Comparison of algorithms on the TORCS car racing simulator. Four different configurations of car speed and
opponent presence or absence are shown. In each plot, all four algorithms (one-step Q, one-step Sarsa, n-step Q and
Advantage Actor-Critic) are compared on score vs training time in wall clock hours. Multi-step algorithms achieve better
policies much faster than one-step algorithms on all four levels. The curves show averages over the 5 best runs from 50
experiments with learning rates sampled from LogU nif orm(104 , 102 ) and all other hyperparameters fixed.
                                   Asynchronous Methods for Deep Reinforcement Learning
Figure S7. Performance for the Mujoco continuous action domains. Scatter plot of the best score obtained against
learning rates sampled from LogU nif orm(105 , 101 ). For nearly all of the tasks there is a wide range of learning
rates that lead to good performance on the task.
                                                                                              Asynchronous Methods for Deep Reinforcement Learning
Figure S8. Score per episode vs wall-clock time plots for the Mujoco domains. Each plot shows error bars for the top 5
experiments.
        12000                        Beamrider                   350                        Breakout                    20                 Pong                                      4500                          Q*bert                    900                     Space Invaders
                    1-step SARSA, 1 threads                                1-step SARSA, 1 threads                                                                                              1-step SARSA, 1 threads                                1-step SARSA, 1 threads
                    1-step SARSA, 2 threads                                1-step SARSA, 2 threads                      15                                                           4000       1-step SARSA, 2 threads                                1-step SARSA, 2 threads
                    1-step SARSA, 4 threads                      300       1-step SARSA, 4 threads                                                                                              1-step SARSA, 4 threads                      800       1-step SARSA, 4 threads
        10000       1-step SARSA, 8 threads                                1-step SARSA, 8 threads                                                                                              1-step SARSA, 8 threads                                1-step SARSA, 8 threads
                    1-step SARSA, 16 threads                               1-step SARSA, 16 threads                     10                                                           3500       1-step SARSA, 16 threads                     700       1-step SARSA, 16 threads
                                                                 250
        8000                                                                                                             5                                                           3000
                                                                                                                                                                                                                                             600
                                                                 200                                                    0                                                            2500
Score
Score
Score
Score
Score
        6000                                                                                                                                                                                                                                 500
                                                                 150                                                     5                                                           2000
                                                                                                                                                                                                                                             400
        4000                                                                                                            10                                                           1500
                                                                 100
                                                                                                                        15                        1-step SARSA, 1 threads            1000                                                    300
        2000                                                                                                                                      1-step SARSA, 2 threads
                                                                  50                                                    20
                                                                                                                                                  1-step SARSA, 4 threads
                                                                                                                                                                                      500                                                    200
                                                                                                                                                  1-step SARSA, 8 threads
                                                                                                                                                  1-step SARSA, 16 threads
           0                                                      0                                                     25                                                              0                                                    100
                0       10          20        30    40                 0       10          20        30    40                0   10     20        30           40                           0       10          20        30    40                 0       10          20        30    40
                                  Training epochs                                        Training epochs                              Training epochs                                                         Training epochs                                        Training epochs
Figure S9. Data efficiency comparison of different numbers of actor-learners one-step Sarsa on five Atari games. The
x-axis shows the total number of training epochs where an epoch corresponds to four million frames (across all threads).
The y-axis shows the average score. Each curve shows the average of the three best performing agents from a search over
50 random learning rates. Sarsa shows increased data efficiency with increased numbers of parallel workers.
                                                                                                               Asynchronous Methods for Deep Reinforcement Learning
        12000                           Beamrider                              350                           Breakout                               20                              Pong                                3500                             Q*bert                                800                        Space Invaders
                       1-step SARSA, 1 threads                                              1-step SARSA, 1 threads                                             1-step SARSA, 1 threads                                               1-step SARSA, 1 threads                                               1-step SARSA, 1 threads
                       1-step SARSA, 2 threads                                              1-step SARSA, 2 threads                                 15          1-step SARSA, 2 threads                                               1-step SARSA, 2 threads                                               1-step SARSA, 2 threads
        10000          1-step SARSA, 4 threads                                 300          1-step SARSA, 4 threads                                             1-step SARSA, 4 threads                                 3000          1-step SARSA, 4 threads                                  700          1-step SARSA, 4 threads
                       1-step SARSA, 8 threads                                              1-step SARSA, 8 threads                                             1-step SARSA, 8 threads                                               1-step SARSA, 8 threads                                               1-step SARSA, 8 threads
                       1-step SARSA, 16 threads                                             1-step SARSA, 16 threads                                10          1-step SARSA, 16 threads                                              1-step SARSA, 16 threads                                              1-step SARSA, 16 threads
                                                                               250                                                                                                                                      2500                                                                   600
        8000                                                                                                                                         5
                                                                               200                                                                   0                                                                  2000                                                                   500
Score
Score
Score
Score
                                                                                                                                                                                                                                                                                       Score
        6000
                                                                               150                                                                   5                                                                  1500                                                                   400
        4000                                                                                                                                        10
                                                                               100                                                                                                                                      1000                                                                   300
                                                                                                                                                    15
        2000                                                                    50                                                                                                                                       500                                                                   200
                                                                                                                                                    20
           0                                                                     0                                                                  25                                                                     0                                                                   100
                0       2      4        6      8     10    12   14                   0       2      4        6      8     10    12   14                  0       2      4        6      8     10    12   14                    0       2      4        6      8     10    12   14                    0       2      4        6      8     10    12   14
                                   Training time (hours)                                                Training time (hours)                                               Training time (hours)                                                 Training time (hours)                                                 Training time (hours)
Figure S10. Training speed comparison of different numbers of actor-learners for all one-step Sarsa on five Atari games.
The x-axis shows training time in hours while the y-axis shows the average score. Each curve shows the average of the
three best performing agents from a search over 50 random learning rates. Sarsa shows significant speedups from using
greater numbers of parallel actor-learners.
12000 1-step Q, Beamrider 400 1-step Q, Breakout 30 1-step Q, Pong 5000 1-step Q, Q*bert 800 1-step Q, Space Invaders
                                                                               350                                                                                                                                                                                                             700
        10000                                                                                                                                       20                                                                  4000
                                                                               300
                                                                                                                                                                                                                                                                                               600
        8000                                                                   250                                                                  10                                                                  3000
                                                                               200                                                                                                                                                                                                             500
Score
Score
Score
Score
                                                                                                                                                                                                                                                                                       Score
        6000                                                                                                                                        0                                                                   2000
                                                                               150                                                                                                                                                                                                             400
        4000                                                                   100                                                                  10                                                                  1000
                                                                                                                                                                                                                                                                                               300
                                                                                50
        2000                                                                                                                                        20                                                                     0                                                                   200
                                                                                 0
           0                                                                    50                                                                  30                                                                  1000                                                                   100
                10-4                      10-3                  10-2                 10-4                     10-3                   10-2                10-4                       10-3                 10-2                  10-4                       10-3                 10-2                  10-4                     10-3                   10-2
                                      Learning rate                                                       Learning rate                                                        Learning rate                                                          Learning rate                                                       Learning rate
        14000                   1-step SARSA, Beamrider                        400                   1-step SARSA, Breakout                         20                      1-step SARSA, Pong                          5000                      1-step SARSA, Q*bert                         900                1-step SARSA, Space Invaders
Score
Score
Score
Score
Score
Score
Score
Score
                                                                                                                                                     0                                                                  2000
         6000                                                                  150                                                                                                                                                                                                             600
         4000                                                                  100                                                                  10                                                                  1000
                                                                                                                                                                                                                                                                                               500
         2000                                                                   50
                                                                                                                                                    20                                                                     0                                                                   400
            0                                                                    0
         2000                                                                   50                                                                  30                                                                  1000                                                                   300
                10-4                       10-3                 10-2                 10-4                      10-3                  10-2                10-4                      10-3                  10-2                  10-4                      10-3                  10-2                  10-4                      10-3                  10-2
                                       Learning rate                                                       Learning rate                                                       Learning rate                                                         Learning rate                                                         Learning rate
Figure S11. Scatter plots of scores obtained by one-step Q, one-step Sarsa, and n-step Q on five games (Beamrider,
Breakout, Pong, Q*bert, Space Invaders) for 50 different learning rates and random initializations. All algorithms exhibit
some level of robustness to the choice of learning rate.
                                     Asynchronous Methods for Deep Reinforcement Learning
    Game                    DQN       Gorila    Double     Dueling   Prioritized   A3C FF, 1 day    A3C FF     A3C LSTM
    Alien                   570.2      813.5     1033.4     1486.5        900.5            182.1       518.4         945.3
    Amidar                  133.4      189.2      169.1      172.7        218.4            283.9       263.9         173.0
    Assault                3332.3     1195.8     6060.8     3994.8       7748.5           3746.1      5474.9       14497.9
    Asterix                 124.5     3324.7    16837.0    15840.0      31907.5           6723.0     22140.5       17244.5
    Asteroids               697.1      933.6     1193.2     2035.4       1654.0           3009.4      4474.5        5093.1
    Atlantis              76108.0   629166.5   319688.0   445360.0     593642.0        772392.0     911091.0     875822.0
    Bank Heist              176.3      399.4      886.0     1129.3        816.8            946.0       970.1         932.8
    Battle Zone           17560.0    19938.0    24740.0    31320.0      29100.0          11340.0     12950.0       20760.0
    Beam Rider             8672.4     3822.1    17417.2    14591.3      26172.7          13235.9     22707.9       24622.2
    Berzerk                                      1011.1      910.6       1165.6           1433.4       817.9         862.2
    Bowling                  41.2       54.0       69.6       65.7          65.8            36.2        35.1          41.8
    Boxing                   25.8       74.2       73.5       77.3          68.6            33.7        59.8          37.3
    Breakout                303.9      313.0      368.9      411.6        371.6            551.6       681.9         766.8
    Centipede              3773.1     6296.9     3853.5     4881.0       3421.9           3306.5      3755.8        1997.0
    Chopper Comman         3046.0     3191.8     3495.0     3784.0       6604.0           4669.0      7021.0       10150.0
    Crazy Climber         50992.0    65451.0   113782.0   124566.0     131086.0        101624.0     112646.0     138518.0
    Defender                                    27510.0    33996.0      21093.5          36242.5     56533.0     233021.5
    Demon Attack          12835.2    14880.1    69803.4    56322.8      73185.8          84997.5    113308.4     115201.9
    Double Dunk             -21.6      -11.3       -0.3       -0.8           2.7              0.1       -0.1           0.1
    Enduro                  475.6       71.0     1216.6     2077.4       1884.4            -82.2       -82.5         -82.5
    Fishing Derby            -2.3        4.6        3.2       -4.1           9.2            13.6        18.8          22.6
    Freeway                  25.8       10.2       28.8        0.2          27.9              0.1        0.1           0.1
    Frostbite               157.4      426.6     1448.1     2332.4       2930.2            180.1       190.5         197.6
    Gopher                 2731.8     4373.0    15253.0    20051.4      57783.8           8442.8     10022.8       17106.8
    Gravitar                216.5      538.4      200.5      297.0        218.0            269.5       303.5         320.0
    H.E.R.O.              12952.5     8963.4    14892.5    15207.9      20506.4          28765.8     32464.1       28889.5
    Ice Hockey               -3.8       -1.7       -2.5       -1.3          -1.0             -4.7       -2.8          -1.7
    James Bond              348.5      444.0      573.0      835.5       3511.5            351.5       541.0         613.0
    Kangaroo               2696.0     1431.0    11204.0    10334.0      10241.0            106.0        94.0         125.0
    Krull                  3864.0     6363.1     6796.1     8051.6       7406.5           8066.6      5560.0        5911.4
    Kung-Fu Master        11875.0    20620.0    30207.0    24288.0      31244.0           3046.0     28819.0       40835.0
    Montezumas Revenge      50.0       84.0       42.0       22.0          13.0            53.0        67.0          41.0
    Ms. Pacman              763.5     1263.0     1241.3     2250.6       1824.6            594.4       653.7         850.7
    Name This Game         5439.9     9238.5     8960.3    11185.1      11836.1           5614.0     10476.1       12093.7
    Phoenix                                     12366.5    20410.5      27430.1          28181.8     52894.1       74786.7
    Pit Fall                                     -186.7      -46.9         -14.8          -123.0       -78.5        -135.7
    Pong                     16.2       16.7       19.1       18.8          18.9            11.4         5.6          10.7
    Private Eye             298.2     2598.6     -575.5      292.6        179.0            194.4       206.9         421.1
    Q*Bert                 4589.8     7089.8    11020.8    14175.8      11277.0          13752.3     15148.8       21307.5
    River Raid             4065.3     5310.3    10838.4    16569.4      18184.4          10001.2     12201.8        6591.9
    Road Runner            9264.0    43079.8    43156.0    58549.0      56990.0          31769.0     34216.0       73949.0
    Robotank                 58.5       61.8       59.1       62.0          55.4              2.3       32.8           2.6
    Seaquest               2793.9    10145.9    14498.0    37361.6      39096.7           2300.2      2355.4        1326.1
    Skiing                                     -11490.4   -11928.0     -10852.8         -13700.0    -10911.1      -14863.8
    Solaris                                       810.0     1768.4       2238.2           1884.8      1956.0        1936.4
    Space Invaders         1449.7     1183.3     2628.7     5993.1       9063.0           2214.7     15730.5       23846.0
    Star Gunner           34081.0    14919.2    58365.0    90804.0      51959.0          64393.0    138218.0     164766.0
    Surround                                        1.9        4.0          -0.9             -9.6       -9.7          -8.3
    Tennis                   -2.3       -0.7       -7.8        4.4          -2.0           -10.2        -6.3          -6.4
    Time Pilot             5640.0     8267.8     6608.0     6601.0       7448.0           5825.0     12679.0       27202.0
    Tutankham                32.4      118.5       92.2       48.0          33.6            26.1       156.3         144.2
    Up and Down            3311.3     8747.7    19086.9    24759.2      29443.7          54525.4     74705.7     105728.7
    Venture                  54.0      523.4       21.0      200.0        244.0             19.0        23.0          25.0
    Video Pinball         20228.1   112093.4   367823.7   110976.2     374886.9        185852.6     331628.1     470310.5
    Wizard of Wor           246.0    10431.0     6201.0     7054.0       7451.0           5278.0     17244.0       18082.0
    Yars Revenge                                 6270.6    25976.5       5965.1           7270.8      7157.5        5615.5
    Zaxxon                  831.0     6159.4     8593.0    10164.0       9501.0           2659.0     24622.0       23519.0
Table S3. Raw scores for the human start condition (30 minutes emulator time). DQN scores taken from (Nair et al.,
2015). Double DQN scores taken from (Van Hasselt et al., 2015), Dueling scores from (Wang et al., 2015) and Prioritized
scores taken from (Schaul et al., 2015)