Fivth
Fivth
Reinforcement Learning
1 2 3
Mila - Québec AI Institute Université de Montréal Google DeepMind
Abstract
Scaling deep reinforcement learning networks is challenging and often results in degraded performance,
yet the root causes of this failure mode remain poorly understood. Several recent works have proposed
mechanisms to address this, but they are often complex and fail to highlight the causes underlying this
difficulty. In this work, we conduct a series of empirical analyses which suggest that the combination of
non-stationarity with gradient pathologies, due to suboptimal architectural choices, underlie the challenges of
scale. We propose a series of direct interventions that stabilize gradient flow, enabling robust performance
across a range of network depths and widths. Our interventions are simple to implement and compatible
with well-established algorithms, and result in an effective mechanism that enables strong performance even
at large scales. We validate our findings on a variety of agents and suites of environments. We make our code
publicly available.
“We must be able to look at the world and see it as a dynamic process, not a static picture.”
— David Bohm
1 Introduction
Recent advances in deep reinforcement learning (deep RL) have demonstrated the ability of deep
neural networks to solve complex decision-making tasks from robotics to game play and resource
optimization (Bellemare et al., 2020, Fawzi et al., 2022, Mnih et al., 2015, Vinyals et al., 2019).
Motivated by successes in supervised and generative learning, recent works have explored scaling
architectures in deep RL, showing gains in representation quality and generalization across tasks
(Farebrother et al., 2023, Taiga et al., 2023). However, scaling neural networks in deep RL remains
fundamentally challenging (Ceron et al., 2024a,b).
A central cause of this instability lies in the unique optimization challenges of RL. Unlike
supervised learning, where data distributions are fixed, deep RL involves policy-dependent data
that constantly change during training (Lyle et al., 2022). Each update of the policy πθ alters
future states and rewards, making the training objective inherently non-stationary. Value-based
methods exacerbate these issues via bootstrapping, recursively using predicted values as targets.
Estimation errors compound over time (Fujimoto et al., 2018), especially under sparse or delayed
rewards (Zheng et al., 2018), leading to unstable updates, policy collapse, or value divergence (Lyle
et al., 2023, 2024, Van Hasselt et al., 2016). These dynamics are tightly coupled with architectural
vulnerabilities. Deep networks face well known pathologies such as vanishing/exploding gradients
∗
Equal contribution
1
(Pascanu et al., 2013), ill-conditioned Jacobians (Pennington et al., 2017), and activation saturation
(Glorot and Bengio, 2010). In deep RL, these are magnified by the “deadly triad” (Sutton and Barto,
2018, Van Hasselt et al., 2018), off-policy corrections, and changing targets. As networks scale, the
risk of signal distortion and misalignment increases, resulting in underutilized capacity and brittle
learning (Ceron et al., 2024a, Obando Ceron et al., 2023).
One overlooked source of these failures lies in how gradients propagate through the network.
Specifically, the gradient decomposition, the layer-wise structure of backpropagation as a chain of
Jacobians and weights determine how information flows during learning (Lee et al., 2020). While
gradient signal preservation has been studied in supervised learning (Jacot et al., 2018, Schoenholz
et al., 2017), its role in deep RL, where both inputs and targets shift continually, remains poorly
understood.
In this work, we investigate how gradient decomposition interacts with non-stationarity and
network scaling in deep RL. We demonstrate that in non-stationary settings like RL – where targets
are bootstrapped, policies evolve continually, and data distributions shift – gradient signals pro-
gressively degrade across depth. This motivates the need for methods that explicitly preserve the
structure of gradient information across layers. We explore this through a series of controlled ex-
periments and ablations across multiple algorithms and environments, demonstrating that actively
encouraging gradient propagation significantly improves stability and performance, even with
large networks. Our work offers a promising approach for scaling deep RL architectures, yielding
substantial performance gains across a variety of agents and training regimes.
2 Preliminaries
Deep Reinforcement Learning A deep reinforcement learning agent interacts with an environ-
ment through sequences of actions (a ∈ A), which produce corresponding sequences of observations
(s ∈ S ) and rewards (r ∈ R), resulting in trajectories of the form τ := {s0 , a0 , r0 , s1 , a1 , r2 , . . .}.
The agent’s behavior is often represented by a neural network with parameters θ, composed of
convolutional layers {ϕ1 , ϕ2 , . . . , ϕLc } and dense (fully connected) layers {ψ1 , ψ2 , . . . , ψLd }, where
ψLd has an output dimensionality of |A|. At every timestep t, an observation st ∈ S is fed
through the network to obtain an estimate of the long-term value of each action: Qθ (st , ·) =
ψLd (ψLd −1 (. . . (ϕLc (. . . (ϕ1 (st )) . . .)) . . .)). The agent’s policy πθ (· | st ) specifies the probability of
selecting each action, for instance by taking the softmax over the estimated values as in Equation 1.
The training objective is typically defined as the maximization of expected cumulative reward as in
Equation 2,
"∞ #
eQθ (st ,at ) X
πθ (at | st ) = P Qθ (st ,a)
(1) J(θ) = Eτ ∼πθ t
γ rt (2)
a∈A e t=0
where γ ∈ [0, 1) is a discount factor and τ denotes a trajectory generated by following policy πθ .
Optimization proceeds by minimizing a surrogate loss L(θ), which may be derived from temporal-
difference (TD) errors, policy gradients, or actor-critic estimators (Sutton and Barto, 2018). In
TD-based methods, the TD error at timestep t is defined as:
2
is performed by collecting trajectories, computing gradients ∇L(θ), and updating parameters
via θ ← θ − η∇L(θ), where η > 0 is the learning rate. Following conventions from supervised
learning, deep RL algorithms often use adaptive variants of stochastic gradient descent, such as
Adam (Kingma and Ba, 2014) or RMSprop (Hinton, 2012), which adjust learning rates based on
running estimates of gradient statistics. The gradients with respect to each layer are denoted by;
∂L ∂L
∇ϕi = , ∇ψj = ,
∂ϕi ∂ψj
where ϕi and ψj represent the parameters (i.e., weight matrices or bias vectors) of layer i and j
respectively. The structure and magnitude of these gradients (∇ϕi and ∇ψj ) are influenced by the
loss function, data distribution collected from the environment, and the architecture itself. These
per-layer gradients determine how effectively different parts of the network adapt during training.
While training large models in supervised learning settings present challenges, advances in
initialization, normalization, and scaling strategies have enabled relatively stable optimization (Ba
et al., 2016, Glorot and Bengio, 2010, Ioffe and Szegedy, 2015). Scaling up model size has been
a central driver of progress across domains, improving generalization, enhancing representation
learning, and boosting downstream performance (Kaplan et al., 2020).
Deep RL differs substantially from supervised learning. First, the data distribution is non-
stationary, continually shifting as πθ updates. Second, learning signals are often sparse, delayed, or
noisy, which introduces variance in the estimated gradients (Fujimoto et al., 2018, Han et al., 2022).
These factors destabilize optimization and lead to loss surfaces with sharp curvature and complex
local structure (Achiam et al., 2019, Ilyas et al., 2020). Moreover, increasing model capacity often
degrades performance unless regularization or architectural interventions are applied (Bjorck et al.,
2021, Gogianu et al., 2021, Schwarzer et al., 2023, Wang et al., 2025).
These challenges are further compounded by both architectural and environmental factors. Net-
work depth, width, initialization, and nonlinearity affect how gradients are propagated across layers.
Meanwhile, reward sparsity, exploration dificulty, and transition stochasticity impose additional
structure on the optimization landscape. The resulting geometry reflects the joint dynamics of policy,
environment, and architecture, making deep RL optimization uniquely complex.
Gradient Propagation Training deep networks poses fundamental challenges for effective gradi-
ent propagation (Glorot and Bengio, 2010). As network depth increases, gradients may either vanish
or explode as they are backpropagated through multiple layers, impeding the optimization of early
layers and destabilizing learning dynamics (Ba et al., 2016). These issues arise from repeated appli-
cations of the chain rule. For a network with intermediate hidden representations {h0 , h1 , . . . , hL },
where hk ∈ Rdk , the gradient of the loss L with respect to a hidden layer hℓ is:
L
!
∂L Y ∂hk ∂L
= ,
∂hℓ ∂hk−1 ∂hL
k=ℓ+1
3
W h ∈ Rm , and under the assumption that W and h have i.i.d. zero-mean entries with finite variance
σW and σh , respectively, the variance of the output is given by Var[W h] = n σW σh . Thus, scaling
the width n without adjusting σW and σh leads to instability in forward and backward signal
propagation affecting gradient norms and optimization trajectories.
Beyond depth and width, the choice of nonlinearity also plays a central role in determining
how gradients propagate . In a typical feedforward network, hidden activations evolve as hk =
ζ(Wk hk−1 ), where ζ(·) is a nonlinear activation function (e.g., ReLU, tanh, sigmoid), and Wk is
the weight matrix at layer k . During backpropagation, the gradient with respect to a hidden
layer includes the product of the Jacobian of the linear transformation and the derivative of the
nonlinearity:
∂L ⊤ ′ ∂L
= Wk ζ (Wk hk−1 ) ⊙ ,
∂hk−1 ∂hk
where ζ ′ (·) denotes the elementwise derivative of the activation function, and ⊙ represents
elementwise multiplication. For ReLU, ζ ′ (x) = 1x>0 , so the gradient is entirely blocked wherever
the neuron is inactive. This leads to the well-known dying ReLU problem, where a significant portion
of the network ceases to update and becomes untrainable (Lu et al., 2019, Shin and Karniadakis,
2020).
4
MLP Depth: Small MLP Depth
1.0
Stationary
2.0
Small Medium Large
Non-Stationary
0.5 Stationary
1.8 Non-Stationary
Train Accuracy
1.5
Figure 1: Training dynamics under stationary and non-stationary supervised learning. (Left) In the stationary setting,
both shallow and deep models fit the data effectively across widths. Under non-stationarity only shallow networks
partially recover during training, while deeper ones collapse. (Right) This collapse correlates with degraded gradient flow.
In stationary settings, gradient norms remains stable across all network scales (shaded boxes) while in non-stationary
settings (solid-colored boxes), gradient magnitudes diminish with depth and width, suggesting poor adaptability.
0.05
0
Phoenix 0.00
×10³
Qbert
4 2
2 1
0
0 10 20 30 40 50 0
Small Medium Large Small Medium Large Small Medium Large
Total Timesteps in Millions MLP Width MLP Width MLP Width
Figure 2: Mean episode returns and gradient norms across increasing MLP depths and widths on two ALE games using
PQN. (Left) Only shallow networks achieve high episode returns; performance collapses for deeper networks. (Right) The
collapse correlates with vanishing gradient norms, suggesting that deeper models fail to adapt to non-stationarity in deep
RL.
stability and convergence using Layer Normalization (Ba et al., 2016) and supports GPU-based
training through vectorized environments for online parallel data collection. In subsection C.1 we
extend our investigation to DQN (Mnih et al., 2015) and Rainbow (Hessel et al., 2018), demonstrating
the generality of our observations. As shown in Figure 2, deeper networks trained with PQN exhibit
a collapse in both episode returns and gradient norms1 , highlighting the fragility of deep models
under non-stationarity.
5
% Dead Neurons MLP 1e2 SRank 1e3 Trace of the Hessian Q-values 1e3 Episode Return
10
75 2
5
Qbert
5 5
50 1
25 0 0 0 0
1e2 1e2 1e3
75 4 2 4
Phoenix
MLP Depth 2
50 Small
1
Medium 2 2
25 Large
0 0 0 0
0 20 40
Environment Steps (in Millions)
Figure 3: Training pathologies emerge as MLP depth increases. Deeper networks exhibit a higher fraction of inactive
neurons, reduced representation rank (SRank), vanishing Hessian trace (loss curvature), and degraded learning perfor-
mance (mean Q-values and episode returns). These trends indicate that scaling depth limits expressivity and plasticity,
impairing policy quality.
underutilized capacity. Next, we assess representational expressivity using SRank, the effective
rank of penultimate-layer activations (Kumar et al., 2020), observing that deeper networks tend to
collapse state representations into lower-dimensional, and less expressive (as evidenced by declining
returns) subspaces. To study loss curvature, we compute the Hessian trace of the temporal-difference
loss. This metric serves as a proxy for sharpness or smoothness in optimization (Ghorbani et al.,
2019), similarly to tracking the largest eigenvalue. Figure 3 shows that only shallow networks
exhibit high Hessian trace values, suggesting access to sharper regions of the loss surface with
pronounced directions of improvement. In contrast, deeper architectures consistently show near-
zero trace, indicating poorly conditioned geometry that hinders effective gradient-based updates.
These findings suggest a breakdown in representation, plasticity, and optimization as networks
scale, ultimately impeding learning.
4 Stabilizing Gradients
Having identified the pathologies that emerge in non-stationary regimes, particularly under large-
scale architectures, we investigate strategies to mitigate these instabilities. We focus on two com-
plementary interventions: skip connections (He et al., 2016) and optimizers (Martens and Grosse,
2015), as these directly improve gradient flow. We continue to use PQN as our base RL algorithm
and evaluate on the Atari-10 suite (Aitchison et al., 2023). In section 5, we demonstrate that the
effectiveness of our proposed gradient interventions generalize beyond this specific algorithm and
environment suite.
6
Nonlinearities
Baseline
Encoder
Increase
DenseNet Depth
Encoder
Increase
MultiSkip
Width
Encoder
Dense layers
PQN + Multi-Skip MLP Depth: Large PQN + Kron MLP Depth: Large
Width: Small Medium Large 0.5 Width: Small Medium Large 1.0
Depth: Large
PQN Depth: Large
PQN
1.0 + Multi-Skip 1.0 + Kron
0.8 0.8
0.3 0.6
0.6 0.6
0.4 0.2 0.4 0.4
0.2 0.1 0.2
0.2
0.0 0.0
0.0 Small Medium Large
0.0
0.2
0 10 20 30 40
0.2
0 10 20 30 40 Small Medium Large
Environment Steps (in Millions) MLP Width Environment Steps (in Millions) MLP Width
Figure 5: Gradient-stabilizing interventions improve scalability in deep RL. (Left) Standard fully connected networks
trained with PQN collapse at greater depths due to vanishing gradients. In contrast, multi-skip architectures maintain
gradient flow and scale effectively. (Right) The default RAdam optimizer leads to instability in deep networks, while
switching to the Kron optimizer preserves gradient signal and enables stable learning without architectural changes.
or two layers, which can be insufficient in the presence of severe gradient disruption due to non-
stationarity. We introduce multi-skip residual connections, in which the flattened convolutional features
are broadcast directly to all subsequent MLP layers. This design ensures that gradients can propagate
from any depth back to the shared encoder without obstruction.
We compare our network architecture against the standard fully connected baseline across
varying depths. As shown in Figure 5 (left), performance collapses with increased depth in the
baseline, while the multi-skip architecture maintains stable learning and continues to improve across
widths. This improvement is accompanied by consistently higher gradient magnitudes. Complete
results across all network depths and widths are presented in section C.3.
7
Let L(θ) denote the loss function, and g = ∇L(θ) its gradient. A second-order update takes
the form θt+1 = θt − ηH −1 g , where H is the curvature matrix, typically the Hessian or the Fisher
Information Matrix (FIM) (Martens, 2020). Directly inverting H is computationally infeasible in
deep neural networks so Kronecker-factored approximations, such as K-FAC (Martens and Grosse,
2015), address this challenge by approximating H using low-rank Kronecker products.
Kronecker-factored optimizer (Kron for short) approximates the FIM and applies structured
preconditioning that captures inter-parameter dependencies, unlike Adam’s diagonal scaling. This
yields directionally aware preconditioning that better aligns with the curvature of the loss surface
(Martens, 2020). In non-stationary settings, such as deep RL, where both the data distribution and
curvature evolve over time, curvature-aware updates can help preserve gradient signal by main-
taining stable update magnitudes and directions. As shown in Figure 5 (right), replacing RAdam
with Kron prevents performance collapse at greater depths, even in standard MLP architectures.
Complete results across all network depths and widths are presented in section C.3.
PQN Ours
Width: Small Medium Large MLP Depth: Large 105 PQN (+Ours) improvement over PQN Baseline
1.0 104
[%] Improvement
0.8
Human Normalized Score
Hero
IceHockey
Gravitar
Skiing
Alien
StarGunner
BeamRider
NameThisGame
Frostbite
TimePilot
Berzerk
Phoenix
Zaxxon
MsPacman
Jamesbond
WizardOfWor
DoubleDunk
UpNDown
Gopher
VideoPinball
Tennis
Asteroids
Asterix
Atlantis
0.2 0.2 ChopperCommand
0.0
0.0
0.2 Small Medium Large
0 10 20 30 40
Environment Steps (in Millions) MLP Width
Figure 6: Gradient-stabilized PQN achieves superior scalability. (Left) On Atari-10, the combined interventions lead to
high HNS even at greater depths, outperforming either intervention alone (see Figure 5) and increased gradient gradient
flow. (Right) On the full ALE suite, our agent outperforms the baseline in 90% of the games with a median performance
improvement of 83.27%.
8
Width:
Baseline
Small
MLP Depth: Small
Ours
Medium Large
Small
MLP Depth Large
1.0 2.0
0.5 Baseline
0.0
1.5 Ours
MLP Depth: Large 1.0
1.0
0.5 0.5
0.0 0.0
0 25 50 75 100 Small Medium Large Small Medium Large
Training Steps MLP Width MLP Width
Figure 7: Gradient interventions enable rapid recovery in non-stationary SL. (Left) Models with combined gradient
interventions rapidly recover accuracy after label reshuffling, demonstrating robust adaptation in non-stationary settings.
(Right) This is supported by stable gradient flow across depth. Dashed curves and shaded boxes indicate MLP baselines.
Cartpole
400
105 PPO (+Ours) improvement over PPO Baseline
Baseline
104 200
[%] Improvement
103 PPO
102 PPO (+ Ours)
101 0
Returns
100 Anymal
100 80
101
102 60
40
Pong
KungFuMaster
Enduro
Bowling
Alien
Tutankham
Tennis
Centipede
Surround
IceHockey
UpNDown
Freeway
Defender
Boxing
BankHeist
Breakout
RoadRunner
SpaceInvaders
Berzerk
Qbert
BattleZone
NameThisGame
Zaxxon
PrivateEye
CrazyClimber
Hero
FishingDerby
Riverraid
Assault
Kangaroo
Atlantis
StarGunner
Krull
DoubleDunk
Jamesbond
VideoPinball
Amidar
Solaris
Asterix
Skiing
Seaquest
Phoenix
Gopher
Pitfall
TimePilot
Gravitar
Frostbite
DemonAttack
MsPacman
WizardOfWor
BeamRider
YarsRevenge
ChopperCommand
Robotank
Asteroids
20
0
Small Medium Large
MLP Depth
Figure 8: PPO with gradient interventions. Left: On the full ALE suite, applying the combined gradient interventions to
PPO yields a median performance improvement of 31.40% and outperforms the baseline in 83.64% of the games. Right: In
the Cartpole and Anymal tasks from IsaacGym, only the augmented PPO maintains stable performance across depths and
widths.
et al., 2018); and (iii) augment Simba (Lee et al., 2025) with our proposed techniques and evaluate
performance on the DeepMind Control Suite (DMC) (Tassa et al., 2018).
PPO with Gradient Interventions. Figure 8 (left) shows that augmenting PPO with the same
strategies as in PQN (Layer Normalization by default on PQN, multi-skip residual connections,
and Kronecker-factored optimization) significantly boots performance. On the ALE benchmark,
the augmented PPO outperforms the baseline in 83.64% of the environments, achieving a median
relative improvement of 31.40%. In Isaac Gym’s continuous control tasks, including Cartpole and
Anymal (Figure 8, right), the baseline PPO collapses as model size increases, while the augmented
variant remains stable and achieves superior performance at all depths and widths.
Gradient Interventions in Scaled Encoder Variants The Impala CNN is a scalable convolutional
architecture that has demonstrated strong performance gains in agents such as Impala (Espeholt et al.,
2018) and Rainbow (Hessel et al., 2018). We investigate whether, given its capacity to extract richer
representations from visual input, combining Impala CNN with our gradient flow interventions
enables effective scaling of the MLP component. As shown in Figure 9, PPO and PQN benefit
9
CNN Impala CNN CNN Impala CNN
3.0 3.5
Baseline Baseline
3.0
Human Normalized Score
Figure 9: Scaling performance with standard vs. Impala CNN encoders on PQN (left) and PPO (right). Each agent is
evaluated using both the Atari CNN (left sub-panels) and the Impala CNN (right sub-panels) as the encoder. Gradient
interventions enable successful scaling in both cases.
significantly from replacing the standard CNN with the Impala CNN. For PQN, the Impala encoder
enables successful scaling of the MLP, in contrast to the performance collapse seen without our
interventions. These results suggest that the expressivity of richer visual encoders is more effectively
leveraged by deeper networks when gradient flow is preserved.
Simba with Kron Optimizer. Simba (Lee et al., 2025) is a scalable actor-critic framework that
integrates normalization, residual connections, and LayerNorm. We augment Simba by replacing its
default AdamW optimizer with Kron while keeping all other hyperparameters fixed. We evaluate
SAC (Haarnoja et al., 2018) and DDPG (Lillicrap et al., 2015) on challenging DMC tasks, using
the Simba architectures of varying depth and width. Despite its design for scalability, default
Simba collapses across all tasks as networks grow as shown in Figure 10 (additional results in
subsection C.5). In contrast, the Kron-augmented version successfully scales in both depth and
width, achieving consistent and stable performance gains. These findings underscore the generality
of our approach as effectively enabling parameter scaling in deep RL agents.
Scaling Depth Scaling Width Scaling Width and Depth Scaling Depth Scaling Width Scaling Width and Depth
1e2 Humanoid Walk 1e2 Humanoid Run 1e2 Dog Trot 1e2 Dog Run
9.0 3.0 6.0
8
7.5 2.5
6 4.5
6.0 2.0
4 3.0
4.5 1.5
Episode Return
10
6 Related Work
A central challenge in scaling deep RL lies in the inefficient use of model capacity. Increasing
parameter counts often fails to yield proportional gains due to under-utilization. Sokar et al. (2023)
show that online RL induces a growing fraction of inactive neurons, a phenomenon also observed in
offline settings. Ceron et al. (2024a) report that up to 95% of parameters can be pruned post-training
with negligible performance drop, underscoring substantial redundancy. These findings have
motivated techniques such as weight resetting (Schwarzer et al., 2023), tokenized computation (Sokar
et al., 2025), and sparse architectures (Ceron et al., 2024b, Liu et al., 2025, Willi et al., 2024), along
with auxiliary objectives to promote capacity utilization (Farebrother et al., 2023). While scaling
model size offers greater expressivity, its benefits depend on appropriate training strategies (Ota
et al., 2021). Architectural interventions such as SimBa (Lee et al., 2025) improve robustness by
regularizing signal propagation through components such as observation normalization, residual
feedforward blocks, and layer normalization. Complementarily, BRO (Nauman et al., 2024) shows
that scaling the critic network yields substantial gains in sample and compute efficiency, provided it
is paired with strong regularization and optimistic exploration strategies.
Gradient flow, however, remains a central bottleneck. We complement prior efforts by explicitly
targeting vanishing gradients as a mechanism for improving scalability. Our approach builds on the
role of LayerNorm in stabilizing training and enhancing plasticity (Lyle et al., 2024), and leverages its
theoretical effect on gradient preservation as formalized in PQN (Gallici et al., 2025). Optimization-
level interventions such as second-order methods (Martens and Grosse, 2015, Muppidi et al., 2024)
and adaptive optimizers (Bengio et al., 2021, Ellis et al., 2024, Wu et al., 2017) also address instability
under non-stationarity. Our approach integrates architectural and optimizer-level interventions to
enable stable gradient flow and unlock parameter scaling in deep RL agents.
7 Discussion
Our analyses in section 3 suggest that the difficulty in scaling networks in deep RL stems from the in-
teraction between inherent non-stationarity and gradient pathologies that worsen with network size.
In section 4, we introduced targeted interventions to address these challenges, and in subsection 4.3,
we demonstrated their effectiveness. We validated the generality of our approach across agents and
environment suites, consistently observing similar trends. These findings reaffirm the critical role
of network design and optimization dynamics in training scalable RL agents. While our proposed
solutions may not be optimal, they establish a strong baseline and provide a foundation for future
work on gradient stabilization in deep RL. More broadly, our findings suggest that scaling limitations
in deep RL are not solely attributable to algorithmic instability or insufficient exploration, but also
stem from gradient pathologies amplified by architectural and optimization choices. Addressing
these issues directly, without altering the learning algorithm, yields substantial gains in scalability
and performance. This suggests that ensuring stable gradient flow is a necessary precondition for
effective parameter scaling in deep RL.
Limitations. Our study is constrained by computational resources, which limited our ability to
explore architectures beyond a certain size. While our interventions show consistent improvements
across agents and environments, further scaling remains an open question. While using second
order optimizers introduced additional computational overhead (see Table 10), this cost is mitigated
by leveraging vectorized environments and efficient deep RL algorithms, narrowing the gap relative
to standard methods. These limitations highlight promising directions for future work, including
the development of more computationally efficient gradient stabilization strategies and scalable
optimization techniques.
11
8 Acknowledgment
The authors would like to thank João Guilherme Madeira Araújo, Evan Walters, Olya Mastikhina,
Dhruv Sreenivas, Ali Saheb Pasand, Ayoub Echchahed and Gandharv Patil for valuable discussions
during the preparation of this work. João Araújo deserves a special mention for providing us
valuable feed-back on an early draft of the paper. We want to acknowledge funding support from
Google, CIFAR AI and compute support from Digital Research Alliance of Canada and Mila IDT. We
would also like to thank the Python community (Oliphant, 2007, Van Rossum and Drake Jr, 1995) for
developing tools that enabled this work, including NumPy Harris et al. (2020), Matplotlib (Hunter,
2007), Jupyter (Kluyver et al., 2016), and Pandas (McKinney, 2013).
References
Joshua Achiam, Ethan Knight, and Pieter Abbeel. Towards characterizing divergence in deep
q-learning. arXiv preprint arXiv:1903.08894, 2019.
Rishabh Agarwal, Max Schwarzer, Pablo Samuel Castro, Aaron C Courville, and Marc Bellemare.
Deep reinforcement learning at the edge of the statistical precipice. Advances in neural information
processing systems, 34:29304–29320, 2021.
Matthew Aitchison, Penny Sweetser, and Marcus Hutter. Atari-5: Distilling the arcade learning
environment down to five games. In International Conference on Machine Learning, pages 421–438.
PMLR, 2023.
Kavosh Asadi, Rasool Fakoor, and Shoham Sabach. Resetting the optimizer in deep rl: An empirical
study. Advances in Neural Information Processing Systems, 36:72284–72324, 2023.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint
arXiv:1607.06450, 2016.
Marc G Bellemare, Yavar Naddaf, Joel Veness, and Michael Bowling. The arcade learning envi-
ronment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:
253–279, 2013.
Marc G. Bellemare, Salvatore Candido, Pablo Samuel Castro, Jun Gong, Marlos C. Machado, Sub-
hodeep Moitra, Sameera S. Ponda, and Ziyun Wang. Autonomous navigation of stratospheric
balloons using reinforcement learning. Nature, 588:77 – 82, 2020.
Emmanuel Bengio, Joelle Pineau, and Doina Precup. Correcting momentum in temporal difference
learning. arXiv preprint arXiv:2106.03955, 2021.
Nils Bjorck, Carla P Gomes, and Kilian Q Weinberger. Towards deeper deep reinforcement learning
with spectral normalization. Advances in neural information processing systems, 34:8242–8255, 2021.
Johan Samir Obando Ceron and Pablo Samuel Castro. Revisiting rainbow: Promoting more insightful
and inclusive deep reinforcement learning research. In International Conference on Machine Learning,
pages 1373–1383. PMLR, 2021.
Johan Samir Obando Ceron, Aaron Courville, and Pablo Samuel Castro. In value-based deep
reinforcement learning, a pruned network is a good network. In International Conference on Machine
Learning, pages 38495–38519. PMLR, 2024a.
12
Johan Samir Obando Ceron, Ghada Sokar, Timon Willi, Clare Lyle, Jesse Farebrother, Jakob Nicolaus
Foerster, Gintare Karolina Dziugaite, Doina Precup, and Pablo Samuel Castro. Mixtures of experts
unlock parameter scaling for deep rl. In International Conference on Machine Learning, pages
38520–38540. PMLR, 2024b.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam
Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm:
Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113,
2023.
Benjamin Ellis, Matthew T Jackson, Andrei Lupu, Alexander D Goldie, Mattie Fellows, Shimon
Whiteson, and Jakob Foerster. Adam on local time: Addressing nonstationarity in rl with relative
adam timesteps. Advances in Neural Information Processing Systems, 37:134567–134590, 2024.
Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Vlad Mnih, Tom Ward, Yotam Doron,
Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance
weighted actor-learner architectures. In International conference on machine learning, pages 1407–1416.
PMLR, 2018.
Jesse Farebrother, Joshua Greaves, Rishabh Agarwal, Charline Le Lan, Ross Goroshin, Pablo Samuel
Castro, and Marc G Bellemare. Proto-value networks: Scaling representation learning with
auxiliary tasks. In The Eleventh International Conference on Learning Representations, 2023. URL
https://openreview.net/forum?id=oGDKSt9JrZi.
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Moham-
madamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz
Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning.
Nature, 610(7930):47–53, 2022.
Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-
critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster,
and Mario Martin. Simplifying deep temporal difference learning. In The Thirteenth International
Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=7IzeL0kflu.
Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization
via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241.
PMLR, 2019.
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics,
pages 249–256. JMLR Workshop and Conference Proceedings, 2010.
Florin Gogianu, Tudor Berariu, Mihaela C Rosca, Claudia Clopath, Lucian Busoniu, and Razvan
Pascanu. Spectral normalisation for deep reinforcement learning: an optimisation perspective. In
International Conference on Machine Learning, pages 3734–3744. PMLR, 2021.
Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor opti-
mization. In International Conference on Machine Learning, pages 1842–1850. PMLR, 2018.
13
Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor. In International conference
on machine learning, pages 1861–1870. PMLR, 2018.
Beining Han, Zhizhou Ren, Zuofan Wu, Yuan Zhou, and Jian Peng. Off-policy reinforcement
learning with delayed rewards. In International conference on machine learning, pages 8280–8303.
PMLR, 2022.
Charles R Harris, K Jarrod Millman, Stéfan J Van Der Walt, Ralf Gommers, Pauli Virtanen, David
Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, et al. Array program-
ming with numpy. Nature, 585(7825):357–362, 2020.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
770–778, 2016.
Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan
Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in
deep reinforcement learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32,
2018.
Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of its recent
magnitude. COURSERA: Neural networks for machine learning, 4(2):26, 2012.
Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected con-
volutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 4700–4708, 2017.
John D Hunter. Matplotlib: A 2d graphics environment. Computing in science & engineering, 9(03):
90–95, 2007.
Andrew Ilyas, Logan Engstrom, Shibani Santurkar, Dimitris Tsipras, Firdaus Janoos, Larry Rudolph,
and Aleksander Madry. A closer look at deep policy gradients. In International Conference on
Learning Representations, 2020. URL https://openreview.net/forum?id=ryxdEkHtPS.
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by
reducing internal covariate shift. In International conference on machine learning, pages 448–456.
pmlr, 2015.
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and
generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott
Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361, 2020.
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
Thomas Kluyver, Benjain Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier,
Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián
Avila, Safia Abdalla, Carol Willing, and Jupyter Development Team. Jupyter Notebooks—a
publishing format for reproducible computational workflows. In IOS Press, pages 87–90. 2016. doi:
10.3233/978-1-61499-649-1-87.
14
Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
Aviral Kumar, Rishabh Agarwal, Dibya Ghosh, and Sergey Levine. Implicit under-parameterization
inhibits data-efficient deep reinforcement learning. arXiv preprint arXiv:2010.14498, 2020.
Hojoon Lee, Dongyoon Hwang, Donghu Kim, Hyunseung Kim, Jun Jet Tai, Kaushik Subramanian,
Peter R. Wurman, Jaegul Choo, Peter Stone, and Takuma Seno. Simba: Simplicity bias for scaling
up parameters in deep reinforcement learning. In The Thirteenth International Conference on Learning
Representations, 2025. URL https://openreview.net/forum?id=jXLiDKsuDo.
Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip H. S. Torr. A signal propagation
perspective for pruning neural networks at initialization. In International Conference on Learning
Representations, 2020. URL https://openreview.net/forum?id=HJeTo2VFwH.
Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa,
David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv
preprint arXiv:1509.02971, 2015.
Jiashun Liu, Johan Samir Obando Ceron, Aaron Courville, and Ling Pan. Neuroplastic expansion in
deep reinforcement learning. In The Thirteenth International Conference on Learning Representations,
2025. URL https://openreview.net/forum?id=20qZK2T7fa.
Lu Lu, Yeonjong Shin, Yanhui Su, and George Em Karniadakis. Dying relu and initialization: Theory
and numerical examples. arXiv preprint arXiv:1903.06733, 2019.
Clare Lyle, Mark Rowland, and Will Dabney. Understanding and preventing capacity loss in
reinforcement learning. In International Conference on Learning Representations, 2022. URL https:
//openreview.net/forum?id=ZkC8wKoLbQ7.
Clare Lyle, Zeyu Zheng, Evgenii Nikishin, Bernardo Avila Pires, Razvan Pascanu, and Will Dabney.
Understanding plasticity in neural networks. In International Conference on Machine Learning, pages
23190–23211. PMLR, 2023.
Clare Lyle, Zeyu Zheng, Khimya Khetarpal, James Martens, Hado P van Hasselt, Razvan Pascanu,
and Will Dabney. Normalization and effective learning rates in reinforcement learning. Advances
in Neural Information Processing Systems, 37:106440–106473, 2024.
Xuezhe Ma. Apollo: An adaptive parameter-wise diagonal quasi-newton method for nonconvex
stochastic optimization. arXiv preprint arXiv:2009.13586, 2020.
Viktor Makoviychuk, Lukasz Wawrzyniak, Yunrong Guo, Michelle Lu, Kier Storey, Miles Macklin,
David Hoeller, Nikita Rudin, Arthur Allshire, Ankur Handa, et al. Isaac gym: High performance
gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
James Martens. New insights and perspectives on the natural gradient method. Journal of Machine
Learning Research, 21(146):1–76, 2020.
James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate
curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
Wes McKinney. Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython. O’Reilly
Media, 1 edition, February 2013. ISBN 9789351100065. URL http://www.amazon.com/exec/obidos/
redirect?tag=citeulike07-20&path=ASIN/1449319793.
15
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare,
Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles
Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane
Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature,
518(7540):529–533, February 2015.
Aneesh Muppidi, Zhiyu Zhang, and Heng Yang. Fast trac: A parameter-free optimizer for lifelong
reinforcement learning. Advances in Neural Information Processing Systems, 37:51169–51195, 2024.
Michal Nauman, Mateusz Ostaszewski, Krzysztof Jankowski, Piotr Miłoś, and Marek Cygan. Bigger,
regularized, optimistic: scaling for compute and sample efficient continuous control. In The
Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024.
Johan Obando Ceron, Marc Bellemare, and Pablo Samuel Castro. Small batch deep reinforcement
learning. Advances in Neural Information Processing Systems, 36:26003–26024, 2023.
Travis E. Oliphant. Python for scientific computing. Computing in Science & Engineering, 9(3):10–20,
2007. doi: 10.1109/MCSE.2007.58.
Kei Ota, Devesh K Jha, and Asako Kanezaki. Training larger networks for deep reinforcement
learning. arXiv preprint arXiv:2102.07920, 2021.
Kei Ota, Devesh K Jha, and Asako Kanezaki. A framework for training larger networks for deep
reinforcement learning. Machine Learning, 113(9):6115–6139, 2024.
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural
networks. In International conference on machine learning, pages 1310–1318. Pmlr, 2013.
Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. Resurrecting the sigmoid in deep
learning through dynamical isometry: theory and practice. Advances in neural information processing
systems, 30, 2017.
Samuel S Schoenholz, Justin Gilmer, Surya Ganguli, and Jascha Sohl-Dickstein. Deep information
propagation. In International Conference on Learning Representations, 2017.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Max Schwarzer, Johan Samir Obando Ceron, Aaron Courville, Marc G Bellemare, Rishabh Agarwal,
and Pablo Samuel Castro. Bigger, better, faster: Human-level Atari with human-level efficiency.
In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and
Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning,
volume 202 of Proceedings of Machine Learning Research, pages 30365–30380. PMLR, 23–29 Jul 2023.
URL https://proceedings.mlr.press/v202/schwarzer23a.html.
Yeonjong Shin and George Em Karniadakis. Trainability of relu networks and data-dependent
initialization. Journal of Machine Learning for Modeling and Computing, 1(1), 2020.
Ghada Sokar, Rishabh Agarwal, Pablo Samuel Castro, and Utku Evci. The dormant neuron phe-
nomenon in deep reinforcement learning. In International Conference on Machine Learning, pages
32145–32168. PMLR, 2023.
16
Ghada Sokar, Johan Samir Obando Ceron, Aaron Courville, Hugo Larochelle, and Pablo Samuel
Castro. Don’t flatten, tokenize! unlocking the key to softmoe’s efficacy in deep RL. In The Thirteenth
International Conference on Learning Representations, 2025. URL https://openreview.net/forum?
id=8oCrlOaYcc.
Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book,
Cambridge, MA, USA, 2018. ISBN 0262039249.
Adrien Ali Taiga, Rishabh Agarwal, Jesse Farebrother, Aaron Courville, and Marc G Bellemare.
Investigating multi-task pretraining and generalization in reinforcement learning. In The Eleventh
International Conference on Learning Representations, 2023. URL https://openreview.net/forum?
id=sSt9fROSZRO.
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas, David Budden,
Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, et al. Deepmind control suite. arXiv preprint
arXiv:1801.00690, 2018.
Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double
q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph
Modayil. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018.
Guido Van Rossum and Fred L Drake Jr. Python reference manual. Centrum voor Wiskunde en
Informatica Amsterdam, 1995.
Oriol Vinyals, Igor Babuschkin, Wojciech M Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung
Chung, David H Choi, Richard Powell, Timo Ewalds, Petko Georgiev, et al. Grandmaster level in
starcraft ii using multi-agent reinforcement learning. Nature, 575(7782):350–354, 2019.
Kevin Wang, Ishaan Javali, MichaĹ Bortkiewicz, Benjamin Eysenbach, et al. 1000 layer networks
for self-supervised rl: Scaling depth can enable new goal-reaching capabilities. arXiv preprint
arXiv:2503.14858, 2025.
Timon Willi, Johan Obando-Ceron, Jakob Foerster, Karolina Dziugaite, and Pablo Samuel Castro.
Mixture of experts in a mixture of rl settings. arXiv preprint arXiv:2406.18420, 2024.
Yuhuai Wu, Elman Mansimov, Roger B Grosse, Shun Liao, and Jimmy Ba. Scalable trust-region
method for deep reinforcement learning using kronecker-factored approximation. Advances in
neural information processing systems, 30, 2017.
Zeyu Zheng, Junhyuk Oh, and Satinder Singh. On learning intrinsic rewards for policy gradient
methods. Advances in neural information processing systems, 31, 2018.
Juntang Zhuang, Tommy Tang, Yifan Ding, Sekhar C Tatikonda, Nicha Dvornek, Xenophon Pa-
pademetris, and James Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed
gradients. Advances in neural information processing systems, 33:18795–18806, 2020.
17
A Environment Details
Throughout the paper, we evaluate the deep reinforcement learning agents’ performance on the
Atari-10 suite (Aitchison et al., 2023), a curated subset of games from the Arcade Learning Environ-
ment (ALE) (Bellemare et al., 2013). Atari-10 consists of 10 games selected to capture the maximum
variance in algorithm performance, achieving over 90% correlation with results on the full ALE
benchmark. This makes it a computationally efficient yet representative testbed for deep reinforce-
ment learning. We follow the experimental protocol of Agarwal et al. (2021), Ceron et al. (2024b),
Obando Ceron et al. (2023), running each experiment with three random seeds and reporting the
aggregate human-normalized score across games.
The games in Atari-10 are:
• Amidar, Battle Zone, Bowling, Double Dunk, Frostbite, Kung Fu Master, Name This Game,
Phoenix, Q*Bert and River Raid.
Additionally, to further support the generality of our findings, we evaluate the proposed
combined gradient interventions on the full ALE benchmark. We also assess their effectiveness on
continuous control tasks from the IsaacGym simulator (Makoviychuk et al., 2021) and the DeepMind
Control Suite (DMC) (Tassa et al., 2018), extending our analysis to robotics-based environments. We
conduct experiments on the 4 challenging tasks of DMC:
B Network Sizes
Throughout the paper, we experiment with models of varying depths and widths. Unless stated
otherwise (e.g. in section 5, where we evaluate the Impala CNN), the convolutional feature extractors
are kept fixed. Consequently, our experiments focus primarily on scaling strategies and architectural
variations in the MLP components of the networks.
To enable meaningful comparisons across different learning regimes, the MLP architectures are
kept consistent across supervised learning (SL), non-stationary SL, and reinforcement learning (RL)
experiments. This consistency ensures that observed differences in gradient behavior arise from the
learning setting itself, rather than confounding factors due to domain-specific architectures.
Table 1 provides detailed information on the number of parameters for each depth–width
configuration, categorized as small, medium, or large, as used throughout the paper.
18
C Additional Experiments
C.1 Scaling with DQN and Rainbow
To further support our hypothesis on the emergence of gradient pathologies in deep reinforcement
learning, we investigate whether similar issues arise in algorithms beyond PQN and PPO, as
discussed in the main paper. Specifically, we study the effects of architectural scaling on two widely
used value-based algorithms: DQN (Mnih et al., 2015) and Rainbow (Hessel et al., 2018).
DQN is a foundational deep RL algorithm that learns action-value functions using temporal
difference updates and experience replay, serving as a standard baseline for value-based methods.
Rainbow extends DQN by integrating several enhancements, such as double Q-learning, prioritized
experience replay, dueling networks, multi-step learning, distributional value functions, and noisy
exploration, to achieve improved sample efficiency and stability.
In Figure 11, we report the performance of DQN and Rainbow as we scale the depth and width
of their networks. As with PQN and PPO, we observe consistent degradation in performance at
larger scales. In Figure 12, we present the corresponding gradient behavior, which reveals the same
vanishing and destabilization phenomena discussed in this work. These findings reinforce the
generality of the identified gradient pathologies across both policy-based and value-based deep RL
algorithms.
Figure 11: Median human normalized scores for DQN (left) and Rainbow (right) as a function of total network
parameters. Lines of different colors denote varying network depths, while marker shapes indicate different widths. For
both agents, performance consistently declines as network size increases, highlighting the adverse effects of scaling.
19
Width: Small Medium Large XLarge
Depth: Small Depth: Medium Depth: Large Depth: XLarge
5
Gradient Norm
4
3
2
1
0
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Environment Steps ×104 Environment Steps ×104 Environment Steps ×104 Environment Steps ×104
Width: Small Medium Large
×10 1 Depth: Small Depth: Medium Depth: Large Depth: XLarge
1.25
1.00
Gradient Norm
0.75
0.50
0.25
0.00
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Environment Steps ×104 Environment Steps ×104 Environment Steps ×104 Environment Steps ×104
Figure 12: Gradient magnitudes during training for DQN (top) and Rainbow (bottom). As network depth increases,
gradient flow systematically diminishes, ultimately collapsing to near-zero values. This consistent decay mirrors the
performance degradation observed at larger scales.
continuous adaptation. The models quickly adapt to the changing optimization problem following
label reshuffling, with gradient magnitudes remaining stable throughout the process.
20
We present the results for PPO and PQN across all tested optimizers in Figure 13.
Figure 13: Median human normalized scores on Atari-10 for PPO (top row) and PQN (bottom row), comparing a range
of optimizers including RAdam, AdaBelief, Shampoo, Apollo, and Kron (shown in the main curves). While adaptive
optimizers like AdaBelief show some robustness, only Kron consistently enables stable and performant training as models
scale. Each curve represents the mean performance across three random seeds per algorithm, with shaded areas indicating
95% bootstrap confidence intervals.
Results with the Multi-Skip Architecture. We present the full learning curves comparing the
proposed multi-skip architecture to the baseline fully connected architecture across all depths and
widths studied in the paper. We follow the experimental protocol of Agarwal et al. (2021), Ceron
et al. (2024b), Obando Ceron et al. (2023), running each experiment with three random seeds.
Results with the Kron Optimizer. We present the full learning curves comparing the Kron
optimizer to the baseline RAdam optimizer originally used in PQN (Gallici et al., 2025), across all
depths and widths studied in the paper. We follow the experimental protocol of Agarwal et al.
(2021), Ceron et al. (2024b), Obando Ceron et al. (2023), running each experiment with three random
seeds.
21
Figure 14: Median human-normalized scores with PQN on the Atari-10 benchmark, comparing the baseline agent and
the proposed multi-skip architecture across varying depths and widths. The multi-skip architecture not only improves
performance at shallow depths, but also enables PQN to remain trainable across all scales considered, whereas the baseline
MLP rapidly collapses as depth and width increase. Each curve represents the mean performance across three random
seeds per algorithm, with shaded areas indicating 95% bootstrap confidence intervals.
Figure 15: Median human-normalized scores with PQN on the Atari-10 benchmark, comparing the Kron optimizer
to the baseline RAdam optimizer across varying depths and widths. Similar to the multi-skip architecture, Kron not
only improves performance at shallow depths, but also enables PQN to remain trainable across all scales considered.
In contrast, performance with RAdam rapidly collapses as depth and width increase. Each curve represents the mean
performance across thee random seeds per algorithm, with shaded areas indicating 95% bootstrap confidence intervals.
follow the experimental protocol of Agarwal et al. (2021), Ceron et al. (2024b), Obando Ceron et al.
(2023), running each experiment with three random seeds.
22
Figure 16: Mean human-normalized score on the full ALE suite, comparing the baseline PQN agent (light curves) with the
augmented agent using our combined gradient interventions (dark curves).
23
Figure 17: Mean human-normalized score on the full ALE suite, comparing the baseline PPO agent (light curves) with the
augmented agent using our combined gradient interventions (dark curves).
24
C.5 Simba on DMC
In this section, we present the full results accompanying the experiments combining Simba (Lee
et al., 2025) with our proposed gradient interventions, as introduced in subsection 4.3. For these
experiments, we retain Simba’s original architectural choices but replace the AdamW optimizer
with Kron.
We compare Simba using both SAC and DDPG as the underlying RL algorithms. While SAC
generally outperforms DDPG, we consistently observe that scaling depth and width, either inde-
pendently or jointly, leads to a degradation in performance with Simba. However, this degradation
is mitigated, and in many cases reversed, when using the Kron optimizer, resulting in improved
performance as model capacity increases.
The following figures illustrate these findings:
780 540
240
Episode Return
720
720 480
640 200
420
660
560 160
360
600
AdamW
480 Kron 120 300
4.39 8.73 13.06 17.4 4.39 8.73 13.06 17.4 4.51 8.84 13.17 17.51 4.51 8.84 13.17 17.51
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
480
720
210 720
420
640
180 640
360
560
150 560
300
480 AdamW
Kron 120 480
240
4.39 17.44 39.14 69.48 4.39 17.44 39.14 69.48 4.51 17.67 39.48 69.94 4.51 17.67 39.48 69.94
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
700
SAC SAC 900 SAC SAC
900 350
600
750
300
750 500
Episode Return
600
250
400
600
450
200
300
450 150 300
200
AdamW 100 150
300 Kron
100
4.39 34.76 117.04 4.39 34.76 117.04 4.51 34.98 117.38 4.51 34.98 117.38
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
25
Humanoid Walk Humanoid Run Dog Trot Dog Run
Depth Scale Depth Scale Depth Scale Depth Scale
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
320 540
880 DDPG DDPG 900 DDPG DDPG
480
800 280
750
420
720
Episode Return
240
600 360
640
200
300
560 450
160 240
480
300
AdamW 180
400 Kron 120
4.39 8.73 13.06 17.39 4.39 8.73 13.06 17.39 4.5 8.84 13.17 17.5 4.5 8.84 13.17 17.5
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
700 480
700 240
Episode Return
600
200 400
600
500
160 320
500
400
120 240
400 AdamW
Kron 300
80 160
4.39 17.44 39.13 69.47 4.39 17.44 39.13 69.47 4.5 17.66 39.46 69.92 4.5 17.66 39.46 69.92
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
600
600
200 320
450
450
150 240
300
300
100 160
AdamW
150
Kron 150
50
4.39 34.75 117.03 4.39 34.75 117.03 4.5 34.97 117.37 4.5 34.97 117.37
Parameters (M) Parameters (M) Parameters (M) Parameters (M)
D Hyper-parameters
Below, we provide details of the hyperparameters used throughout the paper for each algorithm. In
general, they match those proposed in the corresponding original papers.
26
Table 2: PQN Hyperparameters
Hyperparameter Value / Description
Learning rate 2.5e-4
Anneal lr False (no learning rate annealing)
Num envs 128 (parallel environments)
Num steps 32 (steps per rollout per environment)
Gamma 0.99 (discount factor)
Num minibatches 32
Update epochs 2 (policy update epochs)
Max grad norm 10.0 (gradient clipping)
Start e 1.0 (initial exploration rate)
End e 0.005 (final exploration rate)
Exploration fraction 0.10 (exploration annealing fraction)
Q lambda 0.65 (Q(λ) parameter)
Use ln True (use layer normalization)
Activation fn relu (activation function)
27
Table 4: PPO Hyperparameters for IsaacGym
Hyperparameter Value / Description
Total timesteps 30,000,000
Learning rate 0.0026
Num envs 4096 (parallel environments)
Num steps 16 (steps per rollout)
Anneal lr False (disable learning rate annealing)
Gamma 0.99 (discount factor)
Gae lambda 0.95 (GAE lambda)
Num minibatches 2
Update epochs 4 (update epochs per PPO iteration)
Norm adv True (normalize advantages)
Clip coef 0.2 (policy clipping coefficient)
Clip vloss False (disable value function clipping)
Ent coef 0.0 (entropy coefficient)
Vf coef 2.0 (value function loss coefficient)
Max grad norm 1.0 (max gradient norm)
Use ln False (no layer normalization)
Activation fn relu (activation function)
28
Table 6: Rainbow Hyperparameters
Hyperparameter Value / Description
Learning rate 6.25e-5
Num envs 1
Buffer size 1,000,000 (replay memory size)
Gamma 0.99 (discount factor)
Tau 1.0 (target network update rate)
Target network frequency 8000 (timesteps per target update)
Batch size 32
Start e 1.0 (initial exploration epsilon)
End e 0.01 (final exploration epsilon)
Exploration fraction 0.10 (fraction of total timesteps for decay)
Learning starts 80,000 (timesteps before training starts)
Train frequency 4 (training frequency)
N step 3 (n-step Q-learning horizon)
Prioritized replay alpha 0.5
Prioritized replay beta 0.4
Prioritized replay eps 1e-6
N atoms 51 (number of atoms in distributional RL)
V min -10 (value distribution lower bound)
V max 10 (value distribution upper bound)
Use ln False (no layer normalization)
Activation fn relu (activation function)
29
Table 8: SAC Hyperparameters
Hyperparameter Value / Description
Critic block type SimBa
Critic num blocks {2, 4, 6, 8}
Critic hidden dim {512, 1024, 1536, 2048}
Target critic momentum (τ ) 5e-3
Actor block type SimBa
Actor num blocks {1, 2, 3, 4}
Actor hidden dim {128, 256, 384, 512}
Initial temperature (α0 ) 1e-2
Temperature learning rate 1e-4
Target entropy (H∗ ) |A|/2
Batch size 256
Optimizer {AdamW, Kron}
AdamW’s learning rate 1e-4
Kron’s learning rate 5e-5
Optimizer momentum (β1 , β2 ) (0.9, 0.999)
Weight decay (λ) 1e-2
Discount (γ ) Heuristic
Replay ratio 2
Clipped Double Q False
30
Table 9: DDPG Hyperparameters
Hyperparameter Value / Description
Critic block type SimBa
Critic num blocks {2, 4, 6, 8}
Critic hidden dim {512, 1024, 1536, 2048}
Critic learning rate 1e-4
Target critic momentum (τ ) 5e-3
Actor block type SimBa
Actor num blocks {1, 2, 3, 4}
Actor hidden dim {128, 256, 384, 512}
Actor learning rate 1e-4
Exploration noise N (0, 0.12 )
Batch size 256
Optimizer {AdamW, Kron}
AdamW’s learning rate 1e-4
Kron’s learning rate 5e-5
Optimizer momentum (β1 , β2 ) (0.9, 0.999)
Weight decay (λ) 1e-2
Discount (γ ) Heuristic
Replay ratio 2
Clipped Double Q False
31
E Compute Details
All experiments were conducted on a single-GPU setup using an NVIDIA RTX 8000, 12 CPU workers,
and 50GB of RAM.
Table 10: Training times across model scales for two optimizers K-FAC shows increased cost as depth and width grow.
Depth Width Optimizer Time
RAdam
Small Small Adam 51m
Small Medium Adam 53m
Small Large Adam 57m
Medium Small Adam 1h 4m
Medium Medium Adam 1h 10m
Medium Large Adam 1h 11m
Large Small Adam 1h 18m
Large Medium Adam 1h 18m
Large Large Adam 1h 27m
Kron
Small Small Kron 1h 59m
Small Medium Kron 2h 27m
Small Large Kron 3h 38m
Medium Small Kron 2h 44m
Medium Medium Kron 3h 32m
Medium Large Kron 5h 59m
Large Small Kron 3h 27m
Large Medium Kron 4h 36m
Large Large Kron 7h 42m
32