Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson; Bettina Messmer; Martin Jaggi

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson, Bettina Messmer, Martin Jaggi

Published: 02 May 2024, Last Modified: 25 Jun 2024ICML 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Abstract: This study investigates how weight decay affects the update behavior of individual neurons in deep neural networks through a combination of applied analysis and experimentation. Weight decay can cause the expected magnitude and angular updates of a neuron's weight vector to converge to a steady state we call rotational equilibrium. These states can be highly homogeneous, effectively balancing the average rotation---a proxy for the effective learning rate---across different layers and neurons. Our work analyzes these dynamics across optimizers like Adam, Lion, and SGD with momentum, offering a new simple perspective on training that elucidates the efficacy of widely used but poorly understood methods in deep learning. We demonstrate how balanced rotation plays a key role in the effectiveness of normalization like Weight Standardization, as well as that of AdamW over Adam with L2-regularization. Finally, we show that explicitly controlling the rotation provides the benefits of weight decay while substantially reducing the need for learning rate warmup.

Submission Number: 7691

Loading