skip to main content
research-article

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

Published: 29 March 2024 Publication History

Abstract

L2 regularization for weights in neural networks is widely used as a standard training trick. In addition to weights, the use of batch normalization involves an additional trainable parameter γ, which acts as a scaling factor. However, L2 regularization for γ remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this article, we study whether L2 regularization for γ is valid. To explore this issue, we consider two approaches: (1) variance control to make the residual network behave like an identity mapping and (2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable γ to apply L2 regularization and propose four guidelines for managing them. In several experiments, we observed that applying L2 regularization to applicable γ increased 1% to 4% classification accuracy, whereas applying L2 regularization to inapplicable γ decreased 1% to 3% classification accuracy, which is consistent with our four guidelines. Our proposed guidelines were further validated through various tasks and architectures, including variants of residual networks and transformers.

References

[1]
Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450 (2016).
[2]
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 — Mining discriminative components with random forests. In ECCV (6), Vol. 8694. 446–461. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/
[3]
Andrew Brock, Soham De, and Samuel L. Smith. 2021. Characterizing signal propagation to close the performance gap in unnormalized ResNets. In ICLR.
[4]
Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign. 2–17. https://workshop2014.iwslt.org/
[5]
Soham De and Samuel L. Smith. 2020. Batch normalization biases residual blocks towards the identity function in deep networks. In NeurIPS.
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. 248–255. https://www.image-net.org/
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1). 4171–4186.
[8]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
[9]
Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017).
[10]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV. 1026–1034.
[11]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.
[12]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In ECCV (4), Vol. 9908. 630–645.
[13]
Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of tricks for image classification with convolutional neural networks. In CVPR. 558–567.
[14]
Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. 2018. Norm matters: Efficient and accurate normalization schemes in deep networks. In NeurIPS. 2164–2174.
[15]
Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge J. Belongie. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR. 595–604. https://dl.allaboutbirds.org/nabirds
[16]
Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37. 448–456.
[17]
Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. CoRR abs/1807.11205 (2018).
[18]
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1106–1114.
[19]
Xingjian Li, Haoyi Xiong, Zeyu Chen, Jun Huan, Cheng-Zhong Xu, and Dejing Dou. 2021. “In-network ensemble”: Deep ensemble learning with diversified knowledge distillation. ACM Trans. Intell. Syst. Technol. 12, 5 (2021), 63:1–63:19.
[20]
Dongyu Liu, Weiwei Cui, Kai Jin, Yuxiao Guo, and Huamin Qu. 2019. DeepTracker: Visualizing the training process of convolutional neural networks. ACM Trans. Intell. Syst. Technol. 10, 1 (2019), 6:1–6:25.
[21]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.
[22]
Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In CVPR. 3498–3505. https://www.robots.ox.ac.uk/vgg/data/pets/
[23]
Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. 2015. Striving for simplicity: The all convolutional net. In ICLR (Workshop).
[24]
Cecilia Summers and Michael J. Dinneen. 2020. Four things everyone should know to improve batch normalization. In ICLR.
[25]
Qing Tian, Shun Peng, and Tinghuai Ma. 2023. Source-free unsupervised domain adaptation with trusted pseudo samples. ACM Trans. Intell. Syst. Technol. 14, 2 (2023), 30:1–30:17.
[26]
Twan van Laarhoven. 2017. L2 regularization versus batch and weight normalization. CoRR abs/1706.05350 (2017).
[27]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998–6008.
[28]
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR. https://gluebenchmark.com/
[29]
Jindong Wang, Yiqiang Chen, Wenjie Feng, Han Yu, Meiyu Huang, and Qiang Yang. 2020. Transfer learning with dynamic distribution adaptation. ACM Trans. Intell. Syst. Technol. 11, 1 (2020), 6:1–6:25.
[30]
Yuxin Wu and Kaiming He. 2020. Group normalization. Int. J. Comput. Vis. 128, 3 (2020), 742–755.
[31]
Yanzhao Wu and Ling Liu. 2023. Selecting and composing learning rate policies for deep neural networks. ACM Trans. Intell. Syst. Technol. 14, 2 (2023), 22:1–22:25.
[32]
Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR. 5987–5995.
[33]
Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, and Jian Sun. 2020. Towards stabilizing batch statistics in backward propagation of batch normalization. In ICLR.
[34]
Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In BMVC.
[35]
Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV (1), Vol. 8689. 818–833.
[36]
Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger B. Grosse. 2019. Three mechanisms of weight decay regularization. In ICLR.
[37]
Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019. Fixup initialization: Residual learning without normalization. In ICLR.

Cited By

View all
  • (2024)Parallelization of Recurrent Neural Network-Based Equalizer for Coherent Optical Systems via Knowledge DistillationJournal of Lightwave Technology10.1109/JLT.2023.333760442:7(2275-2284)Online publication date: 1-Apr-2024

Index Terms

  1. Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Transactions on Intelligent Systems and Technology
      ACM Transactions on Intelligent Systems and Technology  Volume 15, Issue 3
      June 2024
      646 pages
      EISSN:2157-6912
      DOI:10.1145/3613609
      • Editor:
      • Huan Liu
      Issue’s Table of Contents

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 29 March 2024
      Online AM: 01 February 2024
      Accepted: 29 January 2024
      Revised: 25 January 2024
      Received: 19 July 2023
      Published in TIST Volume 15, Issue 3

      Permissions

      Request permissions for this article.

      Check for updates

      Author Tags

      1. L2 regularization
      2. weight decay
      3. batch normalization
      4. residual network
      5. effective learning rate
      6. deep learning

      Qualifiers

      • Research-article

      Funding Sources

      • Samsung Electronics Co., Ltd

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)177
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 12 Feb 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2024)Parallelization of Recurrent Neural Network-Based Equalizer for Coherent Optical Systems via Knowledge DistillationJournal of Lightwave Technology10.1109/JLT.2023.333760442:7(2275-2284)Online publication date: 1-Apr-2024

      View Options

      Login options

      Full Access

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Full Text

      View this article in Full Text.

      Full Text

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media