research-article

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks

Authors:

Sang Woo KimAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology, Volume 15, Issue 3

Article No.: 44, Pages 1 - 20

https://doi.org/10.1145/3643860

Published: 29 March 2024 Publication History

Abstract

L₂ regularization for weights in neural networks is widely used as a standard training trick. In addition to weights, the use of batch normalization involves an additional trainable parameter γ, which acts as a scaling factor. However, L₂ regularization for γ remains an undiscussed mystery and is applied in different ways depending on the library and practitioner. In this article, we study whether L₂ regularization for γ is valid. To explore this issue, we consider two approaches: (1) variance control to make the residual network behave like an identity mapping and (2) stable optimization through the improvement of effective learning rate. Through two analyses, we specify the desirable and undesirable γ to apply L₂ regularization and propose four guidelines for managing them. In several experiments, we observed that applying L₂ regularization to applicable γ increased 1% to 4% classification accuracy, whereas applying L₂ regularization to inapplicable γ decreased 1% to 3% classification accuracy, which is consistent with our four guidelines. Our proposed guidelines were further validated through various tasks and architectures, including variants of residual networks and transformers.

References

[1]

Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. 2016. Layer normalization. CoRR abs/1607.06450 (2016).

[2]

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 2014. Food-101 — Mining discriminative components with random forests. In ECCV (6), Vol. 8694. 446–461. https://data.vision.ee.ethz.ch/cvl/datasets_extra/food-101/

[3]

Andrew Brock, Soham De, and Samuel L. Smith. 2021. Characterizing signal propagation to close the performance gap in unnormalized ResNets. In ICLR.

[4]

Mauro Cettolo, Jan Niehues, Sebastian Stüker, Luisa Bentivogli, and Marcello Federico. 2014. Report on the 11th IWSLT evaluation campaign. In Proceedings of the 11th International Workshop on Spoken Language Translation: Evaluation Campaign. 2–17. https://workshop2014.iwslt.org/

[5]

Soham De and Samuel L. Smith. 2020. Batch normalization biases residual blocks towards the identity function in deep networks. In NeurIPS.

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. 248–255. https://www.image-net.org/

[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT (1). 4171–4186.

[8]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

[9]

Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. 2017. Accurate, large minibatch SGD: Training ImageNet in 1 hour. CoRR abs/1706.02677 (2017).

[10]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In ICCV. 1026–1034.

[11]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. 770–778.

[12]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Identity mappings in deep residual networks. In ECCV (4), Vol. 9908. 630–645.

[13]

Tong He, Zhi Zhang, Hang Zhang, Zhongyue Zhang, Junyuan Xie, and Mu Li. 2019. Bag of tricks for image classification with convolutional neural networks. In CVPR. 558–567.

[14]

Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. 2018. Norm matters: Efficient and accurate normalization schemes in deep networks. In NeurIPS. 2164–2174.

[15]

Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge J. Belongie. 2015. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In CVPR. 595–604. https://dl.allaboutbirds.org/nabirds

[16]

Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, Vol. 37. 448–456.

[17]

Xianyan Jia, Shutao Song, Wei He, Yangzihao Wang, Haidong Rong, Feihu Zhou, Liqiang Xie, Zhenyu Guo, Yuanzhou Yang, Liwei Yu, Tiegang Chen, Guangxiao Hu, Shaohuai Shi, and Xiaowen Chu. 2018. Highly scalable deep learning training system with mixed-precision: Training ImageNet in four minutes. CoRR abs/1807.11205 (2018).

[18]

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. 2012. ImageNet classification with deep convolutional neural networks. In NIPS. 1106–1114.

[19]

Xingjian Li, Haoyi Xiong, Zeyu Chen, Jun Huan, Cheng-Zhong Xu, and Dejing Dou. 2021. “In-network ensemble”: Deep ensemble learning with diversified knowledge distillation. ACM Trans. Intell. Syst. Technol. 12, 5 (2021), 63:1–63:19.

Digital Library

[20]

Dongyu Liu, Weiwei Cui, Kai Jin, Yuxiao Guo, and Huamin Qu. 2019. DeepTracker: Visualizing the training process of convolutional neural networks. ACM Trans. Intell. Syst. Technol. 10, 1 (2019), 6:1–6:25.

Digital Library

[21]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: A method for automatic evaluation of machine translation. In ACL. 311–318.

[22]

Omkar M. Parkhi, Andrea Vedaldi, Andrew Zisserman, and C. V. Jawahar. 2012. Cats and dogs. In CVPR. 3498–3505. https://www.robots.ox.ac.uk/vgg/data/pets/

[23]

Jost Tobias Springenberg, Alexey Dosovitskiy, Thomas Brox, and Martin A. Riedmiller. 2015. Striving for simplicity: The all convolutional net. In ICLR (Workshop).

[24]

Cecilia Summers and Michael J. Dinneen. 2020. Four things everyone should know to improve batch normalization. In ICLR.

[25]

Qing Tian, Shun Peng, and Tinghuai Ma. 2023. Source-free unsupervised domain adaptation with trusted pseudo samples. ACM Trans. Intell. Syst. Technol. 14, 2 (2023), 30:1–30:17.

Digital Library

[26]

Twan van Laarhoven. 2017. L2 regularization versus batch and weight normalization. CoRR abs/1706.05350 (2017).

[27]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS. 5998–6008.

[28]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. 2019. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR. https://gluebenchmark.com/

[29]

Jindong Wang, Yiqiang Chen, Wenjie Feng, Han Yu, Meiyu Huang, and Qiang Yang. 2020. Transfer learning with dynamic distribution adaptation. ACM Trans. Intell. Syst. Technol. 11, 1 (2020), 6:1–6:25.

Digital Library

[30]

Yuxin Wu and Kaiming He. 2020. Group normalization. Int. J. Comput. Vis. 128, 3 (2020), 742–755.

[31]

Yanzhao Wu and Ling Liu. 2023. Selecting and composing learning rate policies for deep neural networks. ACM Trans. Intell. Syst. Technol. 14, 2 (2023), 22:1–22:25.

Digital Library

[32]

Saining Xie, Ross B. Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In CVPR. 5987–5995.

[33]

Junjie Yan, Ruosi Wan, Xiangyu Zhang, Wei Zhang, Yichen Wei, and Jian Sun. 2020. Towards stabilizing batch statistics in backward propagation of batch normalization. In ICLR.

[34]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. In BMVC.

[35]

Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In ECCV (1), Vol. 8689. 818–833.

[36]

Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger B. Grosse. 2019. Three mechanisms of weight decay regularization. In ICLR.

[37]

Hongyi Zhang, Yann N. Dauphin, and Tengyu Ma. 2019. Fixup initialization: Residual learning without normalization. In ICLR.

Cited By

Srivallapanondh SFreire PSpinnler BCosta NNapoli ATuritsyn SPrilepsky J(2024)Parallelization of Recurrent Neural Network-Based Equalizer for Coherent Optical Systems via Knowledge DistillationJournal of Lightwave Technology10.1109/JLT.2023.333760442:7(2275-2284)Online publication date: 1-Apr-2024
https://doi.org/10.1109/JLT.2023.3337604

Index Terms

Guidelines for the Regularization of Gammas in Batch Normalization for Deep Residual Networks
1. Computing methodologies
  1. Machine learning
    1. Machine learning algorithms
      1. Regularization
    2. Machine learning approaches
      1. Neural networks

Recommendations

Analysis of Tikhonov regularization for function approximation by neural networks

This paper is devoted to the convergence and stability analysis of Tikhonov regularization for function approximation by a class of feed-forward neural networks with one hidden layer and linear output layer. We investigate two frequently used approaches,...
Nonconvex Regularization for Network Slimming: Compressing CNNs Even More
Advances in Visual Computing
Abstract
In the last decade, convolutional neural networks (CNNs) have evolved to become the dominant models for various computer vision tasks, but they cannot be deployed in low-memory devices due to its high memory requirement and computational cost. ... $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$ $_{}$
Compressing deep‐quaternion neural networks with targeted regularisation

In recent years, hyper‐complex deep networks (such as complex‐valued and quaternion‐valued neural networks – QVNNs) have received a renewed interest in the literature. They find applications in multiple fields, ranging from image reconstruction to 3D ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 15, Issue 3

June 2024

646 pages

EISSN:2157-6912

DOI:10.1145/3613609

Editor:
Huan Liu
Arizona State University, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 March 2024

Online AM: 01 February 2024

Accepted: 29 January 2024

Revised: 25 January 2024

Received: 19 July 2023

Published in TIST Volume 15, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Samsung Electronics Co., Ltd

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

1
Total Citations
View Citations
177
Total Downloads

Downloads (Last 12 months)177
Downloads (Last 6 weeks)5

Reflects downloads up to 12 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Srivallapanondh SFreire PSpinnler BCosta NNapoli ATuritsyn SPrilepsky J(2024)Parallelization of Recurrent Neural Network-Based Equalizer for Coherent Optical Systems via Knowledge DistillationJournal of Lightwave Technology10.1109/JLT.2023.333760442:7(2275-2284)Online publication date: 1-Apr-2024
https://doi.org/10.1109/JLT.2023.3337604

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents