0% found this document useful (0 votes)

19 views31 pages

Convergence

This document presents a convergence analysis of a new algorithm called diminishing batch normalization (DBN), which generalizes the traditional batch normalization (BN) method used in deep neural networks. The authors demonstrate that DBN converges to a stationary point under certain conditions and outperforms BN in numerical experiments on various datasets. The paper provides theoretical guarantees for the convergence of DBN and discusses its implementation and advantages over the original BN algorithm.

Uploaded by

juntaoduan1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views31 pages

Convergence

Uploaded by

juntaoduan1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 31

Convergence Analysis of Batch Normalization for

Deep Neural Nets

Yintai Ma, Diego Klabjan

Northwestern University, Evanston, IL 60201
yintaima2020@u.northwestern.edu, d-klabjan@northwestern.edu
arXiv:1705.08011v1 [cs.LG] 22 May 2017

Abstract

Batch normalization (BN) is very effective in accelerating the convergence of a

neural network training phase that it has become a common practice. We propose a
generalization of BN, the diminishing batch normalization (DBN) algorithm. We
provide an analysis of the convergence of the DBN algorithm that converges to
a stationary point with respect to trainable parameters. We analyze a two layer
model with linear activation. The main challenge of the analysis is the fact that
some parameters are updated by gradient while others are not. In the numerical
experiments, we use models with more layers and ReLU activation. We observe
that DBN outperforms the original BN algorithm on MNIST, NI and CIFAR-10
datasets with reasonable complex FNN and CNN models.

1 Introduction

Deep neural networks (DNN) have shown unprecedented success in various applications such as
object detection. However, it still takes a long time to train a DNN until it converges. Ioffe and
Szegedy [12] identified an important problem involved in training deep networks, internal covariate
shift, and then proposed batch normalization (BN) to decrease this phenomenon. BN addresses
this problem by normalizing the distribution of every hidden layer’s input. In order to do so, it
calculates the pre-activation mean and standard deviation using mini-batch statistics at each iteration
of training and uses these estimates to normalize the input to the next layer. The output of a layer is
normalized by using the batch statistics and two new trainable parameters per neuron are introduced
that capture the inverse operation. It is now a standard practice [6, 11]. While this approach leads to
a significant performance jump, to the best of our knowledge, there is no known theoretical guarantee
for the convergence of an algorithm with BN. The difficulty of analyzing the convergence of the BN
algorithm comes from the fact that not all of the BN parameters are updated by gradients. Thus it
invalidates most of the classical studies of convergence for gradient methods.
In this paper, we propose a generalization of the BN algorithm, diminishing batch normalization
(DBN), where we update the BN parameters in a diminishing moving average way. It essentially
means that the BN layer adjusts its output according to all past mini-batches instead of only the
current one. It helps to reduce the problem of the original BN that the output of a BN layer on a
particular training pattern depends on the other patterns in the current mini-batch, which is pointed
out in Bottou et al. [6]. By setting the layer parameter we introduce into DBN to a specific value, we
recover the original BN algorithm.
We give a convergence analysis of the algorithm with a two-layer batch-normalized neural network
and diminishing stepsizes. We assume two layers (the generalization to multiple layers can be
done by using the same approach but substantially complicating the notation) and an arbitrary loss
function. The convergence analysis does require linear activations. The main result shows that under
diminishing stepsizes on gradient updates and updates on mini-batch statistics, and standard Lipschitz
conditions on loss functions DBN converges to a stationary point. As already pointed out the main
challenge is the fact that some trainable parameters are updated by gradient while others are updated
by a mere recalculation. We also study the convex case where we assume that the overall loss function
is convex (which holds if the regularization parameter is large). In this case, we show a sublinear
convergence rate towards the global minimum.
Contributions. The main contribution of this paper is in providing a general convergence guarantee
for DBN. Specifically, we make the following contributions.

• In section 4, we show the sufficient and necessary conditions for the stepsizes and diminish-
ing weights to ensure the convergence of BN parameters.

• We show that the algorithm converges to a stationary point under a general nonconvex
objective function.

• In section 5, we show that decreasing the stepsizes η (m) for trainable parameters sublinearly
(i.e., at rate O(1/m)) ensures sublinear convergence to the optimal value with a convex
objective function.

This paper is organized as follows. In Section 2, we review the related works and the development of
the BN algorithm. We formally state our model and algorithm in Section 3. We present our main
results in Sections 4 and 5. In Section 6, we numerically show that the DBN algorithm outperforms
the original BN algorithm. Proofs are collected in Appendixes A and B in supplemental material.

2 Literature Review

Before the introduction of BN, it has long been known in the deep learning community that input
whitening and decorrelation help to speed up the training process. In fact, Orr and Müller [16] show
that preprocessing the data by subtracting the mean, normalizing the variance, and decorrelating the
input has various beneficial effects for back-propagation. Krizhevsky et al. [14] propose a method
called local response normalization which is inspired by computational neuroscience and acts as a
form of lateral inhibition, i.e., the capacity of an excited neuron to reduce the activity of its neighbors.
Gülçehre and Bengio [10] propose a standardization layer that bears significant resemblance to
batch normalization, except that the two methods are motivated by very different goals and perform
different tasks.
Inspired by BN, there are several new works taking BN as a basis for further improvements. Layer
normalization [3] is much like the BN except that it uses all of the summed inputs to compute the
mean and variance instead of the mini-batch statistics. Besides, unlike BN, layer normalization
performs exactly the same computation at training and test times. Normalization propagation [2] uses
data-independent estimations for the mean and standard deviation in every layer to reduce the internal
covariate shift and make the estimation more accurate for the validation phase. Weight normalization
also removes the dependencies between the examples in a minibatch so that it can be applied to
recurrent models, reinforcement learning or generative models [19]. Cooijmans et al. [7] propose a
new way to apply batch normalization to RNN and LSTM models.
In view of all these flavors, the original BN method is the most popular technique and for this reason
our choice of the analysis. To the best of our knowledge, we are not aware of any prior analysis of
BN.
BN has gradient and non-gradient updates. Thus nonconvex convergence results do not immediately
transfer. Our analysis explicitly takes into account the workings of BN. However, nonconvex
convergence proofs are relevant since some small portions of our analysis rely on known proofs and
approaches.
Neural nets are not convex, even if the loss function is convex. For classical convergence results with
a nonconvex objective function and diminishing learning rate, we refer to survey papers [4, 5, 6].
Bertsekas and Tsitsiklis [5] provide a convergence result with deterministic gradient with errors.
Bottou et al. [6] provide a convergence result with stochastic gradient. The classic analyses showing
the norm of gradients of the objective function going to zero date back to [9, 18, 17]. For strongly
convex objective functions with a diminishing learning rate, we learn the classic convergence results
from [6].

2
3 Model and Algorithm
The optimization problem for a network is a objective function consisting of a large number of
component functions, that reads:
N
X
min f¯(θ, λ) = fi (Xi : θ, λ),
i=1
subject to θ ∈ P, λ ∈ Q,
where fi : Rn1 × Rn2 → R, i = 1, ..., N , are real-valued functions for any data record Xi . Index i
associates with data record Xi and target response yi (hidden behind the dependency of f on i) in the
training set. Parameters θ include the usual parameters updated by gradients directly associated with
the loss function, i.e. behind the part that we have a parametric model, while BN parameters λ are
introduced by the BN algorithm and not updated by gradient methods but by the mini-batch statistics.
We define that the derivative of fi is always taken with respect to θ:
∇fi (Xi : θ, λ) := ∇θ fi (Xi : θ, λ).
The deep network we analyze here has 2 fully-connected layers with D1 neurons each. Each hidden
layer computes y = kW u with slope k for linear activation, i.e. we assume that the activation
functions are linear, and u is the input vector of the layer. We do not need to include an intercept term
since the BN algorithm automatically adjusts for it. BN is applied to the output of the first hidden
layer.

Figure 1: The structure of our batch-normalized network model in the analysis.

We next describe the computation in each layer to show how we obtain the output of the network.
The notations introduced here is used in the analysis. Figure 1 shows the full structure of the network.
The input data is vector X, which is one of {Xi }N = (µj )D
i=1 . Vector λ
D
j=1 , (σj )j=1 is the set of all

(1) (1) D
BN parameters and vector θ = W1 , W2 , (βj )D j=1 , (γj )j=1 is the set of all trainable parameters
which are updated by gradients. Matrices W1 , W2 are the actual model parameters and β, γ are
introduced by BN. The j th entry of output of the first hidden layer is
(1)
zj (X : θ) = k (1) W1,j,· X,
where W1,j,· denotes the weights of linear transformations for the j th neuron and k (1) is the slope of
the linear activation function for the first hidden layer. The j th entry of batch-normalized output of
the first layer is
(1)
!
(1) (1) zj (X, θ) − µj (1)
yj (X : θ, λ) = γj + βj ,
σ j + B
(1) (1)
where βj and γj are trainable parameters updated by gradient and µj and σj are batch normal-
(1) (1)
ization parameters for zj . Trainable parameter µj is the mini-batch mean of zj and trainable
(1)
parameter σj is the mini-batch sample deviation of zj .
Constant B is a small offset term to keep
the denominator from zero. The output of j th entry of the output layer reads:
(2)
zj (X : θ, λ) = k (2) W2,j,· y (1) (X : θ, λ).

3
The objective function for the ith data record is

(2)
fi (Xi : θ, λ) = li zj (Xi : θ, λ) + c2 kθk22 ,
j

where li (·) is the loss function associated with the target response yi . We have the following complete
expression for the objective function for the ith data
 
D (1)

X (1) k W 1,j,· Xi − µj (1)
fi (Xi : θ, λ) = li (k (2) W2,k,j γj + βj )k  + c2 kθk22 .
j=1
σ j + B

The objective function fi (Xi : θ, λ) is nonconvex with respect to θ and λ. However, when c2 is large
enough, the regularization term prevails and makes the objective function convex.

3.1 Algorithm

Algorithm 1 shows the algorithm studied herein. There are two deviations from the standard BN
algorithm, one of them actually being a generalization. We use the full gradient instead of the more
popular stochastic gradient (SG) method. It essentially means that each batch contains the entire
training set instead of a randomly chosen subset of the training set. An analysis of SG is potential
future research. Although the primary motivation for full gradient update is to reduce the burdensome
in showing the convergence, the full gradient method is similar to SG in the sense that both of them
go through the entire training set, while full gradient goes through it deterministically and the SG
goes through it in expectation. Therefore, it is reasonable to speculate that the SG method has similar
convergence property as the full algorithm studied herein.

Algorithm 1 DBN: Diminishing Batch-Normalized Network Update Algorithm

1: Initialize θ ∈ Rn1 and λ ∈ Rn2
2: for iteration m=1,2,... do
PN
3: θ(m+1) := θ(m) − η (m) i=1 ∇fi (Xi : θ(m) , λ(m) )
4: for j=1,...,D1 do
(m+1) PN (1)
5: µj := N1 i=1 zj (Xi : θ(m+1) )
r 2
(m+1) 1
PN
(1) (m+1)
6: σj := N zj (Xi : θ(m+1) ) − µj
i=1

(m+1) D1 (m+1) D1
7: λ(m+1) := α(m+1) (µj )j=1 , (σj )j=1 + (1 − α(m+1) )λ(m)

The second difference is that we update the BN parameters (θ, λ) by their moving averages with
respect to diminishing α(m) . The original BN algorithm can be recovered by setting α(m) = 1
for every m. After introducing diminishing α(m) , λ(m) and hence the output of the BN layer is
determined by the history of all past data records, instead of those solely in the last batch. Thus the
output of the BN layer becomes more general that better reflects the distribution of the entire dataset.
We use two strategies to decide the values of α(m) . One is to use a constant smaller than 1 for all m,
the other one is to decay the α(m) gradually, such as α(m) = 1/m.
In our numerical experiment, we show that Algorithm 1 outperforms the original BN algorithm,
where both are based on SG and non-linear activation functions with many layers FNN and CNN
models.

4 General Case
The main purpose of our work is to show that Algorithm 1 converges. In the general case, we focus
on the nonconvex objective function.

4.1 Assumptions

Here are the assumptions we used for the convergence analysis.

4
Assumption 4.1 (Lipschitz continuity on θ and λ). For every i we have
k∇fi (X : θ̃, λ) − ∇fi (X : θ̂, λ)k2 ≤ L̄kθ̃ − θ̂k2 , ∀θ̃, θ̂, λ, X.

k∇W1,j,· fi (X : θ̃, λ) − ∇W1,j,· fi (X : θ̂, λ)k2 ≤ L̄kW̃1,j,· − Ŵ1,j,· k2 ,

∀λ, θ̃, θ̂, X, j ∈ {1, ..., D1 }.
k∇fi (X : θ, λ̃) − ∇fi (X : θ, λ̂)k2 ≤ L̄kλ̃ − λ̂k2 , ∀θ, λ̃, λ̂, X, j ∈ {1, ..., D1 }.

Noted that the Lipschitz constants associates with each of the above inequalities are not necessarily
the same. Here L̄ is an upper bound for these Lipschitz constants for simplicity.

Assumption 4.2 (bounded parameters). Sets P and Q are compact set, where θ ∈ P and λ ∈ Q.
Thus there exists constant M that weights W and parameters λ are bounded element-wise by this
constant M .
kW1 k M and kW2 k M and kλk M.

This also implies that the updated θ, λ in Algorithm 1 remain in P and Q, respectively.

Assumption 4.3 (diminishing update on θ). The stepsizes of θ update satisfy

X∞ X∞
η (m) = ∞ and (η (m) )2 < ∞. (1)
m=1 m=1

This is a common assumption for diminishing stepsizes in optimization problems.

Assumption 4.4 (Lipschitz continuity of li (·)). Assume the loss function li (·) for every i is continu-
ously differentiable. It implies that there exists M̂ such that
kli (x) − li (y)k ≤ M̂ kx − yk, ∀x, y.

Assumption 4.5 (existance of a stationary point). There exists a stationary point (θ∗ , λ∗ ) such that
k∇f¯(θ∗ , λ∗ )k = 0.

We note that all these are standard assumptions in convergence proofs. We also stress that Assumption
4.4 does not directly imply 4.1. Since we assume that P and Q are compact, then Assumptions 4.1,
4.4 and 4.5 hold for many standard loss function such as softmax and MSE.

4.2 Convergence Analysis

We first have the following lemma specifying sufficient conditions for λ to converge. All the proofs
are given in Appendix A.

Theorem 4.6 Under Assumptions 4.1, 4.2 and 4.3, if {α(m) } satisfies
X∞ ∞ X
X m
α(m) < ∞ and α(m) η (n) < ∞,
m=1 m=1 n=1
then sequence {λ(m) } converges to λ̄.

We give a discussion of the above conditions for α(m) and η (m) in the end of this section. With the
help of Theorem 4.6, we can show the following convergence result.

Lemma 4.7 Under Assumptions 4.4, 4.5 and the assumptions of Theorem 4.6, when
∞ X
X ∞ X i X∞ X ∞
α(i) η (n) < ∞ and α(n) < ∞, (2)
m=1 i=m n=1 m=1 n=m
we have
M
X
lim sup η (m) k∇f¯(θ(m) , λ̄)k22 < ∞.
M →∞ m=1

5
This result is similar to the classical convergence rate analysis for the non-convex objective function
with diminishing stepsizes, which can be found in [6].
Lemma 4.8 Under the assumptions of Lemma 4.7, we have
lim inf k∇f¯(θ(m) , λ̄)k22 = 0.
m→∞

The statement of this theorem is that for the full gradient method with diminishing stepsizes the
gradient norms cannot stay bounded away from zero. The following result characterizes more
precisely the convergence property of Algorithm 1.
Lemma 4.9 Under the assumptions stated in Lemma 4.7, we have
lim k∇f¯(θ(m) , λ̄)k22 = 0.
m→∞

Our main result is listed next.

Theorem 4.10 Under the assumptions stated in Lemma 4.7, we have
lim k∇f¯(θ(m) , λ(m) )k22 = 0.
m→∞

We cannot show that {θ(m) }’s converge (standard convergence proofs are also unable to show such a
stronger statement). For this reason Theorem 4.10 does not immediately follow from Lemma 4.9
together with Theorem 4.6. The statement of Theorem 4.10 would easily follow from Lemma 4.9 if
convergence of {θ(m) } is established and the gradient being continuous.
Considering the cases η (m) = O( m1k ) and α(m) = O( m1h ). We show in Appendix B that the set of
sufficient and necessary conditions to satisfy the assumptions of Theorem 4.6 are h > 1 and k ≥ 1.
The set of sufficient and necessary conditions to satisfy the assumptions of Lemma 4.7 are h > 2
1 1
and k ≥ 1. For example, we can pick η (m) = O( m ) and α(m) = O( m2.001 ) to achieve the above
convergence result in Theorem 4.10.

5 Convex Case
In this section, we discuss the convergence of a convex objective function. We stress that even if
li (·)’s are convex, the overall fi (·)’s are not convex. However, if the l2 penalty coefficient c2 is large
enough, then fi (·)’s become convex since the regularization term prevails. We have the following
extra assumption for the convexity of our model.
Assumption 5.1 (strong convexity). The objective function fi is strongly convex in θ for every i. It
implies that there exists constant c > 0 such that
1
fi (Xi : θ̃, λ) ≥ fi (Xi : θ̂, λ) + ∇fi (Xi : θ̂, λ)T (θ̃ − θ̂) + ckθ̃ − θ̂k22 , ∀θ̃, θ̂, i.
2

This inequality is a common result for strongly convex functions. Unfortunately, this convexity
assumption is not met by most popular neural network models unless c2 is large. In this section, we
assume that c2 is large enough to force such strong convexity.

5.1 Convex Objective Convergence Analysis

Here we perform the convergence analysis with the assumption that f (θ, λ) is jointly convex in θ
for any fixed λ (which is implied by convexity of each fi (Xi : θ, λ)). The main result establishes a
1
sublinear convergence rate if η (m) are selected as O( m ).
Theorem 5.2 Under Assumptions 4.1, 4.2, 4.3 and 5.1, suppose the stepsizes satisfy that, for
∞
ζ 1 X
all m, η (m) = for some ζ > and ϑ > 0. If {α(m) } satisfies α(m) ln(m) <
ϑ+m cµ m=1
v
∞, then the optimality gap satisfies f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) ≤ , where v :=
2 ϑ+m
ζ (N L̄M + M1 )
max , (ϑ + 1)[f¯(θ(1) , λ(1) ) − f¯(θ∗ , λ̄)] .
2(ζcc1 − 1)

6
Note that θ∗ is the optimal θ value with respect to the fixed λ̄. Point (θ∗ , λ̄) is the point where f¯
attains its minimum. We learn from Theorem 5.2 that decreasing the stepsizes η (m) sublinearly (i.e.
at a rate O(1/m)) ensures the sublinear convergence to the optimal value. We also observe that the
initial point determines the initial optimality gap, which appears prominently in the second term
defining v. With an appropriate initialization phase, we can easily diminish the role played by this
term. Related discussions about the choice of constant parameters to ensure a sublinear convergence
rate and the initialization can be found in Bottou et al. [6].
1
Considering the cases η (m) = O( m ) and α(m) = O( m1h ). We show in Appendix B that the set of
sufficient and necessary condition to satisfy the assumptions of Theorem 5.2 is h > 1. For example,
1 1
we can pick η (m) = O( m ) and α(m) = O( m1.001 ) to achieve the above convergence result.

6 Computational Experiments
We conduct the computational experiments with Theano and Lasagne on a Linux server with an
Nvidia Titan-X GPU. We use MNIST [15], CIFAR-10 [13] and Network Intrusion (NI) [1] datasets
to compare the performance between DBN and the original BN algorithm. For the MNIST dataset,
we use a four-layer fully connected FNN (784 × 300 × 300 × 10) with the ReLU activation function
and for the NI dataset, we use a four-layer fully connected FNN (784 × 50 × 50 × 10) with the ReLU
activation function. For the CIFAR-10 dataset, we use a reasonably complex CNN network that has a
structure of (Conv-Conv-MaxPool-Dropout-Conv-Conv-MaxPool-Dropout-FC-Dropout-FC), where
all four convolution layers and the first fully connected layers are batch normalized. We use the
softmax loss function and l2 regularization with for all three models. All the trainable parameters are
randomly initialized before training. For all 3 datasets, we use the standard epoch/minibatch setting
with the minibatch size of 100, i.e., we do not compute the full gradient and the statistics are over the
minibatch. We use AdaGrad [8] to update the learning rates η (m) for trainable parameters, starting
from 0.01.
We use two different strategies to decide the values of α(m) in DBN: constant values of α(m)
and diminishing α(m) where α(m) = 1/m and α(m) = 1/m2 . We test the choices of constant
α(m) ∈ {1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001, 0}.

(a) (b) (c)

Figure 2: Comparison of predicted accuracy on test datasets for different choices of α(m) . From left to right are
FNN on MNIST, FNN on NI and CNN on CIFAR-10.

(a) (b) (c)

Figure 3: Comparison of predicted accuracy on test datasets for the most efficient choices of α(m) . From left to
right are FNN on MNIST, FNN on NI and CNN on CIFAR-10.

7
(a) (b) (c)
Figure 4: Comparison of the convergence of the loss function value on validation set for different choices of
α(m) . From left to right are FNN on MNIST, FNN on NI and CNN on CIFAR-10.

We test all the choices of α(m) with the performances presented in Figure 2. Figure 2 shows that all
the non-zero choices of α(m) converge properly. The algorithms converge without much difference
even when α(m) in DBN is very small, e.g., 1/m2 . However, if we select α(m) = 0, the algorithm is
erratic. Besides, we observe that all the non-zero choices of α(m) converge at a similar rate. The fact
that DBN keeps the batch normalization layer stable with a very small α(m) suggests that the BN
parameters do not have to be depended on the latest minibatch, i.e. the original BN.
We compare a selected sets of the most efficient choices of α(m) in Figures 3 and 4. They show that
DBN with α(m) < 1 is more stable than the original BN algorithm. The variances with respect to
epochs of the DBN algorithm are obviously smaller than those of the original BN algorithm’s in each
figure.

Table 1: Best results for different choices of α(m) on each dataset, showing top three with heat map.
MNIST (200 epochs) NI (100 epochs) CIFAR-10 (50 epochs)
Model Epoch Test Error Epoch Test Error Epoch Test Error

α(m) = 1 52 2.70% 58 7.69% 45 17.31%

α(m) = 0.75 69 1.91% 67 7.37% 49 17.03%
α(m) = 0.5 69 1.84% 80 7.46% 44 17.11%
α(m) = 0.25 46 1.91% 38 7.32% 43 17.00%
α(m) = 0.1 48 1.90% 66 7.36% 48 17.10%
α(m) = 0.01 51 1.94% 74 7.47% 43 16.82%
α(m) = 0.001 48 1.95% 98 7.43% 46 16.28%
α(m) = 1/m 59 2.10% 78 7.45% 37 17.26%
α(m) = 1/m2 53 2.00% 74 7.59% 44 17.23%
α(m) = 0 199 24.27% 53 26.09% 2 79.34%

Table 1 shows the best result obtained from each choice of α(m) . Most importantly, it suggests that
the choices of α(m) = 1/m and 1/m2 perform better than the original BN algorithm. Besides, all the
constant less-than-one choices of α(m) perform better than the original BN, showing the importance
of taking the mini-batch history into consideration for the update of the BN parameters. The BN
algorithm in each figure converges to similar error rates on test datasets with different choices of
α(m) except for the α(m) = 0 case. Among all the models we tested, α(m) = 0.25 is the only one
that performs top 3 for all three datasets, thus the most robust choice.
To summarize, our numerical experiments show that the DBN algorithm outperforms the original BN
algorithm on the MNIST, NI and CIFAT-10 datasets with typical deep FNN and CNN models.
Future Directions. On the analytical side, we believe an extension to more than 2 layers is doable
with significant augmentations of the notation. A stochastic gradient version is likely to be much
more challenging to analyze. A second open question concerns more general activation functions.
Instead of the linear activation function assumed in this paper, it would be interesting to analyze more
common choices such as Sigmoid, ReLU, and Maxout.

8
References
[1] KDD Cup 1999 Data, 1999. URL http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data.

[2] Devansh Arpit, Yingbo Zhou, Bhargava U. Kota, and Venu Govindaraju. Normalization Propagation:
A Parametric Technique for Removing Internal Covariate Shift in Deep Networks. In International
Conference on Machine Learning, volume 48, page 11, 2016.

[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv preprint
arXiv:1607.06450, 2016.

[4] Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A
Survey. Optimization for Machine Learning, 2010(3):1–38, 2011.

[5] Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient Convergence in Gradient Methods with Errors. SIAM
Journal on Optimization, 10:627–642, 2000.

[6] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning.
arXiv preprint arXiv:1606.04838, 2016.

[7] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent Batch Normalization.
arXiv preprint arXiv:1603.09025, 2016.

[8] Yoram Duchi, John and Hazan, Elad and Singer. Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.

[9] L. Grippo. A Class of Unconstrained Minimization Methods for Neural Network Training. Optimization
Methods and Software, 4(2):135–150, 1994.

[10] Çaglar Gülçehre and Yoshua Bengio. Knowledge Matters: Importance of Prior Information for Optimiza-
tion. Journal of Machine Learning Research, 17(8):1–32, 2016.

[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.
In Computer Vision and Pattern Recognition, pages 770–778, dec 2016.

[12] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. In International Conference on Machine Learning, pages 448–456,
2015.

[13] Alex Krizhevsky and Geoffrey E. Hinton. Learning Multiple Layers of Features from Tiny Images. PhD
thesis, 2009.

[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet Classification with Deep Convolutional
Neural Networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[15] Yann LeCun, Léon Bottou, and Yoshua Bengio. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.

[16] Genevieve B. Orr and Klaus-Robert Müller. Neural Networks: Tricks of the Trade. Springer, New York,
2003.

[17] B. T. Polyak. Introduction to optimization. Translations series in mathematics and engineering. Optimiza-
tion Software, 1987.

[18] B. T. Polyak and Y. Z. Tsypkin. Pseudogradient Adaption and Training Algorithms. Automation and
Remote Control, 34:45–67, 1973.

[19] Tim Salimans and Diederik P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks. In Advances in Neural Information Processing Systems, pages 901–901,
2016.

9
Supplementary material for “Convergence Analysis of Batch Normalization
for Deep Neural Nets”

7 Appendix A: Proofs
7.1 Nomenclature

Symbol Explanation
L̄ The Lipschitz constant
η Learning rate for gradient algorithm
α Weight parameter for batch-normalization update
~x The vector for one record of input data
yj The target output vector for the input record ~xj
N Size of training set, number of all ~x in one epoch
D1 Dimension of the hidden layer
(1) (2)
zj , zj j th entry of output of the first and second hidden layer, respectively
n1 , n2 Numbers of parameters in θ and λ, respectively
θ Set of all trainable parameters updated by its gradient
W (1) , W (2) ∈ θ, weights of linear transformation between layers
(1) (1) (1)
γj , βj ∈ θ, trainable parameters for batch-normalized output yj
λ Set of all batch normalization parameters determined by previous updates
(1)
µj ∈ λ, mean of previous values of zj
(1)
σj ∈ λ, standard deviation of previous values of zj
B The offset for batch normalization transformation

7.2 Preliminary Results

Proposition 7.1 There exists a constant M such that, for any θ and fixed λ, we have
k∇f¯(θ, λ)k22 ≤ M.

Proof. By Assumption 4.5, we know there exists (θ∗ , λ∗ ) such that k∇f¯(θ∗ , λ∗ )k2 = 0. Then we
have
k∇f¯(θ, λ)k2
=k∇f¯(θ, λ)k2 − k∇f¯(θ∗ , λ∗ )k2
=k∇f¯(θ, λ)k2 − k∇f¯(θ, λ∗ )k2 + k∇f¯(θ, λ∗ )k2 − k∇f¯(θ∗ , λ∗ )k2
≤k∇f¯(θ, λ) − ∇f¯(θ, λ∗ )k2 + k∇f¯(θ, λ∗ ) − ∇f¯(θ∗ , λ∗ )k2
N
X N
X
∗
≤ k∇fi (Xi : θ, λ) − ∇fi (Xi : θ, λ )k2 + k∇fi (Xi : θ, λ∗ ) − ∇fi (Xi : θ∗ , λ∗ )k2
i=1 i=1
∗ ∗
≤N L̄(kλ − λ k2 + kθ − θ k2 ),
where the last inequality is by Assumption 4.1. We then have
k∇f¯(θ, λ)k22 ≤ N 2 L̄2 (kλ − λ∗ k2 + kθ − θ∗ k2 )2 ≤ M,
because sets P and Q are compact by Assumption 4.2.

Proposition 7.2 We have

1
fi (X : θ̃, λ) ≤ fi (X : θ̂, λ) + ∇fi (X : θ̂, λ)T (θ̃ − θ̂) + L̄kθ̃ − θ̂k22 , ∀θ̃, θ̂, X.
2

Proof. This is a known result of the Lipschitz-continuous condition that can be found in [6]. We have
this result together with Assumption 4.1.

10
7.3 Proof of Theorem 4.6
P∞ P∞ Pm
Lemma 7.3 When m=1 α(m) < ∞ and m=1 n=1 α(m) η (n) < ∞,
(m)
(m) µj
µ̃j := is a Cauchy series.
(1 − α(1) )(1 − α(2) )...(1 − α(m) )

Proof. By Algorithm 1, we have

N
(m) 1 X (1) (m) (m−1)
µj = α(m) k W1,j,· Xi + (1 − α(m) )µj . (3)
N i=1

α(m) (m) (m) (m−1)

We define α̃(m) := and ∆W1,j,· := W1,j,· − W1,j,· . After
(1 − α(1) )(1
− α(2) )...(1 − α(m) )
dividing (3) by (1 − α(1) )(1 − α(2) )...(1 − α(m) ), we obtain
N
(m) 1 X (m) (m−1)
µ̃j = α̃(m) k (1) W Xi + µ̃j .
N i=1 1,j,·

Then we have
N
(m+1) (m) (m) (1) 1 X (m)
|µ̃j − µ̃j | = α̃ |k | |W1,j,· Xi |
N i=1

N m
1 XX (n)
= α̃(m) |k (1) | | ∆W1,j,· Xi | (4)
N i=1 n=1

N m N
!
(m) (1) 1 X X X
= α̃ |k | η (n) ∇W1,j,· fl (Xl : θ(n) , λ(n) ) · Xi
N i=1 n=1
l=1

N m N
! !
1 XX X
= α̃(m) |k (1) | η (n) ∇W1,j,· fl (Xl : θ(n) , λ(n) ) · Xi
N i=1 n=1
l=1

N m N
!
(m) (1) 1 XX (n)
X
(n) (n)
≤ α̃ |k | η k ∇W1,j,· fl (Xl : θ , λ )k · kXi k (5)
N i=1 n=1
l=1
1
=α̃(m) |k (1) | ·
N
N X
m N h
!
X X i
η (n) k ∇W1,j,· fl (Xl : θ(n) , λ(n) ) − ∇W1,j,· f¯(Xl : θ∗ , λ∗ ) k · kXi k
i=1 n=1 l=1
N m N
1 X X (n) X
≤ α̃(m) |k (1) | η ( [k∇W1,j,· fl (Xl : θ(n) , λ(n) ) − ∇W1,j,· fl (Xl : θ∗ , λ(n) )k2 +
N i=1 n=1
l=1

k∇W1,j,· fl (Xl : θ∗ , λ(n) ) − ∇W1,j,· fl (Xl : θ∗ , λ∗ )k2 ] · kXi k2 )

N X
m
(n) (n)
X
∗
≤ α̃(m) |k (1) | η (n) L̄ · (kW1,j,· − W1,j,· k2 + kλj,· − λ∗j,· k2 ) · kXi k2
i=1 n=1
m
X N
X
≤ α̃(m) η (n) |k (1) |

2L̄M kXi k2 (6)
n=1 i=1
m
X
≤ α̃(m) η (n) M̃L̄,M .
n=1

11
Equation (4) is due to
m
(m) (n)
X
W1,i,j = ∆W1,i,j .
n=1
Therefore,

q m
(p) (q)
X X
|µ̃j − µ̃j | ≤ M̃L̄,M · α̃(m) η (n)
m=p n=1
q m q Xm
(7)
X X X
(m) (n) (m) (n)
= M̃L̄,M · α̃ η = M̃L̄,M · α̃ η .
m=p n=1 m=p n=1

It remains to show that

∞
X
α(m) < ∞, (8)
m=1
∞ X
X m
α(m) η (n) < ∞, (9)
m=1 n=1

implies the convergence of {µ̃(m) }. By (8), we have

Π∞
m=1 (1 − α
(m)
) > 0,

since
∞
X ∞
X
ln(Π∞
m=1 (1 − α
(m)
)) = ln(1 − α(m) ) > −α(m) > −∞.
m=1 m=1

It is also easy to show that there exists C and Mc such that for all m ≥ Mc , we have
(1 − α(1) )(1 − α(2) ) . . . (1 − α(m) ) ≥ C. (10)
Therefore,
lim (1 − α(1) )(1 − α(2) ) . . . (1 − α(m) ) ≥ C.
m→∞

Thus the following holds:

1 (m)
α̃(m) ≤ α (11)
C
and
q X
m q m
X 1 X X (m) (n)
α̃(m) η (n) ≤ α η . (12)
m=p n=1
C m=p n=1

(m)
From (9) and (12) it follows that the sequence {µ̃j } is a Cauchy series.
(m) (m)
Lemma 7.4 Since {µ̃j } is a Cauchy series, {µj } is a Cauchy series.
Proof. We know that
(m) (m)
µj = µ̃j (1 − α(1) )...(1 − α(m) ).
Since
(m)
lim µ̃j → µ̃j
m→∞

and
lim (1 − α(1) )...(1 − α(m) ) → C̃,
m→∞

we have
(m)
lim µj → µ̃j · C̃.
m→∞

(m)
Thus µj is a Cauchy series.

12
P∞ P∞ Pm (m)
Lemma 7.5 If m=1 α(m) < ∞ and m=1 n=1 α(m) η (n) < ∞, {σj } is a Cauchy series.

(m) (m)
Proof. We define σj := σ̃j (1 − α(1) )...(1 − α(m) ). Then we have
v
u N 2
(m) t 1
(m+1) (m)
u X (m) (m)
|σ̃j − σ̃j | = α̃ k (1) W1,j,· Xi − µj
N i=1

v
uN 2
(m) 1 t
uX (m) (m)
= α̃ √ k (1) W1,j,· Xi − µj
N i=1
v
uN (m)
!2
(1) µj
(m) |k |u
tX (m)
= α̃ √ W1,j,· Xi − .
N i=1
k (1)

(m)
Since {µj } is convergent, there exists c1 , c2 and N1 such that for any m > N1 , −∞ < c1 <
(m)
µj < c2 < ∞. Therefore,
(m+1) (m)
|σ̃j − σ̃j |
vuN
v
uN 
|k | (1)
uX (m) c 1
2 uX (m) c2
2  (13)
≤α̃(m) √ · max t W1,j,· Xi − (1) , t W1,j,· Xi − (1) .
N 
i=1
k i=1
k 

n c c2 o
1
For any C̄ ∈ , , we have
k (1) k (1)
v
uN
(1)
(m+1) (m) |k | uX (m) 2
|σ̃j − σ̃j | ≤ α̃(m) √ · t W1,j,· Xi − C̄ (14)
N i=1

v
uN
(1)
(m) |k | u 2
(m)
X
≤ α̃ √ ·t |W1,j,· Xi | + |C̄| (15)
N i=1

v ! !2
uN m
(1)
(m) |k | u (n)
X X
= α̃ √ · t | ∆W1,j,· Xi | + |C̄|
N i=1 n=1
v
uN m N
! !2
(1)
(m) |k | uX X X
= α̃ √ · t | (n)
η · (n) (n)
∇W1,j,· fl (Xl : θ , λ ) · Xi | + |C̄|
N i=1 n=1 l=1

v
uN m N
! !2
(1)
(m) |k | uX X X
≤ α̃ √ · t (n)
η ·| (n) (n)
∇W1,j,· fl (Xl : θ , λ ) · Xi | + |C̄| (16)
N i=1 n=1 l=1
v
uN m N
!2
(1)
|k | uX X X
(17)
≤ α̃(m) √ · t η (n) k ∇W1,j,· fl (Xl : θ(n) , λ(n) )k · kXi k + |C̄|
N i=1 n=1 l=1
v !2
uN m
(1)
(m) |k | uX X
(18)

≤ α̃ √ · t η (n) 2N L̄M kXi k2 + |C̄|
N i=1 n=1

13
v !2
m
u
(1)
(m) |k | u X
≤ α̃ √ · tN · M̃L̄,M η (n) + |C̄| (19)
N n=1
v !2
m
u
u X
(m) (1)
= α̃ |k | · t M̃L̄,M η (n) + |C̄|
n=1

m
!
X
= α̃(m) |k (1) | · M̃L̄,M η (n) + |C̄| . (20)
n=1
n c c2 o
1
Inequality (14) is by plugging C̄ ∈ , into (13). Inequality (15) is by the following fact:
k (1) k (1)
v v v  v
u n u n u n  u n
uX uX uX uX
t (ai − c)2 ≤ max t (|ai | − c)2 , t (|ai | + c)2 = t (|ai | + |c|)2 , (21)
 
i=1 i=1 i=1 i=1

where b and ai for every i are arbitrary real scalars. Besides, (21) is due to
−2ai c ≤ max{−2|ai |c, 2|ai |c}.
Inequalities (16), (17) and (18) follow from the square function being increasing for nonnegative
numbers. Besides these facts, (18) is also by the same techniques we used in (5)-(6) where we bound
the derivatives with the Lipschitz continuity in the following inequality:
XN
k ∇W1,j,· fl (Xl : θ(n) , λ(n) )k ≤ 2N L̄M.
l=1

Inequality (19) is by collecting the bounded terms into a single bound M̃L̄,M . Therefore,
q−1 q−1 m
!
(q) (p) (m+1) (m)
X X X
(m) (1) (n)
|σ̃j − σ̃j | ≤ |σ̃j − σ̃j | ≤ α̃ |k | · M̃L̄,M η + |C̄| . (22)
m=p m=p n=1

Using the similar methods in deriving (8) and (9), it can be seen that a set of sufficient conditions
(m)
ensuring the convergence for {σ̃j } is:

∞
X
α(m) < ∞,
m=1
∞ X
X m
α(m) η (n) < ∞.
m=1 n=1

(m) (m)
Therefore, the convergence conditions for {σj } are the same as for {µj }.
It is clear that these lemmas establish the proof of Theorem 4.6.

7.4 Consequences of Theorem 4.6

Proposition 7.6 Under the assumptions of Theorem 4.6, we have

|λ(m) − λ̄|∞ ≤ am ,

where
∞ X
X i ∞
X
(i) (j)
am = M1 α η + M2 α(i) (23)
i=m j=1 i=m
and M1 and M2 are constants.

14
(m)
Proof. For the upper bound of σj , by (20), we have

q−1 m
!
(q) (p)
X X
(m) (1) (n)
|σ̃j − σ̃j | ≤ α̃ |k | M̃L̄,M η + |C̄| .
m=p n=1

σ̄j
We define˜
¯σj := . Therefore,
(1 − α(1) )...(1 − α(u) )...

 
∞ i
(m)
X X
|˜
¯σj − σ̃j | ≤ α̃(i) |k (1) | M̃L̄,M η (j) + |C̄|
i=m j=1
  (24)
(1) ∞ i
|k |X X
≤ α(i) M̃L̄,M η (j) + |C̄| .
C i=m j=1

The first inequality comes by substituting p by m and by taking lim as q → ∞ in (22). The second
inequality comes from (10). We then obtain,

(m)
σj − σ̄j
(m)
σj σ̄j
=(1 − α(1) )...(1 − α(m) ) −
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(m) )
(m)
(1) (m)
σj σ̄j
≤(1 − α )...(1 − α )[ − +
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(u) )...
σ̄j σ̄j
− ]
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(u) )...
(m) (∞) σ̄j (∞) (25)
≤ σ̃j − σ̃j + − σ̃j
(1 − α(1) )...(1 − α(m) )
(m) (∞) σ̄j σ̄j
= σ̃j − σ̃j + −
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(u) )...
(m) (∞) (1 − α(m+1) )...(1 − α(u) )... − 1
= σ̃j − σ̃j + σ̄j
(1 − α(1) )...(1 − α(u) )...
(m) (∞) σ̄j
≤ σ̃j − σ̃j + |1 − (1 − α(m+1) )...(1 − α(u) )...|
C
∞
(m) (∞) σ̄j X
≤ σ̃j − σ̃j + α(n) .
C n=m+1

The second inequality is by (1 − α(1) )...(1 − α(m) ) < 1, the third inequality is by (10) and the last
inequality can be easily seen by induction. By (25), we obtain

15
Therefore, we have
(m)
|σ̄j − σj |
∞
(m) σ̄j X
≤|˜
¯σj − σ̃j | + α(n)
C n=m+1
 
∞ i ∞
X
(i) (1) 
X
(j) σ̄j X (i)
≤ α̃ |k | · M̃L̄,M η + |C̄| +  α
i=m j=1
C i=m+1
 
∞ i ∞
X 1 (i) (1)  X
(j) σ̄j X (i) (27)
≤ α |k | · M̃L̄,M η + |C̄| +  α
i=m
C j=1
C i=m+1
 
∞ i ∞
X 1 (i) (1)  X σ̄j X (i)
≤ α |k | · M̃L̄,M η (j) + |C̄| + α
i=m
C j=1
C i=m
∞ X i ∞
M̃L̄,M |k (1) | X |k (1) ||C̄| X (i)

(i) (j) σ̄j
= α η + + α .
C i=m j=1
C C i=m
The first inequality is by (26), the second inequality is by (22), the third inequality is by (11) and the
σ̄j
fourth inequality is by adding the nonnegative term α(m) to the right-hand side.
C
(m)
For the upper bound of µj , we have
(m)
µj − µ̄j
(m)
µj µ̄j
=(1 − α(1) )...(1 − α(m) ) −
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(m) )
(m)
(1) (m)
µj µ̄j
≤(1 − α )...(1 − α )[ − + (28)
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(∞) )
µ̄j µ̄j
− ]
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(∞) )
µ̄j
≤ µ̃(m) − µ̃(∞) + − µ̃(∞) .
(1 − α(1) )...(1 − α(m) )
µ̄j
Let us define Am := µ̃(m) − µ̃(∞) and Bm := − µ̃(∞) . Recall from
(1 − α(1) )...(1 − α(m) )
(m)
Theorem 4.6 that {µj } is a Cauchy series, by (7),
q X
m
(p) (q)
X
|µ̃j − µ̃j | ≤ M̄L̄,M · α(m) η (n) .
m=p n=1
Therefore, the first term in (28) is bounded by
∞ X
i
(m)
X
|µ̃j − µ̃∞
j | ≤ M̃L̄,M · α(i) η (n) < ∞. (29)
i=m n=1

For the second term in (28), recall that C := (1 − α(1) )...(1 − α(u) ).... Then we have
µ̄j
C· − µ̃(∞)
(1 − α(1) )...(1 − α(m) )
=µ̄j |1 − (1 − α(m+1) )...(1 − α(u) )...|
X∞
≤µ̄j α(i) ,
i=m+1

16
where the last inequality can be easily seen by induction. Therefore, the second term in (28) is
bounded by
∞
µ̄j (∞) µ̄j X (i)
− µ̃ ≤ α . (30)
(1 − α(1) )...(1 − α(m) ) C i=m+1
From these we obtain
(m)
µj − µ̄j
µ̄j
≤ µ̃(m) − µ̃(∞) + (1)
− µ̃(∞)
(1 − α )...(1 − α(m) ) (31)
∞ X
i ∞
X µ̄j X
≤M̃L̄,M α(i) η (n) + α(i) .
i=m n=1
C i=m+1
The first inequality is by (28) and the second inequality is by (29) and (30). Combining (27) and (31),
we have that
X∞ X i ∞
X
|λ(m) − λ̄|∞ = max(|µ(m) − µ̄|∞ , |σ (m) − σ̄|∞ ) ≤ M1 α(i) η (j) + M2 α(i) ,
i=m j=1 i=m

where M1 and M2 are constants defined as

M̃L̄,M |k (1) |
M1 = max( , M̄L̄,M )
C
and
σ̄j + |k (1) ||C̄| µ̄j
M2 = max( , ).
C C
Proposition 7.7 Under the assumptions of Theorem 4.6,
√
−∇f¯(θ(m) , λ̄)T · ∇f¯(θ(m) , λ(m) ) ≤ −k∇f¯(θ(m) , λ̄)k2 + L̄M n2 am ,
where am is defined in Proposition 7.6.
Proof. For simplicity of the proof, let us define
x(m) := ∇f¯(θ(m) , λ̄), y (m) := ∇f¯(θ(m) , λ(m) ).
We have
√ √
|x(m) − y (m) |∞ ≤ kx(m) − y (m) k2 ≤ L̄kλ(m) − λ̄k2 ≤ L̄ n2 kλ(m) − λ̄k∞ ≤ L̄ n2 am , (32)
√
where n2 is the dimension of λ. The second inequality is by Assumption 4.1 and the fourth
inequality is by Proposition 7.6. Inequality (32) implies that for all m and i, we have
(m) (m) √
|xi − yi | ≤ L̄ n2 am .

It remains to show
X (m) (m)
X (m) 2 √
− yi xi ≤− xi + L̄M n2 am , ∀i, m. (33)
i i
This is established by the following four cases.
(m) (m) (m) (m) √ (m) (m) (m) (m) 2
1) If xi ≥ 0, xi − yi ≥ 0, then xi ≤ L̄ n2 am + yi . Thus −xi yi ≤ −xi +
√
L̄M n2 am by Proposition 7.1.
(m) (m) (m) (m) (m) (m) 2 (m) (m) (m) (m) (m) 2
2) If xi ≥ 0, xi −yi ≤ 0, then xi ≤ yi , xi ≤ xi ·yi and −xi yi ≤ −xi .
(m) (m) (m) (m) (m) (m) 2 (m) (m) (m) (m) (m) 2
3) If xi < 0, xi −yi ≥ 0, then xi ≥ yi , xi ≤ xi ·yi and −xi yi ≤ −xi .
(m) (m) (m) (m) (m) √ (m) (m) (m) 2 √ (m)
4) If xi < 0, xi −yi ≤ 0, then yi −xi ≤ L̄ n2 am , yi xi −xi ≥ L̄ n2 am xi
(m) (m) (m) 2 √ (m) (m) 2 √
and −yi xi ≤ −xi − L̄ n2 am xi ≤ −xi + L̄M n2 am . The last inequality is by
Proposition 7.1.
All these four cases yield (33).

17
Proposition 7.8 Under the assumptions of Theorem 4.6, we have

√ 1
f¯(θ(m+1) , λ̄) ≤ f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M,
2
where M is a constant and am is defined in Proposition 7.6.

Proof. By Proposition 7.2,

1
fi (Xi : θ̃, λ) ≤ fi (Xi : θ̂, λ) + ∇fi (Xi : θ̂, λ)T (θ̃ − θ̂) + L̄kθ̃ − θ̂k22 .
2
Therefore, we can sum it over the entire training set from i = 1 to N to obtain
N
f¯(θ̃, λ) ≤ f¯(θ̂, λ) + ∇f¯(θ̂, λ)T (θ̃ − θ̂) + L̄kθ̃ − θ̂k22 . (34)
2
In Algorithm 1, we define the update of θ in the following full gradient way:
N
X
θ(m+1) := θ(m) − η (m) · ·∇fi (Xi : θ(m) , λ(m) ),
i=1
which implies
θ(m+1) − θ(m) = −η (m) · ∇f¯(θ(m) , λ(m) ). (35)

By (35) we have θ̃ − θ̂ = θ(m+1) − θ(m) = −η (m) ∇f¯(θ(m) , λ(m) ). We now substitute θ̃ := θ(m+1) ,
θ̂ := θ(m) and λ := λ̄ into (34) to obtain
f¯(θ(m+1) , λ̄)
N L̄
≤ f¯(θ(m) , λ̄) − η (m) ∇f¯(θ(m) , λ̄)T ∇f¯(θ(m) , λ(m) ) + (η (m) )2 · k∇f¯(θ(m) , λ(m) )k22
2
N L̄M
≤ f¯(θ(m) , λ̄) − η (m) ∇f¯(θ(m) , λ̄)T ∇f¯(θ(m) , λ(m) ) + (η (m) )2 · (36)
2
√ 1
≤ f¯(θ(m) , λ̄) + η (m) −k∇f¯(θ(m) , λ̄)k22 + L̄M n2 am + (η (m) )2 · N L̄M
2
√ 1
= f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M.
2
The first inequality is by plugging (35) into (34), the second inequality comes from Proposition 7.1
and the third inequality comes from Proposition 7.7.

7.5 Proof of Theorem 4.10

Here we show Theorem 4.10 as the consequence of Theorem 4.6 and Lemmas 4.7, 4.8 and 4.9.

7.5.1 Proof of Lemma 4.7

Here we show Lemma 4.7 as the consequence of Lemmas 7.9, 7.10 and 7.11.
Lemma 7.9
∞ X
X ∞ X
i
α(i) η (n) < ∞
m=1 i=m n=1
and
∞ X
X ∞
α(n) < ∞
m=1 n=m
is a set of sufficient condition to ensure
∞
(m)
X
|σ¯j − σj | < ∞, ∀j. (37)
m=1

18
Proof. By plugging (26) and (24) into (37), we have the following for all j:

∞
(m)
X
σ̄j − σj
m=1
∞ ∞
!
X (m) σ̄j X (n)
≤ |˜
¯σj − σ̃j |
+ α
m=1
C n=m+1
∞

∞
 
i

∞
 (38)
(1) X
X |k | α(i) M̃L̄,M
X σ̄j X
≤  η (j) + |C̄| + α (n) 

m=1
C i=m j=1
C n=m+1
∞ X ∞ i ∞ ∞
|k (1) | · M̃L̄,M X X σ̄j + |k (1) ||C̄| X X
≤ α(i) η (j) + α(n) .
C m=1 i=m j=1
C m=1 n=m+1

It is easy to see that the the following conditions are sufficient for right-hand side of (38) to be finite:

∞ X
X ∞ X
i
α(i) η (n) < ∞
m=1 i=m n=1

and

∞ X
X ∞
α(n) < ∞.
m=1 n=m

Therefore, we obtain
∞
(m)
X
|σ¯j − σj | < ∞, ∀j.
m=1

Lemma 7.10 Under Assumption 4.4,

∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞
m=1 i=m n=1 m=1 n=m

is a set of sufficient conditions to ensure

M
X
lim sup f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄) < ∞.
M →∞ m=1

Proof. By Assumption 4.4, we have

D
X
kli (x) − li (y)k ≤ M̂ kx − yk ≤ M̂ |xi − yi |. (39)
i=1

19
By the definition of fi (·), we then have
X∞
f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)
m=1
∞ X
X N XN
= li (Xi : θ(m) , λ(m) ) + c2 kθ(m) k22 − li (Xi : θ(m) , λ̄) + c2 kθ(m) k22
m=1 i=1 i=1
∞
X N
X
= li (Xi : θ(m) , λ(m) ) − li (Xi : θ(m) , λ̄)
m=1 i=1
∞ X
X N
≤ li (Xi : θ(m) , λ(m) ) − li (Xi : θ(m) , λ̄)
m=1 i=1
∞ X
D X
N (m) (m) (m)
X k (1) W1,j,· Xi − µj k (1) W1,j,· Xi − µ̄j
≤M2 (m)
−
m=1 j=1 i=1 σj + B σ̄j + B
∞ X (m)
D X
N
! !
X (m) 1 1 µ̄j µj
=M2 (k (1) W1,j,· Xi ) (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B σ̄j + B σ j + B
∞ X (m)
D X
N
! !
X (m) 1 1 µ̄j µj
≤M2 (k (1) W1,j,· Xi ) (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B σ̄j + B σ j + B
∞ X (m)
D N N
!
X X (m) 1 1 X µ̄j µj
≤M3 |(k (1) W1,j,· Xi )| (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B i=1
σ̄j + B σ j + B
∞ X (m) (m)
D N
!
X X (m) σ̄j − σj µ̄j µj
=M3 |(k (1) W1,j,· Xi )| (m)
+N − (m)
m=1 j=1 i=1 (σj + B )(σ̄j + B ) σ̄j + B σ j + B
∞ X (m) (m)
D N
!
X X (m) σ̄j − σj µ̄j µj
≤M3 |k (1) ||W1,j,· Xi | + N − . (40)
m=1 j=1 i=1
2B σ̄j + B σ
(m)
+ B
j

The first inequality is by the Cauchy-Schwarz inequality, and the second one is by (39). To show the
finiteness of (40), we only need to show the following two statements:

∞ X
N (m)
X (m) σ̄j − σj
|k (1) ||W1,j,· Xi | < ∞, ∀j (41)
m=1 i=1
2B
and
∞ (m)
X µ̄j µj
− (m) < ∞, ∀j. (42)
m=1
σ̄j + B σj + B
Proof of (41): For all j we have

∞ X
N (m)
X (m) σ̄j − σj
|k (1) ||W1,j,· Xi |
m=1 i=1
2B
∞
X
(1) 1 (m)
≤ |k |N DM maxkXi k σ̄j − σj (43)
m=1
i 2B
∞
1 X (m)
=|k (1) |N DM maxkXi k σ̄j − σj .
i 2B m=1
(m)
The inequality comes from |W1,j,· Xi | ≤ DM kXi k2 , where D is the dimension of Xi and M is the
(m)
element-wise upper bound for W1,j,· in Assumption 4.2.

20
P∞ (m)
Finally, we invoke Lemma 7.3 to assert that m=1 σ̄j − σj is finite.
Proof of (42): For all j we have
∞ (m)
X µ̄j µj
− (m)
m=1
σ̄j + B σ j + B
(m) (m) (m)
(44)
∞ ∞
X µ̄j µj X µj µj
≤ − + − (m) .
m=1
σ̄j + B σ̄j + B m=1
σ̄j + B σj + B

(m)
The first term in (44) is finite since {µj } is a Cauchy series. For the second term, we know that
(m) (m)
there exists a constant M such that for all m ≥ M , µj ≤ µ̄ + 1. This is also by the fact that {µj }
is a Cauchy series and it converges to µ̄. Therefore, the second term in (44) becomes
M −1 (m) (m) ∞ (m) (m)
X µj µj X µj µj
− (m) + − (m)
m=1
σ̄j + B σ j + B σ̄j + B σ j + B
m=M
(m) (m)
(45)
M −1 ∞
X µj µj X 1 1
≤ − (m) + (µ̄ + 1) − (m) .
m=1
σ̄j + B σ j + B σ̄j + B σ j + B
m=M

1 1
Noted that function f (σ) = is Lipschitz continuous since its gradient is bounded by 2 .
σ + B B
1
Therefore we can choose 2 as the Lipschitz constant for f (σ). We then have the following
B
inequality:

1 1 1 (m)
− (m) ≤ 2 |σ̄j − σj |. (46)
σ̄j + B σ j + B B

Plugging (46) into (45), we obtain

M −1 (m) (m) ∞
X µj µj X 1 1
− (m) + (µ̄ + 1) − (m)
m=1
σ̄j + B σ j + B σ̄ j + B σ j + B
m=M
M −1 (m) (m) ∞
X µj µj X (µ̄ + 1) (m)
≤ − (m) + 2 |σ̄j − σj |,
m=1
σ̄j + B σ j + B B
m=M
where the first term is finite by the fact that M is a finite constant. We have shown the condition for
the second term to be finite in Lemma 7.9. Therefore,
∞ (m)
X µ̄j µj
− (m) < ∞, ∀j.
m=1
σ̄j + B σj + B

By (41) and (42), we have that the right-hand side of (40) is finite. It means that the left-hand side of
(40) is finite. Thus,
∞
X
f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄) < ∞.
m=1

Lemma 7.11 If
∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞,
m=1 i=m n=1 m=1 n=m
then
M
X
lim sup η (m) k∇f¯(θ(m) , λ̄)k22 < ∞.
M →∞ m=1

21
Proof. For simplicity of the proof, we define
M
X
T (M ) := η (m) k∇f¯(θ(m) , λ̄)k22 ,
m=1

O(m) := f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m) , λ(m) ),

(m+1)
∆1 := f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m+1) , λ̄),
(m)
∆2 := f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄),

where λ̄ is the converged value of λ in Theorem 4.6. Therefore,

(m+1) (m) (m) (m+1) (m) (m)

O(m) = ∆1 + ∆1 + ∆2 ≤ |∆1 | + |∆1 | + ∆2 . (47)

By Proposition 7.8,
(m) √ 1
∆2 ≤ −η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M. (48)
2

We sum the inequality (47) from 1 to K with respect to m and plug (48) into it to obtain
K K K K
(m+1) (m)
X X X X
O(m) ≤ |∆1 |+ |∆1 | − {η (m) k∇f¯(θ(m) , λ̄)k22 }
m=1 m=1 m=1 m=1
K K
X √ X 1
+ η (m) L̄M n2 am + { (η (m) )2 N L̄M }
m=1 m=1
2
K K
(m+1) (m)
X X
= |∆1 |+ |∆1 | − T (K)
m=1 m=1
K K
√ X X 1
+ L̄2 n2 · η (m) am + { (η (m) )2 N L̄M }.
m=1 m=1
2
From this, we have:
−1 ¯ (K) (K)
lim sup T (K) ≤ lim sup (f (θ , λ ) − f¯(θ(1) , λ(1) ))
K→∞ K→∞ c1
K
1 X (m+1) (m)
+ lim sup (|∆1 | + |∆1 |)
K→∞ c1 m=1
K (49)
√ X (m)
+ lim sup L̄2 n2 η am
K→∞ m=1
K
N L̄K X (m) 2
+ lim sup η .
K→∞ 2c1 m=1

Next we show that each of the four terms in the right-hand side of (49) is finite, respectively. For the
first term,
−1 ¯ (K) (K)
lim sup (f (θ , λ ) − f¯(θ(1) , λ(1) )) < ∞
K→∞ c1

is by the fact that the parameters {θ, λ} are in compact sets, which implies that the image of fi (·) is
in a bounded set.
For the second term, we showed its finiteness in Lemma 7.10.

22
For the third term, by (23), we have
K
X
lim sup η (m) am
K→∞ m=1
 
K
X ∞ X
X i ∞
X
= lim sup η (m) K1 α(i) η (j) + K2 α(i)  (50)
K→∞ m=1 i=m j=1 i=m
 
K
X ∞ X
X i K
X ∞
X
(m) (i) (j)  (m)
=K1 lim sup η  α η + K2 lim sup η α(i) .
K→∞ m=1 i=m j=1 K→∞ m=1 i=m

The right-hand side of (50) is finite because

   
∞
X ∞ X
X i ∞
X ∞ X
X i
η (m)  α(i) η (j)  <  α(i) η (j)  < ∞ (51)
m=1 i=m j=1 m=1 i=m j=1

and
∞
X ∞
X ∞ X
X ∞
η (m) α(i) < α(i) < ∞. (52)
m=1 i=m m=1 i=m

The second inequalities in (51) and (52) come from the stated assumptions of this lemma.
For the fourth term,
K
N L̄M X (m) 2
lim sup η <∞
K→∞ 2c m=1
P∞ (m) 2
holds, because we have m=1 (η ) < ∞ in Assumption 4.3. Therefore, T (∞) =
P∞ (m) ¯ (m) 2
m=1 η k∇ f (θ , λ̄)k2 < ∞ holds.
In Lemmas 7.9, 7.10 and 7.11, we show that {σ (m) } and {µ(m) } are Cauchy series, hence Lemma
4.7 holds.

7.5.2 Proof of Lemma 4.8

This proof is similar to the the proof by Bertsekas and Tsitsiklis [5].
Proof. By Theorem 4.7, we have
M
X
lim sup η (m) k∇f¯(θ(m) , λ̄)k22 < ∞. (53)
M →∞ m=1

If there exists a > 0 and an integer m̄ such that

k∇f¯(θ(m) , λ̄)k2 ≥
for all m ≥ m̄, we would have
M
X M
X
lim inf η (m) k∇f¯(θ(m) , λ̄)k22 ≥ lim inf 2 η (m) = ∞
M →∞ M →∞
m=m̄ m=m̄

which contradicts (53). Therefore, lim inf k∇f¯(θ(m) , λ̄)k2 = 0.

m→∞

7.5.3 Proof of Lemma 4.9

Lemma 7.12 Let Yt , W, t and Zt be three sequences such that Wt is nonnegative for all t. Assume
that
Yt+1 ≤ Yt − Wt + Zt , t = 0, 1, ...,
PT
P∞ t=0 Zt converges as T → ∞. Then either Yt → ∞ or else Yt converges to a
and that the series
finite value and t=0 Wt < ∞.

23
This lemma has been proven by Bertsekas and Tsitsiklis [5].

Lemma 7.13 When

∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞,
m=1 i=m n=1 m=1 n=m
it follows that f¯(θ(m) , λ̄) converge to a finite value.

Proof. By Proposition 7.8, we have

√ 1
f¯(θ(m+1) , λ̄) ≤ f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M.
2
√
Let Y (m) := f¯(θ(m) , λ̄), W (m) := η (m) k∇f¯(θ(m) , λ̄)k22 and Z (m) := η (m) L̄M n2 am +
1 (m) 2 PM
(η ) · N L̄M . By (1) and (50)- (52), it is easy to see that m=0 Z (m) converges as M → ∞.
2
Therefore, by Lemma 7.12, Y (m) converges to a finite value. The infinite case can not occur in our
setting due to Assumptions 4.1 and 4.2.

Lemma 7.14 If
∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞,
m=1 i=m n=1 m=1 n=m

then lim k∇f¯(θ(m) , λ̄)k2 = 0.

m→∞

Proof. To show that lim k∇f¯(θ(m) , λ̄)k2 = 0, assume the contrary; that is,
m→∞

lim supk∇f¯(θ(m) , λ̄)k2 > 0.

m→∞

Then there exists an > 0 such that k∇f¯(θ(m) , λ̄)k < /2 for infinitely many m and also
k∇f¯(θ(m) , λ̄)k > for infinitely many m. Therefore, there is an infinite subset of integers M,
such that for each m ∈ M, there exists an integer q(m) > m such that

k∇f¯(θ(m) , λ̄)k < /2,

k∇f¯(θ(i(m)) , λ̄)k > ,
(54)
/2 ≤ k∇f¯(θ(i) , λ̄)k ≤ ,
if m < i < q(m).

From
k∇f¯(θ(m+1) , λ̄)k − k∇f¯(θ(m) , λ̄)k ≤ k∇f¯(θ(m+1) , λ̄) − ∇f¯(θ(m) , λ̄)k
≤ L̄kθ(m+1) − θ(m) k
= L̄η (m) k∇f¯(θ(m) , λ(m) )k,
it follows that for all m ∈ M that are sufficiently large so that L̄η (m) < /4, we have

/4 ≤ k∇f¯(θ(m) , λ(m) )k. (55)

Otherwise the condition /2 ≤ k∇f¯(θ(m+1) , λ̄)k would be violated. Without loss of generality, we
assume that the above relations as well as (36) hold for all m ∈ M. With the above observations, we
have for all m ∈ M,

24

≤ k∇f¯(θq(m) , λ̄)k − k∇f¯(θ(m) , λ̄)k
2
≤ k∇f¯(θq(m) , λ̄) − ∇f¯(θ(m) , λ̄)k
≤ L̄kθq(m) − θ(m) k
q(m)−1
X
≤ L̄ kθ(i+1) − θ(i) k
i=m
q(m)−1
X
≤ L̄ η (i) k∇f¯(θ(i) , λ(i) )k
i=m
q(m)−1
X
≤ L̄ η (i) (k∇f¯(θ(i) , λ̄)k + k∇f¯(θ(i) , λ(i) ) − ∇f¯(θ(i) , λ̄)k)
i=m
q(m)−1
X √
≤ L̄ η (i) (k∇f¯(θ(i) , λ̄)k + L̄ n2 am )
i=m
q(m)−1 q(m)−1
X √ X
≤ L̄ η (i) + L̄2 n2 η (i) am
i=m i=m
 
q(m)−1 q(m)−1 j
∞ X ∞
X √ X X X
= L̄ η (i) + L̄2 n2 η (i) M1 α(j) η (k) + M2 α(j) 
i=m i=m j=m k=1 j=m
q(m)−1 q(m)−1 ∞ j q(m)−1 ∞
X √ X X X √ X X
= L̄ η (i) + L̄2 n2 M1 η (i) α(j) η (k) + L̄2 n2 M2 η (i) α(j)
i=m i=m j=m k=1 i=m j=m

The first inequality is by (54) and the third one is by the Lipschitz condition assumption. The seventh
one is by (32). By (2), we have for all m ∈ M,

q(m)−1 j
∞ X ∞ X j
∞ X
X X X
η (i) α(j) η (k) < α(j) η (k) < ∞
i=m j=m k=1 i=1 j=i k=1

and
q(m)−1 ∞ ∞ X
∞
X X X
η (i) α(j) < α(j) < ∞.
i=m j=m i=1 j=i

P∞ P∞
It is easy to see that for any sequence {αi } with i=1 αi < ∞, if follows that lim inf i=M αi = 0.
M →∞
Therefore,
q(m)−1 j
∞ X
X X
(i)
lim inf η α(j) η (k) = 0
m→∞
i=m j=m k=1

and
q(m)−1 ∞
X X
lim inf η (i) α(j) = 0.
m→∞
i=m j=m

From this it follows that

q(m)−1
X 1
lim inf η (i) ≥ . (56)
m→∞
i=m
2L̄

25
By the triangle inequality, we have

k∇f¯(θ(m) , λ̄)k
=k∇f¯(θ(m) , λ(m) ) − ∇f¯(θ(m) , λ(m) ) + ∇f¯(θ(m) , λ̄)k
≥ k∇f¯(θ(m) , λ(m) )k − k∇f¯(θ(m) , λ̄) − ∇f¯(θ(m) , λ(m) )k .

√
By (32) and (55), if we pick m ∈ M such that L n2 am ≤ , we have k∇f¯(θ(m) , λ̄)k ≥ . Using
8 8
(36), we observe that
q(m)−1 1 q(m)−1
X X
f¯(θq(m) , λ̄) ≤ f¯(θ(m) , λ̄) − η (i) c1 k∇f¯(θ(i) , λ̄)k22 + · N L̄M (η (i) )2
i=m
2 i=m
2 q(m)−1
X 1
q(m)−1
X
≤ f¯(θ(m) , λ̄) − c1 η (i) + · N L̄M (η (i) )2 , ∀m ∈ M,
8 i=m
2 i=m
where the second inequality is by (55). By Lemma 7.13, f¯(θq(m) , λ̄) and ¯ (m) , λ̄) converge to
Pf∞(θ (m)
the same finite value. Using this convergence result and the assumption m=0 (η )2 < ∞, this
relation implies that

q(m)−1
X
lim sup η (i) = 0
m→∞,m∈M i=m

and contradicts (56).

By Lemmas 7.12, 7.13 and 7.14, we show that Theorem 4.10 holds. To this end we write

lim k∇f¯(θ(m) , λ(m) )k2

m→∞

≤ lim k∇f¯(θ(m) , λ(m) ) − ∇f¯(θ(m) , λ̄)k2 + lim k∇f¯(θ(m) , λ̄)k2

m→∞ m→∞

≤ lim L̄kλ(m) − λ̄k2 + lim k∇f¯(θ(m) , λ̄)k2 .

m→∞ m→∞

By Theorem 4.6, we have

lim L̄kλ(m) − λ̄k2 = 0
m→∞

and by Lemma 4.9, we have

lim k∇f¯(θ(m) , λ̄)k2 = 0.
m→∞

Therefore, we have
lim k∇f¯(θ(m) , λ(m) )k22 = 0,
m→∞

which is the statement in Theorem 4.10.

7.6 Proof of Theorem 5.2

In this section we assume that fi (·) is strongly convex.

Lemma 7.15 If
∞ X
X i
m2 α(i) η (n) < ∞, (57)
i=m n=1

there exists a constant M such that , for every m we have

(m+1) (m) √
|∆1 | + (1 + η (m) c)|∆1 | + η (m) L̄M n2 am
≤ M. (58)
1 (m) 2
2η

26
Proof. The notation here is the same as the one used in the proof of Lemma 7.11. Showing (58) is
(m)
|∆ | am
equivalent to showing constant upper bounds for 1 2 and (m) .
η (m) η
(m)
|∆1 |
For an upper bound of 2 , by (40) and (43), we have
η (m)
(m)
|∆1 |
2
η (m)
|f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)|
= 2 (59)
η (m)
(m)
D
!
M3 X (1) 1 (m) µ̄j µj
≤ 2 |k |N DM 2 σ̄j − σj +N − (m) .
η (m) j=1
B σ̄j + B σj + B

(m) (m)
|σ¯j − σj | |µ¯j − µj |
We can see that it is equivalent to show that and
have constant upper
η η (m) 2 (m) 2
bounds because all other terms in the right-hand side of (59) are finite constants.
By (38), we have

∞ i ∞
(m) |k (1) | · M̃L̄,M X X σ̄j + |k (1) ||C̄| X
|σ¯j − σj |≤ α(i) η (j) + α(n) .
C i=m j=1
C n=m+1

ζ 2
Note that we have η (m) = and thus η (m) = O( m12 ). Therefore, (57) implies that
ϑ+m
∞ X
i
1 X
2 α(i) η (n) < ∞. (60)
η (m) i=m n=1

Inequality (60) implies

∞
1 X
2 α(i) < ∞. (61)
η (m) i=m+1
P∞
This is by the fact that we assume n=1 η (n) = ∞ in Assumption 4.3. We now apply the same kind
(m)
of analysis to |µ¯j − µj | to establish
(m)
µj − µ̄j
(m)
(1) (m)
µj µ̄j
=(1 − α )...(1 − α ) −
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(m) )
(m)
µj µ̄j
≤(1 − α(1) )...(1 − α(m) )[ − + (62)
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(∞) )
µ̄j µ̄j
(1)
− ]
(1 − α )...(1 − α(m) ) (1 − − α(∞) ) (1)
α )...(1
(m) (∞) µ̄j (∞)
≤ µ̃j − µ̃j + − µ̃j .
(1 − α )...(1 − α(m) )
(1)

(m) (∞) µ̄j (∞)

We define Am := µ̃j − µ̃j and Bm := − µ̃j . Recall from
(1 − α(1) )...(1 − α(m) )
(m)
Theorem 4.6 that {µj } is a Cauchy series. By (7), we have

27
q X
m
(p) (q)
X
|µ̃j − µ̃j | ≤ M̃L̄,M · α(m) η (n) .
m=p n=1

Therefore, the first term in (62) is bounded by

∞ X
i
(m)
X
|µ̃j − µ̃∞
j | ≤ M̃L̄,M · α(i) η (n) < ∞.
i=m n=1

For the second term in (62), we first define C := (1 − α(1) )...(1 − α(u) )... . Then we have
µ̄j (∞)
C· − µ̃j
(1 − α(1) )...(1 − α(m) )
=µ̄j |1 − (1 − α(m+1) )...(1 − α(∞) )|
X∞
≤µ̄j α(i) ,
i=m+1
where the last inequality can be easily checked by induction. Therefore, the second term in (62) is
bounded by
∞
µ̄j (∞) µ̄j X (i)
− µ̃j ≤ α .
(1 − α(1) )...(1 − α(m) ) C i=m+1
(m)
|µ¯j − µj |
Hence (60) and (61) ensure to be finite.
η (m) 2

am
For an upper bound of , by (23), we have
η (m)
P∞ Pi P∞
am M1 i=m j=1 α(i) η (j) + M2 i=m α(i)
= .
η (m) η (m)
We know that
P∞ Pi ∞ X
i
M1 i=m j=1 α(i) η (j) 1 X
< M1 2 α(i) η (j) < ∞ (63)
η (m) η (m) i=m j=1

and P∞ ∞
M2 α(i) 1 X
i=m
< M2 2 α(i) < ∞. (64)
η (m) η (m) i=m

The second inequalities in (63) and (64) are by (60) and (61). Note that given that η (m) = 1/m, (60)
is equivalent to
∞ X i ∞ ∞
X 1 X X
α(i) < α(i) ln(i) < α(i) ln(i) < ∞
i=m j=1
j i=m i=1
This concludes the proof.

Lemma 7.16 Under the assumptions of Lemma 7.15, Theorem 5.2 holds.

The proof for this Lemma of the high level follows the proof of Theorem 4.7 in Bottou et al. [6].
Proof. Assumption 5.1 implies that
1
f¯(θ̃, λ) ≥ f¯(θ̂, λ) + ∇f¯(θ̂, λ)T (θ̃ − θ̂) + ckθ̃ − θ̂k22 , ∀θ̃, θ̂.
2
¯ ¯ ¯
Therefore, f has a unique minimizer f := f (θ , λ) for any λ fixed. Note that θ∗ = θ∗ (λ) but this
∗ ∗

dependency is irrelevant in the rest of the proof. Standard convex analysis argument establishes

28

2c f¯(θ(m) , λ) − f¯(θ∗ , λ) ≤ k∇f¯(θ(m) , λ)k22 . (65)

(m+1)
Recall that ∆1 := f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m+1) , λ̄). We then have
f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m) , λ(m) )
h i h i
= f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)
(66)
+ f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄)
(m+1) (m)
≤|∆1 | + |∆1 | + f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄).
Therefore,
f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄)
√ 1 2
≤ − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + η (m) N L̄M
2
¯ ¯ ∗ √ 1 2
(m) (m) (m)
≤ − η c(f (θ , λ̄) − f (θ , λ̄)) + η L̄M n2 am + η (m) N L̄M
2
¯ ¯ ∗ ¯ ¯
= − η c f (θ , λ ) − f (θ , λ̄) + f (θ , λ̄) − f (θ(m) , λ(m) )
(m) (m) (m) (m)

√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M (67)
2
≤ − η (m) c f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) + η (m) c f¯(θ(m) , λ̄) − f¯(θ(m) , λ(m) )
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M
2
(m)
= − η c f (θ , λ ) − f¯(θ∗ , λ̄) + η (m) c|∆1 |
(m) ¯ (m) (m)

√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M.
2
The first inequality is by Proposition 7.8, while the second inequality is by the strong convexity
property (65). Combining (66) and (67) yields
f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m) , λ(m) )

(m+1) (m)
≤ − η (m) c f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) + |∆1 | + (1 + η (m) c)|∆1 |
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M.
2
By Lemma 7.15, there exists an upper bound M4 such that for all m sufficiently large,
(m+1) (m) √
|∆1 | + (1 + η (m) c)|∆1 | + η (m) L̄M n2 am
≤ M4 .
1 (m) 2
2η
By subtracting f¯(θ∗ , λ̄) from both side of (7.6), we obtain

f¯(θ(m+1) , λ(m+1) ) − f¯(θ∗ , λ̄)

1 2 (68)
≤(1 − η (m) c)(f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄)) + η (m) (N L̄M + M4 ).
2
Inequality (68) has the exact same form used in classic convergence proofs for the strongly convex,
diminishing step size case.
We finally show by induction that

v
f¯(θm , λm ) − f¯(θ∗ , λ̄) ≤ (69)
ϑ+m

29
holds for all m, where
ζ 2 (N L̄M + M4 )
v := max{ , (ϑ + 1)[f¯(θ(1) , λ(1) ) − f¯(θ∗ , λ̄)]}.
2(ζc − 1)
First, the definition of ζ ensures that it holds for m = 1. Assuming (69) holds for some m ≥ 1, it
follows from (68) that
1 2
f¯(θ(m+1) , λ(m+1) ) − f¯(θ∗ , λ̄) ≤ (1 − η (m) c)(f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄)) + η (m) (LM + M4 )
2
(m) v 1 (m) 2
≤ (1 − η c) + η (LM + M4 )
ϑ+m 2
ζc v ζ 2 (LM + M4 )
= (1 − ) +
ϑ+m ϑ+m 2(ϑ + m)2
ϑ + m − ζc ζ 2 (LM + M4 )
= 2
v+
(ϑ + m) 2(ϑ + m)2
ζ 2 (LM + M4 )

ϑ+m−1 ζc − 1
= v − v +
(ϑ + m)2 (ϑ + m)2 2(ϑ + m)2
ϑ+m−1
≤ v
(ϑ + m)2
v
≤ .
ϑ+m+1
The first inequality is by (68), the second inequality is by the definition of η (m) , the third inequality
is by the definition of v, the sum of the latter two terms is non-positive, and the fourth inequality
is because (ϑ + m)2 ≥ (ϑ + m + 1)(ϑ + m − 1). This shows that the algorithm converges at a
sublinear rate.

8 Appendix B: Discussions of conditions for stepsizes

Here we discuss the actual conditions for η (m) and α(m) to satisfy the assumptions of Theorem 4.6,
Lemma 4.7 and Theorem 5.2, respectively. We only consider the cases η (m) = m1k and α(m) = m1h ,
but the same analysis applies to the cases η (m) = O( m1k ) and α(m) = O( m1h ).

8.1 Assumptions of Theorem 4.6

For the assumptions of Theorem 4.6, the first condition

X∞
α(m) < ∞
m=1
requires h > 1.
Besides, the second condition
X∞ X m X∞ X ∞ ∞
X ∞
X
(m) (n) (m) (n) (n)
α η = α η = η α(m)
m=1 n=1 n=1 m=n n=1 m=n
∞ ∞
1 X 1 1 X 1
≈ η (n) h−1 = k+h−1
<∞
h − 1 n=1 n h − 1 n=1 n
requires k + h > 2. The approximation comes from the fact that for every p > 1, we have
∞ Z ∞ ∞
X 1 1 1
k −p ≈ k −p dx = x1−p = p−1
.
k=n 1−p n p−1n
k=n
Since k ≥ 1 due to Assumption 4.3, we conclude that k + h > 2.
Therefore, the conditions for η (m) and α(m) to satisfy the assumptions of Theorem 4.6 are h > 1 and
k ≥ 1.

30
8.2 Assumptions of Lemma 4.7

For the assumptions of Theorem 4.6, the first condition

∞ X
∞ ∞
X X 1
α(n) ≈ h−1
<∞
m=1 n=m m=1
m

requires h > 2.
Besides, the second condition is
∞ X
X ∞ X
i ∞ X
X ∞ i
X ∞ X
X ∞
α(i) η (n) = α(i) η (n) ≤ C α(i) < ∞.
m=1 i=m n=1 m=1 i=m n=1 m=1 i=m

The inequality holds because for any p > 1, we have

n Z n n
X
−p 1 1
k ≈ k −p dk = k 1−p = (1 − n1−p ) ≤ C
k=1 1 − p 1 p−1
k=1

Therefore, the conditions for η (m) and α(m) to satisfy the assumptions of Lemma 4.7 are h > 2 and
k ≥ 1.

8.3 Assumptions of Theorem 5.2

Recall that we have let η (m) = 1/m. For the assumptions of Theorem 5.2, the condition
∞
X
α(m) ln(m) < ∞
m=1

requires h > 1. To see this, note that ln(m) ≤ Cm for any > 0. Thus
∞
X ∞
X
α(m) ln(m) ≤ C m−h m < ∞
m=1 m=1

if − h < −1. This yields h > 1.

Therefore, the condition for α(m) to satisfy the assumptions of Theorem 5.2 is h > 1.

Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
No ratings yet
Chen, Deng Et Al 2021 - Effective and Efficient Batch Normalization
15 pages
Batch Normalization Preconditioning For Neural Network Training
No ratings yet
Batch Normalization Preconditioning For Neural Network Training
41 pages
8.5 Batch Normalization: 8.4.5 Exercises
No ratings yet
8.5 Batch Normalization: 8.4.5 Exercises
11 pages
Batch Normalization in Deep Learning
No ratings yet
Batch Normalization in Deep Learning
17 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
SNGAN 5th Module
No ratings yet
SNGAN 5th Module
12 pages
Theoretical of Batch
No ratings yet
Theoretical of Batch
32 pages
UNIT 2 Notes
No ratings yet
UNIT 2 Notes
19 pages
5.batch Normalisation
No ratings yet
5.batch Normalisation
10 pages
How Does It Help
No ratings yet
How Does It Help
23 pages
Batch Normalization Separate
No ratings yet
Batch Normalization Separate
20 pages
Exponential Convergence Rates For Batch Normalization - 1
No ratings yet
Exponential Convergence Rates For Batch Normalization - 1
1 page
Exponential Convergence Rates For Batch Normalization - 2
No ratings yet
Exponential Convergence Rates For Batch Normalization - 2
1 page
Batch Normalization
No ratings yet
Batch Normalization
6 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Notes For - Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift - Paper GitHub
No ratings yet
Notes For - Batch Normalization - Accelerating Deep Network Training by Reducing Internal Covariate Shift - Paper GitHub
3 pages
03 Optim Batch Normalization
No ratings yet
03 Optim Batch Normalization
13 pages
Training Neural Netwok: Data Set
No ratings yet
Training Neural Netwok: Data Set
35 pages
Revisiting Internal Covariate Shiftfor BN
No ratings yet
Revisiting Internal Covariate Shiftfor BN
12 pages
12.batch Normalization
No ratings yet
12.batch Normalization
12 pages
A Quantitative Analysis of The Effect of Batch Normalization On Gradient Descent
No ratings yet
A Quantitative Analysis of The Effect of Batch Normalization On Gradient Descent
9 pages
BN Free
No ratings yet
BN Free
11 pages
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
No ratings yet
Improving Deep Neural Networks: Hyperparameter Tuning, Regularization and Optimization
1 page
Different Activation Functions With The Equations
No ratings yet
Different Activation Functions With The Equations
6 pages
Eait 2018 8470438
No ratings yet
Eait 2018 8470438
5 pages
Batch Normalization
No ratings yet
Batch Normalization
2 pages
L2 Regularization Versus Batch and Weight Normalization
No ratings yet
L2 Regularization Versus Batch and Weight Normalization
9 pages
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
No ratings yet
On Deep Learning For Inverse Problems: Jaweria Amjad Jure Sokoli C Miguel R.D. Rodrigues
5 pages
Normalization Techniques
No ratings yet
Normalization Techniques
23 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
AN L G N: EW Ook at Host Ormalization
No ratings yet
AN L G N: EW Ook at Host Ormalization
10 pages
DLbook
No ratings yet
DLbook
165 pages
Lecture 8.7
No ratings yet
Lecture 8.7
9 pages
Deep Learning Optimization Guide
100% (1)
Deep Learning Optimization Guide
105 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
6 Batchnorm
No ratings yet
6 Batchnorm
30 pages
CNN 02 Batch Normalization
No ratings yet
CNN 02 Batch Normalization
19 pages
NeurIPS 2020 Batch Normalization Provably Avoids Ranks Collapse For Randomly Initialised Deep Networks Paper
No ratings yet
NeurIPS 2020 Batch Normalization Provably Avoids Ranks Collapse For Randomly Initialised Deep Networks Paper
12 pages
Cours 5
No ratings yet
Cours 5
23 pages
How To Use Batch Normalization With TensorFlow and TF - Keras To Train Deep Neural Networks Faster
No ratings yet
How To Use Batch Normalization With TensorFlow and TF - Keras To Train Deep Neural Networks Faster
11 pages
Batch Normalization
No ratings yet
Batch Normalization
11 pages
Neural Network BatchNorm Analysis
No ratings yet
Neural Network BatchNorm Analysis
11 pages
BDA Unit 2
No ratings yet
BDA Unit 2
48 pages
DL Notes
No ratings yet
DL Notes
16 pages
Deep Learning Basics Lecture 8 Autoencoder & DBM
No ratings yet
Deep Learning Basics Lecture 8 Autoencoder & DBM
28 pages
DL Lecture8 Autoencoder
No ratings yet
DL Lecture8 Autoencoder
28 pages
BN Layer
No ratings yet
BN Layer
4 pages
8.augmentation and Regularization
No ratings yet
8.augmentation and Regularization
62 pages
Deep Learning & Neural Networks Guide
No ratings yet
Deep Learning & Neural Networks Guide
64 pages
Lecture 04 - Optimization - 4p
No ratings yet
Lecture 04 - Optimization - 4p
11 pages
Mathematics in AI: Neural Networks
No ratings yet
Mathematics in AI: Neural Networks
10 pages
Layer Normalization: Jimmy@psi - Toronto.edu Rkiros@cs - Toronto.edu Hinton@cs - Toronto.edu
No ratings yet
Layer Normalization: Jimmy@psi - Toronto.edu Rkiros@cs - Toronto.edu Hinton@cs - Toronto.edu
14 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
No ratings yet
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
3 DL
No ratings yet
3 DL
15 pages
2021mini Alexnet
No ratings yet
2021mini Alexnet
73 pages
幼儿识字卡片1200配图打印版巧宝识字1
No ratings yet
幼儿识字卡片1200配图打印版巧宝识字1
100 pages
Homework 3: X P (X X)
No ratings yet
Homework 3: X P (X X)
1 page
Homework 4
No ratings yet
Homework 4
1 page
Best Practices For Modernizing Enterprise Decision Making: Conclusions Paper
No ratings yet
Best Practices For Modernizing Enterprise Decision Making: Conclusions Paper
12 pages
Solving Triangles: Six Different Types
No ratings yet
Solving Triangles: Six Different Types
5 pages
Matrix Stiffness Method English
No ratings yet
Matrix Stiffness Method English
14 pages
L1 Logical Thinking
No ratings yet
L1 Logical Thinking
27 pages
Digital SAT Prep Online - Master The SAT
No ratings yet
Digital SAT Prep Online - Master The SAT
37 pages
7 Deadly Sins of Research
No ratings yet
7 Deadly Sins of Research
33 pages
Chapter 4 Fluid Flow in Porous Media - 2
100% (1)
Chapter 4 Fluid Flow in Porous Media - 2
72 pages
Final Assignment MAT1004 Code 9
No ratings yet
Final Assignment MAT1004 Code 9
2 pages
An Introduction To Data Analysis Using IBM SPSS, 1st Edition ISBN 1032891793, 9781032891798 Direct Ebook Download
No ratings yet
An Introduction To Data Analysis Using IBM SPSS, 1st Edition ISBN 1032891793, 9781032891798 Direct Ebook Download
15 pages
Bhadra Binu Remya - Tasks - Operators
No ratings yet
Bhadra Binu Remya - Tasks - Operators
4 pages
QFFF Thesis
No ratings yet
QFFF Thesis
116 pages
1 Question Bank CBSE
No ratings yet
1 Question Bank CBSE
17 pages
Ma221-6 1
No ratings yet
Ma221-6 1
7 pages
Chapter 5 Forecasting: Quantitative Analysis For Management, 11e (Render)
No ratings yet
Chapter 5 Forecasting: Quantitative Analysis For Management, 11e (Render)
27 pages
Grade 6 Maths Revision Test
No ratings yet
Grade 6 Maths Revision Test
3 pages
Lower Secondary Mathematics Answers Stage 7
100% (4)
Lower Secondary Mathematics Answers Stage 7
23 pages
Vectors Gcse and Igcse
No ratings yet
Vectors Gcse and Igcse
8 pages
Response of A Damped System Under The Harmonic Motion of The Base
No ratings yet
Response of A Damped System Under The Harmonic Motion of The Base
28 pages
MIMO Module 3 Notes
No ratings yet
MIMO Module 3 Notes
22 pages
Probability Concepts and Application
No ratings yet
Probability Concepts and Application
44 pages
Nat 5 Notes
No ratings yet
Nat 5 Notes
51 pages
1001 Math Problems - Spatial Puzzles (3D)
No ratings yet
1001 Math Problems - Spatial Puzzles (3D)
15 pages
S102502 Asi en Ac
No ratings yet
S102502 Asi en Ac
72 pages
Calculating Bonus Toll
100% (1)
Calculating Bonus Toll
4 pages
See Other Formats
No ratings yet
See Other Formats
23 pages
B-Spline Curves: N K T T T T T
No ratings yet
B-Spline Curves: N K T T T T T
5 pages
Power Flow Management Through Interline Power Flow Controller
No ratings yet
Power Flow Management Through Interline Power Flow Controller
6 pages
Pletser - Lagrangian and Hamiltonian Analytical Mechanics PDF
100% (10)
Pletser - Lagrangian and Hamiltonian Analytical Mechanics PDF
138 pages
Douglas Mooney
No ratings yet
Douglas Mooney
85 pages
Deconstructive Design at Villette
100% (2)
Deconstructive Design at Villette
8 pages