Convergence
Convergence
Abstract
1 Introduction
Deep neural networks (DNN) have shown unprecedented success in various applications such as
object detection. However, it still takes a long time to train a DNN until it converges. Ioffe and
Szegedy [12] identified an important problem involved in training deep networks, internal covariate
shift, and then proposed batch normalization (BN) to decrease this phenomenon. BN addresses
this problem by normalizing the distribution of every hidden layer’s input. In order to do so, it
calculates the pre-activation mean and standard deviation using mini-batch statistics at each iteration
of training and uses these estimates to normalize the input to the next layer. The output of a layer is
normalized by using the batch statistics and two new trainable parameters per neuron are introduced
that capture the inverse operation. It is now a standard practice [6, 11]. While this approach leads to
a significant performance jump, to the best of our knowledge, there is no known theoretical guarantee
for the convergence of an algorithm with BN. The difficulty of analyzing the convergence of the BN
algorithm comes from the fact that not all of the BN parameters are updated by gradients. Thus it
invalidates most of the classical studies of convergence for gradient methods.
In this paper, we propose a generalization of the BN algorithm, diminishing batch normalization
(DBN), where we update the BN parameters in a diminishing moving average way. It essentially
means that the BN layer adjusts its output according to all past mini-batches instead of only the
current one. It helps to reduce the problem of the original BN that the output of a BN layer on a
particular training pattern depends on the other patterns in the current mini-batch, which is pointed
out in Bottou et al. [6]. By setting the layer parameter we introduce into DBN to a specific value, we
recover the original BN algorithm.
We give a convergence analysis of the algorithm with a two-layer batch-normalized neural network
and diminishing stepsizes. We assume two layers (the generalization to multiple layers can be
done by using the same approach but substantially complicating the notation) and an arbitrary loss
function. The convergence analysis does require linear activations. The main result shows that under
diminishing stepsizes on gradient updates and updates on mini-batch statistics, and standard Lipschitz
conditions on loss functions DBN converges to a stationary point. As already pointed out the main
challenge is the fact that some trainable parameters are updated by gradient while others are updated
by a mere recalculation. We also study the convex case where we assume that the overall loss function
is convex (which holds if the regularization parameter is large). In this case, we show a sublinear
convergence rate towards the global minimum.
Contributions. The main contribution of this paper is in providing a general convergence guarantee
for DBN. Specifically, we make the following contributions.
• In section 4, we show the sufficient and necessary conditions for the stepsizes and diminish-
ing weights to ensure the convergence of BN parameters.
• We show that the algorithm converges to a stationary point under a general nonconvex
objective function.
• In section 5, we show that decreasing the stepsizes η (m) for trainable parameters sublinearly
(i.e., at rate O(1/m)) ensures sublinear convergence to the optimal value with a convex
objective function.
This paper is organized as follows. In Section 2, we review the related works and the development of
the BN algorithm. We formally state our model and algorithm in Section 3. We present our main
results in Sections 4 and 5. In Section 6, we numerically show that the DBN algorithm outperforms
the original BN algorithm. Proofs are collected in Appendixes A and B in supplemental material.
2 Literature Review
Before the introduction of BN, it has long been known in the deep learning community that input
whitening and decorrelation help to speed up the training process. In fact, Orr and Müller [16] show
that preprocessing the data by subtracting the mean, normalizing the variance, and decorrelating the
input has various beneficial effects for back-propagation. Krizhevsky et al. [14] propose a method
called local response normalization which is inspired by computational neuroscience and acts as a
form of lateral inhibition, i.e., the capacity of an excited neuron to reduce the activity of its neighbors.
Gülçehre and Bengio [10] propose a standardization layer that bears significant resemblance to
batch normalization, except that the two methods are motivated by very different goals and perform
different tasks.
Inspired by BN, there are several new works taking BN as a basis for further improvements. Layer
normalization [3] is much like the BN except that it uses all of the summed inputs to compute the
mean and variance instead of the mini-batch statistics. Besides, unlike BN, layer normalization
performs exactly the same computation at training and test times. Normalization propagation [2] uses
data-independent estimations for the mean and standard deviation in every layer to reduce the internal
covariate shift and make the estimation more accurate for the validation phase. Weight normalization
also removes the dependencies between the examples in a minibatch so that it can be applied to
recurrent models, reinforcement learning or generative models [19]. Cooijmans et al. [7] propose a
new way to apply batch normalization to RNN and LSTM models.
In view of all these flavors, the original BN method is the most popular technique and for this reason
our choice of the analysis. To the best of our knowledge, we are not aware of any prior analysis of
BN.
BN has gradient and non-gradient updates. Thus nonconvex convergence results do not immediately
transfer. Our analysis explicitly takes into account the workings of BN. However, nonconvex
convergence proofs are relevant since some small portions of our analysis rely on known proofs and
approaches.
Neural nets are not convex, even if the loss function is convex. For classical convergence results with
a nonconvex objective function and diminishing learning rate, we refer to survey papers [4, 5, 6].
Bertsekas and Tsitsiklis [5] provide a convergence result with deterministic gradient with errors.
Bottou et al. [6] provide a convergence result with stochastic gradient. The classic analyses showing
the norm of gradients of the objective function going to zero date back to [9, 18, 17]. For strongly
convex objective functions with a diminishing learning rate, we learn the classic convergence results
from [6].
2
3 Model and Algorithm
The optimization problem for a network is a objective function consisting of a large number of
component functions, that reads:
N
X
min f¯(θ, λ) = fi (Xi : θ, λ),
i=1
subject to θ ∈ P, λ ∈ Q,
where fi : Rn1 × Rn2 → R, i = 1, ..., N , are real-valued functions for any data record Xi . Index i
associates with data record Xi and target response yi (hidden behind the dependency of f on i) in the
training set. Parameters θ include the usual parameters updated by gradients directly associated with
the loss function, i.e. behind the part that we have a parametric model, while BN parameters λ are
introduced by the BN algorithm and not updated by gradient methods but by the mini-batch statistics.
We define that the derivative of fi is always taken with respect to θ:
∇fi (Xi : θ, λ) := ∇θ fi (Xi : θ, λ).
The deep network we analyze here has 2 fully-connected layers with D1 neurons each. Each hidden
layer computes y = kW u with slope k for linear activation, i.e. we assume that the activation
functions are linear, and u is the input vector of the layer. We do not need to include an intercept term
since the BN algorithm automatically adjusts for it. BN is applied to the output of the first hidden
layer.
We next describe the computation in each layer to show how we obtain the output of the network.
The notations introduced here is used in the analysis. Figure 1 shows the full structure of the network.
The input data is vector X, which is one of {Xi }N = (µj )D
i=1 . Vector λ
D
j=1 , (σj )j=1 is the set of all
(1) (1) D
BN parameters and vector θ = W1 , W2 , (βj )D j=1 , (γj )j=1 is the set of all trainable parameters
which are updated by gradients. Matrices W1 , W2 are the actual model parameters and β, γ are
introduced by BN. The j th entry of output of the first hidden layer is
(1)
zj (X : θ) = k (1) W1,j,· X,
where W1,j,· denotes the weights of linear transformations for the j th neuron and k (1) is the slope of
the linear activation function for the first hidden layer. The j th entry of batch-normalized output of
the first layer is
(1)
!
(1) (1) zj (X, θ) − µj (1)
yj (X : θ, λ) = γj + βj ,
σ j + B
(1) (1)
where βj and γj are trainable parameters updated by gradient and µj and σj are batch normal-
(1) (1)
ization parameters for zj . Trainable parameter µj is the mini-batch mean of zj and trainable
(1)
parameter σj is the mini-batch sample deviation of zj .
Constant B is a small offset term to keep
the denominator from zero. The output of j th entry of the output layer reads:
(2)
zj (X : θ, λ) = k (2) W2,j,· y (1) (X : θ, λ).
3
The objective function for the ith data record is
(2)
fi (Xi : θ, λ) = li zj (Xi : θ, λ) + c2 kθk22 ,
j
where li (·) is the loss function associated with the target response yi . We have the following complete
expression for the objective function for the ith data
D (1)
X (1) k W 1,j,· Xi − µj (1)
fi (Xi : θ, λ) = li (k (2) W2,k,j γj + βj )k + c2 kθk22 .
j=1
σ j + B
The objective function fi (Xi : θ, λ) is nonconvex with respect to θ and λ. However, when c2 is large
enough, the regularization term prevails and makes the objective function convex.
3.1 Algorithm
Algorithm 1 shows the algorithm studied herein. There are two deviations from the standard BN
algorithm, one of them actually being a generalization. We use the full gradient instead of the more
popular stochastic gradient (SG) method. It essentially means that each batch contains the entire
training set instead of a randomly chosen subset of the training set. An analysis of SG is potential
future research. Although the primary motivation for full gradient update is to reduce the burdensome
in showing the convergence, the full gradient method is similar to SG in the sense that both of them
go through the entire training set, while full gradient goes through it deterministically and the SG
goes through it in expectation. Therefore, it is reasonable to speculate that the SG method has similar
convergence property as the full algorithm studied herein.
The second difference is that we update the BN parameters (θ, λ) by their moving averages with
respect to diminishing α(m) . The original BN algorithm can be recovered by setting α(m) = 1
for every m. After introducing diminishing α(m) , λ(m) and hence the output of the BN layer is
determined by the history of all past data records, instead of those solely in the last batch. Thus the
output of the BN layer becomes more general that better reflects the distribution of the entire dataset.
We use two strategies to decide the values of α(m) . One is to use a constant smaller than 1 for all m,
the other one is to decay the α(m) gradually, such as α(m) = 1/m.
In our numerical experiment, we show that Algorithm 1 outperforms the original BN algorithm,
where both are based on SG and non-linear activation functions with many layers FNN and CNN
models.
4 General Case
The main purpose of our work is to show that Algorithm 1 converges. In the general case, we focus
on the nonconvex objective function.
4.1 Assumptions
4
Assumption 4.1 (Lipschitz continuity on θ and λ). For every i we have
k∇fi (X : θ̃, λ) − ∇fi (X : θ̂, λ)k2 ≤ L̄kθ̃ − θ̂k2 , ∀θ̃, θ̂, λ, X.
Noted that the Lipschitz constants associates with each of the above inequalities are not necessarily
the same. Here L̄ is an upper bound for these Lipschitz constants for simplicity.
Assumption 4.2 (bounded parameters). Sets P and Q are compact set, where θ ∈ P and λ ∈ Q.
Thus there exists constant M that weights W and parameters λ are bounded element-wise by this
constant M .
kW1 k M and kW2 k M and kλk M.
This also implies that the updated θ, λ in Algorithm 1 remain in P and Q, respectively.
Assumption 4.4 (Lipschitz continuity of li (·)). Assume the loss function li (·) for every i is continu-
ously differentiable. It implies that there exists M̂ such that
kli (x) − li (y)k ≤ M̂ kx − yk, ∀x, y.
Assumption 4.5 (existance of a stationary point). There exists a stationary point (θ∗ , λ∗ ) such that
k∇f¯(θ∗ , λ∗ )k = 0.
We note that all these are standard assumptions in convergence proofs. We also stress that Assumption
4.4 does not directly imply 4.1. Since we assume that P and Q are compact, then Assumptions 4.1,
4.4 and 4.5 hold for many standard loss function such as softmax and MSE.
We first have the following lemma specifying sufficient conditions for λ to converge. All the proofs
are given in Appendix A.
Theorem 4.6 Under Assumptions 4.1, 4.2 and 4.3, if {α(m) } satisfies
X∞ ∞ X
X m
α(m) < ∞ and α(m) η (n) < ∞,
m=1 m=1 n=1
then sequence {λ(m) } converges to λ̄.
We give a discussion of the above conditions for α(m) and η (m) in the end of this section. With the
help of Theorem 4.6, we can show the following convergence result.
Lemma 4.7 Under Assumptions 4.4, 4.5 and the assumptions of Theorem 4.6, when
∞ X
X ∞ X i X∞ X ∞
α(i) η (n) < ∞ and α(n) < ∞, (2)
m=1 i=m n=1 m=1 n=m
we have
M
X
lim sup η (m) k∇f¯(θ(m) , λ̄)k22 < ∞.
M →∞ m=1
5
This result is similar to the classical convergence rate analysis for the non-convex objective function
with diminishing stepsizes, which can be found in [6].
Lemma 4.8 Under the assumptions of Lemma 4.7, we have
lim inf k∇f¯(θ(m) , λ̄)k22 = 0.
m→∞
The statement of this theorem is that for the full gradient method with diminishing stepsizes the
gradient norms cannot stay bounded away from zero. The following result characterizes more
precisely the convergence property of Algorithm 1.
Lemma 4.9 Under the assumptions stated in Lemma 4.7, we have
lim k∇f¯(θ(m) , λ̄)k22 = 0.
m→∞
We cannot show that {θ(m) }’s converge (standard convergence proofs are also unable to show such a
stronger statement). For this reason Theorem 4.10 does not immediately follow from Lemma 4.9
together with Theorem 4.6. The statement of Theorem 4.10 would easily follow from Lemma 4.9 if
convergence of {θ(m) } is established and the gradient being continuous.
Considering the cases η (m) = O( m1k ) and α(m) = O( m1h ). We show in Appendix B that the set of
sufficient and necessary conditions to satisfy the assumptions of Theorem 4.6 are h > 1 and k ≥ 1.
The set of sufficient and necessary conditions to satisfy the assumptions of Lemma 4.7 are h > 2
1 1
and k ≥ 1. For example, we can pick η (m) = O( m ) and α(m) = O( m2.001 ) to achieve the above
convergence result in Theorem 4.10.
5 Convex Case
In this section, we discuss the convergence of a convex objective function. We stress that even if
li (·)’s are convex, the overall fi (·)’s are not convex. However, if the l2 penalty coefficient c2 is large
enough, then fi (·)’s become convex since the regularization term prevails. We have the following
extra assumption for the convexity of our model.
Assumption 5.1 (strong convexity). The objective function fi is strongly convex in θ for every i. It
implies that there exists constant c > 0 such that
1
fi (Xi : θ̃, λ) ≥ fi (Xi : θ̂, λ) + ∇fi (Xi : θ̂, λ)T (θ̃ − θ̂) + ckθ̃ − θ̂k22 , ∀θ̃, θ̂, i.
2
This inequality is a common result for strongly convex functions. Unfortunately, this convexity
assumption is not met by most popular neural network models unless c2 is large. In this section, we
assume that c2 is large enough to force such strong convexity.
Here we perform the convergence analysis with the assumption that f (θ, λ) is jointly convex in θ
for any fixed λ (which is implied by convexity of each fi (Xi : θ, λ)). The main result establishes a
1
sublinear convergence rate if η (m) are selected as O( m ).
Theorem 5.2 Under Assumptions 4.1, 4.2, 4.3 and 5.1, suppose the stepsizes satisfy that, for
∞
ζ 1 X
all m, η (m) = for some ζ > and ϑ > 0. If {α(m) } satisfies α(m) ln(m) <
ϑ+m cµ m=1
v
∞, then the optimality gap satisfies f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) ≤ , where v :=
2 ϑ+m
ζ (N L̄M + M1 )
max , (ϑ + 1)[f¯(θ(1) , λ(1) ) − f¯(θ∗ , λ̄)] .
2(ζcc1 − 1)
6
Note that θ∗ is the optimal θ value with respect to the fixed λ̄. Point (θ∗ , λ̄) is the point where f¯
attains its minimum. We learn from Theorem 5.2 that decreasing the stepsizes η (m) sublinearly (i.e.
at a rate O(1/m)) ensures the sublinear convergence to the optimal value. We also observe that the
initial point determines the initial optimality gap, which appears prominently in the second term
defining v. With an appropriate initialization phase, we can easily diminish the role played by this
term. Related discussions about the choice of constant parameters to ensure a sublinear convergence
rate and the initialization can be found in Bottou et al. [6].
1
Considering the cases η (m) = O( m ) and α(m) = O( m1h ). We show in Appendix B that the set of
sufficient and necessary condition to satisfy the assumptions of Theorem 5.2 is h > 1. For example,
1 1
we can pick η (m) = O( m ) and α(m) = O( m1.001 ) to achieve the above convergence result.
6 Computational Experiments
We conduct the computational experiments with Theano and Lasagne on a Linux server with an
Nvidia Titan-X GPU. We use MNIST [15], CIFAR-10 [13] and Network Intrusion (NI) [1] datasets
to compare the performance between DBN and the original BN algorithm. For the MNIST dataset,
we use a four-layer fully connected FNN (784 × 300 × 300 × 10) with the ReLU activation function
and for the NI dataset, we use a four-layer fully connected FNN (784 × 50 × 50 × 10) with the ReLU
activation function. For the CIFAR-10 dataset, we use a reasonably complex CNN network that has a
structure of (Conv-Conv-MaxPool-Dropout-Conv-Conv-MaxPool-Dropout-FC-Dropout-FC), where
all four convolution layers and the first fully connected layers are batch normalized. We use the
softmax loss function and l2 regularization with for all three models. All the trainable parameters are
randomly initialized before training. For all 3 datasets, we use the standard epoch/minibatch setting
with the minibatch size of 100, i.e., we do not compute the full gradient and the statistics are over the
minibatch. We use AdaGrad [8] to update the learning rates η (m) for trainable parameters, starting
from 0.01.
We use two different strategies to decide the values of α(m) in DBN: constant values of α(m)
and diminishing α(m) where α(m) = 1/m and α(m) = 1/m2 . We test the choices of constant
α(m) ∈ {1, 0.75, 0.5, 0.25, 0.1, 0.01, 0.001, 0}.
Figure 2: Comparison of predicted accuracy on test datasets for different choices of α(m) . From left to right are
FNN on MNIST, FNN on NI and CNN on CIFAR-10.
Figure 3: Comparison of predicted accuracy on test datasets for the most efficient choices of α(m) . From left to
right are FNN on MNIST, FNN on NI and CNN on CIFAR-10.
7
(a) (b) (c)
Figure 4: Comparison of the convergence of the loss function value on validation set for different choices of
α(m) . From left to right are FNN on MNIST, FNN on NI and CNN on CIFAR-10.
We test all the choices of α(m) with the performances presented in Figure 2. Figure 2 shows that all
the non-zero choices of α(m) converge properly. The algorithms converge without much difference
even when α(m) in DBN is very small, e.g., 1/m2 . However, if we select α(m) = 0, the algorithm is
erratic. Besides, we observe that all the non-zero choices of α(m) converge at a similar rate. The fact
that DBN keeps the batch normalization layer stable with a very small α(m) suggests that the BN
parameters do not have to be depended on the latest minibatch, i.e. the original BN.
We compare a selected sets of the most efficient choices of α(m) in Figures 3 and 4. They show that
DBN with α(m) < 1 is more stable than the original BN algorithm. The variances with respect to
epochs of the DBN algorithm are obviously smaller than those of the original BN algorithm’s in each
figure.
Table 1: Best results for different choices of α(m) on each dataset, showing top three with heat map.
MNIST (200 epochs) NI (100 epochs) CIFAR-10 (50 epochs)
Model Epoch Test Error Epoch Test Error Epoch Test Error
Table 1 shows the best result obtained from each choice of α(m) . Most importantly, it suggests that
the choices of α(m) = 1/m and 1/m2 perform better than the original BN algorithm. Besides, all the
constant less-than-one choices of α(m) perform better than the original BN, showing the importance
of taking the mini-batch history into consideration for the update of the BN parameters. The BN
algorithm in each figure converges to similar error rates on test datasets with different choices of
α(m) except for the α(m) = 0 case. Among all the models we tested, α(m) = 0.25 is the only one
that performs top 3 for all three datasets, thus the most robust choice.
To summarize, our numerical experiments show that the DBN algorithm outperforms the original BN
algorithm on the MNIST, NI and CIFAT-10 datasets with typical deep FNN and CNN models.
Future Directions. On the analytical side, we believe an extension to more than 2 layers is doable
with significant augmentations of the notation. A stochastic gradient version is likely to be much
more challenging to analyze. A second open question concerns more general activation functions.
Instead of the linear activation function assumed in this paper, it would be interesting to analyze more
common choices such as Sigmoid, ReLU, and Maxout.
8
References
[1] KDD Cup 1999 Data, 1999. URL http://www.kdd.org/kdd-cup/view/kdd-cup-1999/Data.
[2] Devansh Arpit, Yingbo Zhou, Bhargava U. Kota, and Venu Govindaraju. Normalization Propagation:
A Parametric Technique for Removing Internal Covariate Shift in Deep Networks. In International
Conference on Machine Learning, volume 48, page 11, 2016.
[3] Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer Normalization. arXiv preprint
arXiv:1607.06450, 2016.
[4] Dimitri P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A
Survey. Optimization for Machine Learning, 2010(3):1–38, 2011.
[5] Dimitri P. Bertsekas and John N. Tsitsiklis. Gradient Convergence in Gradient Methods with Errors. SIAM
Journal on Optimization, 10:627–642, 2000.
[6] Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization Methods for Large-Scale Machine Learning.
arXiv preprint arXiv:1606.04838, 2016.
[7] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron Courville. Recurrent Batch Normalization.
arXiv preprint arXiv:1603.09025, 2016.
[8] Yoram Duchi, John and Hazan, Elad and Singer. Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
[9] L. Grippo. A Class of Unconstrained Minimization Methods for Neural Network Training. Optimization
Methods and Software, 4(2):135–150, 1994.
[10] Çaglar Gülçehre and Yoshua Bengio. Knowledge Matters: Importance of Prior Information for Optimiza-
tion. Journal of Machine Learning Research, 17(8):1–32, 2016.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.
In Computer Vision and Pattern Recognition, pages 770–778, dec 2016.
[12] Sergey Ioffe and Christian Szegedy. Batch Normalization: Accelerating Deep Network Training by
Reducing Internal Covariate Shift. In International Conference on Machine Learning, pages 448–456,
2015.
[13] Alex Krizhevsky and Geoffrey E. Hinton. Learning Multiple Layers of Features from Tiny Images. PhD
thesis, 2009.
[14] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. Imagenet Classification with Deep Convolutional
Neural Networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[15] Yann LeCun, Léon Bottou, and Yoshua Bengio. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
[16] Genevieve B. Orr and Klaus-Robert Müller. Neural Networks: Tricks of the Trade. Springer, New York,
2003.
[17] B. T. Polyak. Introduction to optimization. Translations series in mathematics and engineering. Optimiza-
tion Software, 1987.
[18] B. T. Polyak and Y. Z. Tsypkin. Pseudogradient Adaption and Training Algorithms. Automation and
Remote Control, 34:45–67, 1973.
[19] Tim Salimans and Diederik P. Kingma. Weight Normalization: A Simple Reparameterization to Accelerate
Training of Deep Neural Networks. In Advances in Neural Information Processing Systems, pages 901–901,
2016.
9
Supplementary material for “Convergence Analysis of Batch Normalization
for Deep Neural Nets”
7 Appendix A: Proofs
7.1 Nomenclature
Symbol Explanation
L̄ The Lipschitz constant
η Learning rate for gradient algorithm
α Weight parameter for batch-normalization update
~x The vector for one record of input data
yj The target output vector for the input record ~xj
N Size of training set, number of all ~x in one epoch
D1 Dimension of the hidden layer
(1) (2)
zj , zj j th entry of output of the first and second hidden layer, respectively
n1 , n2 Numbers of parameters in θ and λ, respectively
θ Set of all trainable parameters updated by its gradient
W (1) , W (2) ∈ θ, weights of linear transformation between layers
(1) (1) (1)
γj , βj ∈ θ, trainable parameters for batch-normalized output yj
λ Set of all batch normalization parameters determined by previous updates
(1)
µj ∈ λ, mean of previous values of zj
(1)
σj ∈ λ, standard deviation of previous values of zj
B The offset for batch normalization transformation
Proposition 7.1 There exists a constant M such that, for any θ and fixed λ, we have
k∇f¯(θ, λ)k22 ≤ M.
Proof. By Assumption 4.5, we know there exists (θ∗ , λ∗ ) such that k∇f¯(θ∗ , λ∗ )k2 = 0. Then we
have
k∇f¯(θ, λ)k2
=k∇f¯(θ, λ)k2 − k∇f¯(θ∗ , λ∗ )k2
=k∇f¯(θ, λ)k2 − k∇f¯(θ, λ∗ )k2 + k∇f¯(θ, λ∗ )k2 − k∇f¯(θ∗ , λ∗ )k2
≤k∇f¯(θ, λ) − ∇f¯(θ, λ∗ )k2 + k∇f¯(θ, λ∗ ) − ∇f¯(θ∗ , λ∗ )k2
N
X N
X
∗
≤ k∇fi (Xi : θ, λ) − ∇fi (Xi : θ, λ )k2 + k∇fi (Xi : θ, λ∗ ) − ∇fi (Xi : θ∗ , λ∗ )k2
i=1 i=1
∗ ∗
≤N L̄(kλ − λ k2 + kθ − θ k2 ),
where the last inequality is by Assumption 4.1. We then have
k∇f¯(θ, λ)k22 ≤ N 2 L̄2 (kλ − λ∗ k2 + kθ − θ∗ k2 )2 ≤ M,
because sets P and Q are compact by Assumption 4.2.
Proof. This is a known result of the Lipschitz-continuous condition that can be found in [6]. We have
this result together with Assumption 4.1.
10
7.3 Proof of Theorem 4.6
P∞ P∞ Pm
Lemma 7.3 When m=1 α(m) < ∞ and m=1 n=1 α(m) η (n) < ∞,
(m)
(m) µj
µ̃j := is a Cauchy series.
(1 − α(1) )(1 − α(2) )...(1 − α(m) )
Then we have
N
(m+1) (m) (m) (1) 1 X (m)
|µ̃j − µ̃j | = α̃ |k | |W1,j,· Xi |
N i=1
N m
1 XX (n)
= α̃(m) |k (1) | | ∆W1,j,· Xi | (4)
N i=1 n=1
N m N
!
(m) (1) 1 X X X
= α̃ |k | η (n) ∇W1,j,· fl (Xl : θ(n) , λ(n) ) · Xi
N i=1 n=1
l=1
N m N
! !
1 XX X
= α̃(m) |k (1) | η (n) ∇W1,j,· fl (Xl : θ(n) , λ(n) ) · Xi
N i=1 n=1
l=1
N m N
!
(m) (1) 1 XX (n)
X
(n) (n)
≤ α̃ |k | η k ∇W1,j,· fl (Xl : θ , λ )k · kXi k (5)
N i=1 n=1
l=1
1
=α̃(m) |k (1) | ·
N
N X
m N h
!
X X i
η (n) k ∇W1,j,· fl (Xl : θ(n) , λ(n) ) − ∇W1,j,· f¯(Xl : θ∗ , λ∗ ) k · kXi k
i=1 n=1 l=1
N m N
1 X X (n) X
≤ α̃(m) |k (1) | η ( [k∇W1,j,· fl (Xl : θ(n) , λ(n) ) − ∇W1,j,· fl (Xl : θ∗ , λ(n) )k2 +
N i=1 n=1
l=1
11
Equation (4) is due to
m
(m) (n)
X
W1,i,j = ∆W1,i,j .
n=1
Therefore,
q m
(p) (q)
X X
|µ̃j − µ̃j | ≤ M̃L̄,M · α̃(m) η (n)
m=p n=1
q m q Xm
(7)
X X X
(m) (n) (m) (n)
= M̃L̄,M · α̃ η = M̃L̄,M · α̃ η .
m=p n=1 m=p n=1
since
∞
X ∞
X
ln(Π∞
m=1 (1 − α
(m)
)) = ln(1 − α(m) ) > −α(m) > −∞.
m=1 m=1
It is also easy to show that there exists C and Mc such that for all m ≥ Mc , we have
(1 − α(1) )(1 − α(2) ) . . . (1 − α(m) ) ≥ C. (10)
Therefore,
lim (1 − α(1) )(1 − α(2) ) . . . (1 − α(m) ) ≥ C.
m→∞
(m)
From (9) and (12) it follows that the sequence {µ̃j } is a Cauchy series.
(m) (m)
Lemma 7.4 Since {µ̃j } is a Cauchy series, {µj } is a Cauchy series.
Proof. We know that
(m) (m)
µj = µ̃j (1 − α(1) )...(1 − α(m) ).
Since
(m)
lim µ̃j → µ̃j
m→∞
and
lim (1 − α(1) )...(1 − α(m) ) → C̃,
m→∞
we have
(m)
lim µj → µ̃j · C̃.
m→∞
(m)
Thus µj is a Cauchy series.
12
P∞ P∞ Pm (m)
Lemma 7.5 If m=1 α(m) < ∞ and m=1 n=1 α(m) η (n) < ∞, {σj } is a Cauchy series.
(m) (m)
Proof. We define σj := σ̃j (1 − α(1) )...(1 − α(m) ). Then we have
v
u N 2
(m) t 1
(m+1) (m)
u X (m) (m)
|σ̃j − σ̃j | = α̃ k (1) W1,j,· Xi − µj
N i=1
v
uN 2
(m) 1 t
uX (m) (m)
= α̃ √ k (1) W1,j,· Xi − µj
N i=1
v
uN (m)
!2
(1) µj
(m) |k |u
tX (m)
= α̃ √ W1,j,· Xi − .
N i=1
k (1)
(m)
Since {µj } is convergent, there exists c1 , c2 and N1 such that for any m > N1 , −∞ < c1 <
(m)
µj < c2 < ∞. Therefore,
(m+1) (m)
|σ̃j − σ̃j |
vuN
v
uN
|k | (1)
uX (m) c 1
2 uX (m) c2
2 (13)
≤α̃(m) √ · max t W1,j,· Xi − (1) , t W1,j,· Xi − (1) .
N
i=1
k i=1
k
n c c2 o
1
For any C̄ ∈ , , we have
k (1) k (1)
v
uN
(1)
(m+1) (m) |k | uX (m) 2
|σ̃j − σ̃j | ≤ α̃(m) √ · t W1,j,· Xi − C̄ (14)
N i=1
v
uN
(1)
(m) |k | u 2
(m)
X
≤ α̃ √ ·t |W1,j,· Xi | + |C̄| (15)
N i=1
v ! !2
uN m
(1)
(m) |k | u (n)
X X
= α̃ √ · t | ∆W1,j,· Xi | + |C̄|
N i=1 n=1
v
uN m N
! !2
(1)
(m) |k | uX X X
= α̃ √ · t | (n)
η · (n) (n)
∇W1,j,· fl (Xl : θ , λ ) · Xi | + |C̄|
N i=1 n=1 l=1
v
uN m N
! !2
(1)
(m) |k | uX X X
≤ α̃ √ · t (n)
η ·| (n) (n)
∇W1,j,· fl (Xl : θ , λ ) · Xi | + |C̄| (16)
N i=1 n=1 l=1
v
uN m N
!2
(1)
|k | uX X X
(17)
≤ α̃(m) √ · t η (n) k ∇W1,j,· fl (Xl : θ(n) , λ(n) )k · kXi k + |C̄|
N i=1 n=1 l=1
v !2
uN m
(1)
(m) |k | uX X
(18)
≤ α̃ √ · t η (n) 2N L̄M kXi k2 + |C̄|
N i=1 n=1
13
v !2
m
u
(1)
(m) |k | u X
≤ α̃ √ · tN · M̃L̄,M η (n) + |C̄| (19)
N n=1
v !2
m
u
u X
(m) (1)
= α̃ |k | · t M̃L̄,M η (n) + |C̄|
n=1
m
!
X
= α̃(m) |k (1) | · M̃L̄,M η (n) + |C̄| . (20)
n=1
n c c2 o
1
Inequality (14) is by plugging C̄ ∈ , into (13). Inequality (15) is by the following fact:
k (1) k (1)
v v v v
u n u n u n u n
uX uX uX uX
t (ai − c)2 ≤ max t (|ai | − c)2 , t (|ai | + c)2 = t (|ai | + |c|)2 , (21)
i=1 i=1 i=1 i=1
where b and ai for every i are arbitrary real scalars. Besides, (21) is due to
−2ai c ≤ max{−2|ai |c, 2|ai |c}.
Inequalities (16), (17) and (18) follow from the square function being increasing for nonnegative
numbers. Besides these facts, (18) is also by the same techniques we used in (5)-(6) where we bound
the derivatives with the Lipschitz continuity in the following inequality:
XN
k ∇W1,j,· fl (Xl : θ(n) , λ(n) )k ≤ 2N L̄M.
l=1
Inequality (19) is by collecting the bounded terms into a single bound M̃L̄,M . Therefore,
q−1 q−1 m
!
(q) (p) (m+1) (m)
X X X
(m) (1) (n)
|σ̃j − σ̃j | ≤ |σ̃j − σ̃j | ≤ α̃ |k | · M̃L̄,M η + |C̄| . (22)
m=p m=p n=1
Using the similar methods in deriving (8) and (9), it can be seen that a set of sufficient conditions
(m)
ensuring the convergence for {σ̃j } is:
∞
X
α(m) < ∞,
m=1
∞ X
X m
α(m) η (n) < ∞.
m=1 n=1
(m) (m)
Therefore, the convergence conditions for {σj } are the same as for {µj }.
It is clear that these lemmas establish the proof of Theorem 4.6.
where
∞ X
X i ∞
X
(i) (j)
am = M1 α η + M2 α(i) (23)
i=m j=1 i=m
and M1 and M2 are constants.
14
(m)
Proof. For the upper bound of σj , by (20), we have
q−1 m
!
(q) (p)
X X
(m) (1) (n)
|σ̃j − σ̃j | ≤ α̃ |k | M̃L̄,M η + |C̄| .
m=p n=1
σ̄j
We define˜
¯σj := . Therefore,
(1 − α(1) )...(1 − α(u) )...
∞ i
(m)
X X
|˜
¯σj − σ̃j | ≤ α̃(i) |k (1) | M̃L̄,M η (j) + |C̄|
i=m j=1
(24)
(1) ∞ i
|k |X X
≤ α(i) M̃L̄,M η (j) + |C̄| .
C i=m j=1
The first inequality comes by substituting p by m and by taking lim as q → ∞ in (22). The second
inequality comes from (10). We then obtain,
(m)
σj − σ̄j
(m)
σj σ̄j
=(1 − α(1) )...(1 − α(m) ) −
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(m) )
(m)
(1) (m)
σj σ̄j
≤(1 − α )...(1 − α )[ − +
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(u) )...
σ̄j σ̄j
− ]
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(u) )...
(m) (∞) σ̄j (∞) (25)
≤ σ̃j − σ̃j + − σ̃j
(1 − α(1) )...(1 − α(m) )
(m) (∞) σ̄j σ̄j
= σ̃j − σ̃j + −
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(u) )...
(m) (∞) (1 − α(m+1) )...(1 − α(u) )... − 1
= σ̃j − σ̃j + σ̄j
(1 − α(1) )...(1 − α(u) )...
(m) (∞) σ̄j
≤ σ̃j − σ̃j + |1 − (1 − α(m+1) )...(1 − α(u) )...|
C
∞
(m) (∞) σ̄j X
≤ σ̃j − σ̃j + α(n) .
C n=m+1
The second inequality is by (1 − α(1) )...(1 − α(m) ) < 1, the third inequality is by (10) and the last
inequality can be easily seen by induction. By (25), we obtain
∞
(m) (M ) (m) (m) σ¯j X
|σ̄j − σj | = lim |σj − σj | ≤ |σ̃j − σ̃j |+ α(n) . (26)
M →∞ C n=m+1
15
Therefore, we have
(m)
|σ̄j − σj |
∞
(m) σ̄j X
≤|˜
¯σj − σ̃j | + α(n)
C n=m+1
∞ i ∞
X
(i) (1)
X
(j) σ̄j X (i)
≤ α̃ |k | · M̃L̄,M η + |C̄| + α
i=m j=1
C i=m+1
∞ i ∞
X 1 (i) (1) X
(j) σ̄j X (i) (27)
≤ α |k | · M̃L̄,M η + |C̄| + α
i=m
C j=1
C i=m+1
∞ i ∞
X 1 (i) (1) X σ̄j X (i)
≤ α |k | · M̃L̄,M η (j) + |C̄| + α
i=m
C j=1
C i=m
∞ X i ∞
M̃L̄,M |k (1) | X |k (1) ||C̄| X (i)
(i) (j) σ̄j
= α η + + α .
C i=m j=1
C C i=m
The first inequality is by (26), the second inequality is by (22), the third inequality is by (11) and the
σ̄j
fourth inequality is by adding the nonnegative term α(m) to the right-hand side.
C
(m)
For the upper bound of µj , we have
(m)
µj − µ̄j
(m)
µj µ̄j
=(1 − α(1) )...(1 − α(m) ) −
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(m) )
(m)
(1) (m)
µj µ̄j
≤(1 − α )...(1 − α )[ − + (28)
(1 − α(1) )...(1 − α(m) ) (1 − (1)
α )...(1 − α(∞) )
µ̄j µ̄j
− ]
(1 − α(1) )...(1 − α(m) ) (1 − α(1) )...(1 − α(∞) )
µ̄j
≤ µ̃(m) − µ̃(∞) + − µ̃(∞) .
(1 − α(1) )...(1 − α(m) )
µ̄j
Let us define Am := µ̃(m) − µ̃(∞) and Bm := − µ̃(∞) . Recall from
(1 − α(1) )...(1 − α(m) )
(m)
Theorem 4.6 that {µj } is a Cauchy series, by (7),
q X
m
(p) (q)
X
|µ̃j − µ̃j | ≤ M̄L̄,M · α(m) η (n) .
m=p n=1
Therefore, the first term in (28) is bounded by
∞ X
i
(m)
X
|µ̃j − µ̃∞
j | ≤ M̃L̄,M · α(i) η (n) < ∞. (29)
i=m n=1
For the second term in (28), recall that C := (1 − α(1) )...(1 − α(u) ).... Then we have
µ̄j
C· − µ̃(∞)
(1 − α(1) )...(1 − α(m) )
=µ̄j |1 − (1 − α(m+1) )...(1 − α(u) )...|
X∞
≤µ̄j α(i) ,
i=m+1
16
where the last inequality can be easily seen by induction. Therefore, the second term in (28) is
bounded by
∞
µ̄j (∞) µ̄j X (i)
− µ̃ ≤ α . (30)
(1 − α(1) )...(1 − α(m) ) C i=m+1
From these we obtain
(m)
µj − µ̄j
µ̄j
≤ µ̃(m) − µ̃(∞) + (1)
− µ̃(∞)
(1 − α )...(1 − α(m) ) (31)
∞ X
i ∞
X µ̄j X
≤M̃L̄,M α(i) η (n) + α(i) .
i=m n=1
C i=m+1
The first inequality is by (28) and the second inequality is by (29) and (30). Combining (27) and (31),
we have that
X∞ X i ∞
X
|λ(m) − λ̄|∞ = max(|µ(m) − µ̄|∞ , |σ (m) − σ̄|∞ ) ≤ M1 α(i) η (j) + M2 α(i) ,
i=m j=1 i=m
It remains to show
X (m) (m)
X (m) 2 √
− yi xi ≤− xi + L̄M n2 am , ∀i, m. (33)
i i
This is established by the following four cases.
(m) (m) (m) (m) √ (m) (m) (m) (m) 2
1) If xi ≥ 0, xi − yi ≥ 0, then xi ≤ L̄ n2 am + yi . Thus −xi yi ≤ −xi +
√
L̄M n2 am by Proposition 7.1.
(m) (m) (m) (m) (m) (m) 2 (m) (m) (m) (m) (m) 2
2) If xi ≥ 0, xi −yi ≤ 0, then xi ≤ yi , xi ≤ xi ·yi and −xi yi ≤ −xi .
(m) (m) (m) (m) (m) (m) 2 (m) (m) (m) (m) (m) 2
3) If xi < 0, xi −yi ≥ 0, then xi ≥ yi , xi ≤ xi ·yi and −xi yi ≤ −xi .
(m) (m) (m) (m) (m) √ (m) (m) (m) 2 √ (m)
4) If xi < 0, xi −yi ≤ 0, then yi −xi ≤ L̄ n2 am , yi xi −xi ≥ L̄ n2 am xi
(m) (m) (m) 2 √ (m) (m) 2 √
and −yi xi ≤ −xi − L̄ n2 am xi ≤ −xi + L̄M n2 am . The last inequality is by
Proposition 7.1.
All these four cases yield (33).
17
Proposition 7.8 Under the assumptions of Theorem 4.6, we have
√ 1
f¯(θ(m+1) , λ̄) ≤ f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M,
2
where M is a constant and am is defined in Proposition 7.6.
By (35) we have θ̃ − θ̂ = θ(m+1) − θ(m) = −η (m) ∇f¯(θ(m) , λ(m) ). We now substitute θ̃ := θ(m+1) ,
θ̂ := θ(m) and λ := λ̄ into (34) to obtain
f¯(θ(m+1) , λ̄)
N L̄
≤ f¯(θ(m) , λ̄) − η (m) ∇f¯(θ(m) , λ̄)T ∇f¯(θ(m) , λ(m) ) + (η (m) )2 · k∇f¯(θ(m) , λ(m) )k22
2
N L̄M
≤ f¯(θ(m) , λ̄) − η (m) ∇f¯(θ(m) , λ̄)T ∇f¯(θ(m) , λ(m) ) + (η (m) )2 · (36)
2
√ 1
≤ f¯(θ(m) , λ̄) + η (m) −k∇f¯(θ(m) , λ̄)k22 + L̄M n2 am + (η (m) )2 · N L̄M
2
√ 1
= f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M.
2
The first inequality is by plugging (35) into (34), the second inequality comes from Proposition 7.1
and the third inequality comes from Proposition 7.7.
Here we show Theorem 4.10 as the consequence of Theorem 4.6 and Lemmas 4.7, 4.8 and 4.9.
18
Proof. By plugging (26) and (24) into (37), we have the following for all j:
∞
(m)
X
σ̄j − σj
m=1
∞ ∞
!
X (m) σ̄j X (n)
≤ |˜
¯σj − σ̃j |
+ α
m=1
C n=m+1
∞
∞
i
∞
(38)
(1) X
X |k | α(i) M̃L̄,M
X σ̄j X
≤ η (j) + |C̄| + α (n)
m=1
C i=m j=1
C n=m+1
∞ X ∞ i ∞ ∞
|k (1) | · M̃L̄,M X X σ̄j + |k (1) ||C̄| X X
≤ α(i) η (j) + α(n) .
C m=1 i=m j=1
C m=1 n=m+1
It is easy to see that the the following conditions are sufficient for right-hand side of (38) to be finite:
∞ X
X ∞ X
i
α(i) η (n) < ∞
m=1 i=m n=1
and
∞ X
X ∞
α(n) < ∞.
m=1 n=m
Therefore, we obtain
∞
(m)
X
|σ¯j − σj | < ∞, ∀j.
m=1
∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞
m=1 i=m n=1 m=1 n=m
M
X
lim sup f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄) < ∞.
M →∞ m=1
D
X
kli (x) − li (y)k ≤ M̂ kx − yk ≤ M̂ |xi − yi |. (39)
i=1
19
By the definition of fi (·), we then have
X∞
f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)
m=1
∞ X
X N XN
= li (Xi : θ(m) , λ(m) ) + c2 kθ(m) k22 − li (Xi : θ(m) , λ̄) + c2 kθ(m) k22
m=1 i=1 i=1
∞
X N
X
= li (Xi : θ(m) , λ(m) ) − li (Xi : θ(m) , λ̄)
m=1 i=1
∞ X
X N
≤ li (Xi : θ(m) , λ(m) ) − li (Xi : θ(m) , λ̄)
m=1 i=1
∞ X
D X
N (m) (m) (m)
X k (1) W1,j,· Xi − µj k (1) W1,j,· Xi − µ̄j
≤M2 (m)
−
m=1 j=1 i=1 σj + B σ̄j + B
∞ X (m)
D X
N
! !
X (m) 1 1 µ̄j µj
=M2 (k (1) W1,j,· Xi ) (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B σ̄j + B σ j + B
∞ X (m)
D X
N
! !
X (m) 1 1 µ̄j µj
≤M2 (k (1) W1,j,· Xi ) (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B σ̄j + B σ j + B
∞ X (m)
D N N
!
X X (m) 1 1 X µ̄j µj
≤M3 |(k (1) W1,j,· Xi )| (m)
− + − (m)
m=1 j=1 i=1 σ j + B σ̄j + B i=1
σ̄j + B σ j + B
∞ X (m) (m)
D N
!
X X (m) σ̄j − σj µ̄j µj
=M3 |(k (1) W1,j,· Xi )| (m)
+N − (m)
m=1 j=1 i=1 (σj + B )(σ̄j + B ) σ̄j + B σ j + B
∞ X (m) (m)
D N
!
X X (m) σ̄j − σj µ̄j µj
≤M3 |k (1) ||W1,j,· Xi | + N − . (40)
m=1 j=1 i=1
2B σ̄j + B σ
(m)
+ B
j
The first inequality is by the Cauchy-Schwarz inequality, and the second one is by (39). To show the
finiteness of (40), we only need to show the following two statements:
∞ X
N (m)
X (m) σ̄j − σj
|k (1) ||W1,j,· Xi | < ∞, ∀j (41)
m=1 i=1
2B
and
∞ (m)
X µ̄j µj
− (m) < ∞, ∀j. (42)
m=1
σ̄j + B σj + B
Proof of (41): For all j we have
∞ X
N (m)
X (m) σ̄j − σj
|k (1) ||W1,j,· Xi |
m=1 i=1
2B
∞
X
(1) 1 (m)
≤ |k |N DM maxkXi k σ̄j − σj (43)
m=1
i 2B
∞
1 X (m)
=|k (1) |N DM maxkXi k σ̄j − σj .
i 2B m=1
(m)
The inequality comes from |W1,j,· Xi | ≤ DM kXi k2 , where D is the dimension of Xi and M is the
(m)
element-wise upper bound for W1,j,· in Assumption 4.2.
20
P∞ (m)
Finally, we invoke Lemma 7.3 to assert that m=1 σ̄j − σj is finite.
Proof of (42): For all j we have
∞ (m)
X µ̄j µj
− (m)
m=1
σ̄j + B σ j + B
(m) (m) (m)
(44)
∞ ∞
X µ̄j µj X µj µj
≤ − + − (m) .
m=1
σ̄j + B σ̄j + B m=1
σ̄j + B σj + B
(m)
The first term in (44) is finite since {µj } is a Cauchy series. For the second term, we know that
(m) (m)
there exists a constant M such that for all m ≥ M , µj ≤ µ̄ + 1. This is also by the fact that {µj }
is a Cauchy series and it converges to µ̄. Therefore, the second term in (44) becomes
M −1 (m) (m) ∞ (m) (m)
X µj µj X µj µj
− (m) + − (m)
m=1
σ̄j + B σ j + B σ̄j + B σ j + B
m=M
(m) (m)
(45)
M −1 ∞
X µj µj X 1 1
≤ − (m) + (µ̄ + 1) − (m) .
m=1
σ̄j + B σ j + B σ̄j + B σ j + B
m=M
1 1
Noted that function f (σ) = is Lipschitz continuous since its gradient is bounded by 2 .
σ + B B
1
Therefore we can choose 2 as the Lipschitz constant for f (σ). We then have the following
B
inequality:
1 1 1 (m)
− (m) ≤ 2 |σ̄j − σj |. (46)
σ̄j + B σ j + B B
By (41) and (42), we have that the right-hand side of (40) is finite. It means that the left-hand side of
(40) is finite. Thus,
∞
X
f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄) < ∞.
m=1
Lemma 7.11 If
∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞,
m=1 i=m n=1 m=1 n=m
then
M
X
lim sup η (m) k∇f¯(θ(m) , λ̄)k22 < ∞.
M →∞ m=1
21
Proof. For simplicity of the proof, we define
M
X
T (M ) := η (m) k∇f¯(θ(m) , λ̄)k22 ,
m=1
By Proposition 7.8,
(m) √ 1
∆2 ≤ −η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M. (48)
2
We sum the inequality (47) from 1 to K with respect to m and plug (48) into it to obtain
K K K K
(m+1) (m)
X X X X
O(m) ≤ |∆1 |+ |∆1 | − {η (m) k∇f¯(θ(m) , λ̄)k22 }
m=1 m=1 m=1 m=1
K K
X √ X 1
+ η (m) L̄M n2 am + { (η (m) )2 N L̄M }
m=1 m=1
2
K K
(m+1) (m)
X X
= |∆1 |+ |∆1 | − T (K)
m=1 m=1
K K
√ X X 1
+ L̄2 n2 · η (m) am + { (η (m) )2 N L̄M }.
m=1 m=1
2
From this, we have:
−1 ¯ (K) (K)
lim sup T (K) ≤ lim sup (f (θ , λ ) − f¯(θ(1) , λ(1) ))
K→∞ K→∞ c1
K
1 X (m+1) (m)
+ lim sup (|∆1 | + |∆1 |)
K→∞ c1 m=1
K (49)
√ X (m)
+ lim sup L̄2 n2 η am
K→∞ m=1
K
N L̄K X (m) 2
+ lim sup η .
K→∞ 2c1 m=1
Next we show that each of the four terms in the right-hand side of (49) is finite, respectively. For the
first term,
−1 ¯ (K) (K)
lim sup (f (θ , λ ) − f¯(θ(1) , λ(1) )) < ∞
K→∞ c1
is by the fact that the parameters {θ, λ} are in compact sets, which implies that the image of fi (·) is
in a bounded set.
For the second term, we showed its finiteness in Lemma 7.10.
22
For the third term, by (23), we have
K
X
lim sup η (m) am
K→∞ m=1
K
X ∞ X
X i ∞
X
= lim sup η (m) K1 α(i) η (j) + K2 α(i) (50)
K→∞ m=1 i=m j=1 i=m
K
X ∞ X
X i K
X ∞
X
(m) (i) (j) (m)
=K1 lim sup η α η + K2 lim sup η α(i) .
K→∞ m=1 i=m j=1 K→∞ m=1 i=m
and
∞
X ∞
X ∞ X
X ∞
η (m) α(i) < α(i) < ∞. (52)
m=1 i=m m=1 i=m
The second inequalities in (51) and (52) come from the stated assumptions of this lemma.
For the fourth term,
K
N L̄M X (m) 2
lim sup η <∞
K→∞ 2c m=1
P∞ (m) 2
holds, because we have m=1 (η ) < ∞ in Assumption 4.3. Therefore, T (∞) =
P∞ (m) ¯ (m) 2
m=1 η k∇ f (θ , λ̄)k2 < ∞ holds.
In Lemmas 7.9, 7.10 and 7.11, we show that {σ (m) } and {µ(m) } are Cauchy series, hence Lemma
4.7 holds.
23
This lemma has been proven by Bertsekas and Tsitsiklis [5].
√ 1
f¯(θ(m+1) , λ̄) ≤ f¯(θ(m) , λ̄) − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + (η (m) )2 · N L̄M.
2
√
Let Y (m) := f¯(θ(m) , λ̄), W (m) := η (m) k∇f¯(θ(m) , λ̄)k22 and Z (m) := η (m) L̄M n2 am +
1 (m) 2 PM
(η ) · N L̄M . By (1) and (50)- (52), it is easy to see that m=0 Z (m) converges as M → ∞.
2
Therefore, by Lemma 7.12, Y (m) converges to a finite value. The infinite case can not occur in our
setting due to Assumptions 4.1 and 4.2.
Lemma 7.14 If
∞ X
X ∞ X
i ∞ X
X ∞
α(i) η (n) < ∞ and α(n) < ∞,
m=1 i=m n=1 m=1 n=m
Proof. To show that lim k∇f¯(θ(m) , λ̄)k2 = 0, assume the contrary; that is,
m→∞
Then there exists an > 0 such that k∇f¯(θ(m) , λ̄)k < /2 for infinitely many m and also
k∇f¯(θ(m) , λ̄)k > for infinitely many m. Therefore, there is an infinite subset of integers M,
such that for each m ∈ M, there exists an integer q(m) > m such that
From
k∇f¯(θ(m+1) , λ̄)k − k∇f¯(θ(m) , λ̄)k ≤ k∇f¯(θ(m+1) , λ̄) − ∇f¯(θ(m) , λ̄)k
≤ L̄kθ(m+1) − θ(m) k
= L̄η (m) k∇f¯(θ(m) , λ(m) )k,
it follows that for all m ∈ M that are sufficiently large so that L̄η (m) < /4, we have
Otherwise the condition /2 ≤ k∇f¯(θ(m+1) , λ̄)k would be violated. Without loss of generality, we
assume that the above relations as well as (36) hold for all m ∈ M. With the above observations, we
have for all m ∈ M,
24
≤ k∇f¯(θq(m) , λ̄)k − k∇f¯(θ(m) , λ̄)k
2
≤ k∇f¯(θq(m) , λ̄) − ∇f¯(θ(m) , λ̄)k
≤ L̄kθq(m) − θ(m) k
q(m)−1
X
≤ L̄ kθ(i+1) − θ(i) k
i=m
q(m)−1
X
≤ L̄ η (i) k∇f¯(θ(i) , λ(i) )k
i=m
q(m)−1
X
≤ L̄ η (i) (k∇f¯(θ(i) , λ̄)k + k∇f¯(θ(i) , λ(i) ) − ∇f¯(θ(i) , λ̄)k)
i=m
q(m)−1
X √
≤ L̄ η (i) (k∇f¯(θ(i) , λ̄)k + L̄ n2 am )
i=m
q(m)−1 q(m)−1
X √ X
≤ L̄ η (i) + L̄2 n2 η (i) am
i=m i=m
q(m)−1 q(m)−1 j
∞ X ∞
X √ X X X
= L̄ η (i) + L̄2 n2 η (i) M1 α(j) η (k) + M2 α(j)
i=m i=m j=m k=1 j=m
q(m)−1 q(m)−1 ∞ j q(m)−1 ∞
X √ X X X √ X X
= L̄ η (i) + L̄2 n2 M1 η (i) α(j) η (k) + L̄2 n2 M2 η (i) α(j)
i=m i=m j=m k=1 i=m j=m
The first inequality is by (54) and the third one is by the Lipschitz condition assumption. The seventh
one is by (32). By (2), we have for all m ∈ M,
q(m)−1 j
∞ X ∞ X j
∞ X
X X X
η (i) α(j) η (k) < α(j) η (k) < ∞
i=m j=m k=1 i=1 j=i k=1
and
q(m)−1 ∞ ∞ X
∞
X X X
η (i) α(j) < α(j) < ∞.
i=m j=m i=1 j=i
P∞ P∞
It is easy to see that for any sequence {αi } with i=1 αi < ∞, if follows that lim inf i=M αi = 0.
M →∞
Therefore,
q(m)−1 j
∞ X
X X
(i)
lim inf η α(j) η (k) = 0
m→∞
i=m j=m k=1
and
q(m)−1 ∞
X X
lim inf η (i) α(j) = 0.
m→∞
i=m j=m
25
By the triangle inequality, we have
k∇f¯(θ(m) , λ̄)k
=k∇f¯(θ(m) , λ(m) ) − ∇f¯(θ(m) , λ(m) ) + ∇f¯(θ(m) , λ̄)k
≥ k∇f¯(θ(m) , λ(m) )k − k∇f¯(θ(m) , λ̄) − ∇f¯(θ(m) , λ(m) )k .
√
By (32) and (55), if we pick m ∈ M such that L n2 am ≤ , we have k∇f¯(θ(m) , λ̄)k ≥ . Using
8 8
(36), we observe that
q(m)−1 1 q(m)−1
X X
f¯(θq(m) , λ̄) ≤ f¯(θ(m) , λ̄) − η (i) c1 k∇f¯(θ(i) , λ̄)k22 + · N L̄M (η (i) )2
i=m
2 i=m
2 q(m)−1
X 1
q(m)−1
X
≤ f¯(θ(m) , λ̄) − c1 η (i) + · N L̄M (η (i) )2 , ∀m ∈ M,
8 i=m
2 i=m
where the second inequality is by (55). By Lemma 7.13, f¯(θq(m) , λ̄) and ¯ (m) , λ̄) converge to
Pf∞(θ (m)
the same finite value. Using this convergence result and the assumption m=0 (η )2 < ∞, this
relation implies that
q(m)−1
X
lim sup η (i) = 0
m→∞,m∈M i=m
Therefore, we have
lim k∇f¯(θ(m) , λ(m) )k22 = 0,
m→∞
Lemma 7.15 If
∞ X
X i
m2 α(i) η (n) < ∞, (57)
i=m n=1
26
Proof. The notation here is the same as the one used in the proof of Lemma 7.11. Showing (58) is
(m)
|∆ | am
equivalent to showing constant upper bounds for 1 2 and (m) .
η (m) η
(m)
|∆1 |
For an upper bound of 2 , by (40) and (43), we have
η (m)
(m)
|∆1 |
2
η (m)
|f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)|
= 2 (59)
η (m)
(m)
D
!
M3 X (1) 1 (m) µ̄j µj
≤ 2 |k |N DM 2 σ̄j − σj +N − (m) .
η (m) j=1
B σ̄j + B σj + B
(m) (m)
|σ¯j − σj | |µ¯j − µj |
We can see that it is equivalent to show that and
have constant upper
η η (m) 2 (m) 2
bounds because all other terms in the right-hand side of (59) are finite constants.
By (38), we have
∞ i ∞
(m) |k (1) | · M̃L̄,M X X σ̄j + |k (1) ||C̄| X
|σ¯j − σj |≤ α(i) η (j) + α(n) .
C i=m j=1
C n=m+1
ζ 2
Note that we have η (m) = and thus η (m) = O( m12 ). Therefore, (57) implies that
ϑ+m
∞ X
i
1 X
2 α(i) η (n) < ∞. (60)
η (m) i=m n=1
27
q X
m
(p) (q)
X
|µ̃j − µ̃j | ≤ M̃L̄,M · α(m) η (n) .
m=p n=1
For the second term in (62), we first define C := (1 − α(1) )...(1 − α(u) )... . Then we have
µ̄j (∞)
C· − µ̃j
(1 − α(1) )...(1 − α(m) )
=µ̄j |1 − (1 − α(m+1) )...(1 − α(∞) )|
X∞
≤µ̄j α(i) ,
i=m+1
where the last inequality can be easily checked by induction. Therefore, the second term in (62) is
bounded by
∞
µ̄j (∞) µ̄j X (i)
− µ̃j ≤ α .
(1 − α(1) )...(1 − α(m) ) C i=m+1
(m)
|µ¯j − µj |
Hence (60) and (61) ensure to be finite.
η (m) 2
am
For an upper bound of , by (23), we have
η (m)
P∞ Pi P∞
am M1 i=m j=1 α(i) η (j) + M2 i=m α(i)
= .
η (m) η (m)
We know that
P∞ Pi ∞ X
i
M1 i=m j=1 α(i) η (j) 1 X
< M1 2 α(i) η (j) < ∞ (63)
η (m) η (m) i=m j=1
and P∞ ∞
M2 α(i) 1 X
i=m
< M2 2 α(i) < ∞. (64)
η (m) η (m) i=m
The second inequalities in (63) and (64) are by (60) and (61). Note that given that η (m) = 1/m, (60)
is equivalent to
∞ X i ∞ ∞
X 1 X X
α(i) < α(i) ln(i) < α(i) ln(i) < ∞
i=m j=1
j i=m i=1
This concludes the proof.
Lemma 7.16 Under the assumptions of Lemma 7.15, Theorem 5.2 holds.
The proof for this Lemma of the high level follows the proof of Theorem 4.7 in Bottou et al. [6].
Proof. Assumption 5.1 implies that
1
f¯(θ̃, λ) ≥ f¯(θ̂, λ) + ∇f¯(θ̂, λ)T (θ̃ − θ̂) + ckθ̃ − θ̂k22 , ∀θ̃, θ̂.
2
¯ ¯ ¯
Therefore, f has a unique minimizer f := f (θ , λ) for any λ fixed. Note that θ∗ = θ∗ (λ) but this
∗ ∗
dependency is irrelevant in the rest of the proof. Standard convex analysis argument establishes
28
2c f¯(θ(m) , λ) − f¯(θ∗ , λ) ≤ k∇f¯(θ(m) , λ)k22 . (65)
(m+1)
Recall that ∆1 := f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m+1) , λ̄). We then have
f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m) , λ(m) )
h i h i
= f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ(m) ) − f¯(θ(m) , λ̄)
(66)
+ f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄)
(m+1) (m)
≤|∆1 | + |∆1 | + f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄).
Therefore,
f¯(θ(m+1) , λ̄) − f¯(θ(m) , λ̄)
√ 1 2
≤ − η (m) k∇f¯(θ(m) , λ̄)k22 + η (m) L̄M n2 am + η (m) N L̄M
2
¯ ¯ ∗ √ 1 2
(m) (m) (m)
≤ − η c(f (θ , λ̄) − f (θ , λ̄)) + η L̄M n2 am + η (m) N L̄M
2
¯ ¯ ∗ ¯ ¯
= − η c f (θ , λ ) − f (θ , λ̄) + f (θ , λ̄) − f (θ(m) , λ(m) )
(m) (m) (m) (m)
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M (67)
2
≤ − η (m) c f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) + η (m) c f¯(θ(m) , λ̄) − f¯(θ(m) , λ(m) )
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M
2
(m)
= − η c f (θ , λ ) − f¯(θ∗ , λ̄) + η (m) c|∆1 |
(m) ¯ (m) (m)
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M.
2
The first inequality is by Proposition 7.8, while the second inequality is by the strong convexity
property (65). Combining (66) and (67) yields
f¯(θ(m+1) , λ(m+1) ) − f¯(θ(m) , λ(m) )
(m+1) (m)
≤ − η (m) c f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄) + |∆1 | + (1 + η (m) c)|∆1 |
√ 1 2
+ η (m) L̄M n2 am + η (m) N L̄M.
2
By Lemma 7.15, there exists an upper bound M4 such that for all m sufficiently large,
(m+1) (m) √
|∆1 | + (1 + η (m) c)|∆1 | + η (m) L̄M n2 am
≤ M4 .
1 (m) 2
2η
By subtracting f¯(θ∗ , λ̄) from both side of (7.6), we obtain
v
f¯(θm , λm ) − f¯(θ∗ , λ̄) ≤ (69)
ϑ+m
29
holds for all m, where
ζ 2 (N L̄M + M4 )
v := max{ , (ϑ + 1)[f¯(θ(1) , λ(1) ) − f¯(θ∗ , λ̄)]}.
2(ζc − 1)
First, the definition of ζ ensures that it holds for m = 1. Assuming (69) holds for some m ≥ 1, it
follows from (68) that
1 2
f¯(θ(m+1) , λ(m+1) ) − f¯(θ∗ , λ̄) ≤ (1 − η (m) c)(f¯(θ(m) , λ(m) ) − f¯(θ∗ , λ̄)) + η (m) (LM + M4 )
2
(m) v 1 (m) 2
≤ (1 − η c) + η (LM + M4 )
ϑ+m 2
ζc v ζ 2 (LM + M4 )
= (1 − ) +
ϑ+m ϑ+m 2(ϑ + m)2
ϑ + m − ζc ζ 2 (LM + M4 )
= 2
v+
(ϑ + m) 2(ϑ + m)2
ζ 2 (LM + M4 )
ϑ+m−1 ζc − 1
= v − v +
(ϑ + m)2 (ϑ + m)2 2(ϑ + m)2
ϑ+m−1
≤ v
(ϑ + m)2
v
≤ .
ϑ+m+1
The first inequality is by (68), the second inequality is by the definition of η (m) , the third inequality
is by the definition of v, the sum of the latter two terms is non-positive, and the fourth inequality
is because (ϑ + m)2 ≥ (ϑ + m + 1)(ϑ + m − 1). This shows that the algorithm converges at a
sublinear rate.
Here we discuss the actual conditions for η (m) and α(m) to satisfy the assumptions of Theorem 4.6,
Lemma 4.7 and Theorem 5.2, respectively. We only consider the cases η (m) = m1k and α(m) = m1h ,
but the same analysis applies to the cases η (m) = O( m1k ) and α(m) = O( m1h ).
30
8.2 Assumptions of Lemma 4.7
requires h > 2.
Besides, the second condition is
∞ X
X ∞ X
i ∞ X
X ∞ i
X ∞ X
X ∞
α(i) η (n) = α(i) η (n) ≤ C α(i) < ∞.
m=1 i=m n=1 m=1 i=m n=1 m=1 i=m
Therefore, the conditions for η (m) and α(m) to satisfy the assumptions of Lemma 4.7 are h > 2 and
k ≥ 1.
Recall that we have let η (m) = 1/m. For the assumptions of Theorem 5.2, the condition
∞
X
α(m) ln(m) < ∞
m=1
requires h > 1. To see this, note that ln(m) ≤ Cm for any > 0. Thus
∞
X ∞
X
α(m) ln(m) ≤ C m−h m < ∞
m=1 m=1
31