0% found this document useful (0 votes)
86 views24 pages

Gradient Descent Learns One-Hidden-Layer CNN: Don't Be Afraid of Spurious Local Minima

This document summarizes research on using gradient descent to learn a one-hidden-layer convolutional neural network (CNN) with a non-overlapping convolutional layer and ReLU activation. The researchers proved that while there exists a spurious local minimum, gradient descent with weight normalization can recover the true parameters with constant probability or converge to the spurious minimum also with constant probability. Furthermore, gradient descent has two phases - an initial slow convergence phase followed by a faster linear convergence phase after several iterations.

Uploaded by

Dipayan Bhadra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
86 views24 pages

Gradient Descent Learns One-Hidden-Layer CNN: Don't Be Afraid of Spurious Local Minima

This document summarizes research on using gradient descent to learn a one-hidden-layer convolutional neural network (CNN) with a non-overlapping convolutional layer and ReLU activation. The researchers proved that while there exists a spurious local minimum, gradient descent with weight normalization can recover the true parameters with constant probability or converge to the spurious minimum also with constant probability. Furthermore, gradient descent has two phases - an initial slow convergence phase followed by a faster linear convergence phase after several iterations.

Uploaded by

Dipayan Bhadra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Gradient Descent Learns One-hidden-layer CNN:

Don’t be Afraid of Spurious Local Minima


Simon S. Du1 , Jason D. Lee2 , Yuandong Tian3 , Barnabás Póczos1 , and Aarti Singh1
1
Machine Learning Department, Carnegie Mellon University
2
Department of Data Sciences and Operations, University of Southern California
arXiv:1712.00779v1 [cs.LG] 3 Dec 2017

3
Facebook Artificial Intelligence Research

December 5, 2017

Abstract
We consider the problem of learning a one-hidden-layer neural network
P with non-overlapping
convolutional layer and ReLU activation function, i.e., f (Z; w, a) = j aj σ(wT Zj ), in which
both the convolutional weights w and the output weights a are parameters to be learned. We
prove that with Gaussian input Z, there is a spurious local minimum that is not a global
mininum. Surprisingly, in the presence of local minimum, starting from randomly initialized
weights, gradient descent with weight normalization can still be proven to recover the true
parameters with constant probability (which can be boosted to arbitrarily high accuracy with
multiple restarts). We also show that with constant probability, the same procedure could also
converge to the spurious local minimum, showing that the local minimum plays a non-trivial
role in the dynamics of gradient descent. Furthermore, a quantitative analysis shows that the
gradient descent dynamics has two phases: it starts off slow, but converges much faster after
several iterations.

1 Introduction
Deep convolutional neural networks (CNN) have achieved the state-of-the-art performance in many
applications such as computer vision [Krizhevsky et al., 2012], natural language processing [Dauphin
et al., 2016] and reinforcement learning applied in classic games like Go [Silver et al., 2016]. Despite
the highly non-convex nature of the objective function, simple first-order algorithms like stochas-
tic gradient descent and its variants often train such networks successfully. The success of such
simple methods in learning convolutional neural networks remains elusive from an optimization
perspective.
Recently, a line of research [Tian, 2017, Brutzkus and Globerson, 2017, Li and Yuan, 2017,
Soltanolkotabi, 2017, Shalev-Shwartz et al., 2017b] assumed the input distribution is Gaussian and
showed that stochasticP gradient descent with random or 0 initialization is able to train a neural
network f (Z; {wj }) = j aj σ(wjT Z) with ReLU activation σ(x) = max(x, 0) in polynomial time.
However, these results all assume there is only one unknown layer {wj }, while a is a fixed vector.
A natural question thus arises:

Does randomly initialized (stochastic) gradient descent learn deep neural networks?

1
5

Logarithm of Prediction Errors


0

a1 -5

ai
+
-10

Label
-15

ai+1
(Estimate) -20
Sucess Case
ReLU Failure Case
Input
-25
0 50 100 150 200 250 300 350 400 450 500

Epochs

(a) Convolutional neural network with an un- (b) The convergence of gradient descent for learn-
known non-overlapping filter and an unknown ing the convolutional neural network described in
output layer. In the first (hidden) layer, a fil- Figure 1a with filter size, p = 20 and number of
ter w is applied to nonoverlapping parts of the non-overlapping patches, k = 25. The success
input x, which then passes through a ReLU ac- case and the failure case correspond to conver-
tivation function. The final output is the inner gence to the global minimum and the spurious
product between an output weight vector a and local minimum, respectively. In the first ∼ 50 it-
the hidden layer outputs. erations the convergence is slow. After that gra-
dient descent converges at a fast linear rate.

Figure 1: Network architecture we consider in this paper and convergence of gradient descent for
learning the parameters of this network.

In this paper, we take an important step by showing that randomly initialized gradient descent
learns a simple non-linear convolutional neural network with two unknown layers w and a. To our
knowledge, our work is the first of its kind.
Formally, we consider the convolutional case in which wi = w are shared among different hidden
nodes. Let x ∈ Rd be an input sample, e.g., an image. We generate k patches from x, each with size
p: Z ∈ Rp×k where the i-th column is the i-th patch generated by some known function Zi = Zi (x).
Any two patches are non-overlapping. Thus, the neural network function has the following form:
k
X  
f (Z, w, a) = a i σ w > Zi .
i=1

We focus on the realizable case, i.e., the label is generated according to y = f (Z, a∗ , w∗ ) for
some true parameters a∗ and w∗ . We use `2 loss to learn the parameters:
1
min `(Z, w, a) := (f (Z, w, a) − f (Z, w∗ , a∗ ))2 .
w,a 2
We assume x is sampled from a Gaussian distribution and there is no overlap between patches. This
assumption is equivalent to that each entry of Z is sampled from a Gaussian distribution [Brutzkus
and Globerson, 2017, Zhong et al., 2017b]. Following [Zhong et al., 2017a,b, Li and Yuan, 2017,
Tian, 2017, Brutzkus and Globerson, 2017, Shalev-Shwartz et al., 2017b], in this paper, we mainly
focus on the population loss:
1 h
∗ ∗ 2
i
` (w, a) := EZ (f (Z, w, a) − f (Z, w , a )) .
2

2
Algorithm 1 GD for Learning One-Hidden-Layer CNN with Weight Normalization
1: Input: Initialization v0 ∈ Rp , a0 ∈ Rk , learning rate η.
2: for t = 1, 2, . . . do
∂`(vt ,at ) ∂`(vt ,at )
3: vt+1 ← vt − η ∂vt , at+1 ← at − η ∂at .
4: end for

We study whether the global convergence w → w∗ and a → a∗ can be achieved when optimizing
`(w, a) using gradient descent.
A crucial difference between our two-layer network and  previous one-layer models is there is
a
non-uniqueness issue. That is, for any c > 0, f Z, cw, c = f (Z, w, a). This interesting property
allows the network to be rescaled without changing the function computed by the network. As
reported by Neyshabur et al. [2015], it is desirable to have scaling-invariant learning algorithm
to stabilize the training process. Similar observations have appeared in the matrix factorization
(equivalent to linear activation) literature. For example, see Tu et al. [2016].
One commonly used technique to achieve stability is weight-normalization introduced by Sal-
v
imans and Kingma [2016]. In our setting, we re-parametrize the first layer as w = kvk and the
2
prediction function becomes
k
σ Z>

i v
X
f (Z, v, a) = ai . (1)
kvk2
i=1

The loss function is


1 h i
` (v, a) = EZ (f (Z, v, a) − f (Z, v∗ , a∗ ))2 . (2)
2
The pseudo-code for optimizing the objective function in Equation (2) is listed in Algorithm 1.
We show that with random initialization, gradient descent converges to the target convolutional
neural network with probability at least 0.25. By further exploiting the symmetry, we can boost
up the success probability to 1 with only 3 additional deterministic restarts. Further, perhaps
surprisingly, we prove that the objective function (Equation (2)) does have a spurious local minimum
and with constant probability, using the same random initialization scheme gradient descent can
converge to this bad local minimum as well. In contrast to previous works on guarantees for non-
convex objective functions whose landscape satisfies “no spurious local minima” property [Li et al.,
2016, Sun et al., 2016, Ge et al., 2017a, 2016, Bhojanapalli et al., 2016, Ge et al., 2017b], our result
highlights a conceptually surprising phenomenon:

Randomly initialized local search can find a global minimum in presence of spurious local minima.

At the core of our analysis is a series of invariant qualitative characterizations of gradient descent
dynamics, which determines the convergence to the global or the spurious local minimum. This
analysis emphasizes that for non-convex optimization problems, we need to carefully characterize
both the trajectory of the algorithm and the initialization. We believe that this idea is applicable
to other non-convex problems.
Next, we conduct a quantitative study of the dynamics of gradient descent. We show that
the dynamics of Algorithm 1 contain two phases. At the beginning (around first 50 iterations in

3
Figure 1b), because the magnitude of initial signal (angle between v and w∗ and a> a∗ ) is small,
the prediction error drops slowly. After that, when the signal becomes stronger, gradient descent
converges at a much faster linear rate and the prediction error drops quickly.

1.1 Organization
This paper is organized as follows. In Section 2 we introduce necessary notations and analytical
formulas of gradient updates in Algorithm 1. In Section 3, we provide our main theorems on the
performances of algorithms and their implications. In Section 4, we give a proof sketch of our main
theorem, laying out the key technical insights of learning one-hidden-layer CNN. We conclude and
list future directions in Section 5. We place most of our detailed proofs in the appendix.

1.2 Related Works


From the point of view of learning theory, it is well known that training a neural network is hard
in the worst cases [Blum and Rivest, 1989, Livni et al., 2014, Šı́ma, 2002, Shalev-Shwartz et al.,
2017a,b] and recently, Shamir [2016] showed that assumptions on both the target function and the
input distribution are needed for optimization algorithms used in practice to succeed. With some
additional assumptions, many works tried to design algorithms that provably learn a neural network
with polynomial time and sample complexity [Goel et al., 2016, Zhang et al., 2015, Sedghi and
Anandkumar, 2014, Janzamin et al., 2015, Goel and Klivans, 2017a,b]. However these algorithms
are specially designed for certain architectures and cannot explain why (stochastic) gradient based
optimization algorithm works well in practice.
Focusing on gradient-based algorithms, a line of research analyzed the behavior of (stochastic)
gradient descent for Gaussian input distribution. Tian [2017] showed that population gradient
descent is able to find the true weight vector with random initialization for one-layer one-neuron
model. Soltanolkotabi [2017] later improved this result by showing the true weights can be exactly
recovered by empirical projected gradient descent with enough samples in linear time. Brutzkus
and Globerson [2017] showed population gradient descent recovers the true weights of a convolution
filter with non-overlapping input in polynomial time. Zhong et al. [2017b] and later work [Zhong
et al., 2017a] proved that with sufficiently good initialization, which can be implemented by ten-
sor method, gradient descent can find the true weights of a one-hidden-layer fully connected and
convolutional neural network. Li and Yuan [2017] showed SGD can recover the true weights of a
one-layer ResNet model with ReLU activation under the assumption that the spectral norm of the
true weights is within a small constant of the identity mapping. This paper also follows this line
of approach that studies the behavior of gradient descent algorithm with Gaussian inputs.
Finding the optimal weights of a neural network is non-convex problem. Recently, researchers
found that if the objective functions satisfy the following two key properties:

1. All saddle points and local maxima are strict, i.e., there exists a negative curvature;

2. No spurious local minmum, i.e., all local minima are global,

then perturbed (stochastic) gradient descent [Ge et al., 2015, Jin et al., 2017, Levy, 2016] or
methods with second order information [Nesterov and Polyak, 2006, Curtis et al., 2014, Carmon
et al., 2016, Allen-Zhu, 2017] can find a global minimum in polynomial time. Combined with
geometric analyses, these algorithmic results have shown a large number problems, including tensor

4
decomposition [Ge et al., 2015], dictionary learning [Sun et al., 2017], phase retrieval [Sun et al.,
2016], matrix sensing [Bhojanapalli et al., 2016, Park et al., 2017], matrix completion [Ge et al.,
2017a, 2016] and matrix factorization [Li et al., 2016] can be solved in polynomial time with local
search algorithms.
This motivates the research of studying the landscape of neural networks [Kawaguchi, 2016,
Choromanska et al., 2015a, Hardt and Ma, 2016, Haeffele and Vidal, 2015, Mei et al., 2016, Freeman
and Bruna, 2016, Safran and Shamir, 2016, Zhou and Feng, 2017, Nguyen and Hein, 2017a,b,
Ge et al., 2017b, Zhou and Feng, 2017]. In particular, Kawaguchi [2016], Hardt and Ma [2016],
Zhou and Feng [2017], Nguyen and Hein [2017a,b], Feizi et al. [2017] showed that under some
conditions, all local minima are global. Recently, Ge et al. [2017b] showed using a modified objective
function satisfying the two properties above, one-hidden-layer neural network can be learned by
noisy perturbed gradient descent.1 However, for nonlinear activation function where with number
of samples larger than the number of nodes at every layer, the usually case in most deep neural
network, and natural objective functions like `2 , it is still unclear whether the strict saddle and
“all locals are global” properties are satisfied. In this paper, we show even for an one-hidden-layer
neural network with ReLU activation, there exists a local minimum. However, we further show
that randomly initialized local search can achieve global minimum with constant probability.

2 Preliminaries
We use bold-faced letters for vectors and matrices. We use k·k2 to denote the Euclidean norm of a
finite-dimensional vector. Let O(·) and Θ (·) denote standard Big-O and Big-Theta notations, only
hiding absolute constants. We let wt and at to be the parameters at the t-th iteration and w∗ and
a∗ be the optimal weights. ai is the i-th coordinate of a and Zi is the transpose of the i-th row
of Z (thus a column vector). We denote S p−1 the (p − 1)-dimensional unit sphere and B (0, r) the
ball centered at 0 with radius r.
In this paper we assume every patch Zi is vector of IID Gaussian random variables. The
following theorem gives an explicit formula for the population loss. The proof uses basic rotational
invariant property and polar decomposition of Gaussian random variables. See Section A for details.

Theorem 2.1. For Z ∼ N (0, I), the population loss is


"
1 (π − 1) kw∗ k22 ∗ 2 (π − 1) 2 (g (φ) − 1) kw∗ k2 > ∗
` (v, a) = ka k2 + kak22 − a a
2 2π 2π 2π
#
kw∗ k22  > ∗ 2 1  > 2 ∗ > > ∗
+ 1 a + 1 a − 2 kw k2 1 a · 1 a (3)
2π 2π

where φ = θ (v, w∗ ) and g(φ) = cos(φ)(π − φ + sin(φ) cos(φ)) + sin3 (φ).

Using similar techniques, we can show the gradient also has an analytical form.

1
Lee et al. [2016] showed vanilla gradient descent only converges to minimizers with no convergence rates guar-
antees. Recently, Du et al. [2017a] gave an exponential time lower bound for the vanilla gradient descent. In this
paper, we give polynomial convergence guarantee on vanilla gradient descent.

5
Theorem 2.2. If Z ∼ N (0, I) and denote θ (w, w∗ ) = φ then the expected gradient of w and a
can be written as
!
vv>
 
∂` (Z, v, a) 1
EZ =− I− 2 a> a∗ (π − φ) w∗
∂v 2π kvk2 kvk2
 
∂` (Z, v, a) 1   1  
EZ = 11> + (π − 1) I a − kw∗ k2 11> + (g(φ) − 1) I a∗ .
∂a 2π 2π

As a remark, if the second layer is fixed, upon proper scaling, the formulas for the population loss
and gradient of v are equivalent to the corresponding formulas derived in [Brutzkus and Globerson,
2017, Cho and Saul, 2009]. However, when the second layer is not fixed, the gradient of v depends
a> a∗ , which plays an important role in deciding whether converging to the global or the local
minimum.

3 Main Result
We begin with our main theorem about the convergence of gradient descent.
0 > a∗ > 0, 1> a0 ≤ 1> a∗ , φ0 < π/2 and

Theorem 3.1. Suppose the initialization
( satisfies a )!
>
(a0 ) a∗ cos φ0 (g(φ0 )−1)ka∗ k22 cos φ0 cos φ 0
1
step size satisfies η = O min  ∗ 2 > ∗ 2  ∗ 2 ,  ∗ 2 > ∗ 2  ∗ 2 ,  ∗ 2 > ∗ 2  ∗ 2 , k .
ka k2 +(1 a ) kw k2 ka k2 +(1 a ) kw k2 ka k2 +(1 a ) kw k2
Then the convergence of gradient descent has two phases.  
(Phase I: Slow Initial Rate) There exists T1 = O η cos1φ0 β 0 + η1 such that we have φT1 = Θ (1)
>   n  o
>
and aT1 a∗ kw∗ k2 = Θ ka∗ k22 kw∗ k22 where β 0 = min a0 a∗ kw∗ k2 , (g(φ0 ) − 1) ka∗ k22 ka∗ k22 .
 
1 1 1 2 T1 +T2 , aT1 +T2 ≤
 
(Phase II: Fast Rate) There exists T2 = O( + η log  ) such that ` v
e 2 2
ηkw∗ k ka∗ k 2 2
 kw∗ k22 ka∗ k22 .

Theorem 3.1 shows under certain conditions of the initialization, gradient descent converges to
the global minimum. The convergence has phases, at the beginning because the initial signal is
small, the convergence is quite slow. After T1 iterations, the signal becomes stronger and we enter
a regime with a faster convergence rate. More technical insights are provided in Section 4.
Initialization plays an important role in the convergence. First, Theorem 3.1 needs the initial-
>
ization satisfies a0 a∗ > 0, 1> a0 ≤ 1> a∗ and φ0 < π/2. Second, the step size η and the

convergence rate in the first phase also depends on the initialization. If the initial signal is very
small, for example, φ0 ≈ π/2 which makes cos φ0 close to 0, we can only choose a very small step
size and because T1 depends on the inverse of cos φ0 , we need a large number of iterations to enter
phase II. We provide the following initialization scheme which ensures the conditions required by
Theorem 3.1 and a large enough initial signal.
  
|1> a∗√|kw∗ k2
Theorem 3.2. Let v ∼ unif S p−1 and a ∼ unif B 0,

k
, then exists

v0 , a0 ∈ {(v, a) , (v, −a) , (−v, a) , (−v, −a)}




e (·) hides factors on 1> a∗ kw∗ k and ka∗ k kw∗ k


2

O 2 2 2

6
>
a∗ > 0, 1> a0 ≤ 
> ∗
that a0 1 a and φ0 < π/2.  Further, with high probability, the initialization
2
|1 a |ka k2 kw k2
> ∗ ∗ ∗  
>
satisfies a0 a∗ kw∗ k2 = Θ 0 = Θ √1 .

k , and φ p

Theorem 3.2 shows after generating a pair of random vectors (v, a), trying out all 4 sign com-
binations of (v, a), we can find the global minimum by gradient descent. Further, because the
initial signal is not too small, we only need to set the step size to be O(1/poly(k,
e p)) and the
number of iterations in phase I is at most O (poly (k, p)). Therefore, Theorem 3.1 and Theorem 3.2
e
together show that randomly initialized gradient descent learns an one-hidden-layer convolutional
neural network in polynomial time. The proof of the first part of Theorem 3.2 just uses the sym-
metry of unit sphere and ball and the second part is a standard application of random vector in
high-dimensional spaces. See Lemma 2.5 of [Hardt and Price, 2014] for example.
1
Remark 1: For the second layer we use O √k type initialization, verifying common initialization
techniques [Glorot and Bengio, 2010, He et al., 2015, LeCun et al., 1998].
Remark 2: The Gaussian input assumption is not necessarily true in practice, although this is
a common assumption appeared in the previous papers [Brutzkus and Globerson, 2017, Li and
Yuan, 2017, Zhong et al., 2017a,b, Tian, 2017, Choromanska et al., 2015a, Xie et al., 2017, Shalev-
Shwartz et al., 2017b] and also considered plausible in [Choromanska et al., 2015b]. Our result
can be easily generalized to rotation invariant distributions. However, extending to more general
distributional assumption, e.g., structural conditions used in [Du et al., 2017b] remains a challenging
open problem.

3.1 Gradient Descent Can Converge to the Spurious Local Minimum


Theorem 3.2 shows among {(v, a) , (v, −a) , (−v, a) , (−v, −a)}, there is one pair enables gradient
descent to converge to the global minimum. Perhaps surprisingly, the next theorem shows, under
some conditions of the underlying truth, there is also a pair that makes gradient descent converge
to the spurious local minimum. See Figure 1b for the difference between the global minimum and
the spurious local minimum.
2
Theorem 3.3. Without loss of generality, we let kw∗ k2 = 1. Suppose 1> a∗ < poly(p) 1
ka∗ k22
1> a∗ kw∗ k2
  
and η is sufficiently small. Let v ∼ unif S p−1 and a ∼ unif B 0,


k
, then with
>
high probability, there exists v0 , a0 ∈ {(v, a) , (v, −a) , (−v, a) , (−v, −a)} that a0 a∗ < 0,

> ∗ 2
1 a ≤ 1 a , g φ0 ≤ −2(1 a2 ) + 1. If v0 , a0 is used as the initialization, when Algorithm 1
> 0 > ∗  
ka∗k2 
converges, we have ` (v, a) = Ω kw∗ k22 ka∗ k22 .

4 Overview of Proofs and Technical Insights


In this section we list our main ideas of proving Theorem 3.1. In Section 4.1, we give qualitative
high level intuition on why the initial conditions are sufficient for gradient descent to converge to
the global minimum. In Section 4.2, we explain why the gradient descent has two phases.

7
4.1 Qualitative Analysis of Convergence
The convergence to global optimum relies on a geometric characterization of saddle points and a
series of invariants throughout the gradient descent dynamics. The next lemma gives the analysis
of stationary points. The proof is by checking the first order condition of stationary points using
Theorem 2.2.

Lemma 4.1 (Stationary Point Analysis). When the gradient descent converges, if a> a∗ 6= 0 and
kvk2 < ∞, we have either

θ (v, w∗ ) = 0, a = kw∗ k2 a∗
 −1  
or θ (v, w∗ ) = π, a = 11> + (π − 1) I 11> − I kw∗ k2 a∗ .

This lemma shows when the algorithm converge, and a and a∗ are not orthogonal, then we
arrive at either a global
 optimal  point or a local minimum. Now recall the gradient  formula of
∂`(v,a) 1 vv> > ∗ ∗ vv>
v: ∂v = − 2πkvk I − kvk2 a a (π − φ) w . Notice that φ ≤ π and I − kvk2 is just the
2 2 2
projection matrix onto the complement of v. Therefore, the sign of inner product between a and
a∗ plays a crucial role in the dynamics of Algorithm 1 because if the inner product is positive,
the gradient update will decrease the angle between v and w∗ and if it is negative, the angle will
increase. This observation is formalized in the lemma below.
>
Lemma 4.2 (Invariance I: Angle between v and w always decreases.). If at a∗ > 0, then
φt+1 ≤ φt .
>
This lemmas shows if at a∗ > 0 for all t, we have the convergence toward the global minimum.
>
Thus, we need to study the dynamics of at a∗ . For the ease of presentation, without loss of
generality, we assume kw∗ k2 = 1. By the gradient formula of a, we have
>
at+1 a∗
η(g(φt ) − 1)
   
η(π − 1) t > ∗
 t 2
η  > ∗ 2  > t   > ∗ 
= 1− a a + a 2+ 1 a − 1 a 1 a . (4)
2π 2π 2π
>
We can use the induction to prove the invariance. If at a∗ > 0 and φt < π2 the first term of
Equation (4) is non-negative. For the second term, notice that if φt < π2 , we have
 g(φt ) > 1, so the
2
second term is non-negative. Therefore, as long as 1> a∗ − 1> at 1> a∗ is also non-negative,
  

we have the desired invariance. The next lemma summarizes the above analysis.
>
Lemma 4.3 (Invariance II: Positive Signal from the Second Layer.). If at a∗ > 0, 0 ≤ 1> a∗ ·
2 >
1> at ≤ 1> a∗ , 0 < φt < π/2 and η < 2, then at+1 a∗ > 0.
 2 
It remains to prove 1> a∗ − 1> at 1> a∗

> 0. Again, we study the dynamics of this
quantity. Using the gradient formula and some routine algebra, we have

η k + g(φt ) − 1  > ∗ 2
  
> t+1 > ∗ η (k − π − 1) > t > ∗
1 a ·1 a ≤ 1− 1 a ·1 a + 1 a
2π 2

8
 
η (k − π − 1) η (k + π − 1)  > ∗ 2
≤ 1− 1> at · 1> a∗ + 1 a
2π 2
where have used the fact that g(φ) ≤ π for all 0 ≤ φ ≤ π2 . Therefore we have
 

> ∗ > t+1

> ∗ η(k + π − 1)  > ∗ 
1 a −1 a ·1 a ≥ 1− 1 a − 1> at 1> a∗ .

Now we have the third invariance.
2
Lemma 4.4 (Invariance III: Summation of Second Layer Always Small.). If 1> a∗ ·1> at ≤ 1> a∗
2

and η < k+π−1 then 1> a∗ · 1> at+1 ≤ 1> a∗ .
>
To sum up, if the initialization satisfies (1) φ0 < π2 , (2) a0 a∗ > 0 and (3) 1> a∗ · 1> a0 ≤
2
1> a∗ , with Lemma 4.2, 4.3, 4.4, by induction we can show the convergence to the global mini-
mum. Further, Theorem 3.2 shows theses three conditions are true with constant probability using
random initialization.

4.2 Quantitative Analysis of Two Phase Phenomenon


In this section we demonstrate why there is a two-phase phenomenon. Throughout this section,
we assume the conditions in Section 4.1 hold. We first consider the convergence of the first layer.
Because we are using weight-normalization, only the angle between v and w∗ will affect the predic-
tion. Therefore, in this paper, we study the dynamics sin2 φt . The following lemma quantitatively
characterize the shrinkage of this quantity of one iteration.
Lemma 4.5 (Convergence ofn Angle between v and w∗o). Under the same assumptions as in
>
Theorem 3.1. Let β 0 = min a0 a∗ , g(φ0 ) − 1 ka∗ k22 kw∗ k22 . If the step size satisfies η =

( )!
β ∗ cos φ0  φ0 
O min  2 ,  ∗ 2 cos 2 ,1 , we have
ka∗ k22 +(1> a∗ ) kw∗ k22 ka k2 +(1> a∗ ) kw∗ k22 k

sin2 φt+1 ≤ 1 − η cos φt λt sin2 φt




>
kw∗ k2 (π−φt )(at ) a∗
where λt = 2πkvt k22
.
This lemma shows the convergence rate depends on two crucial quantities, cos φt and λt . At
the beginning, both cos φt and λt are small. Nevertheless, Lemma C.3 shows λt is universally lower
bounded by Ω β 0 . Therefore, after O( η cos1φ0 β 0 ) we have cos φt = Ω (1). Once cos φt = Ω (1),
   
Lemma C.2 shows, after O η1 iterations, at a∗ kw∗ k = Ω kw∗ k22 ka∗ k22 . Combining the facts

 
v ≤ 2 (Lemma C.3) and φt < π/2, we have cos φt λt = Ω kw∗ k2 ka∗ k2 . Now we enter phase
t
2 2 2
II.
In phase II, Lemma 4.5 shows
 
sin2 φt+1 ≤ 1 − ηC kw∗ k22 ka∗ k22 sin2 φt
for some positive absolute constant
 C. Therefore, we have much faster convergence rate than that
1 1

in the Phase I. After only O ηkw∗ k2 kak2 log  iterations, we obtain φt ≤ . Once we have this,
2 2
e 1 log 1 ) iterations, the population

using Lemma C.4, Lemma C.5 Lemma C.6 we can show after O( η 
loss is smaller than  kv∗ k22 ka∗ k22 .

9
5 Conclusions and Future Works
In this paper we give the first polynomial convergence guarantee of randomly initialized gradient
descent algorithm for learning a one-hidden-layer convolutional neural network. Our result reveals
an interesting phenomenon that randomly initialized local search algorithm can converge to a
global minimum or a spurious local minimum and both events have constant probability. We give
a complete quantitative characterization of gradient descent dynamics to explain the two-phase
convergence phenomenon. Here we list some future directions.
Our analysis focused on the population loss with Gaussian input. In practice one uses (stochas-
tic) gradient descent on the empirical loss. Concentration results in [Mei et al., 2016, Soltanolkotabi,
2017, Daskalakis et al., 2016, Xu et al., 2016] are useful to generalize our results to the empirical
version. A more challenging question is how to extend the analysis of gradient dynamics beyond
rotationally invariant input distributions. Du et al. [2017b] proved the convergence of gradient
descent under some structural input distribution assumptions in the one-layer convolutional neural
network. It would be interesting to bring their insights to our setting.
Another interesting direction is to generalize our result to deeper and wider architectures.
Specifically, an open problem is under what conditions randomly initialized gradient descent algo-
rithms can learn one-hidden-layer fully connected neural network or a convolutional neural network
with multiple kernels. Existing results often requires sufficiently good initialization [Zhong et al.,
2017a,b]. We believe the insights from this paper, especially the invariance principles in Section 4.1
are helpful to understand the behaviors of gradient-based algorithms in these settings.

6 Acknowledgment
The authors thank Xiaolong Wang and Kai Zhong for useful discussions.

References
Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. arXiv preprint
arXiv:1708.08694, 2017.

Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search
for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages
3873–3881, 2016.

Avrim Blum and Ronald L Rivest. Training a 3-node neural network is NP-complete. In Advances
in neural information processing systems, pages 494–501, 1989.

Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian
inputs. arXiv preprint arXiv:1702.07966, 2017.

Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-
convex optimization. arXiv preprint arXiv:1611.00756, 2016.

Youngmin Cho and Lawrence K Saul. Kernel methods for deep learning. In Advances in neural
information processing systems, pages 342–350, 2009.

10
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, and Yann LeCun. The
loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204, 2015a.

Anna Choromanska, Yann LeCun, and Gérard Ben Arous. Open problem: The landscape of the
loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760, 2015b.

Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a
worst-case iteration complexity of O(−3/2 ) for nonconvex optimization. Mathematical Program-
ming, pages 1–32, 2014.

Constantinos Daskalakis, Christos Tzamos, and Manolis Zampetakis. Ten steps of EM suffice for
mixtures of two gaussians. arXiv preprint arXiv:1609.00368, 2016.

Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated
convolutional networks. arXiv preprint arXiv:1612.08083, 2016.

Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Barnabas Poczos, and Aarti Singh. Gradient
descent can take exponential time to escape saddle points. arXiv preprint arXiv:1705.10412,
2017a.

Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? arXiv
preprint arXiv:1709.06129, 2017b.

Soheil Feizi, Hamid Javadi, Jesse Zhang, and David Tse. Porcupine neural networks:(almost) all
local optima are global. arXiv preprint arXiv:1710.02196, 2017.

C Daniel Freeman and Joan Bruna. Topology and geometry of half-rectified network optimization.
arXiv preprint arXiv:1611.01540, 2016.

Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points − online stochastic
gradient for tensor decomposition. In Proceedings of The 28th Conference on Learning Theory,
pages 797–842, 2015.

Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In
Advances in Neural Information Processing Systems, pages 2973–2981, 2016.

Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems:
A unified geometric analysis. In Proceedings of the 34th International Conference on Machine
Learning, pages 1233–1242, 2017a.

Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape
design. arXiv preprint arXiv:1711.00501, 2017b.

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence
and Statistics, pages 249–256, 2010.

Surbhi Goel and Adam Klivans. Eigenvalue decay implies polynomial-time learnability for neural
networks. arXiv preprint arXiv:1708.03708, 2017a.

11
Surbhi Goel and Adam Klivans. Learning depth-three neural networks in polynomial time. arXiv
preprint arXiv:1709.06010, 2017b.

Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the ReLU in
polynomial time. arXiv preprint arXiv:1611.10258, 2016.

Benjamin D Haeffele and René Vidal. Global optimality in tensor factorization, deep learning, and
beyond. arXiv preprint arXiv:1506.07540, 2015.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231,
2016.

Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. In
Advances in Neural Information Processing Systems, pages 2861–2869, 2014.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing
human-level performance on imagenet classification. In Proceedings of the IEEE international
conference on computer vision, pages 1026–1034, 2015.

Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-convexity:
Guaranteed training of neural networks using tensor methods. arXiv preprint arXiv:1506.08473,
2015.

Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to es-
cape saddle points efficiently. In Proceedings of the 34th International Conference on Machine
Learning, pages 1724–1732, 2017.

Kenji Kawaguchi. Deep learning without poor local minima. In Advances In Neural Information
Processing Systems, pages 586–594, 2016.

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolu-
tional neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.

Yann LeCun, Léon Bottou, Genevieve B Orr, and Klaus-Robert Müller. Efficient backprop. In
Neural networks: Tricks of the trade, pages 9–50. Springer, 1998.

Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only
converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.

Kfir Y Levy. The power of normalization: Faster evasion of saddle points. arXiv preprint
arXiv:1611.04831, 2016.

Xingguo Li, Zhaoran Wang, Junwei Lu, Raman Arora, Jarvis Haupt, Han Liu, and Tuo Zhao.
Symmetry, saddle points, and global geometry of nonconvex matrix factorization. arXiv preprint
arXiv:1612.09296, 2016.

Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with ReLU activa-
tion. arXiv preprint arXiv:1705.09886, 2017.

12
Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training
neural networks. In Advances in Neural Information Processing Systems, pages 855–863, 2014.

Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for non-convex losses.
arXiv preprint arXiv:1607.06534, 2016.

Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global perfor-
mance. Mathematical Programming, 108(1):177–205, 2006.

Behnam Neyshabur, Ruslan R Salakhutdinov, and Nati Srebro. Path-SGD: Path-normalized opti-
mization in deep neural networks. In Advances in Neural Information Processing Systems, pages
2422–2430, 2015.

Quynh Nguyen and Matthias Hein. The loss surface of deep and wide neural networks. arXiv
preprint arXiv:1704.08045, 2017a.

Quynh Nguyen and Matthias Hein. The loss surface and expressivity of deep convolutional neural
networks. arXiv preprint arXiv:1710.10928, 2017b.

Dohyung Park, Anastasios Kyrillidis, Constantine Carmanis, and Sujay Sanghavi. Non-square
matrix sensing without spurious local minima via the Burer-Monteiro approach. In Artificial
Intelligence and Statistics, pages 65–74, 2017.

Itay Safran and Ohad Shamir. On the quality of the initial basin in overspecified neural networks.
In International Conference on Machine Learning, pages 774–782, 2016.

Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to ac-
celerate training of deep neural networks. In Advances in Neural Information Processing Systems,
pages 901–909, 2016.

Hanie Sedghi and Anima Anandkumar. Provable methods for training neural networks with sparse
connectivity. arXiv preprint arXiv:1412.2693, 2014.

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning.
In International Conference on Machine Learning, pages 3067–3075, 2017a.

Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Weight sharing is crucial to succesful
optimization. arXiv preprint arXiv:1706.00687, 2017b.

Ohad Shamir. Distribution-specific hardness of learning neural networks. arXiv preprint


arXiv:1609.01037, 2016.

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering
the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.

Jiřı́ Šı́ma. Training a single sigmoidal neuron is hard. Neural Computation, 14(11):2709–2728, 2002.

Mahdi Soltanolkotabi. Learning ReLUs via gradient descent. arXiv preprint arXiv:1705.04591,
2017.

13
Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In Information Theory
(ISIT), 2016 IEEE International Symposium on, pages 2379–2383. IEEE, 2016.

Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and
the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2017.

Yuandong Tian. An analytical formula of population gradient for two-layered ReLU network and its
applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017.

Stephen Tu, Ross Boczar, Max Simchowitz, Mahdi Soltanolkotabi, and Benjamin Recht. Low-rank
solutions of linear matrix equations via Procrustes flow. In Proceedings of the 33rd International
Conference on International Conference on Machine Learning-Volume 48, pages 964–973. JMLR.
org, 2016.

Bo Xie, Yingyu Liang, and Le Song. Diverse neural network learns true target functions. In
Artificial Intelligence and Statistics, pages 1216–1224, 2017.

Ji Xu, Daniel J Hsu, and Arian Maleki. Global analysis of expectation maximization for mixtures
of two gaussians. In Advances in Neural Information Processing Systems, pages 2676–2684, 2016.

Yuchen Zhang, Jason D Lee, Martin J Wainwright, and Michael I Jordan. Learning halfspaces and
neural networks with random initialization. arXiv preprint arXiv:1511.07948, 2015.

Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolutional neural
networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017a.

Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees
for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017b.

Pan Zhou and Jiashi Feng. The landscape of deep learning algorithms. arXiv preprint
arXiv:1705.07038, 2017.

14
A Proofs of Section 2
Proof of Theorem 2.1. We first expand the loss function directly.

` (v, a)
  2 
1 >
=E y − a σ (Z) w
2
h i h i h i
= (a∗ )> E σ (Zw∗ ) σ (Zw∗ )> a∗ + a> E σ (Zw) σ (Zw)> a − 2a> E σ (Zw) σ (Zw∗ )> a∗
= (a∗ )> A (w∗ ) a∗ + a> A (w) a − 2a> B (w, w∗ ) w∗ .

where for simplicity, we denote


h i
A(w) =E σ (Zw) σ (Zw)> (5)
h i
B (w, w∗ ) =E σ (Zw) σ (Zw∗ )> . (6)

For i 6= j, using the second identity of Lemma A.1, we can compute


h  i h  i 1
A(w)ij = E σ Z> i w E σ Z>
j w = kwk22

For i = j, using the second moment formula of half-Gaussian distribution we can compute
1
A (w)ii = kwk22 .
2
Therefore
1  
A(w) = kwk22 11> + (π − 1) I .

Now let us compute B (w, w∗ ). For i 6= j, similar to A(w)ij , using the independence property of
Gaussian, we have
1
B (w, w∗ )ij = kwk2 kw∗ k2 .

Next, using the fourth identity of Lemma A.1, we have
1
B (w, w∗ )ii = cos φ (π − φ + sin φ cos φ) + sin3 φ kwk2 kw∗ k2 .


Therefore, we can also write B (w, w∗ ) in a compact form
1   
B (w, w∗ ) = kwk2 kw∗ k2 11> + cos φ (π − φ + sin φ cos φ) + sin3 φ − 1 I .

Plugging in the formulas of A(w) and B (w, w∗ ) and w = v
kvk2 , we obtain the desired result.

Proof of Theorem 2.2. We first compute the expect gradient for v. From[Salimans and Kingma,
2016], we know
!
∂` (v, a) 1 vv> ∂` (w, a)
= I− .
∂v kvk2 kvk22 ∂w

15
Recall the gradient formula,

∂` (Z, w, a)
∂w
k k
! k !
X X X n o
∗ ∗ ∗ >
= ai σ (Zi w) − ai σ (Zw ) ai Zi I Zi w
i=1 i=1 i=1
 
Xk n o X n o
= a2i Zi Z> >
i I Zi w ≥ 0 + ai aj Zi Z> > >
j I Zi w ≥ 0, Zj w ≥ 0
w (7)
i=1 i6=j
 
Xk n o X n o
− ai a∗i Zi Z> > > ∗
i I Zi w ≥ 0, Zi w ≥ 0 + ai a∗j Zi Z∗j I Z> > ∗
i w ≥ 0, Zj w ≥ 0
 w∗ . (8)
i=1 i6=j

Now we calculate expectation of Equation (7) and (8) separately. For (7), by first two formulas of
Lemma A.1, we have
 
X k n o X n o
 a2i Zi Z> >
i I Zi w ≥ 0 + ai aj Zi Z> > >
j I Zi w ≥ 0, Zj w ≥ 0
w
i=1 i6=j
k
X w X w
= a2i · + ai aj .
2 2π
i=1 i6=j

For (8), we use the second and third formula in Lemma A.1 to obtain
 
k
X n o X n o
 ai a∗i Zi Z> > > ∗
i I Zi w ≥ 0, Zi w ≥ 0 + ai a∗j Zi Z∗j I Z> > ∗
i w ≥ 0, Zj w ≥ 0
 w∗
i=1 i6=j
kw∗ k2 1 kw∗ k2 .
 
1 1 X
=a> a∗ (π − φ) w∗ + sin φ w + ai a∗j w
π π kwk2 2π kwk2
i6=j

In summary, aggregating them together we have


 
∂` (Z, w, a)
EZ
∂w
!
2 ∗
P P
1 > ∗ kak a i aj a> a∗ sin φ kw∗ k aj a kw k
i6 = j i6 = j j ∗
= a a (π − φ) w∗ + 2
+ + 2
+ 2
w.
2π 2 2π 2π kwk2 2π kwk2

As a sanity check, this formula matches Equation (16) of [Brutzkus and Globerson, 2017] when
a = a∗ = 1.
Next, we calculate the expected gradient of a. Recall the gradient formula of a

∂`(Z, w, a)  > 
= a σ (Zw) − (a∗ )> σ (Zw∗ ) σ (Zw)
a
=σ (Zw) σ (Zw)> a − σ (Zw) σ (Zw∗ )> a∗

16
Taking expectation we have
∂` (w, a)
= A (w) a − B (w, w∗ ) a∗
∂a
where A (w) and B (w, w∗ ) are defined in Equation (5) and (6). Plugging in the formulas for A (w)
and B (w, w∗ ) derived in the proof of Theorem 2.1 we obtained the desired result.

Lemma A.1 (Useful Identities). Given w, w∗ with angle φ and Z is a Gaussian random vector,
then
h n oi 1 w
E zz> I z> w ≥ 0 w =
2 kwk2
h n oi 1 w
E zI z> w ≥ 0 = √
2π kwk2
h n oi 1 1 kw∗ k2
E zz> I z> w ≥ 0, z> w∗ ≥ 0 w∗ = (π − φ) w∗ + sin φ w
π π kwk2
h    i 1
E σ z> w σ z> w∗ = cos φ π − φ + sin φ cos φ + sin2 φ kwk2 kw∗ k2


n o
Proof. Consider an orthonormal basis of Rd×d : ei e>j with e1 k w. Then for i 6= j, we know
h n oi
hei ej , E zz> I z> w ≥ 0 i = 0

by the independence properties of Gaussian random vector. For i = j = 1,


h n oi  2 n o 1
> > > > >
hei ej , E zz I z w ≥ 0 i = E z w I z w ≥ 0 =
2

where the last step is by the property of half-Gaussian. For i = j 6= j, hei e> , E zz> I z> w ≥ 0 i =
  
j
1 by standard Gaussian second moment formula. Therefore, E zz> I z> w ≥ 0 w = 12 w. E zI z> w ≥ 0 =
     
√1 w can be proved by mean formula of half-normal distribution. To prove the third identity, con-
2π n o
sider an orthonormal basis of Rd×d : ei e> j with e1 k w∗ and w lies in the plane spanned by e1
and e2 . Using the polar representation of 2D Gaussian random variables (r is the radius and θ is
1
the angle with dPr = r exp(−r2 /2) and dPθ = 2π ):
Z ∞ Z π/2
h n oi 1
he1 e> zz> I z> w ≥ 0, z> w∗ ≥ 0 i = 3 2
cos2 θdθ

1 ,E r exp −r /2 dr ·
2π 0 −π/2+φ
1
= (π − φ + sin φ cos φ) ,

Z ∞ Z π/2
h n oi 1
he1 e> > > > 3 2

2 , E zz I z w ≥ 0, z w∗ ≥ 0 i= r exp −r /2 dr · sin θ cos θdθ
2π 0 −π/2+φ
1
sin2 φ ,

=

Z ∞ Z π/2
h n oi 1
he2 e> > > >
r3 exp −r2 /2 dr · sin2 θdθ

2 , E zz I z w ≥ 0, z w∗ ≥ 0 i=
2π 0 −π/2+φ

17
1
= (π − φ − sin φ cos φ) .

w̄−cos φe1
Also note that e2 = sin φ . Therefore
h n oi 1 1 w̄ − cos φe1
E zz> I z> w ≥ 0, z> w∗ ≥ 0 w∗ = (π − φ + sin φ cos φ) w∗ + sin2 φ · kw∗ k2
2π 2π sin φ
1 1 kw∗ k2
= (π − φ) w∗ + sin φ w.
2π 2π kwk2
For the fourth identity, focusing on the plane spanned by w and w∗ , using the polar decomposition,
we have
Z ∞ Z π/2
h 
>
 
>
i 1 3 2
(cos θ cos φ + sin θ sin φ) cos θdθ kwk2 kw∗ k2

E σ z w σ z w∗ = r exp −r /2 dr ·
2π 0 −π/2+φ
1
cos φ (π − φ + sin φ cos φ) + sin3 φ kwk2 kw∗ k2 .

=

B Proofs of Qualitative Convergence Results


Proof of Lemma 4.1. When Algorithm 1 converges, since a> a∗ 6= 0 and kvk2 < ∞, using the
vv>
gradient formula in Theorem 2.2, we know that either π − φ = 0 or I − kvk 2 w∗ = 0. For the
2 
vv> vv>
second case, since I − kvk2 is a projection matrix on the complement space of v, I − kvk2 w∗ = 0
2 2
is equivalent to θ (v, w∗ ) = 0. Once the angle between v and w∗ is fixed, using the gradient formula
for a we have the desired formulas for saddle points.

Proof
 of Lemma  4.2. By the gradient formula of w, if a> a∗ > 0, the gradient is of the form
> vv>
vv
c I − kvk 2 w∗ where c > 0. Thus because I − kvk2 is the projection matrix onto the complement
2 2
space of v, the gradient update always makes the angle smaller.

C Proofs of Quantitative Convergence Results


C.1 Useful Technical Lemmas
We first prove the lemma about the convergence of φt .

Proof of Lemma 4.5. We consider the dynamics of sin2 φt .

sin2 φt+1
  > 2
vt+1 w∗
=1 −
kvt+1 k22 kw∗ k22
2
∂` > ∗

vt − η ∂v

t w
=1 −   
∂` 2
kvt k22 + η 2 ∂v t kw∗ k22

18
 > ∗
2
> (at ) a (π−φt )
vt v +η 2πkvk2 · sin 2
φt kwk22
=1 − 2
(at )> a∗ (π−φ ) sin2 φt kw∗ k42
 t
kvt k22 kw∗ k22 + η 2 2π kvt k22
t >
v kw∗ k2 cos2 φt + 2η kw∗ k3 · (a ) a(π−φ) · sin2 φt cos φt
t 2
2 2 2 2π
≤1 −
t ) 2 sin2 φt kw∗ k4
 t> ∗ 
2 2 (a ) a (π−φ
kvt k2 kw∗ k2 + η 2 2π kvt k22
2

>
 >
2  ∗ 2
2 t kw∗ k2 (at ) a(π−φ) 2 t t 2 (at ) a∗ (π−φ) 2 t kw k2
sin φ − 2η kvt k2 · 2π · sin φ cos φ + η 2π sin φ kvk22
2
=  t> ∗ 2  ∗ 2
kw k
1 + η 2 (a ) a2π(π−φ) sin2 φt kvt k22
2
> > !2 !2
kw∗ k2 at a (π − φ) at a∗ (π − φ) kw∗ k2

2 t 2 t t 2
≤ sin φ − 2η 2 · · sin φ cos φ + η sin2 φt
t
kv k2 2π 2π kvt k22

where in the first inequality we dropped term proportional to O(η 4 ) because it is negative, in the
2
last equality, we divided numerator and denominator by vt 2 kw∗ k22 and the last inequality we
>
 
kw∗ k2 (at ) a∗ (π−φt )
dropped the denominator because it is bigger than 1. Therefore, recall λt = 2πkvt k22
and we have
 2  2 t
sin2 φt+1 ≤ 1 − 2η cos φt λt + η 2 λt sin φ . (9)

cos φt
t 2
To this end, we need to make sure η ≤ λ t . Note that since v is
2
monotonically increasing,
>
it
 is lower bounded by 1. Next notice φt ≤ π/2. Finally, from Lemma C.2, we know at a∗ ≤
2 2 
∗ >
ka k2 + 1 a ∗ kwk22 . Combining these, we have an upper bound
 2 
ka∗ k22 + 1> a∗ kw∗ k22
t
λ ≤ .
4
Plugging this back to Equation (9) and use our assumption on η, we have

sin2 φt+1 ≤ 1 − η cos φt λt sin2 φt .




n   o
> > g(φt )−1 > t )−1
Lemma C.1. at+1 a∗ ≥ min at a∗ + η π−1 ka∗ k22 − at a∗ , g(φ
π−1 ka∗ k2
2

>
Proof. Recall the dynamics of at a∗ .

η g(φt ) − 1
    
t+1 >
 ∗ η (π − 1) t > ∗
 ∗ 2 η  > ∗ 2  > ∗   > t 
a a = 1− a a + ka k2 + 1 a − 1 a 1 a
2π 2π 2π
η g(φt ) − 1
  
η (π − 1) t > ∗
ka∗ k22

≥ 1− a a +
2π 2π

19
> g(φt )−1
where the inequality is due to Lemma 4.4. If at a∗ ≥ π−1 ka∗ k22 ,
t)

t) − 1
 
> η (π − 1) g(π 2 η g(φ
at+1 a∗ ≥ 1 − ka∗ k2 + ka∗ k22

2π π−1 π−1
g(φt ) − 1 ∗ 2
= ka k2 .
π−1
> t )−1
If at a∗ ≤ g(φ ∗ 2 t+1 > a∗ increases by at least

π−1 ka k2 , simple algebra shows a

g(φt ) − 1 ∗ 2
 
t > ∗

η ka k2 − a a .
π−1

A simple corollary is a> a∗ is uniformly lower bounded.


> n  0 )−1
o
>
Corollary C.1. For all t = 1, 2, . . ., at a∗ ≥ min a0 a∗ , g(φπ−1 ka∗ k22 .
 
This lemma also gives an upper bound of number of iterations to make a> a∗ = Θ ka∗ k22 .
 
Corollary C.2. If g(φ) − 1 = Ω (1), then after 1
η iterations, a> a∗ = Θ ka∗ k22 .

g(φ)
Proof. Note if g(φ) − 1 = Ω (1) and a> a∗ ≤ 1
2 · π−1 ka∗ k22 , each iteration a> a∗ increases by
g(φ)
η π−1 ka∗ k22 .
>
We also need an upper bound of at a∗ .
>  2 
Lemma C.2. For t = 0, 1, . . ., at a∗ ≤ ka∗ k22 + 1> a∗ kw∗ k22 .
>
Proof. Without loss of generality, assume kw∗ k2 = 1. Again, recall the dynamics of at a∗ .
t) − 1
    
t+1
 > ∗ η (π − 1) t
 > ∗ η g(φ ∗ 2 η  > ∗ 2  > ∗   > t 
a a = 1− a a + ka k2 + 1 a − 1 a 1 a
2π 2π 2π
 
η (π − 1) > η (π − 1) ∗ 2 η (π − 1)  > ∗ 2
≤ 1− at a∗ + ka k2 + 1 a .
2π 2π 2π
> 2
Now we prove by induction, suppose the conclusion holds at iteration t, at a∗ ≤ ka∗ k22 + 1> a∗ .
Plugging in we have the desired result.

C.2 Convergence of Phase I


In this section we prove the convergence of Phase I.
 
1
Proof of Convergence of Phase I. Lemma C.3 implies after O iterations, cos φt = Ω (1),
cos φ0 β 0
t )−1
 
which implies g(φ
π−1 = Ω (1). Using Corollary C.2, we know after O 1
η iterations we have
 
>
at a∗ kw∗ k = Ω kw∗ k22 ka∗ k22 .


20
The main ingredient of the proof of phase I is the follow lemma where
we use a joint induction
t
argument to show the convergence of φ and a uniform upper bound of v 2 .
t

n  o
>
Lemma C.3. Let β 0 = min a0 a∗ , g(φ0 ) − 1 ka∗ k22 kw∗ k22 . If the step size satisfies η ≤

( )
β ∗ cos φ0  φ0 
min  2 ,  ∗ 2 cos 2 , 2π , we have for t = 0, 1, . . .
∗ 2
8 ka k2 +(1 a ) kw k2
> ∗ ∗ 2
ka k2 +(1 a∗ ) kw∗ k22 k+π−1
>

t
cos φ0 β 0

2 t
and vt 2 ≤ 2.

sin φ ≤ 1 − η ·
8

Proof. We prove by induction. The 2 initialization ensure when t = 0, the conclusion is correct. Now
t
we consider the dynamics of v 2 . Note because the gradient of v is orthogonal to v [Salimans

2
and Kingma, 2016], we have a simple dynamic of vt . 2
2
v = v + η 2 ∂` (v, a)
t 2 t−1 2
2 2 ∂v
2
> ∗  !2
t−1 2 2 a t a π − φ t−1
sin2 φt kw∗ k22
= v 2 + η
2π kvt k22
  2 
t−1 2 ∗ 2
≤ v

2
2
+ η ka k2 + 1 a > ∗
kw∗ k22 sin2 φt−1
  2  t−1
X
=1 + η 2
ka∗ k22 + 1 a> ∗
kw∗ k22 sin2 φi
i=1
 2 
 8
≤1 + η 2 ka∗ k22 + 1> a∗ kw∗ k22
η cos φ0 β 0
≤2

where the first inequality is by Lemma C.2 and the second inequality we use our induction hy-
>

kw∗ k2 (at ) a∗ (π−φt )
pothesis. Recall λt = 2πkvt k2
. The uniform upper bound of kvk2 and the fact that
2
β0
φt ≤ π/2 imply a lower bound λt ≥ 8 . Plugging in Lemma 4.5, we have
t+1
cos φ0 β 0 cos φ0 β 0
  
sin2 φt+1 ≤ 1−η sin2 φt ≤ 1 − η .
8 8

We finish our joint induction proof.

C.3 Analysis of Phase II


In this section we prove the convergence of phase II and necessary auxiliary lemmas.
>  
Proof of Convergence of Phase II. At the beginning of Phase II, aT1 a∗ kw∗ k = Ω kw∗ k22 ka∗ k22
T1 ) − 1 = Ω (1). Therefore, Lemma C.1 implies for all t = T , T + 1, . . ., at > a∗ kw∗ k =

and
 g(φ  1 1
Ω kw∗ k22 ka∗ k22 . Combining with the fact that kvk2 ≤ 2 (c.f. Lemma C.3), we obtain a lower

21
 
bound λt ≥ Ω kw∗ k22 ka∗ k22 We also know that cos φT1 = Ω (1) and cos φt is monotinically in-
creasing (c.f. Lemma 4.2), so for all t = T1 , T1 + 1, . . ., cos φt = Ω (1). Plugging in these two lower
bounds into Theorem 4.5, we have
 
sin2 φt+1 ≤ 1 − ηC kw∗ k22 ka∗ k22 sin2 φt .
 
for some absolute constant C. Thus, after O ηkw∗ k12 ka∗ k2 log 1 iterations, we have sin2 φt ≤
( ) 2 2
 10  
10 ka∗ k2 t ka∗ k2
min  ,  1> a∗ , which implies π−g(φ ) ≤ min ,  1> a∗ . Now using Lemma C.4,Lemma C.5
| | | |
 
e 1 log 1 iterations ` (v, a) ≤ C1  ka∗ k2 kw∗ k2 for some abso-

and Lemma C.6, we have after O ηk  2 2
lute constant C1 . Rescaling  properly we obtain the desired result.

C.3.1 Technical Lemmas for Analyzing Phase II


In this section we provide some technical lemmas for analyzing Phase II. Because of the positive
homogeneity property, without loss of generality, we assume kw∗ k2 = 1.
  
ka∗ k |1> a∗ −1> a0 |
1
iterations, 1> a∗ − 1> aT ≤

Lemma C.4. If π−g(φ0 ) ≤  1> a∗2 , after T = O ηk log ∗k
| | ka 2

2 ka∗ k2 .

Proof. Recall the dynamics of 1> at .

η k + g(φt ) − 1 > ∗
  
> t+1 η (k + π − 1) > t
1 a = 1− 1 a + 1 a
2π 2π
η k + g(φt ) − 1 > ∗
  
η (k + π − 1) > t
= 1− 1 a + 1 a .
2π 2π

Assume 1> a∗ > 0 (the other case is similar). By Lemma 4.4 we know 1> at < 1> a∗ for all t.
Consider
 
η (k + π − 1)  > ∗  η π − g(φt )
> ∗ > t+1 > ∗
1 a −1 a = 1− 1 a −1 a + 1> a∗ .
2π 2π

Therefore we have
t
 > ∗  t
 > ∗ !
− 1 a − 1 a

π g φ η (k + π − 1) π g φ
1> a∗ − 1> at+1 − = 1− 1> a∗ − 1> a∗ − .
k+π−1 2π k+π−1
  
|1> a∗ −1> a0 | (π−g(φt ))1> a∗
After T = O 1
ηk log ka∗ k2 iterations, we have 1> a∗ − 1> at − k+π−1 ≤  ka∗ k2 ,
which implies1> a∗ − 1> at ≤ 2 ka∗ k2 .
  
ka∗ k ka∗ −a0 k2
Lemma C.5. If π−g(φ0 ) ≤  1> a∗2 and 1> a∗ − 1> a0 ≤ k ka∗ k2 , then after T = O 1

log ka∗ k2
∗ | | η

0

iterations, a − a 2 ≤ C ka k2 for some absolute constant C.

22
Proof. We first consider the inner product

∂` vt , at

h , at − a∗ i
at
π−1 t
at − a∗ 2 − g(φ ) − π (a∗ )> at − a∗ + at − a∗ 11> a> − a∗
   
= 2
2π 2π
π−1 g(φ t) − π
at − a∗ −2
ka∗ k2 at − a∗ 2 .

≥ 2
2π 2π
Next we consider the squared norm of gradient

∂` (v, a) 2

= 1

t ∗
 t
 ∗ > t
 2

(π − 1) a − a + π − g(φ ) a + 11 a − a
∂a 4π 2

2 2
 2 
3 2 t
∗ 2
t 2
 ∗ 2 2

> t > ∗
≤ 2 (π − 1) a − a 2 + π − g(φ ) ka k2 + k 1 a − 1 a .

Suppose at − a∗ 2 ≤  ka∗ k2 , then


∂` vt , at

π−1 2
t ∗ at − a∗ 2 −  ka∗ k2

h , a − a i ≥ 2 2
at 2π 2π
2
∂` (v, a) 2 ∗ 2
∂a ≤ 3 ka k2 .

2

Therefore we have
 
2 η (π − 1) a − a∗ 2 + 4η2 kak2
− a∗ 2 ≤
t+1 t
a 1−
2π 2
!
2 8 (π − 1) 2 ka∗ k22 2 ka∗ k2
 
η (π − 1) 2 8 (π − 1) 
⇒ at+1 − a∗ 2 − a − a∗ −
t 2
≤ 1− 2
.
π−1 2π π−1
  2
Thus after O η1 1 iterations, we must have at+1 − a∗ 2 ≤ C ka∗ k2 for some large absolute

constant C. Rescaling , we obtain the desired result.

Lemma C.6. If π−g(φ) ≤  and ka − a∗ kw∗ k2 k ≤  ka∗ k2 kw∗ k2 , then the population loss satisfies
` (v, a) ≤ C ka∗ k22 kw∗ k22 for some constant C > 0.

Proof. The result follows by plugging in the assumptions in Theorem 2.1.

D Proofs of Initialization Scheme


Proof of Theorem 3.2. The proof of the first part of Theorem 3.2 just uses the symmetry of unit
sphere and ball and the
 second  part is a direct application of Lemma 2.5 of [Hardt and Price, 2014].
|1 > a∗
| √
, we have 1> a0 ≤ a0 1 ≤ k a0 2 ≤ 1> a∗ kfirstlayer∗ k2 where

Lastly, since a0 ∼ B 0 √k
the second inequality is due to Hölder’s inequality.

23
E Proofs of Converging to Spurious Local Minimum
Proof of Theorem 3.3. The main idea is similar to Theorem 3.1 but here we show w → −w∗
(without loss of generality, we assume kw∗ k2 = 1). Different from Theorem 3.1, here we need to
prove the invariance a> a∗ < 0, which implies our desired result. We prove by induction, suppose
2
> −2(1> a)
at a∗ > 0, 1> at ≤ 1> a∗ , g φ0 ≤ + 1 and η < k+π−1 . Note 1> at ≤ 1> a∗ are

2 2π
ka∗ k2
2
0
 −2(1> a)
satisfied by Lemma 4.4 and g φ ≤ + 1 by our initialization condition and induction
ka∗ k22
>
hypothesis that implies φt is increasing. Recall the dynamics of at a∗ .

η g φt − 1
     
t+1 > ∗
 η (π − 1) t > ∗
 ∗ 2 η  > ∗ 2  > t   > ∗ 
a a = 1− a a + ka k2 + 1 a − 1 a 1 a
2π 2π 2π
 2 
η g(φt ) − 1 ka∗ k2 + 2 1> a∗

≤ <0

where the first inequality we used our induction hypothesis on inner product between at and a∗ and
> > ∗

1 a ≤ 1 a and the second inequality is by induction hypothesis on φt . Thus when gradient de-
t
−1
scent algorithm converges, according Lemma 4.1, θ (v, w∗ ) = π, a = 11> + (π − 1) I 11> − Ikw∗ k2 a∗ .


Plugging these into Theorem 2.1, with some routine algebra, we show ` (v, a) = Ω kw∗ k22 ka∗ k22 .

24

You might also like