For nonconvex optimization in machine learning, this article proves that every local minimum achieves the globally optimal value of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this article improves or complements several state-of-the-art theoretical results on deep neural networks, deep residual networks, and overparameterized deep neural networks with a unified proof technique and novel geometric insights. A special case of our results also contributes to the theoretical foundation of representation learning.

Deep learning has achieved considerable empirical success in machine learning applications. However, insufficient work has been done on theoretically understanding deep learning, partly because of the nonconvexity and high-dimensionality of the objective functions used to train deep models. In general, theoretical understanding of nonconvex, high-dimensional optimization is challenging. Indeed, finding a global minimum of a general nonconvex function (Murty & Kabadi, 1987) and training certain types of neural networks (Blum & Rivest, 1992) are both NP-hard. Considering the NP-hardness for a general set of relevant problems, it is necessary to use additional assumptions to guarantee efficient global optimality in deep learning. Accordingly, recent theoretical studies have proven global optimality in deep learning by using additional strong assumptions such as linear activation, random activation, semirandom activation, gaussian inputs, single hidden-layer network, and significant overparameterization (Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016; Hardt & Ma, 2017; Nguyen & Hein, 2017, 2018; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Ge, Lee, & Ma, 2017; Goel & Klivans, 2017; Zhong, Song, Jain, Bartlett, & Dhillon, 2017; Li & Yuan, 2017; Kawaguchi, Xie, & Song, 2018; Du & Lee, 2018).

A study proving efficient global optimality in deep learning is thus closely related to the search for additional assumptions that might not hold in many practical applications. Toward widely applicable practical theory, we can also ask a different type of question: If standard global optimality requires additional assumptions, then what type of global optimality does not? In other words, instead of searching for additional assumptions to guarantee standard global optimality, we can also search for another type of global optimality under mild assumptions. Furthermore, instead of an arbitrary type of global optimality, it is preferable to develop a general theory of global optimality that not only works under mild assumptions but also produces the previous results with the previous additional assumptions, while predicting new results with future additional assumptions. This type of general theory may help not only to explain when and why an existing machine learning method works but also to predict the types of future methods that will or will not work.

As a step toward this goal, this article proves a series of theoretical results. The major contributions are summarized as follows:

  • For nonconvex optimization in machine learning with mild assumptions, we prove that every differentiable local minimum achieves global optimality of the perturbable gradient basis model class. This result is directly applicable to many existing machine learning models, including practical deep learning models, and to new models to be proposed in the future, nonconvex and convex.

  • The proposed general theory with a simple and unified proof technique is shown to be able to prove several concrete guarantees that improve or complement several state-of-the-art results.

  • In general, the proposed theory allows us to see the effects of the design of models, methods, and assumptions on the optimization landscape through the lens of the global optima of the perturbable gradient basis model class.

Because a local minimum θ in Rdθ only requires the θ to be locally optimal in Rdθ, it is nontrivial that the local minimum is guaranteed to achieve the globally optimality in Rdθ of the induced perturbable gradient basis model class. The reason we can possibly prove something more than many worst-case results in general nonconvex optimization is that we explicitly take advantage of mild assumptions that commonly hold in machine learning and deep learning. In particular, we assume that an objective function to be optimized is structured with a sum of weighted errors, where each error is an output of composition of a loss function and a function of a hypothesis class. Moreover, we make mild assumptions on the loss function and a hypothesis class, all of which typically hold in practice.

This section defines the problem setting and common notation.

2.1  Problem Description

Let xX and yY be an input vector and a target vector, respectively. Define ((xi,yi))i=1m as a training data set of size m. Let θRdθ be a parameter vector to be optimized. Let f(x;θ)Rdy be the output of a model or a hypothesis, and let :Rdy×YR0 be a loss function. Here, dθ,dyN>0. We consider the following standard objective function L to train a model f(x;θ):
L(θ)=i=1mλi(f(xi;θ),yi).
This article allows the weights λ1,,λm>0 to be arbitrarily fixed. With λ1==λm=1m, all of our results hold true for the standard average loss L as a special case.

2.2  Notation

Because the focus of this article is the optimization of the vector θ, the following notation is convenient: y(q)=(q,y) and fx(q)=f(x;q). Then we can write
L(θ)=i=1mλiyi(fxi(θ))=i=1mλi(yifxi)(θ).

We use the following standard notation for differentiation. Given a scalar-valued or vector-valued function ϕ:RdRd' with components ϕ=(ϕ1,,ϕd') and variables (v1,,vd¯), let vϕ:RdRd'×d¯ be the matrix-valued function with each entry (vϕ)i,j=ϕivj. Note that if ϕ is a scalar-valued function, vϕ outputs a row vector. In addition, ϕ=vϕ if (v1,,vd) are the input variables of ϕ. Given a function ϕ:RdRd', let kϕ:RdR be the partial derivative kϕ with respect to the kth variable of ϕ. For the syntax of any differentiation map , given functions ϕ and ζ, let ϕ(ζ(q))=(ϕ)(ζ(q)) be the (partial) derivative ϕ evaluated at an output ζ(q) of a function ζ.

Given a matrix MRd×d', vec(M)=[M1,1,,Md,1,M1,2,,Md,2,,M1,d',,Md,d']T represents the standard vectorization of the matrix M. Given a set of n matrices or vectors {M(j)}j=1n, define [M(j)]j=1n=[M(1),M(2),,M(n)] to be a block matrix of each column block being M(1),M(2),,M(n). Similarly, given a set I={i1,,in} with (i1,,in) increasing, define [M(j)]jI=[M(i1)M(in)].

This section shows our first main result that under mild assumptions, every differentiable local minimum achieves the global optimality of the perturbable gradient basis model class.

3.1  Assumptions

Given a hypothesis class f and data set, let Ω be a set of nondifferentiable points θ as Ω={θRdθ:(i{1,,m})[fxi is not differentiable at θ]}. Similarly, define Ω˜={θRdθ:(ε>0)(θ'B(θ,ε))(i{1,,m})[fxi is not differentiable at θ']}. Here, B(θ,ε) is the open ball with the center θ and the radius ε. In common nondifferentiable models f such as neural networks with rectified linear units (ReLUs) and pooling operations, we have that Ω=Ω˜, and the Lebesgue measure of Ω(=Ω˜) is zero.

This section uses the following mild assumptions.

Assumption 1

(Use of Common Loss criteria). For all i{1,,m}, the function yi:q(q,yi)R0 is differentiable and convex (e.g., the squared loss, cross-entropy loss, or polynomial hinge loss satisfies this assumption).

Assumption 2

(Use of Common Model Structures). There exists a function g:RdθRdθ such that fxi(θ)=k=1dθg(θ)kkfxi(θ) for all i{1,,m} and all θRdθΩ.

Assumption 1 is satisfied by simply using common loss criteria that include the squared loss (q,y)=q-y22, cross-entropy loss (q,y)=-k=1dyyklogexp(qk)k'exp(qk'), and smoothed hinge loss (q,y)=(max{0,1-yq})p with p2 (the hinge loss with dy=1). Although the objective function L:θL(θ) used to train a complex machine learning model (e.g., a neural network) is nonconvex in θ, the loss criterion yi:q(q,yi) is usually convex in q. In this article, the cross-entropy loss includes the softmax function, and thus fx(θ) is the pre-softmax output of the last layer in related deep learning models.

Assumption 2 is satisfied by simply using a common architecture in deep learning or a classical machine learning model. For example, consider a deep neural network of the form fx(θ)=Wh(x;u)+b, where h(x;u) is an output of an arbitrary representation at the last hidden layer and θ=vec([W,b,u]). Then assumption 2 holds because fxi(θ)=k=1dθg(θ)kkfxi(θ), where g(θ)k=θk for all k corresponding to the parameters (W,b) in the last layer and g(θ)k=0 for all other k corresponding to u. In general, because g is a function of θ, assumption 2 is easily satisfiable. Assumption 2 does not require the model f(x;θ) to be linear in θ or x.

Note that we allow the nondifferentiable points to exist in L(θ); for example, the use of ReLU is allowed. For a nonconvex and nondifferentiable function, we can still have first-order and second-order necessary conditions of local minima (e.g., Rockafellar & Wets, 2009, theorem 13.24). However, subdifferential calculus of a nonconvex function requires careful treatment at nondifferentiable points (see Rockafellar & Wets, 2009; Kakade & Lee, 2018; Davis, Drusvyatskiy, Kakade, & Lee, 2019), and deriving guarantees at nondifferentiable points is left to a future study.

3.2  Theory for Critical Points

Before presenting the first main result, this section provides a simpler result for critical points to illustrate the ideas behind the main result for local minima. We define the (theoretical) objective function Lθ of the gradient basis model class as
Lθ(α)=i=1mλifθ(xi;α),yi,
where {fθ(xi;α)=k=1dθαkkfxi(θ):αRdθ} is the induced gradient basis model class. The following theorem shows that every differentiable critical point of our original objective L (including every differentiable local minimum and saddle point) achieves the global minimum value of Lθ. The complete proofs of all the theoretical results are presented in appendix A.
Theorem 1.
Let assumptions 1 and 2 hold. Then for any critical point θ(RdθΩ) of L, the following holds:
L(θ)=infαRdθLθ(α).

An important aspect in theorem 1 is that Lθ on the right-hand side is convex, while L on the left-hand side can be nonconvex or convex. Here, following convention, infS is defined to be the infimum of a subset S of R¯ (the set of affinely extended real numbers); that is, if S has no lower bound, infS=- and inf=. Note that theorem 1 vacuously holds true if there is no critical point for L. To guarantee the existence of a minimizer in a (nonempty) subspace SRdθ for L (or Lθ), a classical proof requires two conditions: a lower semicontinuity of L (or Lθ) and the existence of a qS for which the set {q'S:L(q')L(q)} (or {q'S:Lθ(q')Lθ(q)}) is compact (see Bertsekas, 1999, for different conditions).

3.2.1  Geometric View

This section presents the geometric interpretation of theorem 1 that provides an intuitive yet formal description of gradient basis model class. Figure 1 illustrates the gradient basis model class and theorem 1 with θR2 and fX(θ)R3. Here, we consider the following map from the parameter space to the concatenation of the output of the model at x1,x2,,xm:
fX:θRdθ(fx1(θ),fx2(θ),,fxm(θ))Rmdy.
Figure 1:

Illustration of gradient basis model class and theorem 1 with θR2 and fX(θ)R3 (dy=1). Theorem 1 translates the local condition of θ in the parameter space R2 (on the left) to the global optimality in the output space R3 (on the right). The subspace TfX(θ) is the space of the outputs of the gradient basis model class. Theorem 1 states that fX(θ) is globally optimal in the subspace as fX(θ)argminfTfX(θ)dist(f,y) for any differentiable critical point θ of L.

Figure 1:

Illustration of gradient basis model class and theorem 1 with θR2 and fX(θ)R3 (dy=1). Theorem 1 translates the local condition of θ in the parameter space R2 (on the left) to the global optimality in the output space R3 (on the right). The subspace TfX(θ) is the space of the outputs of the gradient basis model class. Theorem 1 states that fX(θ) is globally optimal in the subspace as fX(θ)argminfTfX(θ)dist(f,y) for any differentiable critical point θ of L.

Close modal
In the output space Rmdy of fX, the objective function L induces the notion of distance from the target vector y=(y1,,ym)Rmdy to a vector f=(f1,,fm)Rmdy as
dist(f,y)=i=1mλi(fi,yi).
We consider the affine subspace TfX(θ) of Rmdy that passes through the point fX(θ) and is spanned by the set of vectors {1fX(θ),,dθfX(θ)},
TfX(θ)=span({1fX(θ),,dθfX(θ)})+{fX(θ)},
where the sum of the two sets represents the Minkowski sum of the sets.
Then the subspace TfX(θ) is the space of the outputs of the gradient basis model class in general beyond the low-dimensional illustration. This is because by assumption 2, for any given θ,
TfX(θ)=k=1dθ(g(θ)k+αk)kfX(θ):αRdθ=k=1dθαkkfX(θ):αRdθ,
(3.1)
and k=1dθαkkfX(θ)=(fθ(x1;α),,fθ(xm;α)). In other words, TfX(θ)=span({1fX(θ),,dθfX(θ)})(fθ(x1;α),,fθ(xm;α)).
Therefore, in general, theorem 1 states that under assumptions 1 and 2, fX(θ) is globally optimal in the subspace TfX(θ) as
fX(θ)argminfTfX(θ)dist(f,y),
for any differentiable critical point θ of L. Theorem 1 concludes this global optimality in the affine subspace of the output space based on the local condition in the parameter space (i.e., differentiable critical point). A key idea behind theorem 1 is to consider the map between the parameter space and the output space, which enables us to take advantage of assumptions 1 and 2.

Figure 2 illustrates the gradient basis model class and theorem 1 with a union of manifolds and a tangent space. Under the constant rank condition, the image of the map fX locally forms a single manifold. More precisely, if there exists a small neighborhood U(θ) of θ such that fX is differentiable in U(θ) and rank(fX(θ'))=r is constant with some r for all θ'U(θ) (the constant rank condition), then the rank theorem states that the image fX(U(θ)) is a manifold of dimension r (Lee, 2013, theorem 4.12). We note that the rank map θrank(fX(θ)) is lower semicontinuous (i.e., if rank(fX(θ))=r, then there exists a neighborhood U(θ) of θ such that rank(fX(θ'))r for any θ'U(θ)). Therefore, if fX(θ) at θ has the maximum rank in a small neighborhood of θ, then the constant rank condition is satisfied.

Figure 2:

Illustration of gradient basis model class and theorem 1 with manifold and tangent space. The space R2θ on the left is the parameter space, and the space R3fX(θ) on the right is the output space. The surface MR3 on the right is the image of fX, which is a union of finitely many manifolds. The tangent space TfX(θ) is the space of the outputs of the gradient basis model class. Theorem 1 states that if θ is a differentiable critical point of L, then fX(θ) is globally optimal in the tangent space TfX(θ).

Figure 2:

Illustration of gradient basis model class and theorem 1 with manifold and tangent space. The space R2θ on the left is the parameter space, and the space R3fX(θ) on the right is the output space. The surface MR3 on the right is the image of fX, which is a union of finitely many manifolds. The tangent space TfX(θ) is the space of the outputs of the gradient basis model class. Theorem 1 states that if θ is a differentiable critical point of L, then fX(θ) is globally optimal in the tangent space TfX(θ).

Close modal
For points θ where the constant rank condition is violated, the image of the map fX is no longer a single manifold. However, locally it decomposes as a union of finitely many manifolds. More precisely, if there exists a small neighborhood U(θ) of θ such that fX is analytic over U(θ) (this condition is satisfied for commonly used activation functions such as ReLU, sigmoid, and hyperbolic tangent at any differentiable point), then the image fX(U(θ)) admits a locally finite partition M into connected submanifolds such that whenever MM'M with M¯M' (M¯ is the closure of M), we have
M'M¯,dim(M')<dim(M).
See Hardt (1975) for the proof.
If the point θ satisfies the constant rank condition, then TfX(θ) is exactly the tangent space of the manifold formed by the image fX(U(θ)). Otherwise, locally the image decomposes into a finite union M of submanifolds. In this case, TfX(θ) belongs to the span of the tangent space of those manifolds in M as
TfX(θ){TpM:p=fX(θ),MM},
where TpM is the tangent space of the manifold M at the point p.

3.2.2  Examples

In this section, we show through examples that theorem 1 generalizes the previous results in special cases while providing new theoretical insights based on the gradient basis model class and its geometric view. In the following, whenever the form of f is specified, we require only assumption 1 because assumption 2 is automatically satisfied by a given f.

For classical machine learning models, example 1 shows that the gradient basis model class is indeed equivalent to a given model class. From the geometric view, this means that for any θ, the tangent space TfX(θ) is equal to the whole image M of fX (i.e., TfX(θ) does not depend on θ). This reduces theorem 1 to the statement that every critical point of L is a global minimum of L.

Example 1: Classical Machine Learning Models.

For any basis function model f(x;θ)=k=1dθθkφ(x)k in classical machine learning with any fixed feature map φ:XRdθ, we have that fθ(x;α)=f(x;α), and hence infθRdθL(θ)=infαRdθLθ(α), as well as Ω=. In other words, in this special case, theorem 1 states that every critical point of L is a global minimum of L. Here, we do not assume that a critical point or a global minimum exists or can be attainable. Instead, the statement logically means that if a point is a critical point, then the point is a global minimum. This type of statement vacuously holds true if there is no critical point.

For overparameterized deep neural networks, example 2 shows that the induced gradient basis model class is highly expressive such that it must contain the globally optimal model of a given model class of deep neural networks. In this example, the tangent space TfX(θ) is equal to the whole output space Rmdy. This reduces theorem 1 to the statement that every critical point of L is a global minimum of L for overparameterized deep neural networks.

Intuitively, in Figure 1 or 2, we can increase the number of parameters and raise the number of partial derivatives kfX(θ) in order to increase the dimensionality of the tangent space TfX(θ) so that TfX(θ)=Rmdy. This is indeed what happens in example 2, as well as in the previous studies of significantly overparameterized deep neural networks (Allen-Zhu, Li, & Song, 2018; Du, Lee, Li, Wang, & Zhai, 2018; Zou et al., 2018). In the previous studies, the significant overparameterization is required so that the tangent space TfX(θ) does not change from the initial tangent space TfX(θ(0))=Rmdy during training. Thus, theorem 1, with its geometric view, provides the novel algebraic and geometric insights into the results of the previous studies and the reason why overparameterized deep neural networks are easy to be optimized despite nonconvexity.

Example 2: Overparameterized Deep Neural Networks.

Theorem 1 implies that every critical point (and every local minimum) is a global minimum for sufficiently overparameterized deep neural networks. Let n be the number of units in each layer of a fully connected feedforward deep neural network. Let us consider a significant overparameterization such that nm. Let us write a fully connected feedforward deep neural network with the trainable parameters (θ,u) by f(x;θ)=Wφ(x;u), where WRdy×n is the weight matrix in the last layer, θ=vec(W), u contains the rest of the parameters, and φ(x;u) is the output of the last hidden layer. Denote xi=[(xi(raw)),1] to contain the constant term to account for the bias term in the first layer. Assume that the input samples are normalized as xi(raw)2=1 for all i{1,,m} and distinct as (xi(raw))xi'(raw)<1-δ with some δ>0 for all i'i. Assume that the activation functions are ReLU activation functions. Then we can efficiently set u to guarantee rank([φ(xi;u)]i=1m)m (e.g., by choosing u to make each unit of the last layer to be active only for each sample xi).1 Theorem 1 implies that every critical point θ with this u is a global minimum of the whole set of trainable parameters (θ,u) because infαLθ(α)=inff1,,fmi=1mλi(fi,yi) (with assumption 1).

For deep neural networks, example 3 shows that standard networks have the global optimality guarantee with respect to the representation learned at the last layer, and skip connections further ensure the global optimality with respect to the representation learned at each hidden layer. This is because adding the skip connections incurs new partial derivatives {kfX(θ)}k that span the tangent space containing the output of the best model with the corresponding learned representation.

Example 3: Deep Neural Networks and Learned Representations.
Consider a feedforward deep neural network, and let I(skip){1,,H} be the set of indices such that there exists a skip connection from the (l-1)th layer to the last layer for all lI(skip); that is, in this example,
f(x;θ)=lI(skip)W(l+1)h(l)(x;u),
where θ=vec([[W(l+1)]lI(skip),u])Rdθ with W(l+1)Rdy×dl and uRdu.
The conclusion in this example holds for standard deep neural networks without skip connections too, since we always have HI(skip) for standard deep neural networks. Let assumption 1 hold. Then theorem 1 implies that for any critical point θ(RdθΩ) of L, the following holds:
L(θ)=infαRdθLθ(skip)(α),
where
Lθ(skip)(α)=i=1mλiyilI(skip)αw(l+1)h(l)(xi;u)+k=1du(αu)kukfxi(θ),
with α=vec([[α(l+1)]lI(skip),αu])Rdθ with α(l+1)Rdy×dl and αuRdu. This is because f(x;θ)=(vec(W(H+1))f(x;θ))vec(W(H+1)), and thus assumption 2 is automatically satisfied. Here, h(l)(xi;u) is the representation learned at the l-layer. Therefore, infαRdθLθ(skip)(α) is at most the global minimum value of the basis models with the learned representations of the last layer and all hidden layers with the skip connections.

3.3  Theory for Local Minima

We are now ready to present our first main result. We define the (theoretical) objective function L˜θ of the perturbable gradient basis model class as
L˜θ(α,ε,S)=i=1mλi(f˜θ(xi;α,ε,S),yi),
where f˜θ(xi;α,ε,S) is a perturbed gradient basis model defined as
f˜θ(xi;α,ε,S)=k=1dθj=1|S|αk,jkfxi(θ+εSj).
Here, S is a finite set of vectors S1,,S|S|Rdθ and αRdθ×|S|. Let V[θ,ε] be the set of all vectors vRdθ such that v21 and fxi(θ+εv)=fxi(θ) for any i{1,,m}. Let SfinS' denote a finite subset S of a set S'. For an SjV[θ,ε], we have fxi(θ+εSj)=fxi(θ), but it is possible to have kfxi(θ+εSj)kfxi(θ). This enables the greater expressivity of f˜θ(xi;α,ε,S) with a SfinV[θ,ε] when compared with fθ(xi;α).

The following theorem shows that every differentiable local minimum of L achieves the global minimum value of L˜θ:

Theorem 2.
Let assumptions 1 and 2 hold. Then, for any local minimum θ(RdθΩ˜) of L, the following holds: there exists ε0>0 such that for any ε[0,ε0),
L(θ)=infSfinV[θ,ε],αRdθ×|S|L˜θ(α,ε,S).
(3.2)
To understand the relationship between theorems 1 and 2, let us consider the following general inequalities: for any θ(RdθΩ˜) with ε0 being sufficiently small,
L(θ)infαRdθLθ(α)infSfinV[θ,ε],αRdθ×|S|L˜θ(α,ε,S).
Here, whereas theorem 1 states that the first inequality becomes equality as L(θ)=infαRdθLθ(α) at every differentiable critical point, theorem 2 states that both inequalities become equality as
L(θ)=infαRdθLθ(α)=infSfinV[θ,ε],αRdθ×|S|L˜θ(α,ε,S)
at every differentiable local minimum.

From theorem 1 to theorem 2, the power of increasing the number of parameters (including overparameterization) is further improved. The right-hand side in equation 3.2 is the global minimum value over the variables SfinV[θ,ε] and αRdθ×|S|. Here, as dθ increases, we may obtain the global minimum value of a larger search space Rdθ×|S|, which is similar to theorem 1. A concern in theorem 1 is that as dθ increases, we may also significantly increase the redundancy among the elements in {kfxi(θ)}k=1dθ. Although this remains a valid concern, theorem 2 allows us to break the redundancy by the globally optimal SfinV[θ,ε] to some degree.

For example, consider f(x;θ)=g(W(l)h(l)(x;u);u), which represents a deep neural network, with some lth-layer output h(l)(x;u)Rdl, a trainable weight matrix W(l), and an arbitrary function g to compute the rest of the forward pass. Here, θ=vec([W(l),u]). Let h(l)(X;u)=[h(l)(xi;u)]i=1mRdl×m and, similarly, f(X;θ)=g(W(l)h(l)(X;u);u)Rdy×m. Then, all vectors v corresponding to any elements in the left null space of h(l)(X;u) are in V[θ,ε] (i.e., vk=0 for all k corresponding to u and the rest of vk is set to perturb W(l) by an element in the left null space). Thus, as the redundancy increases such that the dimension of the left null space of h(l)(X;u) increases, we have a larger space of V[θ,ε], for which a global minimum value is guaranteed at a local minimum.

3.3.1  Geometric View

This section presents the geometric interpretation of the perturbable gradient basis model class and theorem 2. Figure 3 illustrates the perturbable gradient basis model class and theorem 2 with θR2 and fX(θ)R3. Figure 4 illustrates them with a union of manifolds and tangent spaces at a singular point. Given a ε (ε0), define the affine subspace T˜fX(θ) of the output space Rmdy by
T˜fX(θ)=span({fRmdy:(vV[θ,ε])[fTfX(θ+εv)]}).
Then the subspace T˜fX(θ) is the space of the outputs of the perturbable gradient basis model class in general beyond the low-dimensional illustration (this follows equation 3.1 and the definition of the perturbable gradient basis model). Therefore, in general, theorem 2 states that under assumptions 1 and 2, fX(θ) is globally optimal in the subspace T˜fX(θ) as
fX(θ)argminfT˜fX(θ)dist(f,y)
for any differentiable local minima θ of L. Theorem 2 concludes the global optimality in the affine subspace of the output space based on the local condition in the parameter space—that is, differentiable local minima. Here, a (differentiable) local minimum θ is required to be optimal only in an arbitrarily small local neighborhood in the parameter space, and yet fX(θ) is guaranteed to be globally optimal in the affine subspace of the output space. This illuminates the fact that nonconvex optimization in machine learning has a particular structure beyond general nonconvex optimization.
Figure 3:

Illustration of perturbable gradient basis model class and theorem 2 with θR2 and fX(θ)R3 (dy=1). Theorem 2 translates the local condition of θ in the parameter space R2 (on the left) to the global optimality in the output space R3 (on the right). The subspace T˜fX(θ) is the space of the outputs of the perturbable gradient basis model class. Theorem 2 states that fX(θ) is globally optimal in the subspace as fX(θ)argminfT˜fX(θ)dist(f,y) for any differentiable local minima θ of L. In this example, T˜fX(θ) is the whole output space R3, while TfX(θ) is not, illustrating the advantage of the perturbable gradient basis over the gradient basis. Since T˜fX(θ)=R3, fX(θ) must be globally optimal in the whole output space R3.

Figure 3:

Illustration of perturbable gradient basis model class and theorem 2 with θR2 and fX(θ)R3 (dy=1). Theorem 2 translates the local condition of θ in the parameter space R2 (on the left) to the global optimality in the output space R3 (on the right). The subspace T˜fX(θ) is the space of the outputs of the perturbable gradient basis model class. Theorem 2 states that fX(θ) is globally optimal in the subspace as fX(θ)argminfT˜fX(θ)dist(f,y) for any differentiable local minima θ of L. In this example, T˜fX(θ) is the whole output space R3, while TfX(θ) is not, illustrating the advantage of the perturbable gradient basis over the gradient basis. Since T˜fX(θ)=R3, fX(θ) must be globally optimal in the whole output space R3.

Close modal
Figure 4:

Illustration of perturbable gradient basis model class and theorem 2 with manifold and tangent space at a singular point. The surface MR3 is the image of fX, which is a union of finitely many manifolds. The line TfX(θ) on the left panel is the space of the outputs of the gradient basis model class. The whole space T˜fX(θ)=R3 on the right panel is the space of the outputs of the perturbable gradient basis model class. The space T˜fX(θ) is the span of the set of the vectors in the tangent spaces TfX(θ),TfX(θ'), and TfX(θ''). Theorem 2 states that if θ is a differentiable local minimum of L, then fX(θ) is globally optimal in the space T˜fX(θ).

Figure 4:

Illustration of perturbable gradient basis model class and theorem 2 with manifold and tangent space at a singular point. The surface MR3 is the image of fX, which is a union of finitely many manifolds. The line TfX(θ) on the left panel is the space of the outputs of the gradient basis model class. The whole space T˜fX(θ)=R3 on the right panel is the space of the outputs of the perturbable gradient basis model class. The space T˜fX(θ) is the span of the set of the vectors in the tangent spaces TfX(θ),TfX(θ'), and TfX(θ''). Theorem 2 states that if θ is a differentiable local minimum of L, then fX(θ) is globally optimal in the space T˜fX(θ).

Close modal

The previous section showed that all local minima achieve the global optimality of the perturbable gradient basis model class with several direct consequences for special cases. In this section, as consequences of theorem 2, we complement or improve the state-of-the-art results in the literature.

4.1  Example: ResNets

As an example of theorem 2, we set f to be the function of a certain type of residual networks (ResNets) that Shamir (2018) studied. That is, both Shamir (2018) and this section set f as
f(x;θ)=W(x+Rz(x;u)),
(4.1)
where θ=vec([W,R,u])Rdθ with WRdy×dx, RRdx×dz, and uRdu. Here, z(x;u)Rdz represents an output of deep residual functions with a parameter vector u. No assumption is imposed on the form of z(x;u), and z(x;u) can represent an output of possibly complicated deep residual functions that arise in ResNets. For example, the function f can represent deep preactivation ResNets (He, Zhang, Ren, & Sun, 2016), which are widely used in practice. To simplify theoretical study, Shamir (2018) assumed that every entry of the matrix R is unconstrained (e.g., instead of R representing convolutions). We adopt this assumption based on the previous study (Shamir, 2018).

4.1.1  Background

Along with an analysis of approximate critical points, Shamir (2018) proved the following main result, proposition 1, under the assumptions PA1, PA2, and PA3:

  • PA1: The output dimension dy=1.

  • PA2: For any y, the function y is convex and twice differentiable.

  • PA3: On any bounded subset of the domain of L, the function Lu(W,R), its gradient Lu(W,R), and its Hessian 2Lu(W,R) are all Lipschitz continuous in (W,R), where Lu(W,R)=L(θ) with a fixed u.

Proposition 1
(Shamir, 2018). Let f be specified by equation 4.1, Let assumptions PA1, PA2, and PA3 hold. Then for any local minimum θ of L,
L(θ)infWRdy×dxi=1mλiyi(Wxi).

Shamir (2018) remarked that it is an open problem whether proposition 1 and another main result in the article can be extended to networks with dy>1 (multiple output units). Note that Shamir (2018) also provided proposition 1 with an expected loss and an analysis for a simpler decoupled model, Wx+Vz(x;u). For the simpler decoupled model, our theorem 1 immediately concludes that given any u, every critical point with respect to θ-u=(W,R) achieves a global minimum value with respect to θ-u as L(θ-u)=inf{i=1mλiyi(Wxi+Rz(xi;u)):WRdy×dx,RRdx×dz} (infWRdy×dxi=1mλiyi(Wxi)). This holds for every critical point θ since any critical point θ must be a critical point with respect to θ-u.

4.2  Result

The following theorem shows that every differentiable local minimum achieves the global minimum value of L˜θ(ResNet) (the right-hand side in equation 4.2), which is no worse than the upper bound in proposition 1 and is strictly better than the upper bound as long as z(xi,u) or f˜θ(xi;α,ε,S) is nonnegligible. Indeed, the global minimum value of L˜θ(ResNet) (the right-hand side in equation 4.2) is no worse than the global minimum value of all models parameterized by the coefficients of the basis x and z(x;u), and further improvement is guaranteed through a nonnegligible f˜θ(xi;α,ε,S).

Theorem 3.
Let f be specified by equation 4.1. Let assumption 1 hold. Assume that dymin{dx,dz}. Then for any local minimum θ(RdθΩ˜) of L, the following holds: there exists ε0>0 such that for any ε(0,ε0),
L(θ)=infSfinV[θ,ε],αRdθ×|S|,αwRdy×dx,αrRdy×dzL˜θ(ResNet)(α,αw,αr,ε,S),
(4.2)
where
L˜θ(ResNet)(α,αw,αr,ε,S)=i=1mλiyi(αwxi+αrz(xi;u)+f˜θ(xi;α,ε,S)).

Theorem 3 also successfully solved the first part of the open problem in the literature (Shamir, 2018) by discarding the assumption of dy=1. From the geometric view, theorem 3 states that the span T˜fX(θ) of the set of the vectors in the tangent spaces {TfX(θ+εv):vV[θ,ε]} contains the output of the best basis model with the linear feature x and the learned nonlinear feature z(xi;u). Similar to the examples in Figures 3 and 4, T˜fX(θ)Tf(θ) and the output of the best basis model with these features is contained in T˜fX(θ) but not in Tf(θ).

Unlike the recent study on ResNets (Kawaguchi & Bengio, 2019), our theorem 3 predicts the value of L through the global minimum value of a large search space (i.e., the domain of L˜θ(ResNet)) and is proven as a consequence of our general theory (i.e., theorem 2) with a significantly different proof idea (see section 4.3) and with the novel geometric insight.

4.2.1  Example: Deep Nonlinear Networks with Locally Induced Partial Linear Structures

We specify f to represent fully connected feedforward networks with arbitrary nonlinearity σ and arbitrary depth H as follows:
f(x;θ)=W(H+1)h(H)(x;θ),
(4.3)
where
h(l)(x;θ)=σ(l)(W(l)h(l-1)(x;θ)),
for all l{1,,H} with h(0)(x;θ)=x. Here, θ=vec([W(l)]l=1H+1)Rdθ with W(l)Rdl×dl-1, dH+1=dy, and d0=dx. In addition, σ(l):RdlRdl represents an arbitrary nonlinear activation function per layer l and is allowed to differ among different layers.

4.2.2  Background

Given the difficulty of theoretically understanding deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that theoretically studying simplified networks (i.e., deep linear networks) is worthwhile. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that deep linear networks may exhibit several properties analogous to those of deep nonlinear networks. Accordingly, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett, Helmbold, & Long, 2019; Du & Hu, 2019).

Along this line, Laurent and Brecht (2018) recently proved the following main result, proposition 2, under the assumptions PA4, PA5, and PA6:

  • PA4: Every activation function is identity as σ(l)(q)=q for every l{1,,H} (i.e., deep linear networks).

  • PA5: For any y, the function y is convex and differentiable.

  • PA6: The thinnest layer is either the input layer or the output layer as min{dx,dy}min{d1,,dH}.

Proposition 2

(Laurent & Brecht, 2018). Let f be specified by equation 4.3. Let assumptions PA4, PA5, and PA6 hold. Then every local minimum θ of L is a global minimum.

4.2.3  Result

Instead of studying deep linear networks, we now consider a partial linear structure locally induced by a parameter vector with nonlinear activation functions. This relaxes the linearity assumption and extends our understanding of deep linear networks to deep nonlinear networks.

Intuitively, Jn,t[θ] is a set of partial linear structures locally induced by a vector θ, which is now formally defined as follows. Given a θRdθ, let Jn,t[θ] be a set of all sets J={J(t+1),,J(H+1)} such that each set J={J(t+1),,J(H+1)}Jn,t[θ] satisfies the following conditions: there exists ε>0 such that for all l{t+1,t+2,,H+1},

  1. J(l){1,,dl} with |J(l)|n.

  2. h(l)(xi,θ')k=(W(l)h(l-1)(xi,θ'))k for all (k,θ',i)J(l)×B(θ,ε)×{1,,m}.

  3. Wi,j(l+1)=0 for all (i,j)({1,,dl+1}J(l+1))×J(l) if lH-1.

Let Θn,t be the set of all parameter vectors θ such that Jn,t[θ] is nonempty. As the definition reveals, a neural network with a θΘdy,t can be a standard deep nonlinear neural network (with no linear units).

Theorem 4.
Let f be specified by equation 4.3. Let assumption 1 hold. Then for any t{1,,H}, at every local minimum θ(Θdy,tΩ˜) of L, the following holds. There exists ε0>0 such that for any ε(0,ε0),
L(θ)=infSfinV[θ,ε],αRdθ×|S|,αhRdtL˜θ,t(ff)(α,αh,ε,S),
where
L˜θ,t(ff)(α,αh,ε,S)=i=1mλiyil=tHαh(l+1)h(l)(xi;u)+f˜θ(xi;α,ε,S),
with αh=vec([αh(l+1)]l=tH)Rdt, αh(l+1)Rdy×dl and dt=dyl=tHdl.

Theorem 4 is a special case of theorem 2. A special case of theorem 4 then results in one of the main results in the literature regarding deep linear neural networks, that is, every local minimum is a global minimum. Consider any deep linear network with dymin{d1,,dH}. Then every local minimum θ is in Θdy,0Ω˜=Θdy,0. Hence, theorem 4 is reduced to the statement that for any local minimum, L(θ)=infαhRdti=1mλiyi(l=0Hαh(l+1)h(l)(xi;u))=infαxRdxi=1mλiyi(αxxi), which is the global minimum value. Thus, every local minimum is a global minimum for any deep linear neural network with dymin{d1,,dH}. Therefore, theorem 4 successfully generalizes the recent previous result in the literature (proposition 2) for a common scenario of dydx.

Beyond deep linear networks, theorem 4 illustrates both the benefit of the locally induced structure and the overparameterization for deep nonlinear networks. In the first term, l=tHαh(l+1)h(l)(xi;u), in Lθ,t(ff), we benefit by decreasing t (a more locally induced structure) and increasing the width of the lth layer for any lt (overparameterization). The second term, f˜θ(xi;α,ε,S) in Lθ,t(ff), is the general term that is always present from theorem 2, where we benefit from increasing dθ because αRdθ×|S|.

From the geometric view, theorem 4 captures the intuition that the span T˜fX(θ) of the set of the vectors in the tangent spaces {TfX(θ+εv):vV[θ,ε]} contains the best basis model with the linear feature for deep linear networks, as well as the best basis models with more nonlinear features as more local structures arise. Similar to the examples in Figures 3 and 4, T˜fX(θ)Tf(θ) and the output of the best basis models with those features are contained in T˜fX(θ) but not in Tf(θ).

A similar local structure was recently considered in Kawaguchi, Huang, and Kaelbling (2019). However, both the problem settings and the obtained results largely differ from those in Kawaguchi et al. (2019). Furthermore, theorem 4 is proven as a consequence of our general theory (theorem 2), and accordingly, the proofs largely differ from each other as well. Theorem 4 also differs from recent results on the gradient decent algorithm for deep linear networks (Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett et al., 2019; Du & Hu, 2019), since we analyze the loss surface instead of a specific algorithm and theorem 4 applies to deep nonlinear networks as well.

4.3  Proof Idea in Applications of Theorem 2

Theorems 3 and 4 are simple consequences of theorem 2, and their proof is illustrative as a means of using theorem 2 in future studies with different additional assumptions. The high-level idea behind the proofs in the applications of theorem 2 is captured in the geometric view of theorem 2 (see Figures 3 and 4). That is, given a desired guarantee, we check whether the space T˜fX(θ) is expressive enough to contain the output of the desired model corresponding to the desired guarantee.

To simplify the use of theorem 2, we provide the following lemma. This lemma states that the expressivity of the model f˜θ(x;α,ε,S) with respect to (α,S) is the same as that of f˜θ(x;α,ε,S)+f˜θ(x;α',ε,S') with respect to (α,α',S,S'). As shown in its proof, this is essentially because f˜θ is linear in α, and a union of two sets SfinV[θ,ε] and S'finV[θ,ε] remains a finite subset of V[θ,ε].

Lemma 1.

For any θ, any ε0, any S'finV[θ,ε], and any x, it holds that {f˜θ(x;α,ε,S):αRdθ×|S|,SfinV[θ,ε]}={f˜θ(x;α,ε,S)+f˜θ(x;α',ε,S'):αRdθ×|S|,α'Rdθ×|S'|,SfinV[θ,ε]}.

Based on theorem 2 and lemma 1, the proofs of theorems 3 and 4 are reduced to a simple search for finding S'finV[θ,ε] such that the expressivity of f˜θ(xi;α',ε,S') with respect to α' is no worse than the expressivity of αwxi+αrz(xi;u) with respect to (αw,αr) (see theorem 3) and that of l=tHαh(l+1)h(l)(xi;u) with respect to αh(l+1) (see theorem 4). In other words, {f˜θ(xi;α',ε,S'):α'Rdθ×|S'|}{αwxi+αrz(xi;u):αwRdy×dx,αrRdy×dz} (see theorem 3) and {f˜θ(xi;α',ε,S'):α'Rdθ×|S'|}{l=tHαh(l+1)h(l)(xi;u):αhRdt} (see theorem 4). Only with this search for S', theorem 2 together with lemma 1 implies the desired statements for theorems 3 and 4 (see sections A.4 and A.5 in the appendix for further details). Thus, theorem 2 also enables simple proofs.

This study provided a general theory for nonconvex machine learning and demonstrated its power by proving new competitive theoretical results with it. In general, the proposed theory provides a mathematical tool to study the effects of hypothesis classes f, methods, and assumptions through the lens of the global optima of the perturbable gradient basis model class.

In convex machine learning with a model output f(x;θ)=θx with a (nonlinear) feature output x=φ(x(raw)), achieving a critical point ensures the global optimality in the span of the fixed basis x=φ(x(raw)). In nonconvex machine learning, we have shown that achieving a critical point ensures the global optimality in the span of the gradient basis fx(θ), which coincides with the fixed basis x=φ(x(raw)) in the case of the convex machine learning. Thus, whether convex or nonconvex, achieving a critical point ensures the global optimality in the span of some basis, which might be arbitrarily bad (or good) depending on the choice of the handcrafted basis φ(x(raw))=fx(θ) (for the convex case) or the induced basis fx(θ) (for the nonconvex case). Therefore, in terms of the loss values at critical points, nonconvex machine learning is theoretically as justified as the convex one, except in the case when a preference is given to φ(x(raw)) over fx(θ) (both of which can be arbitrarily bad or good). The same statement holds for local minima and perturbable gradient basis.

In this appendix, we provide complete proofs of the theoretical results.

A.1  Proof of Theorem 1

The proof of theorem 1 combines lemma 2 with assumptions 1 and 2 by taking advantage of the structure of the objective function L. Although lemma 2 is rather weak and assumptions 1 and 2 are mild (in the sense that they usually hold in practice), a right combination of these with the structure of L can prove the desired statement.

Lemma 2.
Assume that for any i{1,,m}, the function yi:q(q,yi) is differentiable. Then for any critical point θ(RdθΩ) of L, the following holds: for any k{1,,dθ},
i=1mλiyi(fxi(θ))kfxi(θ)=0.
Proof of Lemma 2.

Let θ be an arbitrary critical point θ(RdθΩ) of L. Since yi:RdyR is assumed to be differentiable and fxiRdy is differentiable at the given θ, the composition (yifxi) is also differentiable, and k(yifxi)=yi(fxi(θ))kfxi(θ). In addition, L is differentiable because a sum of differentiable functions is differentiable. Therefore, for any critical point θ of L, we have that L(θ)=0, and, hence, kL(θ)=i=1mλiyi(fxi(θ))kfxi(θ)=0, for any k{1,,dθ}, from linearity of differentiation operation.

Proof of Theorem 1.
Let θ(RdθΩ) be an arbitrary critical point of L. From assumption 2, there exists a function g such that fxi(θ)=k=1dθg(θ)kkfxi(θ) for all i{1,,m}. Then, for any αRdθ,
Lθ(α)i=1mλiyi(fxi(θ))+λiyi(fxi(θ))(fθ(xi;α)-f(xi;θ))=i=1mλiyi(fxi(θ))+k=1dθαki=1mλiyi(fxi(θ))kfxi(θ)=0fromLemma2-i=1mλiyi(fxi(θ))f(xi;θ)=i=1mλiyi(fxi(θ))-k=1dθg(θ)ki=1mλiyi(fxi(θ))kfxi(θ)=0fromLemma2,=L(θ),
where the first line follows from assumption 1 (differentiable and convex yi), the second line follows from linearity of summation, and the third line follows from assumption 2. Thus, on the one hand, we have that L(θ)infαRdθLθ(α). On the other hand, since f(xi;θ)=k=1dθg(θ)kkfxi(θ){fθ(xi;α)=k=1dθαkkfxi(θ):αRdθ}, we have that L(θ)infαRdθLθ(α). Combining these yields the desired statement of L(θ)=infαRdθLθ(α).

A.2  Proof of Theorem 2

The proof of theorem 2 uses lemma 3, the structure of the objective function L, and assumptions 1 and 2.

Lemma 3.
Assume that for any i{1,,m}, the function yi:q(q,yi) is differentiable. Then for any local minimum θ(RdθΩ˜) of L, the following holds: there exists ε0>0 such that for any ε[0,ε0), any vV[θ,ε], and any k{1,,dθ},
i=1mλiyi(fxi(θ))kfxi(θ+εv)=0.
Proof of Lemma 3.
Let θ(RdθΩ˜) be an arbitrary local minimum of L. Since θ is a local minimum of L, by the definition of a local minimum, there exists ε1>0 such that L(θ)L(θ') for all θ'B(θ,ε1). Then for any ε[0,ε1/2) and any νV[θ,ε], the vector (θ+εv) is also a local minimum because
L(θ+εv)=L(θ)L(θ')
for all θ'B(θ+εv,ε1/2)B(θ,ε1) (the inclusion follows from the triangle inequality), which satisfies the definition of a local minimum for (θ+εv).

Since θ(RdθΩ˜), there exists ε2>0 such that fx1,,fxm are differentiable in B(θ,ε2). Since yi:RdyR is assumed to be differentiable and fxiRdy is differentiable in B(θ,ε2), the composition (yifxi) is also differentiable, and k(yifxi)=yi(fxi(θ))kfxi(θ) in B(θ,ε2). In addition, L is differentiable in B(θ,ε2) because a sum of differentiable functions is differentiable.

Therefore, with ε0=min(ε1/2,ε2), we have that for any ε[0,ε0) and any νV[θ,ε], the vector (θ+εv) is a differentiable local minimum, and hence the first-order necessary condition of differentiable local minima implies that
kL(θ+εv)=i=1mλiyi(fxi(θ))kfxi(θ+εv)=0,
for any k{1,,dθ}, where we used the fact that fxi(θ)=fxi(θ+εv) for any vV[θ,ε].
Proof of Theorem 2.
Let θ(RdθΩ˜) be an arbitrary local minimum of L. Since (RdθΩ˜)(RdθΩ), from assumption 2, there exists a function g such that fxi(θ)=k=1dθg(θ)kkfxi(θ) for all i{1,,m}. Then from lemma 3, there exists ε0>0 such that for any ε[0,ε0), any SfinV[θ,ε] and any αRdθ×|S|,
L˜θ(α,ε,S)i=1mλiyi(fxi(θ))+λiyi(fxi(θ))(f˜θ(xi;α,ε,S)-f(xi;θ))=i=1mλiyi(fxi(θ))+k=1dθj=1|S|αk,ji=1mλiyi(fxi(θ))kfxi(θ+εSj)=0fromLemma3-i=1mλiyi(fxi(θ))f(xi;θ)=i=1mλiyi(fxi(θ))-k=1dθg(θ)ki=1mλiyi(fxi(θ))kfxi(θ)=0fromLemma3,=L(θ),
where the first line follows from assumption 1 (differentiable and convex yi), the second line follows from linearity of summation and the definition of f˜θ(xi;α,ε,S), and the third line follows from assumption 2. Thus, on the one hand, there exists ε0>0 such that for any ε[0,ε0), L(θ)inf{L˜θ(α,ε,S):SfinV[θ,ε],αRdθ×|S|}. On the other hand, since f(xi;θ)=k=1dθg(θ)kkfxi(θ){f˜θ(xi;α,ε,S):αRdθ,S=0}, we have that L(θ)inf{L˜θ(α,ε,S):SfinV[θ,ε],αRdθ×|S|}. Combining these yields the desired statement.

A.3  Proof of Lemma 1

As shown in the proof of lemma 1, lemma 1 is a simple consequence of the following facts: f˜θ is linear in α and a union of two sets SfinV[θ,ε] and S'finV[θ,ε] is still a finite subset of V[θ,ε].

Proof of Lemma 1.
Let S'finV[θ,ε] be fixed. Then,
{f˜θ(x;α,ε,S):αRdθ×|S|,SfinV[θ,ε]}={f˜θ(x;α,ε,SS'):αRdθ×|SS'|,SfinV[θ,ε]}={f˜θ(x;α,ε,SS')+f˜θ(x;α',ε,S'):αRdθ×|SS'|,α'Rdθ×|S'|,SfinV[θ,ε]}={f˜θ(x;α,ε,SS')+fθ(x;α',ε,S'):αRdθ×|SS'|,α'Rdθ×|S'|,SfinV[θ,ε]}={f˜θ(x;α,ε,S)+fθ(x;α',ε,S'):αRdθ×|S|,α'Rdθ×|S'|,SfinV[θ,ε]},
where the second line follows from the facts that a finite union of finite sets is finite and hence SS'finV[θ,ε] (i.e., the set in the first line is a superset of , the set in the second line), and that αRdθ×|SS'| can vanish the extra terms due to S' in f˜θ(x;α,ε,SS') (i.e., the set in the first line is a subset of, , the set in the second line). The last line follows from the same facts. The third line follows from the definition of f˜θ(x;α,ε,S). The fourth line follows from the following equality due to the linearity of f˜θ in α:
{f˜θ(x;α',ε,S'):α'Rdθ×|S'|}=k=1dθj=1|S|(αk,j'+α¯k,j')kfx(θ+εSj'):α'Rdθ×|S'|,α¯'Rdθ×|S'|={f˜θ(x;α',ε,S')+f˜θ(x;α¯',ε,S'):α'Rdθ×|S'|,α¯'Rdθ×|S'|}.

A.4  Proof of Theorem 3

As shown in the proof of theorem 3, thanks to theorem 2 and lemma 1, the remaining task to prove theorem 3 is to find a set S'finV[θ,ε] such that {f˜θ(xi;α',ε,S'):α'Rdθ×|S'|}{αwxi+αrz(xi;u):αwRdy×dx,αrRdy×dz}. Let Null(M) be the null space of a matrix M.

Proof of Theorem 3.
Let θ(RdθΩ˜) be an arbitrary local minimum of L. Since f is specified by equation 4.1, and hence f(x;θ)=(vec(W)f(x;θ))vec(W), assumption 2 is satisfied. Thus, from theorem 2, there exists ε0>0 such that for any ε[0,ε0),
L(θ)=infSfinV[θ,ε],αRdθ×|S|i=1mλi(f˜θ(xi;α,ε,S),yi),
where
f˜θ(xi;α,ε,S)=j=1|S|αw,j(xi+(R+εvr,j)zi,j)+(W+εvw,j)αr,jzi,j+(ufxi(θ+εSj))αu,j,
with α=[α·1,,α·|S|]Rdθ×|S|, α·j=vec([αw,j,αr,j,αu,j])Rdθ, Sj=vec([vw,j,vr,j,vu,j])Rdθ, and zi,j=z(xi,u+εvu,j) for all j{1,,|S|}. Here, αw,j,vw,jRdy×dx, αr,j,vr,jRdx×dz, and αu,j,vu,jRdu. Let ε(0,ε0) be fixed.
Consider the case of rank(W)dy. Define S¯ such that |S¯|=1 and S¯1=0Rdθ, which is in V[θ,ε]. Then by setting αu,1=0 and rewriting αr,1 such that Wαr,1=αr,1(1)-αw,1R with an arbitrary matrix αr,1Rdy×dz (this is possible since rank(W)dy), we have that
{f˜θ(xi;α,ε,S¯):αRdθ×|S¯|}{αw,1xi+αr,1(1)zi,1:αw,1Rdy×dx,αr,1(1)Rdy×dz}.
Consider the case of rank(W)<dy. Since WRdy×dx and rank(W)<dymin(dx,dz)dx, we have that Null(W){0}, and there exists a vector aRdx such that aNull(W) and a2=1. Let a be such a vector. Define S¯' as follows: |S¯'|=dydz+1, S¯1'=0Rdθ, and set S¯j' for all j{2,,dydz+1} such that vw,j=0, vu,j=0, and vr,j=abj where bjRdz is an arbitrary column vector with bj21. Then S¯j'V[θ,ε] for all j{1,,dydz+1}. By setting αr,j=0 and αu,j=0 for all j{1,,dydz+1} and by rewriting αw,1=αw,1(1)-j=2dydz+1αw,j and αw,j=1εqjaT for all j{2,,dydz+1} with an arbitrary vector qjRdy (this is possible since ε>0 is fixed first and αw,j is arbitrary), we have that
{f˜θ(xi;α,ε,S¯'):αRdθ×|S¯'|}αw,1(1)xi+αw,1(1)R+j=2dydz+1qjbjzi,1:qjRdy,bjRdz.
Since qjRdy and bjRdz are arbitrary, we can rewrite j=2dydz+1qjbj=αw,1(2)-αw,1(1)R with an arbitrary matrix αw,1(2)Rdy×dz, yielding
{f˜θ(xi;α,ε,S¯'):αRdθ×|S¯'|}{αw,1(1)xi+αw,1(2)zi,1:αw,1(1)Rdy×dx,αw,1(2)Rdy×dz}.
By summarizing above, in both cases of rank(W), there exists a set S'finV[θ,ε] such that
{f˜θ(xi;α,ε,S):αRdθ×|S|,SfinV[θ,ε]}={f˜θ(xi;α,ε,S)+f˜θ(xi;α',ε,S'):αRdθ×|S|,α'Rdθ×|S'|,SfinV[θ,ε]}{f˜θ(xi;α,ε,S)+αwxi+αrz(xi,u):αRdθ×|S|,αw(1)Rdy×dx,αr(2)Rdy×dz,SfinV[θ,ε]},
where the second line follows from lemma 1. On the other hand, since the set in the first line is a subset of the set in the last line, {f˜θ(xi;α,ε,S):αRdθ×|S|,SfinV[θ,ε]}={f˜θ(xi;α,ε,S)+αwxi+αrz(xi,u):αRdθ×|S|,αw(1)Rdy×dx,αr(2)Rdy×dz,SfinV[θ,ε]}. This immediately implies the desired statement from theorem 2.

A.5  Proof of Theorem 4

As shown in the proof of theorem 4, thanks to theorem 2 and lemma 1, the remaining task to prove theorem 4 is to find a set S'finV[θ,ε] such that {f˜θ(xi;α',ε,S'):α'Rdθ×|S'|}{l=tHαh(l+1)h(l)(xi;u):αhRdt}. Let M(l')M(l+1)M(l)=I if l>l'.

Proof of Theorem 4.
Since f is specified by equation 4.3 and, hence,
f(x;θ)=(vec(W(H+1))f(x;θ))vec(W(H+1)),
assumption 2 is satisfied. Let t{0,,H} be fixed. Let θ(Θdy,tΩ˜) be an arbitrary local minimum of L. Then from theorem 2, there exists ε0>0 such that for any ε[0,ε0), L(θ)=infSfinV[θ,ε],αRdθ×|S|i=1mλi(f˜θ(xi;α,ε,S),yi), where f˜θ(xi;α,ε,S)=k=1dθj=1|S|αk,jkfxi(θ+εSj).
Let J={J(t+1),,J(H+1)}Jn,t[θ] be fixed. Without loss of generality, for simplicity of notation, we can permute the indices of the units of each layer such that J(t+1),,J(H+1){1,,dy}. Let B˜(θ,ε1)=B(θ,ε1){θ'Rdθ:Wi,j(l+1)=0 for all l{t+1,,H-1} and all (i,j)({1,,dl+1}J(l+1))×J(l)}. Because of the definition of the set J, in B˜(θ,ε1) with ε1>0 being sufficiently small, we have that for any l{t,,H},
fxi(θ)=A(H+1)A(l+2)[A(l+1)C(l+1)]h(l)(xi;θ)+ϕxi(l)(θ),
where
ϕxi(l)(θ)=l'=lH-1A(H+1)A(l'+3)C(l'+2)h˜(l'+1)(xi;θ)
and
h˜(l)(xi;θ)=σ(l)(B(l)h˜(l-1)(xi;θ)),
for all lt+2 with h˜(t+1)(xi;θ)=σ(t+1)([ξ(l)B(l)]h(t)(xi;θ)). Here,
A(l)C(l)ξ(l)B(l)=W(l)
with A(l)Rdy×dy, C(l)Rdy×(dl-1-dy), B(l)R(dl-dy)×(dl-1-dy), and ξ(l)R(dl-dy)×dy. Let ε1>0 be a such number, and let ε(0,min(ε0,ε1/2)) be fixed so that both the equality from theorem 2 and the above form of fxi hold in B˜(θ,ε). Let R(l)=[A(l)C(l)].
We will now find sets S(t),,S(H)finV[θ,ε] such that
{f˜θ(xi;α,ε,S(l)):αRdθ}{αh(l+1)h(l)(xi;u):αh(l+1)Rdy×dl}.
  • Find S(l) with l=H: Since
    (vec(R(H+1))fxi(θ))vec(αh(H+1))=αh(H+1)h(H)(xi;θ),
    S(H)={0}finV[θ,ε] (where 0Rdθ) is the desired set.
  • Find S(l) with l{t,,H-1}: With αr(l+1)Rdl+1×dl, we have that
    (vec(R(l+1))fxi(θ))vec(αr(l+1))=A(H+1)A(l+2)αr(l+1)h(l)(xi;θ).
Therefore, if rank(A(H+1)A(l+2))dy, since {A(H+1)A(l+2)αr(l+1):αr(l+1)Rdl+1×dl}{αh(l+1)Rdy×dl}, S(l)={0}finV[θ,ε] (where 0Rdθ) is the desired set. Let us consider the remaining case: let rank(A(H+1)A(l+2))<dy and let l{t,,H-1} be fixed. Let l*=min{l'Z+:l+3l'H+2rank(A(H+1)A(l'))dy}, where A(H+1)A(H+2)Idy and the minimum exists since the set is finite and contains at least H+2 (nonempty). Then rank(A(H+1)A(l*))dH+1 and rank(A(H+1)A(l'))<dH+1 for all l'{l+2,l+3,,l*-1}. Thus, for all l'{l+1,l+2,,l*-2}, there exists a vector al'Rdy such that
al'Null(A(H+1)A(l'+1))andal'2=1.
Let al' denote such a vector. Consider S(l) such that the weight matrices W are perturbed with θ¯+εSj(l) as
A˜j(l')=A(l')+εal'bl',jandR˜j(l+1)=R(l+1)+εal+1bl+1,j
for all l'{l+2,l+3,,l*-2}, where bl',j2 is bounded such that Sj(l)21. That is, the entries of Sj are all zeros except the entries corresponding to A(l') (for l'{l+2,l+3,,l*-2}) and R(l+1). Then Sj(l)V[θ,ε], since A(H+1)A(l'+1)A˜j(l')=A(H+1)A(l'+1)A(l') for all l'{l+2,l+3,,l*-2} and A(H+1)A(l+2)R˜j(l+1)=A(H+1)A(l+2)R(l+1). Let |S(l)|=2N with some integer N to be chosen later. Define Sj+N(l) for j=1,,N by setting Sj+N(l)=Sj(l) except that bl+1,j+N=0 whereas bl+1,j is not necessarily zero. By setting αj+N=-αj for all j{1,,N}, with αjRdl*×dl*-1,
f˜θ(xi;α,ε,S(l))=j=1NA(H+1)A(l*)(αj+αj+N)A˜(l*-2)A˜(l+2)R(l+1)h(l)(xi;θ)+j=1N(vec(A(l*-1))ϕxi(l)(θ+εSj))vec(αj+αj+N)+εj=1NA(H+1)A(l*)αjA˜(l*-2)A˜(l+2)al+1bl+1,jh(l)(xi;θ)=εj=1NA(H+1)A(l*)αjA˜(l*-2)A˜(l+2)al+1bl+1,jh(l)(xi;θ),
where we used the fact that vec(A(l*-1))ϕxi(l)(θ+εSj) does not contain bl+1,j. Since rank(A(H+1)A(l*))dy and {A(H+1)A(l*)αj:αjRdl*×dl*-1}={1εαj':αj'Rdy×dl*-1}, we have that αj'Rdy×dl*-1, αRdθ×|S|,
f˜θ(xi;α,ε,S(l))=j=1Nαj'A˜(l*-2)A˜(l+2)al+1bl+1,jh(l)(xi;θ).
Let N=2N1. Define Sj+N1(l) for j=1,,N1 by setting Sj+N1(l)=Sj(l) except that bl*-2,j+N1=0, whereas bl*-2,j is not necessarily zero. By setting αj+N1'=-αj' for all j{1,,N1},
f˜θ(xi;α,ε,S(l))=εj=1N1αj'al*-2bl*-2,jA˜(l*-3)A˜(l+2)al+1bl+1,jh(l)(xi;θ).
By induction,
f˜θ(xi;α,ε,S(l))=εtj=1Ntαj'al*-2bl*-2,jal*-3bl*-3,jal+1bl+1,jh(l)(xi;θ),
where t=(l*-2)-(l+2)+1 is finite. By setting αj'=1εtqjal*-2 and bl,j=al-1 for all l=l*-2,,l (ε>0),
f˜θ(xi;α,ε,S(l))=j=1Ntqjbl+1,jh(l)(xi;θ).
Since qjbl+1,j are arbitrary, with sufficiently large Nt (Nt=dydl suffices), we can set j=1Ntqjbl+1,j=αh(l) for any αh(l)Rdθ×dl, and hence
{f˜θ(xi;α,ε,S(l)):αRdθ×|S(l)|}{αh(l)h(l)(xi;θ):αh(l)Rdθ×dl}.
Thus far, we have found the sets S(t),,S(H)finV[θ,ε] such that {f˜θ(xi;α,ε,S(l)):αRdθ}{αh(l+1)h(l)(xi;u):αh(l+1)Rdy×dl}. From lemma 1, we can combine these, yielding
{f˜θ(xi;α,ε,S):αRdθ,SfinV[θ,ε]}=l=tHf˜θ(xi;α(l),ε,S(l))+f˜θ(xi;α,ε,S):α(t),,α(H)Rdθ,αRdθ,SfinV[θ,ε]l=tHαh(l+1)h(l)(xi;u)+f˜θ(xi;α,ε,S):αh(l+1)Rdy×dl,αRdθ×|S|,SfinV[θ,ε].
Since the set in the first line is a subset of the set in the last line, the equality holds in the above equation. This immediately implies the desired statement from theorem 2.
1

For example, choose the first layer's weight matrix W(1) such that for all i{1,,m}, (W(1)xi)i>0 and (W(1)xi)i'0 for all i'i. This can be achieved by choosing the ith row of W(1) to be [(xi(raw)),ε-1] with 0<εδ for im. Then choose the weight matrices for the lth layer for all l2 such that for all j, Wj,j(l)0 and Wj',j(l)=0 for all j'j. This guarantees rank([φ(xi;u)]i=1m)m.

We gratefully acknowledge support from NSF grants 1523767 and 1723381, AFOSR grant FA9550-17-1-0165, ONR grant N00014-18-1-2847, Honda Research, and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.

Allen-Zhu
,
Z.
,
Li
,
Y.
, &
Song
,
Z.
(
2018
).
A convergence theory for deep learning via over-parameterization
.
arXiv:1811.03962
.
Arora
,
S.
,
Cohen
,
N.
,
Golowich
,
N.
, &
Hu
,
W.
(
2018
).
A convergence analysis of gradient descent for deep linear neural networks
.
arXiv:1810.02281
.
Arora
,
S.
,
Cohen
,
N.
, &
Hazan
,
E.
(
2018
).
On the optimization of deep networks: Implicit acceleration by overparameterization
. In
Proceedings of the International Conference on Machine Learning
.
Bartlett
,
P. L.
,
Helmbold
,
D. P.
, &
Long
,
P. M.
(
2019
).
Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks
.
Neural Computation
,
31
(
3
),
477
502
.
Bertsekas
,
D. P.
(
1999
).
Nonlinear programming
.
Belmont, MA
:
Athena Scientific
.
Blum
,
A. L.
, &
Rivest
,
R. L.
(
1992
).
Training a 3-node neural network is NP-complete
.
Neural Networks
,
5
(
1
),
117
127
.
Brutzkus
,
A.
, &
Globerson
,
A.
(
2017
).
Globally optimal gradient descent for a convnet with gaussian inputs
. In
Proceedings of the International Conference on Machine Learning
(pp.
605
614
).
Choromanska
,
A.
,
Henaff
,
M.
,
Mathieu
,
M.
,
Ben Arous
,
G.
, &
LeCun
,
Y.
(
2015
).
The loss surfaces of multilayer networks
. In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
(pp.
192
204
).
Davis
,
D.
,
Drusvyatskiy
,
D.
,
Kakade
,
S.
, &
Lee
,
J. D.
(
2019
). Stochastic subgradient method converges on tame functions. In
M.
Overton
(Ed.),
Foundations of computational mathematics
(pp.
1
36
).
Berlin
:
Springer
.
Du
,
S. S.
, &
Hu
,
W.
(
2019
).
Width provably matters in optimization for deep linear neural networks
.
arXiv:1901.08572
.
Du
,
S. S.
, &
Lee
,
J. D.
(
2018
).
On the power of over-parameterization in neural networks with quadratic activation
.
arXiv:1803.01206
.
Du
,
S. S.
,
Lee
,
J. D.
,
Li
,
H.
,
Wang
,
L.
, &
Zhai
,
X.
(
2018
).
Gradient descent finds global minima of deep neural networks
.
arXiv:1811.03804
.
Ge
,
R.
,
Lee
,
J. D.
, &
Ma
,
T.
(
2017
).
Learning one-hidden-layer neural networks with landscape design
.
arXiv:1711.00501
.
Goel
,
S.
, &
Klivans
,
A.
(
2017
).
Learning depth-three neural networks in polynomial time
.
arXiv:1709.06010
.
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
.
Cambridge, MA
:
MIT Press
.
Hardt
,
M.
, &
Ma
,
T.
(
2017
).
Identity matters in deep learning
.
arXiv:1611.04231
.
Hardt
,
R. M.
(
1975
).
Stratification of real analytic mappings and images
.
Invent. Math.
,
28
,
193
208
.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2016
). Identity mappings in deep residual networks. In
Proceedings of the European Conference on Computer Vision
(pp.
630
645
).
Berlin
:
Springer
.
Kakade
,
S. M.
, &
Lee
,
J. D.
(
2018
). Provably correct automatic subdifferentiation for qualified programs. In
S.
Bengio
,
H.
Wallach
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
7125
7135
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
(
2016
). Deep learning without poor local minima. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
586
594
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
, &
Bengio
,
Y.
(
2019
).
Depth with nonlinearity creates no bad local minima in ResNets
.
Neural Networks
,
118
,
167
174
.
Kawaguchi
,
K.
,
Huang
,
J.
, &
Kaelbling
,
L. P.
(
2019
).
Effect of depth and width on local minima in deep learning
.
Neural Computation
,
31
(
6
),
1462
1498
.
Kawaguchi
,
K.
,
Xie
,
B.
, &
Song
,
L.
(
2018
).
Deep semi-random features for nonlinear function approximation
. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
.
Palo Alto, CA
:
AAAI Press
.
Laurent
,
T.
, &
Brecht
,
J.
(
2018
).
Deep linear networks with arbitrary loss: All local minima are global
. In
Proceedings of the International Conference on Machine Learning
(pp.
2908
2913
).
Lee
,
J. M.
(
2013
).
Introduction to smooth manifolds
(2nd ed.).
New York
:
Springer
.
Li
,
Y.
, &
Yuan
,
Y.
(
2017
).
Convergence analysis of two-layer neural networks with ReLU activation
. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
N.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
597
607
).
Red Hook, NY
:
Curran
.
Murty
,
K. G.
, &
Kabadi
,
S. N.
(
1987
).
Some NP-complete problems in quadratic and nonlinear programming
.
Mathematical Programming
,
39
(
2
),
117
129
.
Nguyen
,
Q.
, &
Hein
,
M.
(
2017
).
The loss surface of deep and wide neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2603
2612
).
Nguyen
,
Q.
, &
Hein
,
M.
(
2018
).
Optimization landscape and expressivity of deep CNNS
. In
Proceedings of the International Conference on Machine Learning
(pp.
3727
3736
).
Rockafellar
,
R. T.
, &
Wets
,
R. J.-B.
(
2009
).
Variational analysis
.
New York
:
Springer Science & Business Media
.
Saxe
,
A. M.
,
McClelland
,
J. L.
, &
Ganguli
,
S.
(
2014
).
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
.
arXiv:1312.6120
.
Shamir
,
O.
(
2018
).
Are ResNets provably better than linear predictors
? In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Red Hook, NY
:
Curran
.
Soltanolkotabi
,
M.
(
2017
). Learning ReLUs via gradient descent. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
2007
2017
).
Red Hook, NY
:
Curran
.
Zhong
,
K.
,
Song
,
Z.
,
Jain
,
P.
,
Bartlett
,
P. L.
, &
Dhillon
,
I. S.
(
2017
).
Recovery guarantees for one-hidden-layer neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
4140
4149
).
Zou
,
D.
,
Cao
,
Y.
,
Zhou
,
D.
, &
Gu
,
Q.
(
2018
).
Stochastic gradient descent optimizes over-parameterized deep ReLU networks
.
arXiv:1811.08888
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.