For nonconvex optimization in machine learning, this article proves that every local minimum achieves the globally optimal value of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this article improves or complements several state-of-the-art theoretical results on deep neural networks, deep residual networks, and overparameterized deep neural networks with a unified proof technique and novel geometric insights. A special case of our results also contributes to the theoretical foundation of representation learning.

Deep learning has achieved considerable empirical success in machine learning applications. However, insufficient work has been done on theoretically understanding deep learning, partly because of the nonconvexity and high-dimensionality of the objective functions used to train deep models. In general, theoretical understanding of nonconvex, high-dimensional optimization is challenging. Indeed, finding a global minimum of a general nonconvex function (Murty & Kabadi, 1987) and training certain types of neural networks (Blum & Rivest, 1992) are both NP-hard. Considering the NP-hardness for a general set of relevant problems, it is necessary to use additional assumptions to guarantee efficient global optimality in deep learning. Accordingly, recent theoretical studies have proven global optimality in deep learning by using additional strong assumptions such as linear activation, random activation, semirandom activation, gaussian inputs, single hidden-layer network, and significant overparameterization (Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016; Hardt & Ma, 2017; Nguyen & Hein, 2017, 2018; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Ge, Lee, & Ma, 2017; Goel & Klivans, 2017; Zhong, Song, Jain, Bartlett, & Dhillon, 2017; Li & Yuan, 2017; Kawaguchi, Xie, & Song, 2018; Du & Lee, 2018).

A study proving efficient global optimality in deep learning is thus closely related to the search for additional assumptions that might not hold in many practical applications. Toward widely applicable practical theory, we can also ask a different type of question: If standard global optimality requires additional assumptions, then what type of global optimality does not? In other words, instead of searching for additional assumptions to guarantee standard global optimality, we can also search for another type of global optimality under mild assumptions. Furthermore, instead of an arbitrary type of global optimality, it is preferable to develop a general theory of global optimality that not only works under mild assumptions but also produces the previous results with the previous additional assumptions, while predicting new results with future additional assumptions. This type of general theory may help not only to explain when and why an existing machine learning method works but also to predict the types of future methods that will or will not work.

As a step toward this goal, this article proves a series of theoretical results. The major contributions are summarized as follows:

• For nonconvex optimization in machine learning with mild assumptions, we prove that every differentiable local minimum achieves global optimality of the perturbable gradient basis model class. This result is directly applicable to many existing machine learning models, including practical deep learning models, and to new models to be proposed in the future, nonconvex and convex.

• The proposed general theory with a simple and unified proof technique is shown to be able to prove several concrete guarantees that improve or complement several state-of-the-art results.

• In general, the proposed theory allows us to see the effects of the design of models, methods, and assumptions on the optimization landscape through the lens of the global optima of the perturbable gradient basis model class.

Because a local minimum $θ$ in $Rdθ$ only requires the $θ$ to be locally optimal in $Rdθ$, it is nontrivial that the local minimum is guaranteed to achieve the globally optimality in $Rdθ$ of the induced perturbable gradient basis model class. The reason we can possibly prove something more than many worst-case results in general nonconvex optimization is that we explicitly take advantage of mild assumptions that commonly hold in machine learning and deep learning. In particular, we assume that an objective function to be optimized is structured with a sum of weighted errors, where each error is an output of composition of a loss function and a function of a hypothesis class. Moreover, we make mild assumptions on the loss function and a hypothesis class, all of which typically hold in practice.

This section defines the problem setting and common notation.

### 2.1  Problem Description

Let $x∈X$ and $y∈Y$ be an input vector and a target vector, respectively. Define $((xi,yi))i=1m$ as a training data set of size $m$. Let $θ∈Rdθ$ be a parameter vector to be optimized. Let $f(x;θ)∈Rdy$ be the output of a model or a hypothesis, and let $ℓ:Rdy×Y→R≥0$ be a loss function. Here, $dθ,dy∈N>0$. We consider the following standard objective function $L$ to train a model $f(x;θ)$:
$L(θ)=∑i=1mλiℓ(f(xi;θ),yi).$
This article allows the weights $λ1,…,λm>0$ to be arbitrarily fixed. With $λ1=⋯=λm=1m$, all of our results hold true for the standard average loss $L$ as a special case.

### 2.2  Notation

Because the focus of this article is the optimization of the vector $θ$, the following notation is convenient: $ℓy(q)=ℓ(q,y)$ and $fx(q)=f(x;q)$. Then we can write
$L(θ)=∑i=1mλiℓyi(fxi(θ))=∑i=1mλi(ℓyi∘fxi)(θ).$

We use the following standard notation for differentiation. Given a scalar-valued or vector-valued function $ϕ:Rd→Rd'$ with components $ϕ=(ϕ1,…,ϕd')$ and variables $(v1,…,vd¯)$, let $∂vϕ:Rd→Rd'×d¯$ be the matrix-valued function with each entry $(∂vϕ)i,j=∂ϕi∂vj$. Note that if $ϕ$ is a scalar-valued function, $∂vϕ$ outputs a row vector. In addition, $∂ϕ=∂vϕ$ if $(v1,…,vd)$ are the input variables of $ϕ$. Given a function $ϕ:Rd→Rd'$, let $∂kϕ:Rd→R$ be the partial derivative $∂kϕ$ with respect to the $k$th variable of $ϕ$. For the syntax of any differentiation map $∂$, given functions $ϕ$ and $ζ$, let $∂ϕ(ζ(q))=(∂ϕ)(ζ(q))$ be the (partial) derivative $∂ϕ$ evaluated at an output $ζ(q)$ of a function $ζ$.

Given a matrix $M∈Rd×d'$, $vec(M)=[M1,1,…,Md,1,M1,2,…,Md,2,…,M1,d',…,Md,d']T$ represents the standard vectorization of the matrix $M$. Given a set of $n$ matrices or vectors ${M(j)}j=1n$, define $[M(j)]j=1n=[M(1),M(2),…,M(n)]$ to be a block matrix of each column block being $M(1),M(2),…,M(n)$. Similarly, given a set $I={i1,…,in}$ with $(i1,…,in)$ increasing, define $[M(j)]j∈I=[M(i1)⋯M(in)]$.

This section shows our first main result that under mild assumptions, every differentiable local minimum achieves the global optimality of the perturbable gradient basis model class.

### 3.1  Assumptions

Given a hypothesis class $f$ and data set, let $Ω$ be a set of nondifferentiable points $θ$ as $Ω={θ∈Rdθ:(∃i∈{1,…,m})[fxi$ is not differentiable at $θ]}$. Similarly, define $Ω˜={θ∈Rdθ:(∀ε>0)(∃θ'∈B(θ,ε))(∃i∈{1,…,m})[fxi$ is not differentiable at $θ']}$. Here, $B(θ,ε)$ is the open ball with the center $θ$ and the radius $ε$. In common nondifferentiable models $f$ such as neural networks with rectified linear units (ReLUs) and pooling operations, we have that $Ω=Ω˜$, and the Lebesgue measure of $Ω(=Ω˜$) is zero.

This section uses the following mild assumptions.

Assumption 1

(Use of Common Loss criteria). For all $i∈{1,…,m}$, the function $ℓyi:q↦ℓ(q,yi)∈R≥0$ is differentiable and convex (e.g., the squared loss, cross-entropy loss, or polynomial hinge loss satisfies this assumption).

Assumption 2

(Use of Common Model Structures). There exists a function $g:Rdθ→Rdθ$ such that $fxi(θ)=∑k=1dθg(θ)k∂kfxi(θ)$ for all $i∈{1,…,m}$ and all $θ∈Rdθ∖Ω$.

Assumption 1 is satisfied by simply using common loss criteria that include the squared loss $ℓ(q,y)=∥q-y∥22$, cross-entropy loss $ℓ(q,y)=-∑k=1dyyklogexp(qk)∑k'exp(qk')$, and smoothed hinge loss $ℓ(q,y)=(max{0,1-yq})p$ with $p≥2$ (the hinge loss with $dy=1$). Although the objective function $L:θ↦L(θ)$ used to train a complex machine learning model (e.g., a neural network) is nonconvex in $θ$, the loss criterion $ℓyi:q↦ℓ(q,yi)$ is usually convex in $q$. In this article, the cross-entropy loss includes the softmax function, and thus $fx(θ)$ is the pre-softmax output of the last layer in related deep learning models.

Assumption 2 is satisfied by simply using a common architecture in deep learning or a classical machine learning model. For example, consider a deep neural network of the form $fx(θ)=Wh(x;u)+b$, where $h(x;u)$ is an output of an arbitrary representation at the last hidden layer and $θ=vec([W,b,u])$. Then assumption 2 holds because $fxi(θ)=∑k=1dθg(θ)k∂kfxi(θ)$, where $g(θ)k=θk$ for all $k$ corresponding to the parameters $(W,b)$ in the last layer and $g(θ)k=0$ for all other $k$ corresponding to $u$. In general, because $g$ is a function of $θ$, assumption 2 is easily satisfiable. Assumption 2 does not require the model $f(x;θ)$ to be linear in $θ$ or $x$.

Note that we allow the nondifferentiable points to exist in $L(θ)$; for example, the use of ReLU is allowed. For a nonconvex and nondifferentiable function, we can still have first-order and second-order necessary conditions of local minima (e.g., Rockafellar & Wets, 2009, theorem 13.24). However, subdifferential calculus of a nonconvex function requires careful treatment at nondifferentiable points (see Rockafellar & Wets, 2009; Kakade & Lee, 2018; Davis, Drusvyatskiy, Kakade, & Lee, 2019), and deriving guarantees at nondifferentiable points is left to a future study.

### 3.2  Theory for Critical Points

Before presenting the first main result, this section provides a simpler result for critical points to illustrate the ideas behind the main result for local minima. We define the (theoretical) objective function $Lθ$ of the gradient basis model class as
$Lθ(α)=∑i=1mλiℓfθ(xi;α),yi,$
where ${fθ(xi;α)=∑k=1dθαk∂kfxi(θ):α∈Rdθ}$ is the induced gradient basis model class. The following theorem shows that every differentiable critical point of our original objective $L$ (including every differentiable local minimum and saddle point) achieves the global minimum value of $Lθ$. The complete proofs of all the theoretical results are presented in appendix A.
Theorem 1.
Let assumptions 1 and 2 hold. Then for any critical point $θ∈(Rdθ∖Ω)$ of $L$, the following holds:
$L(θ)=infα∈RdθLθ(α).$

An important aspect in theorem 1 is that $Lθ$ on the right-hand side is convex, while $L$ on the left-hand side can be nonconvex or convex. Here, following convention, $infS$ is defined to be the infimum of a subset $S$ of $R¯$ (the set of affinely extended real numbers); that is, if $S$ has no lower bound, $infS=-∞$ and $inf∅=∞$. Note that theorem 1 vacuously holds true if there is no critical point for $L$. To guarantee the existence of a minimizer in a (nonempty) subspace $S⊆Rdθ$ for $L$ (or $Lθ$), a classical proof requires two conditions: a lower semicontinuity of $L$ (or $Lθ$) and the existence of a $q∈S$ for which the set ${q'∈S:L(q')≤L(q)}$ (or ${q'∈S:Lθ(q')≤Lθ(q)}$) is compact (see Bertsekas, 1999, for different conditions).

#### 3.2.1  Geometric View

This section presents the geometric interpretation of theorem 1 that provides an intuitive yet formal description of gradient basis model class. Figure 1 illustrates the gradient basis model class and theorem 1 with $θ∈R2$ and $fX(θ)∈R3$. Here, we consider the following map from the parameter space to the concatenation of the output of the model at $x1,x2,…,xm$:
$fX:θ∈Rdθ↦(fx1(θ)⊤,fx2(θ)⊤,…,fxm(θ)⊤)⊤∈Rmdy.$
Figure 1:

Illustration of gradient basis model class and theorem 1 with $θ∈R2$ and $fX(θ)∈R3$ ($dy=1$). Theorem 1 translates the local condition of $θ$ in the parameter space $R2$ (on the left) to the global optimality in the output space $R3$ (on the right). The subspace $TfX(θ)$ is the space of the outputs of the gradient basis model class. Theorem 1 states that $fX(θ)$ is globally optimal in the subspace as $fX(θ)∈argminf∈TfX(θ)dist(f,y)$ for any differentiable critical point $θ$ of $L$.

Figure 1:

Illustration of gradient basis model class and theorem 1 with $θ∈R2$ and $fX(θ)∈R3$ ($dy=1$). Theorem 1 translates the local condition of $θ$ in the parameter space $R2$ (on the left) to the global optimality in the output space $R3$ (on the right). The subspace $TfX(θ)$ is the space of the outputs of the gradient basis model class. Theorem 1 states that $fX(θ)$ is globally optimal in the subspace as $fX(θ)∈argminf∈TfX(θ)dist(f,y)$ for any differentiable critical point $θ$ of $L$.

Close modal
In the output space $Rmdy$ of $fX$, the objective function $L$ induces the notion of distance from the target vector $y=(y1⊤,…,ym⊤)⊤∈Rmdy$ to a vector $f=(f1⊤,…,fm⊤)⊤∈Rmdy$ as
$dist(f,y)=∑i=1mλiℓ(fi,yi).$
We consider the affine subspace $TfX(θ)$ of $Rmdy$ that passes through the point $fX(θ)$ and is spanned by the set of vectors ${∂1fX(θ),…,∂dθfX(θ)}$,
$TfX(θ)=span({∂1fX(θ),…,∂dθfX(θ)})+{fX(θ)},$
where the sum of the two sets represents the Minkowski sum of the sets.
Then the subspace $TfX(θ)$ is the space of the outputs of the gradient basis model class in general beyond the low-dimensional illustration. This is because by assumption 2, for any given $θ$,
$TfX(θ)=∑k=1dθ(g(θ)k+αk)∂kfX(θ):α∈Rdθ=∑k=1dθαk∂kfX(θ):α∈Rdθ,$
(3.1)
and $∑k=1dθαk∂kfX(θ)=(fθ(x1;α)⊤,…,fθ(xm;α)⊤)⊤$. In other words, $TfX(θ)=span({∂1fX(θ),…,∂dθfX(θ)})∋(fθ(x1;α)⊤,…,fθ(xm;α)⊤)⊤$.
Therefore, in general, theorem 1 states that under assumptions 1 and 2, $fX(θ)$ is globally optimal in the subspace $TfX(θ)$ as
$fX(θ)∈argminf∈TfX(θ)dist(f,y),$
for any differentiable critical point $θ$ of $L$. Theorem 1 concludes this global optimality in the affine subspace of the output space based on the local condition in the parameter space (i.e., differentiable critical point). A key idea behind theorem 1 is to consider the map between the parameter space and the output space, which enables us to take advantage of assumptions 1 and 2.

Figure 2 illustrates the gradient basis model class and theorem 1 with a union of manifolds and a tangent space. Under the constant rank condition, the image of the map $fX$ locally forms a single manifold. More precisely, if there exists a small neighborhood $U(θ)$ of $θ$ such that $fX$ is differentiable in $U(θ)$ and $rank(∂fX(θ'))=r$ is constant with some $r$ for all $θ'∈U(θ)$ (the constant rank condition), then the rank theorem states that the image $fX(U(θ))$ is a manifold of dimension $r$ (Lee, 2013, theorem 4.12). We note that the rank map $θ↦rank(∂fX(θ))$ is lower semicontinuous (i.e., if $rank(∂fX(θ))=r$, then there exists a neighborhood $U(θ)$ of $θ$ such that $rank(∂fX(θ'))≥r$ for any $θ'∈U(θ)$). Therefore, if $∂fX(θ)$ at $θ$ has the maximum rank in a small neighborhood of $θ$, then the constant rank condition is satisfied.

Figure 2:

Illustration of gradient basis model class and theorem 1 with manifold and tangent space. The space $R2∋θ$ on the left is the parameter space, and the space $R3∋fX(θ)$ on the right is the output space. The surface $M⊂R3$ on the right is the image of $fX$, which is a union of finitely many manifolds. The tangent space $TfX(θ)$ is the space of the outputs of the gradient basis model class. Theorem 1 states that if $θ$ is a differentiable critical point of $L$, then $fX(θ)$ is globally optimal in the tangent space $TfX(θ)$.

Figure 2:

Illustration of gradient basis model class and theorem 1 with manifold and tangent space. The space $R2∋θ$ on the left is the parameter space, and the space $R3∋fX(θ)$ on the right is the output space. The surface $M⊂R3$ on the right is the image of $fX$, which is a union of finitely many manifolds. The tangent space $TfX(θ)$ is the space of the outputs of the gradient basis model class. Theorem 1 states that if $θ$ is a differentiable critical point of $L$, then $fX(θ)$ is globally optimal in the tangent space $TfX(θ)$.

Close modal
For points $θ$ where the constant rank condition is violated, the image of the map $fX$ is no longer a single manifold. However, locally it decomposes as a union of finitely many manifolds. More precisely, if there exists a small neighborhood $U(θ)$ of $θ$ such that $fX$ is analytic over $U(θ)$ (this condition is satisfied for commonly used activation functions such as ReLU, sigmoid, and hyperbolic tangent at any differentiable point), then the image $fX(U(θ))$ admits a locally finite partition $M$ into connected submanifolds such that whenever $M≠M'∈M$ with $M¯∩M'≠∅$ ($M¯$ is the closure of $M$), we have
$M'⊂M¯,dim(M')
See Hardt (1975) for the proof.
If the point $θ$ satisfies the constant rank condition, then $TfX(θ)$ is exactly the tangent space of the manifold formed by the image $fX(U(θ))$. Otherwise, locally the image decomposes into a finite union $M$ of submanifolds. In this case, $TfX(θ)$ belongs to the span of the tangent space of those manifolds in $M$ as
$TfX(θ)⊂{TpM:p=fX(θ),M∈M},$
where $TpM$ is the tangent space of the manifold $M$ at the point $p$.

#### 3.2.2  Examples

In this section, we show through examples that theorem 1 generalizes the previous results in special cases while providing new theoretical insights based on the gradient basis model class and its geometric view. In the following, whenever the form of $f$ is specified, we require only assumption 1 because assumption 2 is automatically satisfied by a given $f$.

For classical machine learning models, example 1 shows that the gradient basis model class is indeed equivalent to a given model class. From the geometric view, this means that for any $θ$, the tangent space $TfX(θ)$ is equal to the whole image $M$ of $fX$ (i.e., $TfX(θ)$ does not depend on $θ$). This reduces theorem 1 to the statement that every critical point of $L$ is a global minimum of $L$.

Example 1: Classical Machine Learning Models.

For any basis function model $f(x;θ)=∑k=1dθθkφ(x)k$ in classical machine learning with any fixed feature map $φ:X→Rdθ$, we have that $fθ(x;α)=f(x;α)$, and hence $infθ∈RdθL(θ)=infα∈RdθLθ(α)$, as well as $Ω=∅$. In other words, in this special case, theorem 1 states that every critical point of $L$ is a global minimum of $L$. Here, we do not assume that a critical point or a global minimum exists or can be attainable. Instead, the statement logically means that if a point is a critical point, then the point is a global minimum. This type of statement vacuously holds true if there is no critical point.

For overparameterized deep neural networks, example 2 shows that the induced gradient basis model class is highly expressive such that it must contain the globally optimal model of a given model class of deep neural networks. In this example, the tangent space $TfX(θ)$ is equal to the whole output space $Rmdy$. This reduces theorem 1 to the statement that every critical point of $L$ is a global minimum of $L$ for overparameterized deep neural networks.

Intuitively, in Figure 1 or 2, we can increase the number of parameters and raise the number of partial derivatives $∂kfX(θ)$ in order to increase the dimensionality of the tangent space $TfX(θ)$ so that $TfX(θ)=Rmdy$. This is indeed what happens in example 2, as well as in the previous studies of significantly overparameterized deep neural networks (Allen-Zhu, Li, & Song, 2018; Du, Lee, Li, Wang, & Zhai, 2018; Zou et al., 2018). In the previous studies, the significant overparameterization is required so that the tangent space $TfX(θ)$ does not change from the initial tangent space $TfX(θ(0))=Rmdy$ during training. Thus, theorem 1, with its geometric view, provides the novel algebraic and geometric insights into the results of the previous studies and the reason why overparameterized deep neural networks are easy to be optimized despite nonconvexity.

Example 2: Overparameterized Deep Neural Networks.

Theorem 1 implies that every critical point (and every local minimum) is a global minimum for sufficiently overparameterized deep neural networks. Let $n$ be the number of units in each layer of a fully connected feedforward deep neural network. Let us consider a significant overparameterization such that $n≥m$. Let us write a fully connected feedforward deep neural network with the trainable parameters $(θ,u)$ by $f(x;θ)=Wφ(x;u)$, where $W∈Rdy×n$ is the weight matrix in the last layer, $θ=vec(W)$, $u$ contains the rest of the parameters, and $φ(x;u)$ is the output of the last hidden layer. Denote $xi=[(xi(raw))⊤,1]⊤$ to contain the constant term to account for the bias term in the first layer. Assume that the input samples are normalized as $∥xi(raw)∥2=1$ for all $i∈{1,…,m}$ and distinct as $(xi(raw))⊤xi'(raw)<1-δ$ with some $δ>0$ for all $i'≠i$. Assume that the activation functions are ReLU activation functions. Then we can efficiently set $u$ to guarantee $rank([φ(xi;u)]i=1m)≥m$ (e.g., by choosing $u$ to make each unit of the last layer to be active only for each sample $xi$).1 Theorem 1 implies that every critical point $θ$ with this $u$ is a global minimum of the whole set of trainable parameters $(θ,u)$ because $infαLθ(α)=inff1,…,fm∑i=1mλiℓ(fi,yi)$ (with assumption 1).

For deep neural networks, example 3 shows that standard networks have the global optimality guarantee with respect to the representation learned at the last layer, and skip connections further ensure the global optimality with respect to the representation learned at each hidden layer. This is because adding the skip connections incurs new partial derivatives ${∂kfX(θ)}k$ that span the tangent space containing the output of the best model with the corresponding learned representation.

Example 3: Deep Neural Networks and Learned Representations.
Consider a feedforward deep neural network, and let $I(skip)⊆{1,…,H}$ be the set of indices such that there exists a skip connection from the $(l-1)$th layer to the last layer for all $l∈I(skip)$; that is, in this example,
$f(x;θ)=∑l∈I(skip)W(l+1)h(l)(x;u),$
where $θ=vec([[W(l+1)]l∈I(skip),u])∈Rdθ$ with $W(l+1)∈Rdy×dl$ and $u∈Rdu$.
The conclusion in this example holds for standard deep neural networks without skip connections too, since we always have $H∈I(skip)$ for standard deep neural networks. Let assumption 1 hold. Then theorem 1 implies that for any critical point $θ∈(Rdθ∖Ω)$ of $L$, the following holds:
$L(θ)=infα∈RdθLθ(skip)(α),$
where
$Lθ(skip)(α)=∑i=1mλiℓyi∑l∈I(skip)αw(l+1)h(l)(xi;u)+∑k=1du(αu)k∂ukfxi(θ),$
with $α=vec([[α(l+1)]l∈I(skip),αu])∈Rdθ$ with $α(l+1)∈Rdy×dl$ and $αu∈Rdu$. This is because $f(x;θ)=(∂vec(W(H+1))f(x;θ))vec(W(H+1))$, and thus assumption 2 is automatically satisfied. Here, $h(l)(xi;u)$ is the representation learned at the $l$-layer. Therefore, $infα∈RdθLθ(skip)(α)$ is at most the global minimum value of the basis models with the learned representations of the last layer and all hidden layers with the skip connections.

### 3.3  Theory for Local Minima

We are now ready to present our first main result. We define the (theoretical) objective function $L˜θ$ of the perturbable gradient basis model class as
$L˜θ(α,ε,S)=∑i=1mλiℓ(f˜θ(xi;α,ε,S),yi),$
where $f˜θ(xi;α,ε,S)$ is a perturbed gradient basis model defined as
$f˜θ(xi;α,ε,S)=∑k=1dθ∑j=1|S|αk,j∂kfxi(θ+εSj).$
Here, $S$ is a finite set of vectors $S1,…,S|S|∈Rdθ$ and $α∈Rdθ×|S|$. Let $V[θ,ε]$ be the set of all vectors $v∈Rdθ$ such that $∥v∥2≤1$ and $fxi(θ+εv)=fxi(θ)$ for any $i∈{1,…,m}$. Let $S⊆finS'$ denote a finite subset $S$ of a set $S'$. For an $Sj∈V[θ,ε]$, we have $fxi(θ+εSj)=fxi(θ)$, but it is possible to have $∂kfxi(θ+εSj)≠∂kfxi(θ)$. This enables the greater expressivity of $f˜θ(xi;α,ε,S)$ with a $S⊆finV[θ,ε]$ when compared with $fθ(xi;α)$.

The following theorem shows that every differentiable local minimum of $L$ achieves the global minimum value of $L˜θ$:

Theorem 2.
Let assumptions 1 and 2 hold. Then, for any local minimum $θ∈(Rdθ∖Ω˜)$ of $L$, the following holds: there exists $ε0>0$ such that for any $ε∈[0,ε0)$,
$L(θ)=infS⊆finV[θ,ε],α∈Rdθ×|S|L˜θ(α,ε,S).$
(3.2)
To understand the relationship between theorems 1 and 2, let us consider the following general inequalities: for any $θ∈(Rdθ∖Ω˜)$ with $ε≥0$ being sufficiently small,
$L(θ)≥infα∈RdθLθ(α)≥infS⊆finV[θ,ε],α∈Rdθ×|S|L˜θ(α,ε,S).$
Here, whereas theorem 1 states that the first inequality becomes equality as $L(θ)=infα∈RdθLθ(α)$ at every differentiable critical point, theorem 2 states that both inequalities become equality as
$L(θ)=infα∈RdθLθ(α)=infS⊆finV[θ,ε],α∈Rdθ×|S|L˜θ(α,ε,S)$
at every differentiable local minimum.

From theorem 1 to theorem 2, the power of increasing the number of parameters (including overparameterization) is further improved. The right-hand side in equation 3.2 is the global minimum value over the variables $S⊆finV[θ,ε]$ and $α∈Rdθ×|S|$. Here, as $dθ$ increases, we may obtain the global minimum value of a larger search space $Rdθ×|S|$, which is similar to theorem 1. A concern in theorem 1 is that as $dθ$ increases, we may also significantly increase the redundancy among the elements in ${∂kfxi(θ)}k=1dθ$. Although this remains a valid concern, theorem 2 allows us to break the redundancy by the globally optimal $S⊆finV[θ,ε]$ to some degree.

For example, consider $f(x;θ)=g(W(l)h(l)(x;u);u)$, which represents a deep neural network, with some $l$th-layer output $h(l)(x;u)∈Rdl$, a trainable weight matrix $W(l)$, and an arbitrary function $g$ to compute the rest of the forward pass. Here, $θ=vec([W(l),u])$. Let $h(l)(X;u)=[h(l)(xi;u)]i=1m∈Rdl×m$ and, similarly, $f(X;θ)=g(W(l)h(l)(X;u);u)∈Rdy×m$. Then, all vectors $v$ corresponding to any elements in the left null space of $h(l)(X;u)$ are in $V[θ,ε]$ (i.e., $vk=0$ for all $k$ corresponding to $u$ and the rest of $vk$ is set to perturb $W(l)$ by an element in the left null space). Thus, as the redundancy increases such that the dimension of the left null space of $h(l)(X;u)$ increases, we have a larger space of $V[θ,ε]$, for which a global minimum value is guaranteed at a local minimum.

#### 3.3.1  Geometric View

This section presents the geometric interpretation of the perturbable gradient basis model class and theorem 2. Figure 3 illustrates the perturbable gradient basis model class and theorem 2 with $θ∈R2$ and $fX(θ)∈R3$. Figure 4 illustrates them with a union of manifolds and tangent spaces at a singular point. Given a $ε$ ($≤ε0$), define the affine subspace $T˜fX(θ)$ of the output space $Rmdy$ by
$T˜fX(θ)=span({f∈Rmdy:(∃v∈V[θ,ε])[f∈TfX(θ+εv)]}).$
Then the subspace $T˜fX(θ)$ is the space of the outputs of the perturbable gradient basis model class in general beyond the low-dimensional illustration (this follows equation 3.1 and the definition of the perturbable gradient basis model). Therefore, in general, theorem 2 states that under assumptions 1 and 2, $fX(θ)$ is globally optimal in the subspace $T˜fX(θ)$ as
$fX(θ)∈argminf∈T˜fX(θ)dist(f,y)$
for any differentiable local minima $θ$ of $L$. Theorem 2 concludes the global optimality in the affine subspace of the output space based on the local condition in the parameter space—that is, differentiable local minima. Here, a (differentiable) local minimum $θ$ is required to be optimal only in an arbitrarily small local neighborhood in the parameter space, and yet $fX(θ)$ is guaranteed to be globally optimal in the affine subspace of the output space. This illuminates the fact that nonconvex optimization in machine learning has a particular structure beyond general nonconvex optimization.
Figure 3:

Illustration of perturbable gradient basis model class and theorem 2 with $θ∈R2$ and $fX(θ)∈R3$ ($dy=1$). Theorem 2 translates the local condition of $θ$ in the parameter space $R2$ (on the left) to the global optimality in the output space $R3$ (on the right). The subspace $T˜fX(θ)$ is the space of the outputs of the perturbable gradient basis model class. Theorem 2 states that $fX(θ)$ is globally optimal in the subspace as $fX(θ)∈argminf∈T˜fX(θ)dist(f,y)$ for any differentiable local minima $θ$ of $L$. In this example, $T˜fX(θ)$ is the whole output space $R3$, while $TfX(θ)$ is not, illustrating the advantage of the perturbable gradient basis over the gradient basis. Since $T˜fX(θ)=R3$, $fX(θ)$ must be globally optimal in the whole output space $R3$.

Figure 3:

Illustration of perturbable gradient basis model class and theorem 2 with $θ∈R2$ and $fX(θ)∈R3$ ($dy=1$). Theorem 2 translates the local condition of $θ$ in the parameter space $R2$ (on the left) to the global optimality in the output space $R3$ (on the right). The subspace $T˜fX(θ)$ is the space of the outputs of the perturbable gradient basis model class. Theorem 2 states that $fX(θ)$ is globally optimal in the subspace as $fX(θ)∈argminf∈T˜fX(θ)dist(f,y)$ for any differentiable local minima $θ$ of $L$. In this example, $T˜fX(θ)$ is the whole output space $R3$, while $TfX(θ)$ is not, illustrating the advantage of the perturbable gradient basis over the gradient basis. Since $T˜fX(θ)=R3$, $fX(θ)$ must be globally optimal in the whole output space $R3$.

Close modal
Figure 4:

Illustration of perturbable gradient basis model class and theorem 2 with manifold and tangent space at a singular point. The surface $M⊂R3$ is the image of $fX$, which is a union of finitely many manifolds. The line $TfX(θ)$ on the left panel is the space of the outputs of the gradient basis model class. The whole space $T˜fX(θ)=R3$ on the right panel is the space of the outputs of the perturbable gradient basis model class. The space $T˜fX(θ)$ is the span of the set of the vectors in the tangent spaces $TfX(θ),TfX(θ')$, and $TfX(θ'')$. Theorem 2 states that if $θ$ is a differentiable local minimum of $L$, then $fX(θ)$ is globally optimal in the space $T˜fX(θ)$.

Figure 4:

Illustration of perturbable gradient basis model class and theorem 2 with manifold and tangent space at a singular point. The surface $M⊂R3$ is the image of $fX$, which is a union of finitely many manifolds. The line $TfX(θ)$ on the left panel is the space of the outputs of the gradient basis model class. The whole space $T˜fX(θ)=R3$ on the right panel is the space of the outputs of the perturbable gradient basis model class. The space $T˜fX(θ)$ is the span of the set of the vectors in the tangent spaces $TfX(θ),TfX(θ')$, and $TfX(θ'')$. Theorem 2 states that if $θ$ is a differentiable local minimum of $L$, then $fX(θ)$ is globally optimal in the space $T˜fX(θ)$.

Close modal

The previous section showed that all local minima achieve the global optimality of the perturbable gradient basis model class with several direct consequences for special cases. In this section, as consequences of theorem 2, we complement or improve the state-of-the-art results in the literature.

### 4.1  Example: ResNets

As an example of theorem 2, we set $f$ to be the function of a certain type of residual networks (ResNets) that Shamir (2018) studied. That is, both Shamir (2018) and this section set $f$ as
$f(x;θ)=W(x+Rz(x;u)),$
(4.1)
where $θ=vec([W,R,u])∈Rdθ$ with $W∈Rdy×dx$, $R∈Rdx×dz$, and $u∈Rdu$. Here, $z(x;u)∈Rdz$ represents an output of deep residual functions with a parameter vector $u$. No assumption is imposed on the form of $z(x;u)$, and $z(x;u)$ can represent an output of possibly complicated deep residual functions that arise in ResNets. For example, the function $f$ can represent deep preactivation ResNets (He, Zhang, Ren, & Sun, 2016), which are widely used in practice. To simplify theoretical study, Shamir (2018) assumed that every entry of the matrix $R$ is unconstrained (e.g., instead of $R$ representing convolutions). We adopt this assumption based on the previous study (Shamir, 2018).

#### 4.1.1  Background

Along with an analysis of approximate critical points, Shamir (2018) proved the following main result, proposition 1, under the assumptions PA1, PA2, and PA3:

• PA1: The output dimension $dy=1$.

• PA2: For any $y$, the function $ℓy$ is convex and twice differentiable.

• PA3: On any bounded subset of the domain of $L$, the function $Lu(W,R)$, its gradient $∇Lu(W,R)$, and its Hessian $∇2Lu(W,R)$ are all Lipschitz continuous in $(W,R)$, where $Lu(W,R)=L(θ)$ with a fixed $u$.

Proposition 1
(Shamir, 2018). Let $f$ be specified by equation 4.1, Let assumptions PA1, PA2, and PA3 hold. Then for any local minimum $θ$ of $L$,
$L(θ)≤infW∈Rdy×dx∑i=1mλiℓyi(Wxi).$

Shamir (2018) remarked that it is an open problem whether proposition 1 and another main result in the article can be extended to networks with $dy>1$ (multiple output units). Note that Shamir (2018) also provided proposition 1 with an expected loss and an analysis for a simpler decoupled model, $Wx+Vz(x;u)$. For the simpler decoupled model, our theorem 1 immediately concludes that given any $u$, every critical point with respect to $θ-u=(W,R)$ achieves a global minimum value with respect to $θ-u$ as $L(θ-u)=inf{∑i=1mλiℓyi(Wxi+Rz(xi;u)):W∈Rdy×dx,R∈Rdx×dz}$ ($≤infW∈Rdy×dx∑i=1mλiℓyi(Wxi)$). This holds for every critical point $θ$ since any critical point $θ$ must be a critical point with respect to $θ-u$.

### 4.2  Result

The following theorem shows that every differentiable local minimum achieves the global minimum value of $L˜θ(ResNet)$ (the right-hand side in equation 4.2), which is no worse than the upper bound in proposition 1 and is strictly better than the upper bound as long as $z(xi,u)$ or $f˜θ(xi;α,ε,S)$ is nonnegligible. Indeed, the global minimum value of $L˜θ(ResNet)$ (the right-hand side in equation 4.2) is no worse than the global minimum value of all models parameterized by the coefficients of the basis $x$ and $z(x;u)$, and further improvement is guaranteed through a nonnegligible $f˜θ(xi;α,ε,S)$.

Theorem 3.
Let $f$ be specified by equation 4.1. Let assumption 1 hold. Assume that $dy≤min{dx,dz}$. Then for any local minimum $θ∈(Rdθ∖Ω˜)$ of $L$, the following holds: there exists $ε0>0$ such that for any $ε∈(0,ε0)$,
$L(θ)=infS⊆finV[θ,ε],α∈Rdθ×|S|,αw∈Rdy×dx,αr∈Rdy×dzL˜θ(ResNet)(α,αw,αr,ε,S),$
(4.2)
where
$L˜θ(ResNet)(α,αw,αr,ε,S)=∑i=1mλiℓyi(αwxi+αrz(xi;u)+f˜θ(xi;α,ε,S)).$

Theorem 3 also successfully solved the first part of the open problem in the literature (Shamir, 2018) by discarding the assumption of $dy=1$. From the geometric view, theorem 3 states that the span $T˜fX(θ)$ of the set of the vectors in the tangent spaces ${TfX(θ+εv):v∈V[θ,ε]}$ contains the output of the best basis model with the linear feature $x$ and the learned nonlinear feature $z(xi;u)$. Similar to the examples in Figures 3 and 4, $T˜fX(θ)≠Tf(θ)$ and the output of the best basis model with these features is contained in $T˜fX(θ)$ but not in $Tf(θ)$.

Unlike the recent study on ResNets (Kawaguchi & Bengio, 2019), our theorem 3 predicts the value of $L$ through the global minimum value of a large search space (i.e., the domain of $L˜θ(ResNet)$) and is proven as a consequence of our general theory (i.e., theorem 2) with a significantly different proof idea (see section 4.3) and with the novel geometric insight.

#### 4.2.1  Example: Deep Nonlinear Networks with Locally Induced Partial Linear Structures

We specify $f$ to represent fully connected feedforward networks with arbitrary nonlinearity $σ$ and arbitrary depth $H$ as follows:
$f(x;θ)=W(H+1)h(H)(x;θ),$
(4.3)
where
$h(l)(x;θ)=σ(l)(W(l)h(l-1)(x;θ)),$
for all $l∈{1,…,H}$ with $h(0)(x;θ)=x$. Here, $θ=vec([W(l)]l=1H+1)∈Rdθ$ with $W(l)∈Rdl×dl-1$, $dH+1=dy$, and $d0=dx$. In addition, $σ(l):Rdl→Rdl$ represents an arbitrary nonlinear activation function per layer $l$ and is allowed to differ among different layers.

#### 4.2.2  Background

Given the difficulty of theoretically understanding deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that theoretically studying simplified networks (i.e., deep linear networks) is worthwhile. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that deep linear networks may exhibit several properties analogous to those of deep nonlinear networks. Accordingly, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett, Helmbold, & Long, 2019; Du & Hu, 2019).

Along this line, Laurent and Brecht (2018) recently proved the following main result, proposition 2, under the assumptions PA4, PA5, and PA6:

• PA4: Every activation function is identity as $σ(l)(q)=q$ for every $l∈{1,…,H}$ (i.e., deep linear networks).

• PA5: For any $y$, the function $ℓy$ is convex and differentiable.

• PA6: The thinnest layer is either the input layer or the output layer as $min{dx,dy}≤min{d1,…,dH}$.

Proposition 2

(Laurent & Brecht, 2018). Let $f$ be specified by equation 4.3. Let assumptions PA4, PA5, and PA6 hold. Then every local minimum $θ$ of $L$ is a global minimum.

#### 4.2.3  Result

Instead of studying deep linear networks, we now consider a partial linear structure locally induced by a parameter vector with nonlinear activation functions. This relaxes the linearity assumption and extends our understanding of deep linear networks to deep nonlinear networks.

Intuitively, $Jn,t[θ]$ is a set of partial linear structures locally induced by a vector $θ$, which is now formally defined as follows. Given a $θ∈Rdθ$, let $Jn,t[θ]$ be a set of all sets $J={J(t+1),…,J(H+1)}$ such that each set $J={J(t+1),…,J(H+1)}∈Jn,t[θ]$ satisfies the following conditions: there exists $ε>0$ such that for all $l∈{t+1,t+2,…,H+1}$,

1. $J(l)⊆{1,…,dl}$ with $|J(l)|≥n$.

2. $h(l)(xi,θ')k=(W(l)h(l-1)(xi,θ'))k$ for all $(k,θ',i)∈J(l)×B(θ,ε)×{1,…,m}$.

3. $Wi,j(l+1)=0$ for all $(i,j)∈({1,…,dl+1}∖J(l+1))×J(l)$ if $l≤H-1$.

Let $Θn,t$ be the set of all parameter vectors $θ$ such that $Jn,t[θ]$ is nonempty. As the definition reveals, a neural network with a $θ∈Θdy,t$ can be a standard deep nonlinear neural network (with no linear units).

Theorem 4.
Let $f$ be specified by equation 4.3. Let assumption 1 hold. Then for any $t∈{1,…,H}$, at every local minimum $θ∈(Θdy,t∖Ω˜)$ of $L$, the following holds. There exists $ε0>0$ such that for any $ε∈(0,ε0)$,
$L(θ)=infS⊆finV[θ,ε],α∈Rdθ×|S|,αh∈RdtL˜θ,t(ff)(α,αh,ε,S),$
where
$L˜θ,t(ff)(α,αh,ε,S)=∑i=1mλiℓyi∑l=tHαh(l+1)h(l)(xi;u)+f˜θ(xi;α,ε,S),$
with $αh=vec([αh(l+1)]l=tH)∈Rdt$, $αh(l+1)∈Rdy×dl$ and $dt=dy∑l=tHdl$.

Theorem 4 is a special case of theorem 2. A special case of theorem 4 then results in one of the main results in the literature regarding deep linear neural networks, that is, every local minimum is a global minimum. Consider any deep linear network with $dy≤min{d1,…,dH}$. Then every local minimum $θ$ is in $Θdy,0∖Ω˜=Θdy,0$. Hence, theorem 4 is reduced to the statement that for any local minimum, $L(θ)=infαh∈Rdt∑i=1mλiℓyi(∑l=0Hαh(l+1)h(l)(xi;u))=infαx∈Rdx∑i=1mλiℓyi(αxxi)$, which is the global minimum value. Thus, every local minimum is a global minimum for any deep linear neural network with $dy≤min{d1,…,dH}$. Therefore, theorem 4 successfully generalizes the recent previous result in the literature (proposition 2) for a common scenario of $dy≤dx$.

Beyond deep linear networks, theorem 4 illustrates both the benefit of the locally induced structure and the overparameterization for deep nonlinear networks. In the first term, $∑l=tHαh(l+1)h(l)(xi;u)$, in $Lθ,t(ff)$, we benefit by decreasing $t$ (a more locally induced structure) and increasing the width of the $l$th layer for any $l≥t$ (overparameterization). The second term, $f˜θ(xi;α,ε,S)$ in $Lθ,t(ff)$, is the general term that is always present from theorem 2, where we benefit from increasing $dθ$ because $α∈Rdθ×|S|$.

From the geometric view, theorem 4 captures the intuition that the span $T˜fX(θ)$ of the set of the vectors in the tangent spaces ${TfX(θ+εv):v∈V[θ,ε]}$ contains the best basis model with the linear feature for deep linear networks, as well as the best basis models with more nonlinear features as more local structures arise. Similar to the examples in Figures 3 and 4, $T˜fX(θ)≠Tf(θ)$ and the output of the best basis models with those features are contained in $T˜fX(θ)$ but not in $Tf(θ)$.

A similar local structure was recently considered in Kawaguchi, Huang, and Kaelbling (2019). However, both the problem settings and the obtained results largely differ from those in Kawaguchi et al. (2019). Furthermore, theorem 4 is proven as a consequence of our general theory (theorem 2), and accordingly, the proofs largely differ from each other as well. Theorem 4 also differs from recent results on the gradient decent algorithm for deep linear networks (Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett et al., 2019; Du & Hu, 2019), since we analyze the loss surface instead of a specific algorithm and theorem 4 applies to deep nonlinear networks as well.

### 4.3  Proof Idea in Applications of Theorem 2

Theorems 3 and 4 are simple consequences of theorem 2, and their proof is illustrative as a means of using theorem 2 in future studies with different additional assumptions. The high-level idea behind the proofs in the applications of theorem 2 is captured in the geometric view of theorem 2 (see Figures 3 and 4). That is, given a desired guarantee, we check whether the space $T˜fX(θ)$ is expressive enough to contain the output of the desired model corresponding to the desired guarantee.

To simplify the use of theorem 2, we provide the following lemma. This lemma states that the expressivity of the model $f˜θ(x;α,ε,S)$ with respect to $(α,S)$ is the same as that of $f˜θ(x;α,ε,S)+f˜θ(x;α',ε,S')$ with respect to $(α,α',S,S')$. As shown in its proof, this is essentially because $f˜θ$ is linear in $α$, and a union of two sets $S⊆finV[θ,ε]$ and $S'⊆finV[θ,ε]$ remains a finite subset of $V[θ,ε]$.

Lemma 1.

For any $θ$, any $ε≥0$, any $S'⊆finV[θ,ε]$, and any $x$, it holds that ${f˜θ(x;α,ε,S):α∈Rdθ×|S|,S⊆finV[θ,ε]}={f˜θ(x;α,ε,S)+f˜θ(x;α',ε,S'):α∈Rdθ×|S|,α'∈Rdθ×|S'|,S⊆finV[θ,ε]}.$

Based on theorem 2 and lemma 1, the proofs of theorems 3 and 4 are reduced to a simple search for finding $S'⊆finV[θ,ε]$ such that the expressivity of $f˜θ(xi;α',ε,S')$ with respect to $α'$ is no worse than the expressivity of $αwxi+αrz(xi;u)$ with respect to $(αw,αr)$ (see theorem 3) and that of $∑l=tHαh(l+1)h(l)(xi;u)$ with respect to $αh(l+1)$ (see theorem 4). In other words, ${f˜θ(xi;α',ε,S'):α'∈Rdθ×|S'|}⊇{αwxi+αrz(xi;u):αw∈Rdy×dx,αr∈Rdy×dz}$ (see theorem 3) and ${f˜θ(xi;α',ε,S'):α'∈Rdθ×|S'|}⊇{∑l=tHαh(l+1)h(l)(xi;u):αh∈Rdt}$ (see theorem 4). Only with this search for $S'$, theorem 2 together with lemma 1 implies the desired statements for theorems 3 and 4 (see sections A.4 and A.5 in the appendix for further details). Thus, theorem 2 also enables simple proofs.

This study provided a general theory for nonconvex machine learning and demonstrated its power by proving new competitive theoretical results with it. In general, the proposed theory provides a mathematical tool to study the effects of hypothesis classes $f$, methods, and assumptions through the lens of the global optima of the perturbable gradient basis model class.

In convex machine learning with a model output $f(x;θ)=θ⊤x$ with a (nonlinear) feature output $x=φ(x(raw))$, achieving a critical point ensures the global optimality in the span of the fixed basis $x=φ(x(raw))$. In nonconvex machine learning, we have shown that achieving a critical point ensures the global optimality in the span of the gradient basis $∂fx(θ)$, which coincides with the fixed basis $x=φ(x(raw))$ in the case of the convex machine learning. Thus, whether convex or nonconvex, achieving a critical point ensures the global optimality in the span of some basis, which might be arbitrarily bad (or good) depending on the choice of the handcrafted basis $φ(x(raw))=∂fx(θ)$ (for the convex case) or the induced basis $∂fx(θ)$ (for the nonconvex case). Therefore, in terms of the loss values at critical points, nonconvex machine learning is theoretically as justified as the convex one, except in the case when a preference is given to $φ(x(raw))$ over $∂fx(θ)$ (both of which can be arbitrarily bad or good). The same statement holds for local minima and perturbable gradient basis.

In this appendix, we provide complete proofs of the theoretical results.

### A.1  Proof of Theorem 1

The proof of theorem 1 combines lemma 2 with assumptions 1 and 2 by taking advantage of the structure of the objective function $L$. Although lemma 2 is rather weak and assumptions 1 and 2 are mild (in the sense that they usually hold in practice), a right combination of these with the structure of $L$ can prove the desired statement.

Lemma 2.
Assume that for any $i∈{1,…,m}$, the function $ℓyi:q↦ℓ(q,yi)$ is differentiable. Then for any critical point $θ∈(Rdθ∖Ω)$ of $L$, the following holds: for any $k∈{1,…,dθ},$
$∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ)=0.$
Proof of Lemma 2.

Let $θ$ be an arbitrary critical point $θ∈(Rdθ∖Ω)$ of $L$. Since $ℓyi:Rdy→R$ is assumed to be differentiable and $fxi∈Rdy$ is differentiable at the given $θ$, the composition $(ℓyi∘fxi)$ is also differentiable, and $∂k(ℓyi∘fxi)=∂ℓyi(fxi(θ))∂kfxi(θ)$. In addition, $L$ is differentiable because a sum of differentiable functions is differentiable. Therefore, for any critical point $θ$ of $L$, we have that $∂L(θ)=0$, and, hence, $∂kL(θ)=∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ)=0,$ for any $k∈{1,…,dθ},$ from linearity of differentiation operation.$□$

Proof of Theorem 1.
Let $θ∈(Rdθ∖Ω)$ be an arbitrary critical point of $L$. From assumption 2, there exists a function $g$ such that $fxi(θ)=∑k=1dθg(θ)k∂kfxi(θ)$ for all $i∈{1,…,m}$. Then, for any $α∈Rdθ$,
$Lθ(α)≥∑i=1mλiℓyi(fxi(θ))+λi∂ℓyi(fxi(θ))(fθ(xi;α)-f(xi;θ))=∑i=1mλiℓyi(fxi(θ))+∑k=1dθαk∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ)︸=0fromLemma2-∑i=1mλi∂ℓyi(fxi(θ))f(xi;θ)=∑i=1mλiℓyi(fxi(θ))-∑k=1dθg(θ)k∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ)︸=0fromLemma2,=L(θ),$
where the first line follows from assumption 1 (differentiable and convex $ℓyi$), the second line follows from linearity of summation, and the third line follows from assumption 2. Thus, on the one hand, we have that $L(θ)≤infα∈RdθLθ(α)$. On the other hand, since $f(xi;θ)=∑k=1dθg(θ)k∂kfxi(θ)∈{fθ(xi;α)=∑k=1dθαk∂kfxi(θ):α∈Rdθ}$, we have that $L(θ)≥infα∈RdθLθ(α)$. Combining these yields the desired statement of $L(θ)=infα∈RdθLθ(α)$.$□$

### A.2  Proof of Theorem 2

The proof of theorem 2 uses lemma 3, the structure of the objective function $L$, and assumptions 1 and 2.

Lemma 3.
Assume that for any $i∈{1,…,m}$, the function $ℓyi:q↦ℓ(q,yi)$ is differentiable. Then for any local minimum $θ∈(Rdθ∖Ω˜)$ of $L$, the following holds: there exists $ε0>0$ such that for any $ε∈[0,ε0)$, any $v∈V[θ,ε]$, and any $k∈{1,…,dθ}$,
$∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ+εv)=0.$
Proof of Lemma 3.
Let $θ∈(Rdθ∖Ω˜)$ be an arbitrary local minimum of $L$. Since $θ$ is a local minimum of $L$, by the definition of a local minimum, there exists $ε1>0$ such that $L(θ)≤L(θ')$ for all $θ'∈B(θ,ε1)$. Then for any $ε∈[0,ε1/2)$ and any $ν∈V[θ,ε]$, the vector $(θ+εv)$ is also a local minimum because
$L(θ+εv)=L(θ)≤L(θ')$
for all $θ'∈B(θ+εv,ε1/2)⊆B(θ,ε1)$ (the inclusion follows from the triangle inequality), which satisfies the definition of a local minimum for $(θ+εv)$.

Since $θ∈(Rdθ∖Ω˜)$, there exists $ε2>0$ such that $fx1,…,fxm$ are differentiable in $B(θ,ε2)$. Since $ℓyi:Rdy→R$ is assumed to be differentiable and $fxi∈Rdy$ is differentiable in $B(θ,ε2)$, the composition $(ℓyi∘fxi)$ is also differentiable, and $∂k(ℓyi∘fxi)=∂ℓyi(fxi(θ))∂kfxi(θ)$ in $B(θ,ε2)$. In addition, $L$ is differentiable in $B(θ,ε2)$ because a sum of differentiable functions is differentiable.

Therefore, with $ε0=min(ε1/2,ε2)$, we have that for any $ε∈[0,ε0)$ and any $ν∈V[θ,ε]$, the vector $(θ+εv)$ is a differentiable local minimum, and hence the first-order necessary condition of differentiable local minima implies that
$∂kL(θ+εv)=∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ+εv)=0,$
for any $k∈{1,…,dθ},$ where we used the fact that $fxi(θ)=fxi(θ+εv)$ for any $v∈V[θ,ε]$.$□$
Proof of Theorem 2.
Let $θ∈(Rdθ∖Ω˜)$ be an arbitrary local minimum of $L$. Since $(Rdθ∖Ω˜)⊆(Rdθ∖Ω)$, from assumption 2, there exists a function $g$ such that $fxi(θ)=∑k=1dθg(θ)k∂kfxi(θ)$ for all $i∈{1,…,m}$. Then from lemma 3, there exists $ε0>0$ such that for any $ε∈[0,ε0)$, any $S⊆finV[θ,ε]$ and any $α∈Rdθ×|S|$,
$L˜θ(α,ε,S)≥∑i=1mλiℓyi(fxi(θ))+λi∂ℓyi(fxi(θ))(f˜θ(xi;α,ε,S)-f(xi;θ))=∑i=1mλiℓyi(fxi(θ))+∑k=1dθ∑j=1|S|αk,j∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ+εSj)︸=0fromLemma3-∑i=1mλi∂ℓyi(fxi(θ))f(xi;θ)=∑i=1mλiℓyi(fxi(θ))-∑k=1dθg(θ)k∑i=1mλi∂ℓyi(fxi(θ))∂kfxi(θ)︸=0fromLemma3,=L(θ),$
where the first line follows from assumption 1 (differentiable and convex $ℓyi$), the second line follows from linearity of summation and the definition of $f˜θ(xi;α,ε,S)$, and the third line follows from assumption 2. Thus, on the one hand, there exists $ε0>0$ such that for any $ε∈[0,ε0)$, $L(θ)≤inf{L˜θ(α,ε,S):S⊆finV[θ,ε],α∈Rdθ×|S|}$. On the other hand, since $f(xi;θ)=∑k=1dθg(θ)k∂kfxi(θ)∈{f˜θ(xi;α,ε,S):α∈Rdθ,S=0}$, we have that $L(θ)≤inf{L˜θ(α,ε,S):S⊆finV[θ,ε],α∈Rdθ×|S|}$. Combining these yields the desired statement.$□$

### A.3  Proof of Lemma 1

As shown in the proof of lemma 1, lemma 1 is a simple consequence of the following facts: $f˜θ$ is linear in $α$ and a union of two sets $S⊆finV[θ,ε]$ and $S'⊆finV[θ,ε]$ is still a finite subset of $V[θ,ε]$.

Proof of Lemma 1.
Let $S'⊆finV[θ,ε]$ be fixed. Then,
${f˜θ(x;α,ε,S):α∈Rdθ×|S|,S⊆finV[θ,ε]}={f˜θ(x;α,ε,S∪S'):α∈Rdθ×|S∪S'|,S⊆finV[θ,ε]}={f˜θ(x;α,ε,S∖S')+f˜θ(x;α',ε,S'):α∈Rdθ×|S∖S'|,α'∈Rdθ×|S'|,S⊆finV[θ,ε]}={f˜θ(x;α,ε,S∪S')+fθ(x;α',ε,S'):α∈Rdθ×|S∪S'|,α'∈Rdθ×|S'|,S⊆finV[θ,ε]}={f˜θ(x;α,ε,S)+fθ(x;α',ε,S'):α∈Rdθ×|S|,α'∈Rdθ×|S'|,S⊆finV[θ,ε]},$
where the second line follows from the facts that a finite union of finite sets is finite and hence $S∪S'⊆finV[θ,ε]$ (i.e., the set in the first line is a superset of $⊇,$ the set in the second line), and that $α∈Rdθ×|S∪S'|$ can vanish the extra terms due to $S'$ in $f˜θ(x;α,ε,S∪S')$ (i.e., the set in the first line is a subset of, $⊆,$ the set in the second line). The last line follows from the same facts. The third line follows from the definition of $f˜θ(x;α,ε,S)$. The fourth line follows from the following equality due to the linearity of $f˜θ$ in $α$:
${f˜θ(x;α',ε,S'):α'∈Rdθ×|S'|}=∑k=1dθ∑j=1|S|(αk,j'+α¯k,j')∂kfx(θ+εSj'):α'∈Rdθ×|S'|,α¯'∈Rdθ×|S'|={f˜θ(x;α',ε,S')+f˜θ(x;α¯',ε,S'):α'∈Rdθ×|S'|,α¯'∈Rdθ×|S'|}.$
$□$

### A.4  Proof of Theorem 3

As shown in the proof of theorem 3, thanks to theorem 2 and lemma 1, the remaining task to prove theorem 3 is to find a set $S'⊆finV[θ,ε]$ such that ${f˜θ(xi;α',ε,S'):α'∈Rdθ×|S'|}⊇{αwxi+αrz(xi;u):αw∈Rdy×dx,αr∈Rdy×dz}$. Let $Null(M)$ be the null space of a matrix $M$.

Proof of Theorem 3.
Let $θ∈(Rdθ∖Ω˜)$ be an arbitrary local minimum of $L$. Since $f$ is specified by equation 4.1, and hence $f(x;θ)=(∂vec(W)f(x;θ))vec(W)$, assumption 2 is satisfied. Thus, from theorem 2, there exists $ε0>0$ such that for any $ε∈[0,ε0)$,
$L(θ)=infS⊆finV[θ,ε],α∈Rdθ×|S|∑i=1mλiℓ(f˜θ(xi;α,ε,S),yi),$
where
$f˜θ(xi;α,ε,S)=∑j=1|S|αw,j(xi+(R+εvr,j)zi,j)+(W+εvw,j)αr,jzi,j+(∂ufxi(θ+εSj))αu,j,$
with $α=[α·1,…,α·|S|]∈Rdθ×|S|$, $α·j=vec([αw,j,αr,j,αu,j])∈Rdθ$, $Sj=vec([vw,j,vr,j,vu,j])∈Rdθ$, and $zi,j=z(xi,u+εvu,j)$ for all $j∈{1,…,|S|}$. Here, $αw,j,vw,j∈Rdy×dx$, $αr,j,vr,j∈Rdx×dz$, and $αu,j,vu,j∈Rdu$. Let $ε∈(0,ε0)$ be fixed.
Consider the case of $rank(W)≥dy$. Define $S¯$ such that $|S¯|=1$ and $S¯1=0∈Rdθ$, which is in $V[θ,ε]$. Then by setting $αu,1=0$ and rewriting $αr,1$ such that $Wαr,1=αr,1(1)-αw,1R$ with an arbitrary matrix $αr,1∈Rdy×dz$ (this is possible since $rank(W)≥dy$), we have that
${f˜θ(xi;α,ε,S¯):α∈Rdθ×|S¯|}⊇{αw,1xi+αr,1(1)zi,1:αw,1∈Rdy×dx,αr,1(1)∈Rdy×dz}.$
Consider the case of $rank(W). Since $W∈Rdy×dx$ and $rank(W), we have that $Null(W)≠{0}$, and there exists a vector $a∈Rdx$ such that $a∈Null(W)$ and $∥a∥2=1$. Let $a$ be such a vector. Define $S¯'$ as follows: $|S¯'|=dydz+1$, $S¯1'=0∈Rdθ$, and set $S¯j'$ for all $j∈{2,…,dydz+1}$ such that $vw,j=0$, $vu,j=0$, and $vr,j=abj⊤$ where $bj∈Rdz$ is an arbitrary column vector with $∥bj∥2≤1$. Then $S¯j'∈V[θ,ε]$ for all $j∈{1,…,dydz+1}$. By setting $αr,j=0$ and $αu,j=0$ for all $j∈{1,…,dydz+1}$ and by rewriting $αw,1=αw,1(1)-∑j=2dydz+1αw,j$ and $αw,j=1εqjaT$ for all $j∈{2,…,dydz+1}$ with an arbitrary vector $qj∈Rdy$ (this is possible since $ε>0$ is fixed first and $αw,j$ is arbitrary), we have that
${f˜θ(xi;α,ε,S¯'):α∈Rdθ×|S¯'|}⊇αw,1(1)xi+αw,1(1)R+∑j=2dydz+1qjbj⊤zi,1:qj∈Rdy,bj∈Rdz.$
Since $qj∈Rdy$ and $bj∈Rdz$ are arbitrary, we can rewrite $∑j=2dydz+1qjbj=αw,1(2)-αw,1(1)R$ with an arbitrary matrix $αw,1(2)∈Rdy×dz$, yielding
${f˜θ(xi;α,ε,S¯'):α∈Rdθ×|S¯'|}⊇{αw,1(1)xi+αw,1(2)zi,1:αw,1(1)∈Rdy×dx,αw,1(2)∈Rdy×dz}.$
By summarizing above, in both cases of $rank(W)$, there exists a set $S'⊆finV[θ,ε]$ such that
${f˜θ(xi;α,ε,S):α∈Rdθ×|S|,S⊆finV[θ,ε]}={f˜θ(xi;α,ε,S)+f˜θ(xi;α',ε,S'):α∈Rdθ×|S|,α'∈Rdθ×|S'|,S⊆finV[θ,ε]}⊇{f˜θ(xi;α,ε,S)+αwxi+αrz(xi,u):α∈Rdθ×|S|,αw(1)∈Rdy×dx,αr(2)∈Rdy×dz,S⊆finV[θ,ε]},$
where the second line follows from lemma 1. On the other hand, since the set in the first line is a subset of the set in the last line, ${f˜θ(xi;α,ε,S):α∈Rdθ×|S|,S⊆finV[θ,ε]}={f˜θ(xi;α,ε,S)+αwxi+αrz(xi,u):α∈Rdθ×|S|,αw(1)∈Rdy×dx,αr(2)∈Rdy×dz,S⊆finV[θ,ε]}$. This immediately implies the desired statement from theorem 2.$□$

### A.5  Proof of Theorem 4

As shown in the proof of theorem 4, thanks to theorem 2 and lemma 1, the remaining task to prove theorem 4 is to find a set $S'⊆finV[θ,ε]$ such that ${f˜θ(xi;α',ε,S'):α'∈Rdθ×|S'|}⊇{∑l=tHαh(l+1)h(l)(xi;u):αh∈Rdt}$. Let $M(l')⋯M(l+1)M(l)=I$ if $l>l'$.

Proof of Theorem 4.
Since $f$ is specified by equation 4.3 and, hence,
$f(x;θ)=(∂vec(W(H+1))f(x;θ))vec(W(H+1)),$
assumption 2 is satisfied. Let $t∈{0,…,H}$ be fixed. Let $θ∈(Θdy,t∖Ω˜)$ be an arbitrary local minimum of $L$. Then from theorem 2, there exists $ε0>0$ such that for any $ε∈[0,ε0)$, $L(θ)=infS⊆finV[θ,ε],α∈Rdθ×|S|∑i=1mλiℓ(f˜θ(xi;α,ε,S),yi),$ where $f˜θ(xi;α,ε,S)=∑k=1dθ∑j=1|S|αk,j∂kfxi(θ+εSj).$
Let $J={J(t+1),…,J(H+1)}∈Jn,t[θ]$ be fixed. Without loss of generality, for simplicity of notation, we can permute the indices of the units of each layer such that $J(t+1),…,J(H+1)⊇{1,…,dy}$. Let $B˜(θ,ε1)=B(θ,ε1)∩{θ'∈Rdθ:Wi,j(l+1)=0$ for all $l∈{t+1,…,H-1}$ and all $(i,j)∈({1,…,dl+1}∖J(l+1))×J(l)}$. Because of the definition of the set $J$, in $B˜(θ,ε1)$ with $ε1>0$ being sufficiently small, we have that for any $l∈{t,…,H}$,
$fxi(θ)=A(H+1)⋯A(l+2)[A(l+1)C(l+1)]h(l)(xi;θ)+ϕxi(l)(θ),$
where
$ϕxi(l)(θ)=∑l'=lH-1A(H+1)⋯A(l'+3)C(l'+2)h˜(l'+1)(xi;θ)$
and
$h˜(l)(xi;θ)=σ(l)(B(l)h˜(l-1)(xi;θ)),$
for all $l≥t+2$ with $h˜(t+1)(xi;θ)=σ(t+1)([ξ(l)B(l)]h(t)(xi;θ))$. Here,
$A(l)C(l)ξ(l)B(l)=W(l)$
with $A(l)∈Rdy×dy$, $C(l)∈Rdy×(dl-1-dy)$, $B(l)∈R(dl-dy)×(dl-1-dy)$, and $ξ(l)∈R(dl-dy)×dy$. Let $ε1>0$ be a such number, and let $ε∈(0,min(ε0,ε1/2))$ be fixed so that both the equality from theorem 2 and the above form of $fxi$ hold in $B˜(θ,ε)$. Let $R(l)=[A(l)C(l)]$.
We will now find sets $S(t),…,S(H)⊆finV[θ,ε]$ such that
${f˜θ(xi;α,ε,S(l)):α∈Rdθ}⊇{αh(l+1)h(l)(xi;u):αh(l+1)∈Rdy×dl}.$
• Find $S(l)$ with $l=H$: Since
$(∂vec(R(H+1))fxi(θ))vec(αh(H+1))=αh(H+1)h(H)(xi;θ),$
$S(H)={0}⊆finV[θ,ε]$ (where $0∈Rdθ$) is the desired set.
• Find $S(l)$ with $l∈{t,…,H-1}$: With $αr(l+1)∈Rdl+1×dl$, we have that
$(∂vec(R(l+1))fxi(θ))vec(αr(l+1))=A(H+1)⋯A(l+2)αr(l+1)h(l)(xi;θ).$
Therefore, if $rank(A(H+1)⋯A(l+2))≥dy$, since ${A(H+1)⋯A(l+2)αr(l+1):αr(l+1)∈Rdl+1×dl}⊇{αh(l+1)∈Rdy×dl}$, $S(l)={0}⊆finV[θ,ε]$ (where $0∈Rdθ$) is the desired set. Let us consider the remaining case: let $rank(A(H+1)⋯A(l+2)) and let $l∈{t,…,H-1}$ be fixed. Let $l*=min{l'∈Z+:l+3≤l'≤H+2∧rank(A(H+1)⋯A(l'))≥dy},$ where $A(H+1)⋯A(H+2)≜Idy$ and the minimum exists since the set is finite and contains at least $H+2$ (nonempty). Then $rank(A(H+1)⋯A(l*))≥dH+1$ and $rank(A(H+1)⋯A(l')) for all $l'∈{l+2,l+3,…,l*-1}$. Thus, for all $l'∈{l+1,l+2,…,l*-2}$, there exists a vector $al'∈Rdy$ such that
$al'∈Null(A(H+1)⋯A(l'+1))and∥al'∥2=1.$
Let $al'$ denote such a vector. Consider $S(l)$ such that the weight matrices $W$ are perturbed with $θ¯+εSj(l)$ as
$A˜j(l')=A(l')+εal'bl',j⊤andR˜j(l+1)=R(l+1)+εal+1bl+1,j⊤$
for all $l'∈{l+2,l+3,…,l*-2}$, where $∥bl',j∥2$ is bounded such that $∥Sj(l)∥2≤1$. That is, the entries of $Sj$ are all zeros except the entries corresponding to $A(l')$ (for $l'∈{l+2,l+3,…,l*-2}$) and $R(l+1)$. Then $Sj(l)∈V[θ,ε]$, since $A(H+1)⋯A(l'+1)A˜j(l')=A(H+1)⋯A(l'+1)A(l')$ for all $l'∈{l+2,l+3,…,l*-2}$ and $A(H+1)⋯A(l+2)R˜j(l+1)=A(H+1)⋯A(l+2)R(l+1)$. Let $|S(l)|=2N$ with some integer $N$ to be chosen later. Define $Sj+N(l)$ for $j=1,…,N$ by setting $Sj+N(l)=Sj(l)$ except that $bl+1,j+N=0$ whereas $bl+1,j$ is not necessarily zero. By setting $αj+N=-αj$ for all $j∈{1,…,N}$, with $αj∈Rdl*×dl*-1$,
$f˜θ(xi;α,ε,S(l))=∑j=1NA(H+1)⋯A(l*)(αj+αj+N)A˜(l*-2)⋯A˜(l+2)R(l+1)h(l)(xi;θ)+∑j=1N(∂vec(A(l*-1))ϕxi(l)(θ+εSj))vec(αj+αj+N)+ε∑j=1NA(H+1)⋯A(l*)αjA˜(l*-2)⋯A˜(l+2)al+1bl+1,j⊤h(l)(xi;θ)=ε∑j=1NA(H+1)⋯A(l*)αjA˜(l*-2)⋯A˜(l+2)al+1bl+1,j⊤h(l)(xi;θ),$
where we used the fact that $∂vec(A(l*-1))ϕxi(l)(θ+εSj)$ does not contain $bl+1,j$. Since $rank(A(H+1)⋯A(l*))≥dy$ and ${A(H+1)⋯A(l*)αj:αj∈Rdl*×dl*-1}={1εαj':αj'∈Rdy×dl*-1}$, we have that $∀αj'∈Rdy×dl*-1$, $∃α∈Rdθ×|S|$,
$f˜θ(xi;α,ε,S(l))=∑j=1Nαj'A˜(l*-2)⋯A˜(l+2)al+1bl+1,j⊤h(l)(xi;θ).$
Let $N=2N1$. Define $Sj+N1(l)$ for $j=1,…,N1$ by setting $Sj+N1(l)=Sj(l)$ except that $bl*-2,j+N1=0$, whereas $bl*-2,j$ is not necessarily zero. By setting $αj+N1'=-αj'$ for all $j∈{1,…,N1}$,
$f˜θ(xi;α,ε,S(l))=ε∑j=1N1αj'al*-2bl*-2,j⊤A˜(l*-3)⋯A˜(l+2)al+1bl+1,j⊤h(l)(xi;θ).$
By induction,
$f˜θ(xi;α,ε,S(l))=εt∑j=1Ntαj'al*-2bl*-2,jal*-3bl*-3,j⋯al+1bl+1,j⊤h(l)(xi;θ),$
where $t=(l*-2)-(l+2)+1$ is finite. By setting $αj'=1εtqjal*-2⊤$ and $bl,j=al-1$ for all $l=l*-2,…,l$ ($ε>0$),
$f˜θ(xi;α,ε,S(l))=∑j=1Ntqjbl+1,j⊤h(l)(xi;θ).$
Since $qjbl+1,j$ are arbitrary, with sufficiently large $Nt$ ($Nt=dydl$ suffices), we can set $∑j=1Ntqjbl+1,j=αh(l)$ for any $αh(l)∈Rdθ×dl$, and hence
${f˜θ(xi;α,ε,S(l)):α∈Rdθ×|S(l)|}⊇{αh(l)h(l)(xi;θ):αh(l)∈Rdθ×dl}.$
Thus far, we have found the sets $S(t),…,S(H)⊆finV[θ,ε]$ such that ${f˜θ(xi;α,ε,S(l)):α∈Rdθ}⊇{αh(l+1)h(l)(xi;u):αh(l+1)∈Rdy×dl}$. From lemma 1, we can combine these, yielding
${f˜θ(xi;α,ε,S):α∈Rdθ,S⊆finV[θ,ε]}=∑l=tHf˜θ(xi;α(l),ε,S(l))+f˜θ(xi;α,ε,S):α(t),…,α(H)∈Rdθ,α∈Rdθ,S⊆finV[θ,ε]⊇∑l=tHαh(l+1)h(l)(xi;u)+f˜θ(xi;α,ε,S):αh(l+1)∈Rdy×dl,α∈Rdθ×|S|,S⊆finV[θ,ε].$
Since the set in the first line is a subset of the set in the last line, the equality holds in the above equation. This immediately implies the desired statement from theorem 2.$□$
1

For example, choose the first layer's weight matrix $W(1)$ such that for all $i∈{1,…,m}$, $(W(1)xi)i>0$ and $(W(1)xi)i'≤0$ for all $i'≠i$. This can be achieved by choosing the $i$th row of $W(1)$ to be $[(xi(raw))⊤,ε-1]$ with $0<ε≤δ$ for $i≤m$. Then choose the weight matrices for the $l$th layer for all $l≥2$ such that for all $j$, $Wj,j(l)≠0$ and $Wj',j(l)=0$ for all $j'≠j$. This guarantees $rank([φ(xi;u)]i=1m)≥m$.

We gratefully acknowledge support from NSF grants 1523767 and 1723381, AFOSR grant FA9550-17-1-0165, ONR grant N00014-18-1-2847, Honda Research, and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.

Allen-Zhu
,
Z.
,
Li
,
Y.
, &
Song
,
Z.
(
2018
).
A convergence theory for deep learning via over-parameterization
.
arXiv:1811.03962
.
Arora
,
S.
,
Cohen
,
N.
,
Golowich
,
N.
, &
Hu
,
W.
(
2018
).
A convergence analysis of gradient descent for deep linear neural networks
.
arXiv:1810.02281
.
Arora
,
S.
,
Cohen
,
N.
, &
Hazan
,
E.
(
2018
).
On the optimization of deep networks: Implicit acceleration by overparameterization
. In
Proceedings of the International Conference on Machine Learning
.
Bartlett
,
P. L.
,
Helmbold
,
D. P.
, &
Long
,
P. M.
(
2019
).
Gradient descent with identity initialization efficiently learns positive-definite linear transformations by deep residual networks
.
Neural Computation
,
31
(
3
),
477
502
.
Bertsekas
,
D. P.
(
1999
).
Nonlinear programming
.
Belmont, MA
:
Athena Scientific
.
Blum
,
A. L.
, &
Rivest
,
R. L.
(
1992
).
Training a 3-node neural network is NP-complete
.
Neural Networks
,
5
(
1
),
117
127
.
Brutzkus
,
A.
, &
Globerson
,
A.
(
2017
).
Globally optimal gradient descent for a convnet with gaussian inputs
. In
Proceedings of the International Conference on Machine Learning
(pp.
605
614
).
Choromanska
,
A.
,
Henaff
,
M.
,
Mathieu
,
M.
,
Ben Arous
,
G.
, &
LeCun
,
Y.
(
2015
).
The loss surfaces of multilayer networks
. In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
(pp.
192
204
).
Davis
,
D.
,
Drusvyatskiy
,
D.
,
,
S.
, &
Lee
,
J. D.
(
2019
). Stochastic subgradient method converges on tame functions. In
M.
Overton
(Ed.),
Foundations of computational mathematics
(pp.
1
36
).
Berlin
:
Springer
.
Du
,
S. S.
, &
Hu
,
W.
(
2019
).
Width provably matters in optimization for deep linear neural networks
.
arXiv:1901.08572
.
Du
,
S. S.
, &
Lee
,
J. D.
(
2018
).
On the power of over-parameterization in neural networks with quadratic activation
.
arXiv:1803.01206
.
Du
,
S. S.
,
Lee
,
J. D.
,
Li
,
H.
,
Wang
,
L.
, &
Zhai
,
X.
(
2018
).
Gradient descent finds global minima of deep neural networks
.
arXiv:1811.03804
.
Ge
,
R.
,
Lee
,
J. D.
, &
Ma
,
T.
(
2017
).
Learning one-hidden-layer neural networks with landscape design
.
arXiv:1711.00501
.
Goel
,
S.
, &
Klivans
,
A.
(
2017
).
Learning depth-three neural networks in polynomial time
.
arXiv:1709.06010
.
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
.
Cambridge, MA
:
MIT Press
.
Hardt
,
M.
, &
Ma
,
T.
(
2017
).
Identity matters in deep learning
.
arXiv:1611.04231
.
Hardt
,
R. M.
(
1975
).
Stratification of real analytic mappings and images
.
Invent. Math.
,
28
,
193
208
.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2016
). Identity mappings in deep residual networks. In
Proceedings of the European Conference on Computer Vision
(pp.
630
645
).
Berlin
:
Springer
.
,
S. M.
, &
Lee
,
J. D.
(
2018
). Provably correct automatic subdifferentiation for qualified programs. In
S.
Bengio
,
H.
Wallach
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
7125
7135
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
(
2016
). Deep learning without poor local minima. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
586
594
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
, &
Bengio
,
Y.
(
2019
).
Depth with nonlinearity creates no bad local minima in ResNets
.
Neural Networks
,
118
,
167
174
.
Kawaguchi
,
K.
,
Huang
,
J.
, &
Kaelbling
,
L. P.
(
2019
).
Effect of depth and width on local minima in deep learning
.
Neural Computation
,
31
(
6
),
1462
1498
.
Kawaguchi
,
K.
,
Xie
,
B.
, &
Song
,
L.
(
2018
).
Deep semi-random features for nonlinear function approximation
. In
Proceedings of the 32nd AAAI Conference on Artificial Intelligence
.
Palo Alto, CA
:
AAAI Press
.
Laurent
,
T.
, &
Brecht
,
J.
(
2018
).
Deep linear networks with arbitrary loss: All local minima are global
. In
Proceedings of the International Conference on Machine Learning
(pp.
2908
2913
).
Lee
,
J. M.
(
2013
).
Introduction to smooth manifolds
(2nd ed.).
New York
:
Springer
.
Li
,
Y.
, &
Yuan
,
Y.
(
2017
).
Convergence analysis of two-layer neural networks with ReLU activation
. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
N.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
597
607
).
Red Hook, NY
:
Curran
.
Murty
,
K. G.
, &
,
S. N.
(
1987
).
Some NP-complete problems in quadratic and nonlinear programming
.
Mathematical Programming
,
39
(
2
),
117
129
.
Nguyen
,
Q.
, &
Hein
,
M.
(
2017
).
The loss surface of deep and wide neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2603
2612
).
Nguyen
,
Q.
, &
Hein
,
M.
(
2018
).
Optimization landscape and expressivity of deep CNNS
. In
Proceedings of the International Conference on Machine Learning
(pp.
3727
3736
).
Rockafellar
,
R. T.
, &
Wets
,
R. J.-B.
(
2009
).
Variational analysis
.
New York
:
.
Saxe
,
A. M.
,
McClelland
,
J. L.
, &
Ganguli
,
S.
(
2014
).
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
.
arXiv:1312.6120
.
Shamir
,
O.
(
2018
).
Are ResNets provably better than linear predictors
? In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Red Hook, NY
:
Curran
.
Soltanolkotabi
,
M.
(
2017
). Learning ReLUs via gradient descent. In
I.
Guyon
,
U. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
2007
2017
).
Red Hook, NY
:
Curran
.
Zhong
,
K.
,
Song
,
Z.
,
Jain
,
P.
,
Bartlett
,
P. L.
, &
Dhillon
,
I. S.
(
2017
).
Recovery guarantees for one-hidden-layer neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
4140
4149
).
Zou
,
D.
,
Cao
,
Y.
,
Zhou
,
D.
, &
Gu
,
Q.
(
2018
).
Stochastic gradient descent optimizes over-parameterized deep ReLU networks
.
arXiv:1811.08888
.
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.