## Abstract

For nonconvex optimization in machine learning, this article proves that every local minimum achieves the globally optimal value of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this article improves or complements several state-of-the-art theoretical results on deep neural networks, deep residual networks, and overparameterized deep neural networks with a unified proof technique and novel geometric insights. A special case of our results also contributes to the theoretical foundation of representation learning.

## 1 Introduction

Deep learning has achieved considerable empirical success in machine learning applications. However, insufficient work has been done on theoretically understanding deep learning, partly because of the nonconvexity and high-dimensionality of the objective functions used to train deep models. In general, theoretical understanding of nonconvex, high-dimensional optimization is challenging. Indeed, finding a global minimum of a general nonconvex function (Murty & Kabadi, 1987) and training certain types of neural networks (Blum & Rivest, 1992) are both NP-hard. Considering the NP-hardness for a general set of relevant problems, it is necessary to use additional assumptions to guarantee efficient global optimality in deep learning. Accordingly, recent theoretical studies have proven global optimality in deep learning by using additional strong assumptions such as linear activation, random activation, semirandom activation, gaussian inputs, single hidden-layer network, and significant overparameterization (Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016; Hardt & Ma, 2017; Nguyen & Hein, 2017, 2018; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Ge, Lee, & Ma, 2017; Goel & Klivans, 2017; Zhong, Song, Jain, Bartlett, & Dhillon, 2017; Li & Yuan, 2017; Kawaguchi, Xie, & Song, 2018; Du & Lee, 2018).

A study proving efficient global optimality in deep learning is thus closely related to the search for additional assumptions that might not hold in many practical applications. Toward widely applicable practical theory, we can also ask a different type of question: If standard global optimality requires additional assumptions, then what type of global optimality does not? In other words, instead of searching for additional assumptions to guarantee standard global optimality, we can also search for another type of global optimality under mild assumptions. Furthermore, instead of an arbitrary type of global optimality, it is preferable to develop a general theory of global optimality that not only works under mild assumptions but also produces the previous results with the previous additional assumptions, while predicting new results with future additional assumptions. This type of general theory may help not only to explain when and why an existing machine learning method works but also to predict the types of future methods that will or will not work.

As a step toward this goal, this article proves a series of theoretical results. The major contributions are summarized as follows:

For nonconvex optimization in machine learning with mild assumptions, we prove that every differentiable local minimum achieves global optimality of the perturbable gradient basis model class. This result is directly applicable to many existing machine learning models, including practical deep learning models, and to new models to be proposed in the future, nonconvex and convex.

The proposed general theory with a simple and unified proof technique is shown to be able to prove several concrete guarantees that improve or complement several state-of-the-art results.

In general, the proposed theory allows us to see the effects of the design of models, methods, and assumptions on the optimization landscape through the lens of the global optima of the perturbable gradient basis model class.

Because a local minimum $\theta $ in $Rd\theta $ only requires the $\theta $ to be locally optimal in $Rd\theta $, it is nontrivial that the local minimum is guaranteed to achieve the globally optimality in $Rd\theta $ of the induced perturbable gradient basis model class. The reason we can possibly prove something more than many worst-case results in general nonconvex optimization is that we explicitly take advantage of mild assumptions that commonly hold in machine learning and deep learning. In particular, we assume that an objective function to be optimized is structured with a sum of weighted errors, where each error is an output of composition of a loss function and a function of a hypothesis class. Moreover, we make mild assumptions on the loss function and a hypothesis class, all of which typically hold in practice.

## 2 Preliminaries

This section defines the problem setting and common notation.

### 2.1 Problem Description

### 2.2 Notation

We use the following standard notation for differentiation. Given a scalar-valued or vector-valued function $\varphi :Rd\u2192Rd'$ with components $\varphi =(\varphi 1,\u2026,\varphi d')$ and variables $(v1,\u2026,vd\xaf)$, let $\u2202v\varphi :Rd\u2192Rd'\xd7d\xaf$ be the matrix-valued function with each entry $(\u2202v\varphi )i,j=\u2202\varphi i\u2202vj$. Note that if $\varphi $ is a scalar-valued function, $\u2202v\varphi $ outputs a row vector. In addition, $\u2202\varphi =\u2202v\varphi $ if $(v1,\u2026,vd)$ are the input variables of $\varphi $. Given a function $\varphi :Rd\u2192Rd'$, let $\u2202k\varphi :Rd\u2192R$ be the partial derivative $\u2202k\varphi $ with respect to the $k$th variable of $\varphi $. For the syntax of any differentiation map $\u2202$, given functions $\varphi $ and $\zeta $, let $\u2202\varphi (\zeta (q))=(\u2202\varphi )(\zeta (q))$ be the (partial) derivative $\u2202\varphi $ evaluated at an output $\zeta (q)$ of a function $\zeta $.

Given a matrix $M\u2208Rd\xd7d'$, $vec(M)=[M1,1,\u2026,Md,1,M1,2,\u2026,Md,2,\u2026,M1,d',\u2026,Md,d']T$ represents the standard vectorization of the matrix $M$. Given a set of $n$ matrices or vectors ${M(j)}j=1n$, define $[M(j)]j=1n=[M(1),M(2),\u2026,M(n)]$ to be a block matrix of each column block being $M(1),M(2),\u2026,M(n)$. Similarly, given a set $I={i1,\u2026,in}$ with $(i1,\u2026,in)$ increasing, define $[M(j)]j\u2208I=[M(i1)\cdots M(in)]$.

## 3 Nonconvex Optimization Landscapes for Machine Learning

This section shows our first main result that under mild assumptions, every differentiable local minimum achieves the global optimality of the perturbable gradient basis model class.

### 3.1 Assumptions

Given a hypothesis class $f$ and data set, let $\Omega $ be a set of nondifferentiable points $\theta $ as $\Omega ={\theta \u2208Rd\theta :(\u2203i\u2208{1,\u2026,m})[fxi$ is not differentiable at $\theta ]}$. Similarly, define $\Omega \u02dc={\theta \u2208Rd\theta :(\u2200\epsilon >0)(\u2203\theta '\u2208B(\theta ,\epsilon ))(\u2203i\u2208{1,\u2026,m})[fxi$ is not differentiable at $\theta ']}$. Here, $B(\theta ,\epsilon )$ is the open ball with the center $\theta $ and the radius $\epsilon $. In common nondifferentiable models $f$ such as neural networks with rectified linear units (ReLUs) and pooling operations, we have that $\Omega =\Omega \u02dc$, and the Lebesgue measure of $\Omega (=\Omega \u02dc$) is zero.

This section uses the following mild assumptions.

(Use of Common Loss criteria). For all $i\u2208{1,\u2026,m}$, the function $\u2113yi:q\u21a6\u2113(q,yi)\u2208R\u22650$ is differentiable and convex (e.g., the squared loss, cross-entropy loss, or polynomial hinge loss satisfies this assumption).

(Use of Common Model Structures). There exists a function $g:Rd\theta \u2192Rd\theta $ such that $fxi(\theta )=\u2211k=1d\theta g(\theta )k\u2202kfxi(\theta )$ for all $i\u2208{1,\u2026,m}$ and all $\theta \u2208Rd\theta \u2216\Omega $.

Assumption 1 is satisfied by simply using common loss criteria that include the squared loss $\u2113(q,y)=\u2225q-y\u222522$, cross-entropy loss $\u2113(q,y)=-\u2211k=1dyyklogexp(qk)\u2211k'exp(qk')$, and smoothed hinge loss $\u2113(q,y)=(max{0,1-yq})p$ with $p\u22652$ (the hinge loss with $dy=1$). Although the objective function $L:\theta \u21a6L(\theta )$ used to train a complex machine learning model (e.g., a neural network) is nonconvex in $\theta $, the loss criterion $\u2113yi:q\u21a6\u2113(q,yi)$ is usually convex in $q$. In this article, the cross-entropy loss includes the softmax function, and thus $fx(\theta )$ is the pre-softmax output of the last layer in related deep learning models.

Assumption 2 is satisfied by simply using a common architecture in deep learning or a classical machine learning model. For example, consider a deep neural network of the form $fx(\theta )=Wh(x;u)+b$, where $h(x;u)$ is an output of an arbitrary representation at the last hidden layer and $\theta =vec([W,b,u])$. Then assumption 2 holds because $fxi(\theta )=\u2211k=1d\theta g(\theta )k\u2202kfxi(\theta )$, where $g(\theta )k=\theta k$ for all $k$ corresponding to the parameters $(W,b)$ in the last layer and $g(\theta )k=0$ for all other $k$ corresponding to $u$. In general, because $g$ is a function of $\theta $, assumption 2 is easily satisfiable. Assumption 2 does not require the model $f(x;\theta )$ to be linear in $\theta $ or $x$.

Note that we allow the nondifferentiable points to exist in $L(\theta )$; for example, the use of ReLU is allowed. For a nonconvex and nondifferentiable function, we can still have first-order and second-order necessary conditions of local minima (e.g., Rockafellar & Wets, 2009, theorem 13.24). However, subdifferential calculus of a nonconvex function requires careful treatment at nondifferentiable points (see Rockafellar & Wets, 2009; Kakade & Lee, 2018; Davis, Drusvyatskiy, Kakade, & Lee, 2019), and deriving guarantees at nondifferentiable points is left to a future study.

### 3.2 Theory for Critical Points

An important aspect in theorem ^{3} is that $L\theta $ on the right-hand side is convex, while $L$ on the left-hand side can be nonconvex or convex. Here, following convention, $infS$ is defined to be the infimum of a subset $S$ of $R\xaf$ (the set of affinely extended real numbers); that is, if $S$ has no lower bound, $infS=-\u221e$ and $inf\u2205=\u221e$. Note that theorem ^{3} vacuously holds true if there is no critical point for $L$. To guarantee the existence of a minimizer in a (nonempty) subspace $S\u2286Rd\theta $ for $L$ (or $L\theta $), a classical proof requires two conditions: a lower semicontinuity of $L$ (or $L\theta $) and the existence of a $q\u2208S$ for which the set ${q'\u2208S:L(q')\u2264L(q)}$ (or ${q'\u2208S:L\theta (q')\u2264L\theta (q)}$) is compact (see Bertsekas, 1999, for different conditions).

#### 3.2.1 Geometric View

^{3}that provides an intuitive yet formal description of gradient basis model class. Figure 1 illustrates the gradient basis model class and theorem

^{3}with $\theta \u2208R2$ and $fX(\theta )\u2208R3$. Here, we consider the following map from the parameter space to the concatenation of the output of the model at $x1,x2,\u2026,xm$:

^{3}states that under assumptions 1 and 2, $fX(\theta )$ is globally optimal in the subspace $TfX(\theta )$ as

^{3}concludes this global optimality in the affine subspace of the output space based on the local condition in the parameter space (i.e., differentiable critical point). A key idea behind theorem

^{3}is to consider the map between the parameter space and the output space, which enables us to take advantage of assumptions 1 and 2.

Figure 2 illustrates the gradient basis model class and theorem ^{3} with a union of manifolds and a tangent space. Under the constant rank condition, the image of the map $fX$ locally forms a single manifold. More precisely, if there exists a small neighborhood $U(\theta )$ of $\theta $ such that $fX$ is differentiable in $U(\theta )$ and $rank(\u2202fX(\theta '))=r$ is constant with some $r$ for all $\theta '\u2208U(\theta )$ (the constant rank condition), then the rank theorem states that the image $fX(U(\theta ))$ is a manifold of dimension $r$ (Lee, 2013, theorem 4.12). We note that the rank map $\theta \u21a6rank(\u2202fX(\theta ))$ is lower semicontinuous (i.e., if $rank(\u2202fX(\theta ))=r$, then there exists a neighborhood $U(\theta )$ of $\theta $ such that $rank(\u2202fX(\theta '))\u2265r$ for any $\theta '\u2208U(\theta )$). Therefore, if $\u2202fX(\theta )$ at $\theta $ has the maximum rank in a small neighborhood of $\theta $, then the constant rank condition is satisfied.

#### 3.2.2 Examples

In this section, we show through examples that theorem ^{3} generalizes the previous results in special cases while providing new theoretical insights based on the gradient basis model class and its geometric view. In the following, whenever the form of $f$ is specified, we require only assumption 1 because assumption 2 is automatically satisfied by a given $f$.

For classical machine learning models, example 1 shows that the gradient basis model class is indeed equivalent to a given model class. From the geometric view, this means that for any $\theta $, the tangent space $TfX(\theta )$ is equal to the whole image $M$ of $fX$ (i.e., $TfX(\theta )$ does not depend on $\theta $). This reduces theorem ^{3} to the statement that every critical point of $L$ is a global minimum of $L$.

For any basis function model $f(x;\theta )=\u2211k=1d\theta \theta k\phi (x)k$ in classical machine learning with any fixed feature map $\phi :X\u2192Rd\theta $, we have that $f\theta (x;\alpha )=f(x;\alpha )$, and hence $inf\theta \u2208Rd\theta L(\theta )=inf\alpha \u2208Rd\theta L\theta (\alpha )$, as well as $\Omega =\u2205$. In other words, in this special case, theorem ^{3} states that every critical point of $L$ is a global minimum of $L$. Here, we do not assume that a critical point or a global minimum exists or can be attainable. Instead, the statement logically means that if a point is a critical point, then the point is a global minimum. This type of statement vacuously holds true if there is no critical point.

For overparameterized deep neural networks, example 2 shows that the induced gradient basis model class is highly expressive such that it must contain the globally optimal model of a given model class of deep neural networks. In this example, the tangent space $TfX(\theta )$ is equal to the whole output space $Rmdy$. This reduces theorem ^{3} to the statement that every critical point of $L$ is a global minimum of $L$ for overparameterized deep neural networks.

Intuitively, in Figure 1 or 2, we can increase the number of parameters and raise the number of partial derivatives $\u2202kfX(\theta )$ in order to increase the dimensionality of the tangent space $TfX(\theta )$ so that $TfX(\theta )=Rmdy$. This is indeed what happens in example 2, as well as in the previous studies of significantly overparameterized deep neural networks (Allen-Zhu, Li, & Song, 2018; Du, Lee, Li, Wang, & Zhai, 2018; Zou et al., 2018). In the previous studies, the significant overparameterization is required so that the tangent space $TfX(\theta )$ does not change from the initial tangent space $TfX(\theta (0))=Rmdy$ during training. Thus, theorem ^{3}, with its geometric view, provides the novel algebraic and geometric insights into the results of the previous studies and the reason why overparameterized deep neural networks are easy to be optimized despite nonconvexity.

Theorem ^{3} implies that every critical point (and every local minimum) is a global minimum for sufficiently overparameterized deep neural networks. Let $n$ be the number of units in each layer of a fully connected feedforward deep neural network. Let us consider a significant overparameterization such that $n\u2265m$. Let us write a fully connected feedforward deep neural network with the trainable parameters $(\theta ,u)$ by $f(x;\theta )=W\phi (x;u)$, where $W\u2208Rdy\xd7n$ is the weight matrix in the last layer, $\theta =vec(W)$, $u$ contains the rest of the parameters, and $\phi (x;u)$ is the output of the last hidden layer. Denote $xi=[(xi(raw))\u22a4,1]\u22a4$ to contain the constant term to account for the bias term in the first layer. Assume that the input samples are normalized as $\u2225xi(raw)\u22252=1$ for all $i\u2208{1,\u2026,m}$ and distinct as $(xi(raw))\u22a4xi'(raw)<1-\delta $ with some $\delta >0$ for all $i'\u2260i$. Assume that the activation functions are ReLU activation functions. Then we can efficiently set $u$ to guarantee $rank([\phi (xi;u)]i=1m)\u2265m$ (e.g., by choosing $u$ to make each unit of the last layer to be active only for each sample $xi$).^{1} Theorem ^{3} implies that every critical point $\theta $ with this $u$ is a global minimum of the whole set of trainable parameters $(\theta ,u)$ because $inf\alpha L\theta (\alpha )=inff1,\u2026,fm\u2211i=1m\lambda i\u2113(fi,yi)$ (with assumption 1).

For deep neural networks, example 3 shows that standard networks have the global optimality guarantee with respect to the representation learned at the last layer, and skip connections further ensure the global optimality with respect to the representation learned at each hidden layer. This is because adding the skip connections incurs new partial derivatives ${\u2202kfX(\theta )}k$ that span the tangent space containing the output of the best model with the corresponding learned representation.

^{3}implies that for any critical point $\theta \u2208(Rd\theta \u2216\Omega )$ of $L$, the following holds:

### 3.3 Theory for Local Minima

The following theorem shows that every differentiable local minimum of $L$ achieves the global minimum value of $L\u02dc\theta $:

^{3}and

^{7}, let us consider the following general inequalities: for any $\theta \u2208(Rd\theta \u2216\Omega \u02dc)$ with $\epsilon \u22650$ being sufficiently small,

^{3}states that the first inequality becomes equality as $L(\theta )=inf\alpha \u2208Rd\theta L\theta (\alpha )$ at every differentiable critical point, theorem

^{7}states that both inequalities become equality as

From theorem ^{3} to theorem ^{7}, the power of increasing the number of parameters (including overparameterization) is further improved. The right-hand side in equation 3.2 is the global minimum value over the variables $S\u2286finV[\theta ,\epsilon ]$ and $\alpha \u2208Rd\theta \xd7|S|$. Here, as $d\theta $ increases, we may obtain the global minimum value of a larger search space $Rd\theta \xd7|S|$, which is similar to theorem ^{3}. A concern in theorem ^{3} is that as $d\theta $ increases, we may also significantly increase the redundancy among the elements in ${\u2202kfxi(\theta )}k=1d\theta $. Although this remains a valid concern, theorem ^{7} allows us to break the redundancy by the globally optimal $S\u2286finV[\theta ,\epsilon ]$ to some degree.

For example, consider $f(x;\theta )=g(W(l)h(l)(x;u);u)$, which represents a deep neural network, with some $l$th-layer output $h(l)(x;u)\u2208Rdl$, a trainable weight matrix $W(l)$, and an arbitrary function $g$ to compute the rest of the forward pass. Here, $\theta =vec([W(l),u])$. Let $h(l)(X;u)=[h(l)(xi;u)]i=1m\u2208Rdl\xd7m$ and, similarly, $f(X;\theta )=g(W(l)h(l)(X;u);u)\u2208Rdy\xd7m$. Then, all vectors $v$ corresponding to any elements in the left null space of $h(l)(X;u)$ are in $V[\theta ,\epsilon ]$ (i.e., $vk=0$ for all $k$ corresponding to $u$ and the rest of $vk$ is set to perturb $W(l)$ by an element in the left null space). Thus, as the redundancy increases such that the dimension of the left null space of $h(l)(X;u)$ increases, we have a larger space of $V[\theta ,\epsilon ]$, for which a global minimum value is guaranteed at a local minimum.

#### 3.3.1 Geometric View

^{7}. Figure 3 illustrates the perturbable gradient basis model class and theorem

^{7}with $\theta \u2208R2$ and $fX(\theta )\u2208R3$. Figure 4 illustrates them with a union of manifolds and tangent spaces at a singular point. Given a $\epsilon $ ($\u2264\epsilon 0$), define the affine subspace $T\u02dcfX(\theta )$ of the output space $Rmdy$ by

^{7}states that under assumptions 1 and 2, $fX(\theta )$ is globally optimal in the subspace $T\u02dcfX(\theta )$ as

^{7}concludes the global optimality in the affine subspace of the output space based on the local condition in the parameter space—that is, differentiable local minima. Here, a (differentiable) local minimum $\theta $ is required to be optimal only in an arbitrarily small local neighborhood in the parameter space, and yet $fX(\theta )$ is guaranteed to be globally optimal in the affine subspace of the output space. This illuminates the fact that nonconvex optimization in machine learning has a particular structure beyond general nonconvex optimization.

## 4 Applications to Deep Neural Networks

The previous section showed that all local minima achieve the global optimality of the perturbable gradient basis model class with several direct consequences for special cases. In this section, as consequences of theorem ^{7}, we complement or improve the state-of-the-art results in the literature.

### 4.1 Example: ResNets

^{7}, we set $f$ to be the function of a certain type of residual networks (ResNets) that Shamir (2018) studied. That is, both Shamir (2018) and this section set $f$ as

#### 4.1.1 Background

Along with an analysis of approximate critical points, Shamir (2018) proved the following main result, proposition 1, under the assumptions PA1, PA2, and PA3:

PA1: The output dimension $dy=1$.

PA2: For any $y$, the function $\u2113y$ is convex and twice differentiable.

PA3: On any bounded subset of the domain of $L$, the function $Lu(W,R)$, its gradient $\u2207Lu(W,R)$, and its Hessian $\u22072Lu(W,R)$ are all Lipschitz continuous in $(W,R)$, where $Lu(W,R)=L(\theta )$ with a fixed $u$.

Shamir (2018) remarked that it is an open problem whether proposition 1 and another main result in the article can be extended to networks with $dy>1$ (multiple output units). Note that Shamir (2018) also provided proposition 1 with an expected loss and an analysis for a simpler decoupled model, $Wx+Vz(x;u)$. For the simpler decoupled model, our theorem ^{3} immediately concludes that given any $u$, every critical point with respect to $\theta -u=(W,R)$ achieves a global minimum value with respect to $\theta -u$ as $L(\theta -u)=inf{\u2211i=1m\lambda i\u2113yi(Wxi+Rz(xi;u)):W\u2208Rdy\xd7dx,R\u2208Rdx\xd7dz}$ ($\u2264infW\u2208Rdy\xd7dx\u2211i=1m\lambda i\u2113yi(Wxi)$). This holds for every critical point $\theta $ since any critical point $\theta $ must be a critical point with respect to $\theta -u$.

### 4.2 Result

The following theorem shows that every differentiable local minimum achieves the global minimum value of $L\u02dc\theta (ResNet)$ (the right-hand side in equation 4.2), which is no worse than the upper bound in proposition 1 and is strictly better than the upper bound as long as $z(xi,u)$ or $f\u02dc\theta (xi;\alpha ,\epsilon ,S)$ is nonnegligible. Indeed, the global minimum value of $L\u02dc\theta (ResNet)$ (the right-hand side in equation 4.2) is no worse than the global minimum value of all models parameterized by the coefficients of the basis $x$ and $z(x;u)$, and further improvement is guaranteed through a nonnegligible $f\u02dc\theta (xi;\alpha ,\epsilon ,S)$.

Theorem ^{9} also successfully solved the first part of the open problem in the literature (Shamir, 2018) by discarding the assumption of $dy=1$. From the geometric view, theorem ^{9} states that the span $T\u02dcfX(\theta )$ of the set of the vectors in the tangent spaces ${TfX(\theta +\epsilon v):v\u2208V[\theta ,\epsilon ]}$ contains the output of the best basis model with the linear feature $x$ and the learned nonlinear feature $z(xi;u)$. Similar to the examples in Figures 3 and 4, $T\u02dcfX(\theta )\u2260Tf(\theta )$ and the output of the best basis model with these features is contained in $T\u02dcfX(\theta )$ but not in $Tf(\theta )$.

Unlike the recent study on ResNets (Kawaguchi & Bengio, 2019), our theorem ^{9} predicts the value of $L$ through the global minimum value of a large search space (i.e., the domain of $L\u02dc\theta (ResNet)$) and is proven as a consequence of our general theory (i.e., theorem ^{7}) with a significantly different proof idea (see section 4.3) and with the novel geometric insight.

#### 4.2.1 Example: Deep Nonlinear Networks with Locally Induced Partial Linear Structures

#### 4.2.2 Background

Given the difficulty of theoretically understanding deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that theoretically studying simplified networks (i.e., deep linear networks) is worthwhile. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that deep linear networks may exhibit several properties analogous to those of deep nonlinear networks. Accordingly, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett, Helmbold, & Long, 2019; Du & Hu, 2019).

Along this line, Laurent and Brecht (2018) recently proved the following main result, proposition 2, under the assumptions PA4, PA5, and PA6:

PA4: Every activation function is identity as $\sigma (l)(q)=q$ for every $l\u2208{1,\u2026,H}$ (i.e., deep linear networks).

PA5: For any $y$, the function $\u2113y$ is convex and differentiable.

PA6: The thinnest layer is either the input layer or the output layer as $min{dx,dy}\u2264min{d1,\u2026,dH}$.

#### 4.2.3 Result

Instead of studying deep linear networks, we now consider a partial linear structure locally induced by a parameter vector with nonlinear activation functions. This relaxes the linearity assumption and extends our understanding of deep linear networks to deep nonlinear networks.

Intuitively, $Jn,t[\theta ]$ is a set of partial linear structures locally induced by a vector $\theta $, which is now formally defined as follows. Given a $\theta \u2208Rd\theta $, let $Jn,t[\theta ]$ be a set of all sets $J={J(t+1),\u2026,J(H+1)}$ such that each set $J={J(t+1),\u2026,J(H+1)}\u2208Jn,t[\theta ]$ satisfies the following conditions: there exists $\epsilon >0$ such that for all $l\u2208{t+1,t+2,\u2026,H+1}$,

$J(l)\u2286{1,\u2026,dl}$ with $|J(l)|\u2265n$.

$h(l)(xi,\theta ')k=(W(l)h(l-1)(xi,\theta '))k$ for all $(k,\theta ',i)\u2208J(l)\xd7B(\theta ,\epsilon )\xd7{1,\u2026,m}$.

$Wi,j(l+1)=0$ for all $(i,j)\u2208({1,\u2026,dl+1}\u2216J(l+1))\xd7J(l)$ if $l\u2264H-1$.

Let $\Theta n,t$ be the set of all parameter vectors $\theta $ such that $Jn,t[\theta ]$ is nonempty. As the definition reveals, a neural network with a $\theta \u2208\Theta dy,t$ can be a standard deep nonlinear neural network (with no linear units).

Theorem ^{11} is a special case of theorem ^{7}. A special case of theorem ^{11} then results in one of the main results in the literature regarding deep linear neural networks, that is, every local minimum is a global minimum. Consider any deep linear network with $dy\u2264min{d1,\u2026,dH}$. Then every local minimum $\theta $ is in $\Theta dy,0\u2216\Omega \u02dc=\Theta dy,0$. Hence, theorem ^{11} is reduced to the statement that for any local minimum, $L(\theta )=inf\alpha h\u2208Rdt\u2211i=1m\lambda i\u2113yi(\u2211l=0H\alpha h(l+1)h(l)(xi;u))=inf\alpha x\u2208Rdx\u2211i=1m\lambda i\u2113yi(\alpha xxi)$, which is the global minimum value. Thus, every local minimum is a global minimum for any deep linear neural network with $dy\u2264min{d1,\u2026,dH}$. Therefore, theorem ^{11} successfully generalizes the recent previous result in the literature (proposition 2) for a common scenario of $dy\u2264dx$.

Beyond deep linear networks, theorem ^{11} illustrates both the benefit of the locally induced structure and the overparameterization for deep nonlinear networks. In the first term, $\u2211l=tH\alpha h(l+1)h(l)(xi;u)$, in $L\theta ,t(ff)$, we benefit by decreasing $t$ (a more locally induced structure) and increasing the width of the $l$th layer for any $l\u2265t$ (overparameterization). The second term, $f\u02dc\theta (xi;\alpha ,\epsilon ,S)$ in $L\theta ,t(ff)$, is the general term that is always present from theorem ^{7}, where we benefit from increasing $d\theta $ because $\alpha \u2208Rd\theta \xd7|S|$.

From the geometric view, theorem ^{11} captures the intuition that the span $T\u02dcfX(\theta )$ of the set of the vectors in the tangent spaces ${TfX(\theta +\epsilon v):v\u2208V[\theta ,\epsilon ]}$ contains the best basis model with the linear feature for deep linear networks, as well as the best basis models with more nonlinear features as more local structures arise. Similar to the examples in Figures 3 and 4, $T\u02dcfX(\theta )\u2260Tf(\theta )$ and the output of the best basis models with those features are contained in $T\u02dcfX(\theta )$ but not in $Tf(\theta )$.

A similar local structure was recently considered in Kawaguchi, Huang, and Kaelbling (2019). However, both the problem settings and the obtained results largely differ from those in Kawaguchi et al. (2019). Furthermore, theorem ^{11} is proven as a consequence of our general theory (theorem ^{7}), and accordingly, the proofs largely differ from each other as well. Theorem ^{11} also differs from recent results on the gradient decent algorithm for deep linear networks (Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett et al., 2019; Du & Hu, 2019), since we analyze the loss surface instead of a specific algorithm and theorem ^{11} applies to deep nonlinear networks as well.

### 4.3 Proof Idea in Applications of Theorem ^{7}

Theorems ^{9} and ^{11} are simple consequences of theorem ^{7}, and their proof is illustrative as a means of using theorem ^{7} in future studies with different additional assumptions. The high-level idea behind the proofs in the applications of theorem ^{7} is captured in the geometric view of theorem ^{7} (see Figures 3 and 4). That is, given a desired guarantee, we check whether the space $T\u02dcfX(\theta )$ is expressive enough to contain the output of the desired model corresponding to the desired guarantee.

To simplify the use of theorem ^{7}, we provide the following lemma. This lemma states that the expressivity of the model $f\u02dc\theta (x;\alpha ,\epsilon ,S)$ with respect to $(\alpha ,S)$ is the same as that of $f\u02dc\theta (x;\alpha ,\epsilon ,S)+f\u02dc\theta (x;\alpha ',\epsilon ,S')$ with respect to $(\alpha ,\alpha ',S,S')$. As shown in its proof, this is essentially because $f\u02dc\theta $ is linear in $\alpha $, and a union of two sets $S\u2286finV[\theta ,\epsilon ]$ and $S'\u2286finV[\theta ,\epsilon ]$ remains a finite subset of $V[\theta ,\epsilon ]$.

For any $\theta $, any $\epsilon \u22650$, any $S'\u2286finV[\theta ,\epsilon ]$, and any $x$, it holds that ${f\u02dc\theta (x;\alpha ,\epsilon ,S):\alpha \u2208Rd\theta \xd7|S|,S\u2286finV[\theta ,\epsilon ]}={f\u02dc\theta (x;\alpha ,\epsilon ,S)+f\u02dc\theta (x;\alpha ',\epsilon ,S'):\alpha \u2208Rd\theta \xd7|S|,\alpha '\u2208Rd\theta \xd7|S'|,S\u2286finV[\theta ,\epsilon ]}.$

Based on theorem ^{7} and lemma ^{12}, the proofs of theorems ^{9} and ^{11} are reduced to a simple search for finding $S'\u2286finV[\theta ,\epsilon ]$ such that the expressivity of $f\u02dc\theta (xi;\alpha ',\epsilon ,S')$ with respect to $\alpha '$ is no worse than the expressivity of $\alpha wxi+\alpha rz(xi;u)$ with respect to $(\alpha w,\alpha r)$ (see theorem ^{9}) and that of $\u2211l=tH\alpha h(l+1)h(l)(xi;u)$ with respect to $\alpha h(l+1)$ (see theorem ^{11}). In other words, ${f\u02dc\theta (xi;\alpha ',\epsilon ,S'):\alpha '\u2208Rd\theta \xd7|S'|}\u2287{\alpha wxi+\alpha rz(xi;u):\alpha w\u2208Rdy\xd7dx,\alpha r\u2208Rdy\xd7dz}$ (see theorem ^{9}) and ${f\u02dc\theta (xi;\alpha ',\epsilon ,S'):\alpha '\u2208Rd\theta \xd7|S'|}\u2287{\u2211l=tH\alpha h(l+1)h(l)(xi;u):\alpha h\u2208Rdt}$ (see theorem ^{11}). Only with this search for $S'$, theorem ^{7} together with lemma ^{12} implies the desired statements for theorems ^{9} and ^{11} (see sections A.4 and A.5 in the appendix for further details). Thus, theorem ^{7} also enables simple proofs.

## 5 Conclusion

This study provided a general theory for nonconvex machine learning and demonstrated its power by proving new competitive theoretical results with it. In general, the proposed theory provides a mathematical tool to study the effects of hypothesis classes $f$, methods, and assumptions through the lens of the global optima of the perturbable gradient basis model class.

In convex machine learning with a model output $f(x;\theta )=\theta \u22a4x$ with a (nonlinear) feature output $x=\phi (x(raw))$, achieving a critical point ensures the global optimality in the span of the fixed basis $x=\phi (x(raw))$. In nonconvex machine learning, we have shown that achieving a critical point ensures the global optimality in the span of the gradient basis $\u2202fx(\theta )$, which coincides with the fixed basis $x=\phi (x(raw))$ in the case of the convex machine learning. Thus, whether convex or nonconvex, achieving a critical point ensures the global optimality in the span of some basis, which might be arbitrarily bad (or good) depending on the choice of the handcrafted basis $\phi (x(raw))=\u2202fx(\theta )$ (for the convex case) or the induced basis $\u2202fx(\theta )$ (for the nonconvex case). Therefore, in terms of the loss values at critical points, nonconvex machine learning is theoretically as justified as the convex one, except in the case when a preference is given to $\phi (x(raw))$ over $\u2202fx(\theta )$ (both of which can be arbitrarily bad or good). The same statement holds for local minima and perturbable gradient basis.

## Appendix: Proofs of Theoretical Results

In this appendix, we provide complete proofs of the theoretical results.

### A.1 Proof of Theorem ^{3}

The proof of theorem ^{3} combines lemma ^{13} with assumptions 1 and 2 by taking advantage of the structure of the objective function $L$. Although lemma ^{13} is rather weak and assumptions 1 and 2 are mild (in the sense that they usually hold in practice), a right combination of these with the structure of $L$ can prove the desired statement.

^{13}.

Let $\theta $ be an arbitrary critical point $\theta \u2208(Rd\theta \u2216\Omega )$ of $L$. Since $\u2113yi:Rdy\u2192R$ is assumed to be differentiable and $fxi\u2208Rdy$ is differentiable at the given $\theta $, the composition $(\u2113yi\u2218fxi)$ is also differentiable, and $\u2202k(\u2113yi\u2218fxi)=\u2202\u2113yi(fxi(\theta ))\u2202kfxi(\theta )$. In addition, $L$ is differentiable because a sum of differentiable functions is differentiable. Therefore, for any critical point $\theta $ of $L$, we have that $\u2202L(\theta )=0$, and, hence, $\u2202kL(\theta )=\u2211i=1m\lambda i\u2202\u2113yi(fxi(\theta ))\u2202kfxi(\theta )=0,$ for any $k\u2208{1,\u2026,d\theta},$ from linearity of differentiation operation.$\u25a1$

^{3}.

### A.2 Proof of Theorem ^{7}

The proof of theorem ^{7} uses lemma ^{14}, the structure of the objective function $L$, and assumptions 1 and 2.

^{14}.

Since $\theta \u2208(Rd\theta \u2216\Omega \u02dc)$, there exists $\epsilon 2>0$ such that $fx1,\u2026,fxm$ are differentiable in $B(\theta ,\epsilon 2)$. Since $\u2113yi:Rdy\u2192R$ is assumed to be differentiable and $fxi\u2208Rdy$ is differentiable in $B(\theta ,\epsilon 2)$, the composition $(\u2113yi\u2218fxi)$ is also differentiable, and $\u2202k(\u2113yi\u2218fxi)=\u2202\u2113yi(fxi(\theta ))\u2202kfxi(\theta )$ in $B(\theta ,\epsilon 2)$. In addition, $L$ is differentiable in $B(\theta ,\epsilon 2)$ because a sum of differentiable functions is differentiable.

^{7}.

^{14}, there exists $\epsilon 0>0$ such that for any $\epsilon \u2208[0,\epsilon 0)$, any $S\u2286finV[\theta ,\epsilon ]$ and any $\alpha \u2208Rd\theta \xd7|S|$,

### A.3 Proof of Lemma ^{12}

As shown in the proof of lemma ^{12}, lemma ^{12} is a simple consequence of the following facts: $f\u02dc\theta $ is linear in $\alpha $ and a union of two sets $S\u2286finV[\theta ,\epsilon ]$ and $S'\u2286finV[\theta ,\epsilon ]$ is still a finite subset of $V[\theta ,\epsilon ]$.

^{12}.

### A.4 Proof of Theorem ^{9}

As shown in the proof of theorem ^{9}, thanks to theorem ^{7} and lemma ^{12}, the remaining task to prove theorem ^{9} is to find a set $S'\u2286finV[\theta ,\epsilon ]$ such that ${f\u02dc\theta (xi;\alpha ',\epsilon ,S'):\alpha '\u2208Rd\theta \xd7|S'|}\u2287{\alpha wxi+\alpha rz(xi;u):\alpha w\u2208Rdy\xd7dx,\alpha r\u2208Rdy\xd7dz}$. Let $Null(M)$ be the null space of a matrix $M$.

^{9}.

^{7}, there exists $\epsilon 0>0$ such that for any $\epsilon \u2208[0,\epsilon 0)$,

^{12}. On the other hand, since the set in the first line is a subset of the set in the last line, ${f\u02dc\theta (xi;\alpha ,\epsilon ,S):\alpha \u2208Rd\theta \xd7|S|,S\u2286finV[\theta ,\epsilon ]}={f\u02dc\theta (xi;\alpha ,\epsilon ,S)+\alpha wxi+\alpha rz(xi,u):\alpha \u2208Rd\theta \xd7|S|,\alpha w(1)\u2208Rdy\xd7dx,\alpha r(2)\u2208Rdy\xd7dz,S\u2286finV[\theta ,\epsilon ]}$. This immediately implies the desired statement from theorem

^{7}.$\u25a1$

### A.5 Proof of Theorem ^{11}

As shown in the proof of theorem ^{11}, thanks to theorem ^{7} and lemma ^{12}, the remaining task to prove theorem ^{11} is to find a set $S'\u2286finV[\theta ,\epsilon ]$ such that ${f\u02dc\theta (xi;\alpha ',\epsilon ,S'):\alpha '\u2208Rd\theta \xd7|S'|}\u2287{\u2211l=tH\alpha h(l+1)h(l)(xi;u):\alpha h\u2208Rdt}$. Let $M(l')\cdots M(l+1)M(l)=I$ if $l>l'$.

^{11}.

^{7}, there exists $\epsilon 0>0$ such that for any $\epsilon \u2208[0,\epsilon 0)$, $L(\theta )=infS\u2286finV[\theta ,\epsilon ],\alpha \u2208Rd\theta \xd7|S|\u2211i=1m\lambda i\u2113(f\u02dc\theta (xi;\alpha ,\epsilon ,S),yi),$ where $f\u02dc\theta (xi;\alpha ,\epsilon ,S)=\u2211k=1d\theta \u2211j=1|S|\alpha k,j\u2202kfxi(\theta +\epsilon Sj).$

^{7}and the above form of $fxi$ hold in $B\u02dc(\theta ,\epsilon )$. Let $R(l)=[A(l)C(l)]$.

- Find $S(l)$ with $l=H$: Since$S(H)={0}\u2286finV[\theta ,\epsilon ]$ (where $0\u2208Rd\theta $) is the desired set.$(\u2202vec(R(H+1))fxi(\theta ))vec(\alpha h(H+1))=\alpha h(H+1)h(H)(xi;\theta ),$
- Find $S(l)$ with $l\u2208{t,\u2026,H-1}$: With $\alpha r(l+1)\u2208Rdl+1\xd7dl$, we have that$(\u2202vec(R(l+1))fxi(\theta ))vec(\alpha r(l+1))=A(H+1)\cdots A(l+2)\alpha r(l+1)h(l)(xi;\theta ).$

^{12}, we can combine these, yielding

^{7}.$\u25a1$

## Note

^{1}

For example, choose the first layer's weight matrix $W(1)$ such that for all $i\u2208{1,\u2026,m}$, $(W(1)xi)i>0$ and $(W(1)xi)i'\u22640$ for all $i'\u2260i$. This can be achieved by choosing the $i$th row of $W(1)$ to be $[(xi(raw))\u22a4,\epsilon -1]$ with $0<\epsilon \u2264\delta $ for $i\u2264m$. Then choose the weight matrices for the $l$th layer for all $l\u22652$ such that for all $j$, $Wj,j(l)\u22600$ and $Wj',j(l)=0$ for all $j'\u2260j$. This guarantees $rank([\phi (xi;u)]i=1m)\u2265m$.

## Acknowledgments

We gratefully acknowledge support from NSF grants 1523767 and 1723381, AFOSR grant FA9550-17-1-0165, ONR grant N00014-18-1-2847, Honda Research, and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.