## Abstract

In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

## 1 Introduction

Deep learning with neural networks has been a significant practical success in many fields, including computer vision, machine learning, and artificial intelligence. Along with its practical success, deep learning has been theoretically analyzed and shown to be attractive in terms of its expressive power. For example, neural networks with one hidden layer can approximate any continuous function (Leshno, Lin, Pinkus, & Schocken, 1993; Barron, 1993), and deeper neural networks enable us to approximate functions of certain classes with fewer parameters (Montufar, Pascanu, Cho, & Bengio, 2014; Livni, Shalev-Shwartz, & Shamir, 2014; Telgarsky, 2016). However, training deep learning models requires us to work with a seemingly intractable problem: nonconvex and high-dimensional optimization. Finding a global minimum of a general nonconvex function is NP-hard (Murty & Kabadi, 1987), and nonconvex optimization to train certain types of neural networks is also known to be NP-hard (Blum & Rivest, 1992). These hardness results pose a serious concern only for high-dimensional problems, because global optimization methods can efficiently approximate global minima without convexity in relatively low-dimensional problems (Kawaguchi, Kaelbling, & Lozano-Pérez, 2015).

A hope is that beyond the worst-case scenarios, practical deep learning allows some additional structure or assumption to make nonconvex high-dimensional optimization tractable. Recently, it has been shown with strong simplification assumptions that there are novel loss landscape structures in deep learning optimization that may play a role in making the optimization tractable (Dauphin et al., 2014; Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016). Another key observation is that if a neural network is strongly overparameterized so that it can memorize any data set of a fixed size, then all stationary points (including all local minima and saddle points) become global minima, with some nondegeneracy assumptions. This observation was explained by Livni et al. (2014) and further refined by Nguyen and Hein (2017, 2018). However, these previous results (Livni et al., 2014; Nguyen and Hein, 2017, 2018) require strong overparameterization by assuming not only that a network's width is larger than the data set size but also that optimizing only a single layer (the last layer or some hidden layer) can memorize any data set based on an assumed condition on the rank or nondegeneracy of other layers.

In this letter, we analyze the effects of depth and width on the values of local minima, without the strong overparameterization and simplification assumptions in the literature. As a result, we prove quantitative upper bounds on the quality of local minima, which shows that the values of local minima of neural networks are guaranteed to be no worse than the globally optimal values of corresponding classical machine learning models, and the guarantee can improve as depth and width increase.

## 2 Preliminaries

This section defines the optimization problem considered in this letter and introduces the basic notation.

### 2.1 Problem Formulation

Let $x\u2208Rdx$ and $y\u2208Rdy$ be an input vector and a target vector, respectively. Let ${(xi,yi)}i=1m$ be a training data set of size $m$. Given a set of $n$ matrices or vectors ${M(j)}j=1n$, define $[M(j)]j=1n:=M(1)M(2)\cdots M(n)$ to be a block matrix of each column block being $M(1),M(2),\cdots ,M(n)$. Define the training data matrices as $X:=([xi]i=1m)\u22a4\u2208Rm\xd7dx$ and $Y:=([yi]i=1m)\u22a4\u2208Rm\xd7dy$.

### 2.2 Additional Notation

Define $P[M]$ to be the orthogonal projection matrix onto the column space (or range space) of a matrix $M$. Let $PN[M]$ be the orthogonal projection matrix onto the null space (or kernel space) of a matrix $M\u22a4$. For a matrix $M\u2208Rd\xd7d'$, we denote the standard vectorization of the matrix $M$ as $vec(M)=[M1,1,\u2026,Md,1,M1,2,\cdots ,Md,2,\u2026,M1,d',\u2026,Md,d']T$.

## 3 Shallow Nonlinear Neural Networks with Scalar-Valued Output

Before presenting our main results for deep nonlinear neural networks, this section provides the results for shallow networks with a single hidden layer (or three-layer networks with the input and output layers) and scalar-valued output (i.e., $dy=1$) to illustrate some of the ideas behind the discussed effects of the depth and width on local minima.

### 3.1 Analysis with ReLU Activations

Under this setting, proposition 1 provides an equation that holds at local minima and illustrates the effect of width for shallow ReLU neural networks.

Proposition 1 is an immediate consequence of our general result (see theorem 1) in the next section (the proof is provided in section A.1). In the rest of this section, we provide a proof sketch of proposition 1.

### 3.2 Probabilistic Bound

From equation 2.2 in proposition 1, the loss $L(\theta )$ at differentiable local minima is expected to tend to get smaller as the width of the hidden layer $d$ gets larger. To further support this theoretical observation, this section obtains a probabilistic upper bound on the loss $L(\theta )$ for white noise data by fixing the activation patterns $\Lambda 1,k$ for $k\u2208{1,2,\cdots ,d}$ and assuming that the data matrix $[XY]$ is a random gaussian matrix, with each entry having mean zero and variance one.

This definition of $\Lambda ii1,k$ generalizes the corresponding definition in section 3.1. Proposition 1 holds for this generalized activation pattern by simply replacing the previous definition of $\Lambda ii1,k$ by this more general definition. This can be seen from the proof sketch in section 3.1 and is later formalized in the proof of theorem 1.

We denote the vector consisting of the diagonal entries of $\Lambda 1,k$ by $\Lambda k\u2208Rm$ for $k\u2208{1,2,\cdots ,d}$. Define the activation pattern matrix as $\Lambda :=[\Lambda k]k=1d\u2208Rm\xd7d$. For any index set $I\u2286{1,2,\cdots ,m}$, let $\Lambda I$ denote the submatrix of $\Lambda $ that consists of its rows of indices in $I$. Let $smin(\Lambda I)$ be the smallest singular value of $\Lambda I$.

Proposition 2 proves that $L(\theta )\u2248(1-dxd/m)\u2225Y\u222522/2$ in the regime $dxd\u226am$, and $L(\theta )=0$ in the regime $dxd\u226bm$, under the corresponding conditions on $\Lambda $; that is, $smin(\Lambda I)\u2265\delta $ for any index set $I\u2286{1,2,\cdots ,m}$ such that $|I|\u2265m/2$ in the regime $dxd\u226am$, and $|I|\u2264d/2$ in the regime $dxd\u226bm$. This supports our theoretical observation that increasing width helps improve the quality of local minima.

Fix the activation pattern matrix $\Lambda =[\Lambda k]k=1d\u2208Rm\xd7d$. Let $XY$ be a random $m\xd7(dx+1)$ gaussian matrix, with each entry having mean zero and variance one. Then the loss $L(\theta )$ as in equation 3.2 satisfies both of the following statements:

- i.If $m\u226564ln2(dxdm/\delta 2)dxd$ and $smin(\Lambda I)\u2265\delta $ for any index set $I\u2286{1,2,\cdots ,m}$ with $|I|\u2265m/2$, thenwith probability at least $1-e-m/(64ln(dxdm/\delta 2))-2e-t$.$L(\theta )\u22641+6tmm-dxd2m\u2225Y\u222522,$
- ii.If $ddx\u22652mln2(md/\delta )$ with $dx\u2265ln2(dm)$ and $smin(\Lambda I)\u2265\delta $ for any index set $I\u2286{1,2,\cdots ,m}$ with $|I|\u2264d/2$, thenwith probability at least $1-2e-dx/20$.$L(\theta )=0$

The proof of proposition 2 is provided in appendix B. In that proof, we first rewrite the loss $L(\theta )$ as the projection of $Y$ onto the null space of an $m\xd7dd0$ matrix $D\u02dc$, with an explicit expression in terms of the activation pattern matrix $\Lambda $ and the data matrix $X$. By our assumption, the data matrix $X$ is a random gaussian matrix. The projection matrix $D\u02dc$ is also a random matrix. Proposition 2 then boils down to understanding the rank of the projection matrix $D\u02dc$, and we proceed to show that $D\u02dc$ has the largest possible rank, $min{dd0,m}$, with high probability. In fact, we derive quantitative estimates on the smallest singular value of $D\u02dc$. The main difficulties are that the columns of the matrix $D\u02dc$ are correlated and variances of different entries vary. Our approach to obtain quantitative estimates on the smallest singular value of $D\u02dc$ combines the epsilon net argument with an iterative argument.

In the regime $dd0\u226bm$, results similar to proposition 2ii were obtained under certain diversity assumptions on the entries of the weight matrices in a previous study (Xie, Liang, & Song, 2017). When compared with the previous study (Xie et al., 2017), proposition 2 specifies precise relations between the size $dd0$ of the neural network and the size $m$ of the data set and also holds true in the regime $dd0\u226am$. Moreover, our proof arguments for proposition 2ii are different. Xie et al. (2017), under the assumption that $dd0\u226bm$, show that $D\u02dcD\u02dcT$ is close to its expectation in the sense of spectral norm. As a consequence, the lower bound of the smallest eigenvalue of $E[D\u02dcD\u02dcT]$ gives the lower bound for the smallest singular value of $D\u02dc$.

However, proposition 2 assumes a gaussian data matrix, which may be a substantial limitation. The proof of proposition 2 relies on the concentration properties of gaussian distribution. Whereas a similar proof would be able to extend proposition 2 to a nongaussian distribution with these properties (e.g., distributions with subgaussian tails), it would be challenging to use a similar proof for a general distribution without the properties similar to those.

## 4 Deep Nonlinear Neural Networks

Let $H$ be the number of hidden layers and $dl$ be the width (or, equivalently, the number of units) of the $l$th hidden layer. To theoretically analyze concrete phenomena, the rest of this letter focuses on fully connected feedforward networks with various depths $H\u22651$ and widths $dl\u22651$, using rectified linear units (ReLUs), leaky ReLUs, and absolute value activations, evaluated with the squared loss function. In the rest of this letter, the (finite) depth $H$ can be arbitrarily large and the (finite) widths $dl$ can arbitrarily differ among different layers.

### 4.1 Model and Notation

This definition of $\Lambda iil,k$ generalizes the corresponding definition in section 3. Let $Id$ be the identity matrix of size $d$ by $d$. Define $M\u2297M'$ to be the Kronecker product of matrices $M$ and $M'$. Given a matrix $M$, $M\xb7,j$ and $Mi,\xb7$ denote the $j$th column vector of $M$ and the $i$th row vector of $M$, respectively.

### 4.2 Theoretical Result

For the standard deep nonlinear neural networks, theorem 1 provides an equation that holds at local minima and illustrates the effect of depth and width. Let $dl':=dl$ for all $l\u2208{1,\cdots ,H}$ and $dH+1':=1$.

The complete proof of theorem 1 is provided in section A.1. Theorem 1 is a generalization of proposition 1. Accordingly, its proof follows the proof sketch presented in the previous section for proposition 1.

Unlike previous studies (Livni et al., 2014; Nguyen & Hein, 2017, 2018), theorem 1 requires no overparameterization such as $dl\u2265m$. Instead, it provides quantitative gradual effects of depth and width on local minima, from no overparameterization to overparameterization. Notably, theorem 1 shows the effect of overparameterization in terms of depth as well as width, which also differs from the results of previous studies that consider overparameterization in terms of width (Livni et al., 2014; Nguyen & Hein, 2017, 2018).

The proof idea behind these previous studies with strong overparameterization is captured in the discussion after equation 3.3—with strong overparameterization such that $dl\u2265m$ and $rank(D(1))\u2265m$, $\u2207vec(W)Y^(X,\theta )\u2208Rdl\xd7m$ is left-invertible and hence every local minimum is a global minimum with zero training error. Here, $rank(M)$ represents the rank of a matrix $M$. The proof idea behind theorem 1 differs from those as shown in section 3.1. What is still missing in theorem 1 is the ability to provide a prior guarantee on $L(\theta )$ without strong overparameterization, which is addressed in sections 3.2 and 5 for some special cases but left as an open problem for other cases.

### 4.3 Experiments

In the synthetic data set, the data points ${(xi,yi)}i=1m$ were randomly generated by a ground-truth, fully connected feedforward neural network with $H=7$, $dl=50$ for all $l\u2208{1,2,\cdots ,H}$, tanh activation function, $(x,y)\u2208R10\xd7R$ and $m=5000$. MNIST (LeCun, Bottou, Bengio, & Haffner, 1998), a popular data set for recognizing handwritten digits, contains 28 $\xd7$ 28 gray-scale images. The CIFAR-10 (Krizhevsky & Hinton, 2009) data set consists of 32 $\xd7$ 32 color images that contain different types of objects such as “airplane,” “automobile,” and “cat.” The Street View House Numbers (SVHN) data set (Netzer et al., 2011) contains house digits collected by Google Street View, and we used the 32 $\xd7$ 32 color image version for the standard task of predicting the digits in the middle of these images. In order to reduce the computational cost, for the image data sets (MNIST, CIFAR-10, and SVHN), we center-cropped the images ($24\xd724$ for MNIST and $28\xd728$ for CIFAR-10 and SVHN), then resized them to smaller gray-scale images ($8\xd78$ for MNIST and $12\xd712$ for CIFAR-10 and SVHN), and used randomly selected subsets of the data sets with size $m=10,000$ as the training data sets.

For all the data sets, the network architecture was fixed to be a fully connected feedforward network with the ReLU activation function. For each data set, the values of $J(\theta )$ were computed with initial random weights drawn from a normal distribution with zero mean and normalized standard deviation ($1/dl$) and with trained weights at the end of 40 training epochs. (Additional experimental details are presented in appendix C.)

Figure 1 shows the results with the synthetic data set, as well as the MNIST, CIFAR-10, and SVHN data sets. As it can be seen, the values of $J(\theta )$ tend to decrease toward zero (and hence the global minimum value), as the width or depth of neural networks increases. In theory, the values of $J(\theta )$ may not improve as much as desired along depth and width if representations corresponding to each unit and each layer are redundant in the sense of linear dependence of the columns of $Dkl(l)(\theta )$ (see theorem 1). Intuitively, at initial random weights, one can mitigate this redundancy due to the randomness of the weights, and hence a major concern is whether such redundancy arises and $J(\theta )$ degrades along with training. From Figure 1, it can be also noticed that the values of $J(\theta )$ tend to decrease along with training. These empirical results partially support our theoretical observation that increasing the depth and width can improve the quality of local minima.

## 5 Deep Nonlinear Neural Networks with Local Structure

Given the scarcity of theoretical understanding of the optimality of deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that it is valuable to theoretically study simplified models: deep linear neural networks. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that in terms of optimization, deep linear networks exhibited several properties similar to those of deep nonlinear networks. Following these observations, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018), as a step toward the goal of establishing the optimization theory of deep learning.

As another step toward the goal, this section discards the strong linearity assumption and considers a locally induced nonlinear-linear structure in deep nonlinear networks with the piecewise linear activation functions such as ReLUs, leaky ReLUs, and absolute value activations.

### 5.1 Locally Induced Nonlinear-Linear Structure

In this section, we describe how a standard deep nonlinear neural network can induce nonlinear-linear structure. The nonlinear-linear structure considered in this letter is defined in definition 1: condition i simply defines the index subsets $S(l)$ that pick out the relevant subset of units at each layer $l$, condition ii requires the existence of $n$ linearly acting units, and condition iii imposes weak separability of edges.

A parameter vector $\theta $ is said to induce $(n,t)$ weakly separated linear units on a training input data set $X$ if there exist $(H+1-t)$ sets $S(t+1),S(t+2),\cdots ,S(H+1)$ such that for all $l\u2208{t+1,t+2,\cdots ,H+1}$, the following three conditions hold:

- i.
$S(l)\u2286{1,\cdots ,dl}$ with $|S(l)|\u2265n$.

- ii.
$\Phi (l)(X,\theta )\xb7,k=\Phi (l-1)(X,\theta )W(l)(\theta )\xb7,k$ for all $k\u2208S(l)$.

- iii.
$W(l+1)(\theta )k',k=0$ for all $(k',k)\u2208S(l)\xd7({1,\cdots ,dl+1}\u2216S(l+1))$ if $l\u2264H-1$.

Given a training input data set $X$, let $\Theta n,t$ be the set of all parameter vectors that induce $(n,t)$ weakly separated linear units on the training input data set $X$ that defines the total loss $L(\theta )$ in equation 2.1. For standard deep nonlinear neural networks, all parameter vectors $\theta $ are in $\Theta dH+1,H$, and some parameter vectors $\theta $ are in $\Theta n,t$ for different values of $(n,t)$. Figure 2 a illustrates locally induced structures for $\theta \u2208\Theta 1,0$. For a parameter $\theta $ to be in $\Theta n,t$, definition 1 requires only the existence of a portion $n/dl$ of units to act linearly on the particular training data set merely at the particular $\theta $. Thus, all units can be nonlinear, act nonlinearly on the training data set outside of some parameters $\theta $, and operate nonlinearly always on other inputs $x$—for example, in a test data set or a different training data set. The weak separability requires that the edges going from the $n$ units to the rest of the network are negligible. The weak separability does not require the $n$ units to be separated from the rest of the neural network.

Here, a neural network with $\theta \u2208\Theta n,t$ can be a standard deep nonlinear neural network (without any linear units in its architecture), a deep linear neural network (with all activation functions being linear), or a combination of these cases. Whereas a standard deep nonlinear neural network can naturally have parameters $\theta \u2208\Theta n,t$, it is possible to guarantee all parameters $\theta $ to be in $\Theta n,t$ with desired $(n,t)$ simply by using corresponding network architectures. For standard deep nonlinear neural networks, one can also restrict all relevant convergent solution parameters $\theta $ to be in $\Theta n,t$ by using some corresponding learning algorithms. Our theoretical results hold for all of these cases.

### 5.2 Theoretical Result

We state our main theoretical result in theorem 2 and corollary 1; a simplified statement is presented in remark 1. Here, a classical machine learning method, basis function regression, is used as a baseline to be compared with neural networks. The global minimum value of basis function regression with an arbitrary basis matrix $M(X)$ is $infR12\u2225M(X)R-Y\u2225F2$, where the basis matrix $M(X)$ does not depend on $R$ and can represent nonlinear maps, for example, by setting $M=([\phi (xi)]i=1m)\u22a4\u2208Rm\xd7d\phi $ with any nonlinear basis functions $\phi $ and any finite $d\phi $. In theorem 2, the expression $PN\Phi (S)Y$ represents the projection of $Y$ onto the null space of $(\Phi (S))\u22a4$, which is also ($Y$—the projection of $Y$ onto the column space of $\Phi (S)$). Given matrices $(M(j))j\u2208S$ with a sequence $S=(s1,s2,\cdots ,sn)$, define $[M(j)]j\u2208S:=M(s1)M(s2)\cdots M(sn)$ to be a block matrix with columns being $M(s1),M(s2),\cdots ,M(sn)$. Let $S\u2286(s1,s2,\cdots ,sn)$ denote a subsequence of $(s1,s2,\cdots ,sn)$.

From theorem 2 (or corollary 1), one can see the following properties of the loss landscape:

- i.
Every differentiable local minimum, $\theta \u2208\Theta dH+1,t$ has a loss value $L(\theta )$ better than or equal to any global minimum value of basis function regression with any combination of the basis matrices in the set ${\Phi (l)}l=tH$ of fixed deep hierarchical representation matrices. In particular with $t=0$, every differentiable local minimum $\theta \u2208\Theta dH+1,0$ has a loss value $L(\theta )$ no worse than the global minimum values of standard basis function regression with the handcrafted basis matrix $\Phi (0)=X$, and of basis function regression with the larger basis matrix $[\Phi (l)]l=0H$.

- ii.
As $dl$ and $H$ increase (or, equivalently, as a neural network gets wider and deeper), the upper bound on the loss values of local minima can further improve.

The proof of theorem 2 is provided in section A.2. The proof is based on the combination of the idea presented in section 3.1 and perturbations of a local minimum candidate. That is, if a $\theta $ is a local minimum, then the $\theta $ is a global minimum within a local region (i.e., a neighborhood of $\theta $). Thus, after perturbing $\theta $ as $\theta '=\theta +\Delta \theta $ such that $\u2225\Delta \theta \u2225$ is sufficiently small (so that $\theta '$ stays in the local region) and $L(\theta ')=L(\theta )$, the $\theta '$ must be still a global minimum within the local region and, hence, the $\theta '$ is also a local minimum. The proof idea of theorem 2 is to apply the proof sketch in section 3.1 to not only a local minimum candidate $\theta $ but also its perturbations $\theta '=\theta +\Delta \theta $.

In terms of overparameterization, theorem 2 states that local minima of deep neural networks are as good as global minima of the corresponding basis function regression even without overparameterization, and overparameterization helps to further improve the guarantee on local minima. The effect of overparameterization is captured in both the first and second terms on the right-hand side of equation 5.1. As depth and width increase, the second term tends to increase, and hence the guarantee on local minima can improve. Moreover, as depth and width increase (for some of $t+1,t+2,\cdots ,H$th layers in theorem 2), the first term tends to decrease and the guarantee on local minima can also improve. For example, if $[\Phi (l)]l=tH$ has rank at least $m$, then the first term is zero and, hence, every local minimum is a global minimum with zero loss value. As a special case of this example, since every $\theta $ is automatically in $\Theta dH+1,H$, if $\Phi (H)$ is forced to have rank at least $m$, every local minimum becomes a global minimum for standard deep nonlinear neural networks, which coincides with the observation about overparameterization by Livni et al. (2014).

Without overparameterization, theorem 2 also recovers one of the main results in the literature of deep linear neural networks as a special case—that is, every local minimum is a global minimum. If $dH+1\u2264min{dl:1\u2264l\u2264H}$, every local minimum $\theta $ for deep linear networks is differentiable and in $\Theta dH+1,0$, and hence theorem 1 yields that $L(\theta )\u226412\u2225PN[X]Y\u2225F2$. Because $12\u2225PN[X]Y\u2225F2$ is the global minimum value, this implies that every local minimum is a global minimum for deep linear neural networks.

Corollary 1 states that the same conclusion and discussions as in theorem 2 hold true even if we fix the edges in condition iii in definition 1 to be zero (by removing them as an architectural design or by forcing it with a learning algorithm) and consider optimization problems only with remaining edges.

The proof of corollary 1 is provided in section A.3 and follows the proof of theorem 2. Here, $\Phi (0)=X$ consists of training inputs $xi$ in the arbitrary given feature space embedded in $Rd0$; for example, given a raw input $xraw$ and any feature map $\phi :xraw\u21a6\phi (xraw)\u2208Rd0$ (including identity as $\phi (xraw)=xraw$), we write $x=\phi (xraw)$. Therefore, theorem 2 and corollary 1 state that every differentiable local minima of deep neural networks can be guaranteed to be no worse than any given basis function regression model with a handcrafted basis taking values in $Rd$ with some finite $d$, such as polynomial regression with a finite degree and radial basis function regression with a finite number of centers.

To illustrate an advantage of the notion of weakly separated edges in definition 1, one can consider the following alternative definition that requires strongly separated edges.

A parameter vector $\theta $ is said to *induce $(n,t)$ strongly separated linear units on the training input data set $X$* if there exist $(H+1-t)$ sets $S(t+1),S(t+2),\cdots ,S(H+1)$ such that for all $l\u2208{t+1,t+2,\cdots ,H+1}$, conditions i to iii in definition 1 hold and $\Phi (l)(X,\theta )W(l+1)(\theta )\xb7,k=\u2211k'\u2208S(l)\Phi (l)(X,\theta )\xb7,k'W(l+1)(\theta )k',k$ for all $k\u2208S(l+1)$ if $l\u2260{H,H+1}$.

Let $\Theta n,tstrong$ be the set of all parameter vectors that induces $(n,t)$ stronglyseparated linear units on the particular training input data set $X$ that defines the total loss $L(\theta )$ in equation 2.1. Figure 2 shows a comparison of weekly separated edges and strongly separated edges. Under this stronger restriction on the local structure, we can obtain corollary 2.

The proof of corollary 2 is provided in section A.4 and follows the proof of theorem 2. As a special case, corollary 2 also recovers the statement that every local minimum is a global minimum for deep linear neural networks in the same way as in theorem 2. When compared with theorem 2, one can see that the statement in corollary 2 is weaker, producing the upper bound only in terms of $S\u2286(t,H)$. This is because the restriction of strongly separated units forces neural networks to have less expressive power with fewer effective edges. This illustrates an advantage of the notion of weakly separated edges in definition 1.

A limitation in theorems 1 and 2 and corollary 1 is the lack of treatment of nondifferentiable local minima. The Lebesgue measure of nondifferentiable points is zero, but this does not imply that the appropriate measure of nondifferentiable points is small. For example, if $L(\theta )=|\theta |$, the Lebesgue measure of the nondifferentiable point ($\theta =0$) is zero, but the nondifferentiable point is the only local and global minimum. Thus, the treatment of nondifferentiable points in this context is a nonnegligible problem. The proofs of theorems 1 and 2 and corollary 1 are all based on the proof sketch in section 3.1, which heavily relies on the differentiability. Thus, the current proofs do not trivially extend to address this open problem.

## 6 Conclusion

In this letter, we have theoretically and empirically analyzed the effect of depth and width on the loss values of local minima, with and without a possible local nonlinear-linear structure. The local nonlinear-linear structure we have considered might naturally arise during training and also is guaranteed to emerge by using specific learning algorithms or architecture designs. With the local nonlinear-linear structure, we have proved that the values of local minima of neural networks are no worse than the global minimum values of corresponding basis function regression and can improve as depth and width increase. In the general case without the possible local structure, we have theoretically shown that increasing the depth and width can improve the quality of local minima, and we empirically supported this theoretical observation. Furthermore, without the local structure but with a shallow neural network and a gaussian data matrix, we have proven the probabilistic bounds on the rates of the improvements on the local minimum values with respect to width. Moreover, we have discussed a major limitation of this letter: all of its the results focus on the differentiable points on the loss surfaces. Additional treatments of the nondifferentiable points are left to future research.

Our results suggest that the values of local minima are not arbitrarily poor (unless one crafts a pathological worst-case example) and can be guaranteed to some desired degree in practice, depending on the degree of overparameterization, as well as the local or global structural assumption. Indeed, a structural assumption, namely the existence of an identity map, was recently used to analyze the quality of local minima (Shamir, 2018; Kawaguchi & Bengio, 2018). When compared with these previous studies (Shamir, 2018; Kawaguchi & Bengio, 2018), we have shown the effect of depth and width, as well as considered a different type of neural network without the explicit identity map.

In practice, we often “overparameterize” a hypothesis space in deep learning in a certain sense (e.g., in terms of expressive power). Theoretically, with strong overparameterization assumptions, we can show that every stationary point (including all local minima) with respect to a single layer is a global minimum with the zero training error and can memorize any data set. However, “overparameterization” in practice may not satisfy such strong overparameterization assumptions in the theoretical literature. In contrast, our results in this letter do not require overparameterization and show the gradual effects of overparameterization as consequences of general results.

## Appendix A: Proofs for Nonprobabilistic Statements

Let $Dkl(l)$ be defined in theorem 2. Let $D(l):=[Dk(l)]k=1dl\u2208RmdH+1\xd7dldl-1$ and $D:=[D(l)]l=1H\u2208RmdH+1\xd7\u2211l=1Hdl-1dl$. Given a matrix-valued function $f(\theta )\u2208Rd'\xd7d$, let $\u2202W(l)f(\theta ):=\u2202vec(f)\u2202vec(W(l))\u2208Rd'd\xd7dl-1dl$ be the partial derivative of $vec(f)$ with respect to $vec(W(l))$. Let ${j,j+1,\cdots ,j'}:=\u2205$ if $j>j'$. Let $M(l)M(l+1)\cdots M(l')=I$ if $l>l'$. Let $Null(M)$ be the null space of a matrix $M$. Let $B(\theta ,\epsilon )$ be an open ball of radius $\epsilon $ with the center at $\theta $.

The following lemma decomposes the model output $Y^$ in terms of the weight matrix $W(l)$ and $D(l)$ that coincides with its derivatives at differentiable points.

Lemma 2 generalizes part of theorem A.45 in Rao, Toutenburg, Shalabh, and Heumann (2007) by discarding invertibility assumptions.

Lemma 3 decomposes a norm of a projected target vector into a form that clearly shows an effect of depth and width.

The following lemma plays a major role in the proof of theorem 2.

### A.1 Proof of Theorem 1

### A.2 Proof of Theorem 2

### A.3 Proof of Corollary 1

The statement follows the proof of theorem 2 by noticing that lemma 4 still holds for the restriction of $L$ to $I$ as $B\u02dc(\theta ,\epsilon 1)=B(\theta ,\epsilon 1)\u2229I$, and by replacing $Dkl(l)$ by $D^kl(l)$ in the proof, where $D^kl(l)$ is obtained from the proof of lemma 1 by setting $W(l+1)(\theta )k',k=0for(k',k)\u2208S(l)\xd7({1,\cdots ,dl+1}\u2216S(l+1))(l=t+1,t+2,\cdots H-1)$ and by not considering their derivatives.$\u25a1$

### A.4 Proof of Corollary 2

## Appendix B: Proofs for Probabilistic Statements

In the following lemma, we rewrite equation 3.2 in terms of the activation pattern, and data matrices $XY$.

From equation B.1, we expect that the larger the rank of the projection matrix $D\u02dc$, the smaller is the loss $L(\theta )$. In the following lemma, we prove that under the conditions of the activation pattern matrix $\Lambda $. In the regime $dxd\u226am$, we have $rankD\u02dc=dxd$. In the regime $dxd\u226bm$, we have $rankD\u02dc=m$. As we show later, proposition 2 follows easily from the rank estimates of $D\u02dc$.

Fix the activation pattern matrix $\Lambda :=[\Lambda k]k=1d\u2208Rm\xd7d$. Let $X$ be a random $m\xd7dx$ gaussian matrix, with each entry having mean zero and variance one. Then the matrix $D\u02dc$ as defined in equation B.2 satisfies both of the following statements:

- i.
If $m\u226564ln2(dxdm/\delta 2)dxd$ and $smin(\Lambda I)\u2265\delta $ for any index set $I\u2286{1,2,\cdots ,m}$ with $|I|\u2265m/2$, then $rank=\u02dcdxd$ with probability at least $1-e-m/(64ln(dxdm/\delta 2))-2e-t$.

- ii.
If $ddx\u22652mln2(md/\delta )$ with $dx\u2265ln2(dm)$ and $smin(\Lambda I)\u2265\delta $ for any index set $I\u2286{1,2,\cdots ,m}$ with $|I|\u2264d/2$, then $rankD\u02dc=m$ with probability at least $1-2e-dx/20$.

The following concentration inequalities for the square of gaussian random variables are from Laurent and Massart (2000).

## Appendix C: Additional Experimental Details

By using the ground-truth network described in section 4.3, the synthetic data set was generated with i.i.d. random inputs $x$ and i.i.d. random weight matrices $W(l)$. Each input $x$ was randomly sampled from the standard normal distribution, and each entry of the weight matrix $W(l)$ was randomly sampled from a normal distribution with zero mean and normalized standard deviation ($2dl$).

For training, we used a standard training procedure with mini-batch stochastic gradient decent (SGD) with momentum. The learning rate was set to 0.01. The momentum coefficient was set to 0.9 for the synthetic data set and 0.5 for the image data sets. The mini-batch size was set to 200 for the synthetic data set and 64 for the image data sets.

From the proof of theorem 1, $J(\theta )=\u2225(I-P[[DD1(H+1)]])vec(Y)\u222522$ for all $\theta $, which was used to numerically compute the values of $J(\theta )$. This is mainly because the form of $J(\theta )$ in theorem 1 may accumulate positive numerical errors for each $l\u2264H$ and $kl\u2264dl$ in the sum in its second term, which may easily cause a numerical overestimation of the effect of depth and width. To compute the projections, we adopted a method of computing a numerical cutoff criterion on singular values from Press, Teukolsky, Vetterling, and Flannery (2007) as (the numerical cutoff criterion) $=$$12$$\xd7$ (maximum singular value of $M$) $\xd7$ (machine precision of $M$) $\xd7$ ($d'+d+1$), for a matrix of $M\u2208Rd'\xd7d$. We also confirmed that the reported experimental results remained qualitatively unchanged with two other different cutoff criteria: a criterion based (Golub & Van Loan, 1996) as (the numerical cutoff criterion) = $12\u2225M\u2225\u221e$$\xd7$ (machine precision of $M$) (where $\u2225M\u2225\u221e=max1\u2264i\u2264d'\u2211j=1d|Mi,j|$ for a matrix of $M\u2208Rd'\xd7d$), as well as another criterion based on Netlib Repository LAPACK documentation as (the numerical cutoff criterion) = (maximum singular value of $M$) $\xd7$ (machine precision of $M$).

## Acknowledgments

We gratefully acknowledge support from NSF grants 1523767 and 1723381; AFOSR grant FA9550-17-1-0165; ONR grant N00014-18-1-2847; Honda Research; and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.