In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

Deep learning with neural networks has been a significant practical success in many fields, including computer vision, machine learning, and artificial intelligence. Along with its practical success, deep learning has been theoretically analyzed and shown to be attractive in terms of its expressive power. For example, neural networks with one hidden layer can approximate any continuous function (Leshno, Lin, Pinkus, & Schocken, 1993; Barron, 1993), and deeper neural networks enable us to approximate functions of certain classes with fewer parameters (Montufar, Pascanu, Cho, & Bengio, 2014; Livni, Shalev-Shwartz, & Shamir, 2014; Telgarsky, 2016). However, training deep learning models requires us to work with a seemingly intractable problem: nonconvex and high-dimensional optimization. Finding a global minimum of a general nonconvex function is NP-hard (Murty & Kabadi, 1987), and nonconvex optimization to train certain types of neural networks is also known to be NP-hard (Blum & Rivest, 1992). These hardness results pose a serious concern only for high-dimensional problems, because global optimization methods can efficiently approximate global minima without convexity in relatively low-dimensional problems (Kawaguchi, Kaelbling, & Lozano-Pérez, 2015).

A hope is that beyond the worst-case scenarios, practical deep learning allows some additional structure or assumption to make nonconvex high-dimensional optimization tractable. Recently, it has been shown with strong simplification assumptions that there are novel loss landscape structures in deep learning optimization that may play a role in making the optimization tractable (Dauphin et al., 2014; Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016). Another key observation is that if a neural network is strongly overparameterized so that it can memorize any data set of a fixed size, then all stationary points (including all local minima and saddle points) become global minima, with some nondegeneracy assumptions. This observation was explained by Livni et al. (2014) and further refined by Nguyen and Hein (2017, 2018). However, these previous results (Livni et al., 2014; Nguyen and Hein, 2017, 2018) require strong overparameterization by assuming not only that a network's width is larger than the data set size but also that optimizing only a single layer (the last layer or some hidden layer) can memorize any data set based on an assumed condition on the rank or nondegeneracy of other layers.

In this letter, we analyze the effects of depth and width on the values of local minima, without the strong overparameterization and simplification assumptions in the literature. As a result, we prove quantitative upper bounds on the quality of local minima, which shows that the values of local minima of neural networks are guaranteed to be no worse than the globally optimal values of corresponding classical machine learning models, and the guarantee can improve as depth and width increase.

This section defines the optimization problem considered in this letter and introduces the basic notation.

### 2.1  Problem Formulation

Let $x∈Rdx$ and $y∈Rdy$ be an input vector and a target vector, respectively. Let ${(xi,yi)}i=1m$ be a training data set of size $m$. Given a set of $n$ matrices or vectors ${M(j)}j=1n$, define $[M(j)]j=1n:=M(1)M(2)⋯M(n)$ to be a block matrix of each column block being $M(1),M(2),⋯,M(n)$. Define the training data matrices as $X:=([xi]i=1m)⊤∈Rm×dx$ and $Y:=([yi]i=1m)⊤∈Rm×dy$.

This letter considers the squared loss function, with which the training objective of the neural networks can be formulated as the following optimization problem:
$minimizeθL(θ):=12∥Y^(X,θ)-Y∥F2,$
(2.1)
where $∥·∥F$ is the Frobenius norm, $Y^(X,θ)∈Rm×dy$ is the output prediction matrix of a neural network, and $θ∈Rdθ$ is the vector consisting of all trainable parameters. Here, $2mL(θ)$ is the standard mean squared error, for which all of our results hold true as well, because multiplying $L(θ)$ by a constant $2m$ (in $θ$) changes only the entire scale of the optimization landscape.

The output prediction matrix $Y^(X,θ)∈Rm×dy$ is specified for shallow networks with rectified linear units (ReLUs) in section 3 and generalized to deep nonlinear neural networks in section 4.

### 2.2  Additional Notation

Define $P[M]$ to be the orthogonal projection matrix onto the column space (or range space) of a matrix $M$. Let $PN[M]$ be the orthogonal projection matrix onto the null space (or kernel space) of a matrix $M⊤$. For a matrix $M∈Rd×d'$, we denote the standard vectorization of the matrix $M$ as $vec(M)=[M1,1,…,Md,1,M1,2,⋯,Md,2,…,M1,d',…,Md,d']T$.

Before presenting our main results for deep nonlinear neural networks, this section provides the results for shallow networks with a single hidden layer (or three-layer networks with the input and output layers) and scalar-valued output (i.e., $dy=1$) to illustrate some of the ideas behind the discussed effects of the depth and width on local minima.

In this section, the vector $θ∈Rdθ$ of all trainable parameters determines the entries of the weight matrices $W(1):=W(1)(θ)∈Rdx×d$ and $W(2):=W(2)(θ)∈Rd$ as $vec([W(1)(θ),W(2)(θ)])=θ$. Given an input matrix $X$ and a parameter vector $θ$, the output prediction matrix $Y^(X,θ)∈Rm$ of a fully connected feedforward network with a single hidden layer can be written as
$Y^(X,θ):=σ(XW(1))W(2),$
(3.1)
where $σ:Rm×d→Rm×d$ is defined by coordinate-wise nonlinear activation functions $σi,j$ as $(σ(M))i,j:=σi,j(Mi,j)$ for each $(i,j)$.

### 3.1  Analysis with ReLU Activations

In this section, the nonlinear activation function $σi,j$ is assumed to be ReLU as $σi,j(z)=max(0,z)$. Let $Λ1,k∈Rm×m$ represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the $k$th unit at the hidden layer over $m$ different samples as for all $i∈{1,⋯,m}$ and all $k∈{1,⋯,d}$,
$Λii1,k=1if(XW(1))i,k>00otherwise.$
Let $Φ(1):=Φ(1)(X,θ):=σ(XW(1))∈Rm×d$ be the postactivation output of the hidden layer.

Under this setting, proposition 1 provides an equation that holds at local minima and illustrates the effect of width for shallow ReLU neural networks.

Proposition 1.
Every differentiable local minimum $θ$ of $L$ satisfies that
$L(θ)=12∥Y∥22-12PN1(2)Φ(1)Y22-∑k=1d12PNk(1)Dk(1)Y22︸≥0︸furtherimprovementasanetworkgetswider,$
(3.2)
where $Dk(1)=Wk(2)Λ1,kX$. Here, $N1(1):=Im$, $Nk(1):=PN[Q¯k-1(1)]$ for any $k∈{2,⋯,d}$, and $N1(2):=PN[Q¯d(1)]$, where $Q¯k(1):=[Q1(1),…,Qk(1)]$ and $Qk(l):=Nk(1)Dk(1)$ for any $k∈{1,⋯,d}.$

Proposition 1 is an immediate consequence of our general result (see theorem 1) in the next section (the proof is provided in section A.1). In the rest of this section, we provide a proof sketch of proposition 1.

A geometric intuition behind proposition 1 is that a local minimum is a global minimum within a local region in $Rdθ$ (i.e., a neighborhood of the local minimum), the dimension of which increases as a network gets wider (or the number of parameters increases). Thus, a local minimum is a global minimum of a search space with a larger dimension for a wider network. One can also see this geometric intuition in an analysis as follows. If $θ$ is a differentiable local minimum, then $θ$ must be a critical point and thus,
$∇θL(θ)=∇θY^(X,θ)Y^(X,θ)-Y=0.$
By rearranging this,
$∇θY^(X,θ)Y^(X,θ)=∇θY^(X,θ)Y,$
(3.3)
where we can already see the power of strong overparameterization in that if the matrix $∇θY^(X,θ)∈Rdθ×m$ is left-invertible, $Y^(X,θ)=Y$, and hence every differentiable local minimum is a global minimum. Here, $∇θY^(X,θ)$ is a $dθ$ by $m$ matrix, so significantly increasing $dθ$ (strong overparameterization) can ensure the left invertibility.
Beyond the strong overparameterization, we proceed with the proof sketch of proposition 1 by taking advantage of the special neural network structures in $Y^(X,θ)$ and $∇θY^(X,θ)$. We first observe that $Y^(X,θ)=Φ(1)W(1)$ and $Y^(X,θ)=D(1)vec(W(2))$, where $D(1):=[Dk(1)]k=1dθ$. Moreover, at any differentiable point, we have that $∇W(1)Y^(X,θ)=(Φ(1))⊤$ and $∇vec(W(2))Y^(X,θ)=(D(1))⊤$. Combining these with equation 3.3 yields
$Φ(1)D(1)⊤12Φ(1)D(1)W(1)vec(W(2))=Φ(1)D(1)⊤Y,$
where
$Y^(X,θ)=12Φ(1)D(1)W(1)vec(W(2)).$
By solving for the vector $W(1)vec(W(2))$,
$Y^(X,θ)=PD(1)Φ(1)Y.$
Therefore,
$L(θ)=12Y-PD(1)Φ(1)Y22=∥Y∥22-PD(1)Φ(1)Y22,$
where the second equality follows the idempotence of the projection. Finally, decomposing the second term $∥P[D(1)Φ(1)]Y∥22$ by following the Gram-Schmidt process on the set of column vectors of $[D(1)Φ(1)]$ yields the desired statement of proposition 1, completing its proof sketch. In proposition 1, the matrices $Nk(l)$ (and $Qk(l)$) are by-products of this Gram-Schmidt process.

### 3.2  Probabilistic Bound

From equation 2.2 in proposition 1, the loss $L(θ)$ at differentiable local minima is expected to tend to get smaller as the width of the hidden layer $d$ gets larger. To further support this theoretical observation, this section obtains a probabilistic upper bound on the loss $L(θ)$ for white noise data by fixing the activation patterns $Λ1,k$ for $k∈{1,2,⋯,d}$ and assuming that the data matrix $[XY]$ is a random gaussian matrix, with each entry having mean zero and variance one.

In this section, each nonlinear activation function $σi,j$ is assumed to be ReLU ($σi,j(z)=max(0,z)$) and leaky ReLU ($σi,j(z)=max(az,z)$ with any fixed $a≤1$) or absolute value activation ($σi,j(z)=|z|$). Let $Λ1,k∈Rm×m$ represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the $k$th unit at the hidden layer over $m$ different samples as
$Λii1,k:=∂σi,k(1)(z)∂z|z=(XW(1))i,kif∂σi,k(1)(z)∂z|z=(XW(1))i,kexists0otherwise.$

This definition of $Λii1,k$ generalizes the corresponding definition in section 3.1. Proposition 1 holds for this generalized activation pattern by simply replacing the previous definition of $Λii1,k$ by this more general definition. This can be seen from the proof sketch in section 3.1 and is later formalized in the proof of theorem 1.

We denote the vector consisting of the diagonal entries of $Λ1,k$ by $Λk∈Rm$ for $k∈{1,2,⋯,d}$. Define the activation pattern matrix as $Λ:=[Λk]k=1d∈Rm×d$. For any index set $I⊆{1,2,⋯,m}$, let $ΛI$ denote the submatrix of $Λ$ that consists of its rows of indices in $I$. Let $smin(ΛI)$ be the smallest singular value of $ΛI$.

Proposition 2 proves that $L(θ)≈(1-dxd/m)∥Y∥22/2$ in the regime $dxd≪m$, and $L(θ)=0$ in the regime $dxd≫m$, under the corresponding conditions on $Λ$; that is, $smin(ΛI)≥δ$ for any index set $I⊆{1,2,⋯,m}$ such that $|I|≥m/2$ in the regime $dxd≪m$, and $|I|≤d/2$ in the regime $dxd≫m$. This supports our theoretical observation that increasing width helps improve the quality of local minima.

Proposition 2.

Fix the activation pattern matrix $Λ=[Λk]k=1d∈Rm×d$. Let $XY$ be a random $m×(dx+1)$ gaussian matrix, with each entry having mean zero and variance one. Then the loss $L(θ)$ as in equation 3.2 satisfies both of the following statements:

• i.
If $m≥64ln2(dxdm/δ2)dxd$ and $smin(ΛI)≥δ$ for any index set $I⊆{1,2,⋯,m}$ with $|I|≥m/2$, then
$L(θ)≤1+6tmm-dxd2m∥Y∥22,$
with probability at least $1-e-m/(64ln(dxdm/δ2))-2e-t$.
• ii.
If $ddx≥2mln2(md/δ)$ with $dx≥ln2(dm)$ and $smin(ΛI)≥δ$ for any index set $I⊆{1,2,⋯,m}$ with $|I|≤d/2$, then
$L(θ)=0$
with probability at least $1-2e-dx/20$.

The proof of proposition 2 is provided in appendix B. In that proof, we first rewrite the loss $L(θ)$ as the projection of $Y$ onto the null space of an $m×dd0$ matrix $D˜$, with an explicit expression in terms of the activation pattern matrix $Λ$ and the data matrix $X$. By our assumption, the data matrix $X$ is a random gaussian matrix. The projection matrix $D˜$ is also a random matrix. Proposition 2 then boils down to understanding the rank of the projection matrix $D˜$, and we proceed to show that $D˜$ has the largest possible rank, $min{dd0,m}$, with high probability. In fact, we derive quantitative estimates on the smallest singular value of $D˜$. The main difficulties are that the columns of the matrix $D˜$ are correlated and variances of different entries vary. Our approach to obtain quantitative estimates on the smallest singular value of $D˜$ combines the epsilon net argument with an iterative argument.

In the regime $dd0≫m$, results similar to proposition 2ii were obtained under certain diversity assumptions on the entries of the weight matrices in a previous study (Xie, Liang, & Song, 2017). When compared with the previous study (Xie et al., 2017), proposition 2 specifies precise relations between the size $dd0$ of the neural network and the size $m$ of the data set and also holds true in the regime $dd0≪m$. Moreover, our proof arguments for proposition 2ii are different. Xie et al. (2017), under the assumption that $dd0≫m$, show that $D˜D˜T$ is close to its expectation in the sense of spectral norm. As a consequence, the lower bound of the smallest eigenvalue of $E[D˜D˜T]$ gives the lower bound for the smallest singular value of $D˜$.

However, proposition 2 assumes a gaussian data matrix, which may be a substantial limitation. The proof of proposition 2 relies on the concentration properties of gaussian distribution. Whereas a similar proof would be able to extend proposition 2 to a nongaussian distribution with these properties (e.g., distributions with subgaussian tails), it would be challenging to use a similar proof for a general distribution without the properties similar to those.

Let $H$ be the number of hidden layers and $dl$ be the width (or, equivalently, the number of units) of the $l$th hidden layer. To theoretically analyze concrete phenomena, the rest of this letter focuses on fully connected feedforward networks with various depths $H≥1$ and widths $dl≥1$, using rectified linear units (ReLUs), leaky ReLUs, and absolute value activations, evaluated with the squared loss function. In the rest of this letter, the (finite) depth $H$ can be arbitrarily large and the (finite) widths $dl$ can arbitrarily differ among different layers.

### 4.1  Model and Notation

Let $θ∈Rdθ$ be the vector consisting of all trainable parameters, which determines the entries of the weight matrix $W(l):=W(l)(θ)∈Rdl-1×dl$ at every $l$th hidden layer as $vec([W(l)(θ)]l=1H+1)=θ$. Here, $dθ:=∑l=1H+1dl-1dl$ is the number of trainable parameters. Given an input matrix $X$ and a parameter vector $θ$, the output prediction matrix $Y^(X,θ)∈Rm×dH+1$ of a fully connected feedforward network can be written as
$Y^(X,θ):=Φ(H)W(H+1),$
(4.1)
where $Φ(l):=Φ(l)(X,θ)∈Rm×dl$ is the postactivation output of $l$th hidden layer,
$Φ(l)(X,θ):=σ(l)(Φ(l-1)W(l)),$
where $Φ(0)(X,θ):=X$, $Φ(H+1)(X,θ):=Y^(X,θ)$, and $σ(l):Rm×dl→Rm×dl$ is defined by coordinate-wise nonlinear activation functions $σi,j(l)$ as $(σ(l)(M))i,j:=σi,j(l)(Mi,j)$ for each $(l,i,j)$. Each nonlinear activation function $σi,j(l)$ is allowed to differ among different layers and different units within each layer, but assumed to be ReLU ($σi,j(l)(z)=max(0,z)$), leaky ReLU ($σi,j(l)(z)=max(az,z)$ with any fixed $a≤1$) or absolute value activation ($σi,j(l)(z)=|z|$). Here, $dH+1=dy$ and $d0=dx$. Let $Λl,k∈Rm×m$ represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the $k$th unit at the $l$th layer over $m$ different samples as
$Λiil,k:=∂σi,k(l)(z)∂z|z=(Φ(l-1)W(l))i,kif∂σi,k(l)(z)∂z|z=(Φ(l-1)W(l))i,kexists0otherwise.$

This definition of $Λiil,k$ generalizes the corresponding definition in section 3. Let $Id$ be the identity matrix of size $d$ by $d$. Define $M⊗M'$ to be the Kronecker product of matrices $M$ and $M'$. Given a matrix $M$, $M·,j$ and $Mi,·$ denote the $j$th column vector of $M$ and the $i$th row vector of $M$, respectively.

### 4.2  Theoretical Result

For the standard deep nonlinear neural networks, theorem 1 provides an equation that holds at local minima and illustrates the effect of depth and width. Let $dl':=dl$ for all $l∈{1,⋯,H}$ and $dH+1':=1$.

Theorem 1.
Every differentiable local minimum $θ$ of $L$ satisfies that
$L(θ)=12YF2-∑l=1H+1∑kl=1dl'12PNkl(l)Dkl(l)vec(Y)22︸≥0︸furtherimprovementasanetworkgetswideranddeeper,$
(4.2)
where $Dkl(l):=Dkl(l)(θ)$ and $Nkl(l):=Nkl(l)(θ)$ are defined as follows. For any $l∈{1,⋯,H}$ and any $kl∈{1,⋯,dl}$,
$Dkl(l):=∑kl+1=1dl+1⋯∑kH=1dH(Wkl,kl+1(l+1)⋯WkH-1,kH(H)WkH,·(H+1))⊤⊗Λl,kl⋯ΛH,kHΦ(l-1),$
with $Dkl(H):=(WkH,·(H+1))⊤⊗ΛH,kHΦ(H-1)$. For any $l∈{1,⋯,H}$ and any $kl∈{1,⋯,dl}$, $Nkl(l):=PN[Q¯kl-1(l)]$ with $N1(1):=Im$ where $Q¯kl(l):=[Q1(1),…,Qd1(1),Q1(2),…,Qd2(2),…,Q1(l),…,Qkl(l)]$, $Qkl(l):=Nkl(l)Dkl(l)$, and $Q¯0(l):=Q¯dl-1(l-1)$. Here, $D1(H+1)(θ):=IdH+1⊗Φ(H)$ and $N1(H+1)(θ):=PN[Q¯dH(H)]$.

The complete proof of theorem 1 is provided in section A.1. Theorem 1 is a generalization of proposition 1. Accordingly, its proof follows the proof sketch presented in the previous section for proposition 1.

Unlike previous studies (Livni et al., 2014; Nguyen & Hein, 2017, 2018), theorem 1 requires no overparameterization such as $dl≥m$. Instead, it provides quantitative gradual effects of depth and width on local minima, from no overparameterization to overparameterization. Notably, theorem 1 shows the effect of overparameterization in terms of depth as well as width, which also differs from the results of previous studies that consider overparameterization in terms of width (Livni et al., 2014; Nguyen & Hein, 2017, 2018).

The proof idea behind these previous studies with strong overparameterization is captured in the discussion after equation 3.3—with strong overparameterization such that $dl≥m$ and $rank(D(1))≥m$, $∇vec(W)Y^(X,θ)∈Rdl×m$ is left-invertible and hence every local minimum is a global minimum with zero training error. Here, $rank(M)$ represents the rank of a matrix $M$. The proof idea behind theorem 1 differs from those as shown in section 3.1. What is still missing in theorem 1 is the ability to provide a prior guarantee on $L(θ)$ without strong overparameterization, which is addressed in sections 3.2 and 5 for some special cases but left as an open problem for other cases.

### 4.3  Experiments

In theorem 1, we have shown that at every differentiable local minimum $θ$, the total training loss value $L(θ)$ has an analytical formula $L(θ)=J(θ)$, where
$J(θ):=12YF2-∑l=1H+1∑kl=1dl'12PNkl(l)(θ)Dkl(l)(θ)vec(Y)22$
denotes the right-hand side of equation 4.1. In this section, we investigate the actual numerical values of the formula $J(θ)$ with a synthetic data set and standard benchmark data sets for neural networks with different degrees of depth $=H$ and hidden layers' width $=dl$ for $l∈{1,2,⋯,H}$.

In the synthetic data set, the data points ${(xi,yi)}i=1m$ were randomly generated by a ground-truth, fully connected feedforward neural network with $H=7$, $dl=50$ for all $l∈{1,2,⋯,H}$, tanh activation function, $(x,y)∈R10×R$ and $m=5000$. MNIST (LeCun, Bottou, Bengio, & Haffner, 1998), a popular data set for recognizing handwritten digits, contains 28 $×$ 28 gray-scale images. The CIFAR-10 (Krizhevsky & Hinton, 2009) data set consists of 32 $×$ 32 color images that contain different types of objects such as “airplane,” “automobile,” and “cat.” The Street View House Numbers (SVHN) data set (Netzer et al., 2011) contains house digits collected by Google Street View, and we used the 32 $×$ 32 color image version for the standard task of predicting the digits in the middle of these images. In order to reduce the computational cost, for the image data sets (MNIST, CIFAR-10, and SVHN), we center-cropped the images ($24×24$ for MNIST and $28×28$ for CIFAR-10 and SVHN), then resized them to smaller gray-scale images ($8×8$ for MNIST and $12×12$ for CIFAR-10 and SVHN), and used randomly selected subsets of the data sets with size $m=10,000$ as the training data sets.

For all the data sets, the network architecture was fixed to be a fully connected feedforward network with the ReLU activation function. For each data set, the values of $J(θ)$ were computed with initial random weights drawn from a normal distribution with zero mean and normalized standard deviation ($1/dl$) and with trained weights at the end of 40 training epochs. (Additional experimental details are presented in appendix C.)

Figure 1 shows the results with the synthetic data set, as well as the MNIST, CIFAR-10, and SVHN data sets. As it can be seen, the values of $J(θ)$ tend to decrease toward zero (and hence the global minimum value), as the width or depth of neural networks increases. In theory, the values of $J(θ)$ may not improve as much as desired along depth and width if representations corresponding to each unit and each layer are redundant in the sense of linear dependence of the columns of $Dkl(l)(θ)$ (see theorem 1). Intuitively, at initial random weights, one can mitigate this redundancy due to the randomness of the weights, and hence a major concern is whether such redundancy arises and $J(θ)$ degrades along with training. From Figure 1, it can be also noticed that the values of $J(θ)$ tend to decrease along with training. These empirical results partially support our theoretical observation that increasing the depth and width can improve the quality of local minima.

Figure 1:

The values of $J(θ)$ for the training data sets ($J(θ)$ are on the right-hand side of equation 4.1) with varying depth $=H$ ($y$-axis) and width $=dl$ for all $l∈{1,2,⋯,H}$ ($x$-axis). The heat map colors represent the values of $J(θ)$. In all panels of this figure, the left heat map (initial) is computed with initial random weights and the right heat map (trained) is calculated after training. It can be seen that both depth and width helped improve the values of $J(θ)$.

Figure 1:

The values of $J(θ)$ for the training data sets ($J(θ)$ are on the right-hand side of equation 4.1) with varying depth $=H$ ($y$-axis) and width $=dl$ for all $l∈{1,2,⋯,H}$ ($x$-axis). The heat map colors represent the values of $J(θ)$. In all panels of this figure, the left heat map (initial) is computed with initial random weights and the right heat map (trained) is calculated after training. It can be seen that both depth and width helped improve the values of $J(θ)$.

Close modal

Given the scarcity of theoretical understanding of the optimality of deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that it is valuable to theoretically study simplified models: deep linear neural networks. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that in terms of optimization, deep linear networks exhibited several properties similar to those of deep nonlinear networks. Following these observations, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018), as a step toward the goal of establishing the optimization theory of deep learning.

As another step toward the goal, this section discards the strong linearity assumption and considers a locally induced nonlinear-linear structure in deep nonlinear networks with the piecewise linear activation functions such as ReLUs, leaky ReLUs, and absolute value activations.

### 5.1  Locally Induced Nonlinear-Linear Structure

In this section, we describe how a standard deep nonlinear neural network can induce nonlinear-linear structure. The nonlinear-linear structure considered in this letter is defined in definition 1: condition i simply defines the index subsets $S(l)$ that pick out the relevant subset of units at each layer $l$, condition ii requires the existence of $n$ linearly acting units, and condition iii imposes weak separability of edges.

Definition 1.

A parameter vector $θ$ is said to induce $(n,t)$ weakly separated linear units on a training input data set $X$ if there exist $(H+1-t)$ sets $S(t+1),S(t+2),⋯,S(H+1)$ such that for all $l∈{t+1,t+2,⋯,H+1}$, the following three conditions hold:

• i.

$S(l)⊆{1,⋯,dl}$ with $|S(l)|≥n$.

• ii.

$Φ(l)(X,θ)·,k=Φ(l-1)(X,θ)W(l)(θ)·,k$ for all $k∈S(l)$.

• iii.

$W(l+1)(θ)k',k=0$ for all $(k',k)∈S(l)×({1,⋯,dl+1}∖S(l+1))$ if $l≤H-1$.

Given a training input data set $X$, let $Θn,t$ be the set of all parameter vectors that induce $(n,t)$ weakly separated linear units on the training input data set $X$ that defines the total loss $L(θ)$ in equation 2.1. For standard deep nonlinear neural networks, all parameter vectors $θ$ are in $ΘdH+1,H$, and some parameter vectors $θ$ are in $Θn,t$ for different values of $(n,t)$. Figure 2 a illustrates locally induced structures for $θ∈Θ1,0$. For a parameter $θ$ to be in $Θn,t$, definition 1 requires only the existence of a portion $n/dl$ of units to act linearly on the particular training data set merely at the particular $θ$. Thus, all units can be nonlinear, act nonlinearly on the training data set outside of some parameters $θ$, and operate nonlinearly always on other inputs $x$—for example, in a test data set or a different training data set. The weak separability requires that the edges going from the $n$ units to the rest of the network are negligible. The weak separability does not require the $n$ units to be separated from the rest of the neural network.

Figure 2:

Illustration of locally induced nonlinear-linear structures. (a) Simple examples of the structure with weakly separated edges considered in this section (see definition 1). (b) Examples of a simpler structure with strongly separated edges (see definition 2). The red nodes represent the linearly acting units on a training data set at a particular $θ$, and the white nodes are the remaining units. The black dashed edges represent standard edges without any assumptions. The red nodes are allowed to depend on all nodes from the previous layer in panel a, whereas they are not allowed in panel b except for the input layer. In both panels a and b, two examples of parameters $θ$ are presented with the exact same network architecture (including activation functions and edges). Even if the network architecture (or parameterization) is identical, different parameters $θ$ can induce different local structures. With $Θ1,4$, this local structure always holds in standard deep nonlinear networks with four hidden layers.

Figure 2:

Illustration of locally induced nonlinear-linear structures. (a) Simple examples of the structure with weakly separated edges considered in this section (see definition 1). (b) Examples of a simpler structure with strongly separated edges (see definition 2). The red nodes represent the linearly acting units on a training data set at a particular $θ$, and the white nodes are the remaining units. The black dashed edges represent standard edges without any assumptions. The red nodes are allowed to depend on all nodes from the previous layer in panel a, whereas they are not allowed in panel b except for the input layer. In both panels a and b, two examples of parameters $θ$ are presented with the exact same network architecture (including activation functions and edges). Even if the network architecture (or parameterization) is identical, different parameters $θ$ can induce different local structures. With $Θ1,4$, this local structure always holds in standard deep nonlinear networks with four hidden layers.

Close modal

Here, a neural network with $θ∈Θn,t$ can be a standard deep nonlinear neural network (without any linear units in its architecture), a deep linear neural network (with all activation functions being linear), or a combination of these cases. Whereas a standard deep nonlinear neural network can naturally have parameters $θ∈Θn,t$, it is possible to guarantee all parameters $θ$ to be in $Θn,t$ with desired $(n,t)$ simply by using corresponding network architectures. For standard deep nonlinear neural networks, one can also restrict all relevant convergent solution parameters $θ$ to be in $Θn,t$ by using some corresponding learning algorithms. Our theoretical results hold for all of these cases.

### 5.2  Theoretical Result

We state our main theoretical result in theorem 2 and corollary 1; a simplified statement is presented in remark 1. Here, a classical machine learning method, basis function regression, is used as a baseline to be compared with neural networks. The global minimum value of basis function regression with an arbitrary basis matrix $M(X)$ is $infR12∥M(X)R-Y∥F2$, where the basis matrix $M(X)$ does not depend on $R$ and can represent nonlinear maps, for example, by setting $M=([φ(xi)]i=1m)⊤∈Rm×dφ$ with any nonlinear basis functions $φ$ and any finite $dφ$. In theorem 2, the expression $PNΦ(S)Y$ represents the projection of $Y$ onto the null space of $(Φ(S))⊤$, which is also ($Y$—the projection of $Y$ onto the column space of $Φ(S)$). Given matrices $(M(j))j∈S$ with a sequence $S=(s1,s2,⋯,sn)$, define $[M(j)]j∈S:=M(s1)M(s2)⋯M(sn)$ to be a block matrix with columns being $M(s1),M(s2),⋯,M(sn)$. Let $S⊆(s1,s2,⋯,sn)$ denote a subsequence of $(s1,s2,⋯,sn)$.

Theorem 2.
For any $t∈{0,1,⋯,H}$, every differentiable local minimum $θ∈ΘdH+1,t$ of $L$ satisfies that for any subsequence $S⊆(t,t+1,⋯,H)$ (including the case of $S$ being the empty sequence),
$L(θ)≤12PNΦ(S)YF2︸globalminimumvalueofbasisfunctionregressionwithbasismatrixΦ(S)-∑l=1H∑kl=1dl12PNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22︸≥0︸furtherimprovementasanetworkgetswideranddeeper,$
(5.1)
where $PNΦ(S)∈Rm×m$, $PNΦ¯(S)∈RmdH+1×mdH+1$, $Φ(S)=[Φ(l)]l∈S$, $Φ¯(S)=[IdH+1⊗Φ(l)]l∈S$. If $S$ is empty, $PN[Φ(S)]=Im$ and $PN[Φ¯(S)]=ImdH+1$. The matrices $Dkl(l)$ and $Nkl(l)$ are defined in theorem 1 with the exception that $Qkl(l)=Nkl(l)PN[Φ¯(S)]Dkl(l)$ (instead of $Qkl(l):=Nkl(l)Dkl(l)$).
Remark 1.

From theorem 2 (or corollary 1), one can see the following properties of the loss landscape:

• i.

Every differentiable local minimum, $θ∈ΘdH+1,t$ has a loss value $L(θ)$ better than or equal to any global minimum value of basis function regression with any combination of the basis matrices in the set ${Φ(l)}l=tH$ of fixed deep hierarchical representation matrices. In particular with $t=0$, every differentiable local minimum $θ∈ΘdH+1,0$ has a loss value $L(θ)$ no worse than the global minimum values of standard basis function regression with the handcrafted basis matrix $Φ(0)=X$, and of basis function regression with the larger basis matrix $[Φ(l)]l=0H$.

• ii.

As $dl$ and $H$ increase (or, equivalently, as a neural network gets wider and deeper), the upper bound on the loss values of local minima can further improve.

The proof of theorem 2 is provided in section A.2. The proof is based on the combination of the idea presented in section 3.1 and perturbations of a local minimum candidate. That is, if a $θ$ is a local minimum, then the $θ$ is a global minimum within a local region (i.e., a neighborhood of $θ$). Thus, after perturbing $θ$ as $θ'=θ+Δθ$ such that $∥Δθ∥$ is sufficiently small (so that $θ'$ stays in the local region) and $L(θ')=L(θ)$, the $θ'$ must be still a global minimum within the local region and, hence, the $θ'$ is also a local minimum. The proof idea of theorem 2 is to apply the proof sketch in section 3.1 to not only a local minimum candidate $θ$ but also its perturbations $θ'=θ+Δθ$.

In terms of overparameterization, theorem 2 states that local minima of deep neural networks are as good as global minima of the corresponding basis function regression even without overparameterization, and overparameterization helps to further improve the guarantee on local minima. The effect of overparameterization is captured in both the first and second terms on the right-hand side of equation 5.1. As depth and width increase, the second term tends to increase, and hence the guarantee on local minima can improve. Moreover, as depth and width increase (for some of $t+1,t+2,⋯,H$th layers in theorem 2), the first term tends to decrease and the guarantee on local minima can also improve. For example, if $[Φ(l)]l=tH$ has rank at least $m$, then the first term is zero and, hence, every local minimum is a global minimum with zero loss value. As a special case of this example, since every $θ$ is automatically in $ΘdH+1,H$, if $Φ(H)$ is forced to have rank at least $m$, every local minimum becomes a global minimum for standard deep nonlinear neural networks, which coincides with the observation about overparameterization by Livni et al. (2014).

Without overparameterization, theorem 2 also recovers one of the main results in the literature of deep linear neural networks as a special case—that is, every local minimum is a global minimum. If $dH+1≤min{dl:1≤l≤H}$, every local minimum $θ$ for deep linear networks is differentiable and in $ΘdH+1,0$, and hence theorem 1 yields that $L(θ)≤12∥PN[X]Y∥F2$. Because $12∥PN[X]Y∥F2$ is the global minimum value, this implies that every local minimum is a global minimum for deep linear neural networks.

Corollary 1 states that the same conclusion and discussions as in theorem 2 hold true even if we fix the edges in condition iii in definition 1 to be zero (by removing them as an architectural design or by forcing it with a learning algorithm) and consider optimization problems only with remaining edges.

Corollary 1.
For any $t∈{0,1,⋯,H}$, every differentiable local minimum $θ∈ΘdH+1,t$ of $L|I$ satisfies that for any subsequence $S⊆(t,t+1,⋯,H)$ (including the case of $S$ being the empty sequence),
$L(θ)≤12PNΦ(S)YF2︸globalminimumvalueofbasisfunctionregressionwithbasismatrixΦ(S)-∑l=1H∑kl=1dl12PN^kl(l)PNΦ¯(S)D^kl(l)vec(Y)22︸≥0︸furtherimprovementasanetworkgetswideranddeeper,$
(5.2)
where $L|I$ is the restriction of $L$ to $I={θ'∈Rdθ:∀l∈{t+1,⋯,H-1},∀(k',k)∈S(l)×S(l+1)¯,W(l+1)(θ')k',k=0}$ with the index sets $S(t+1),S(t+2),⋯,S(H+1)$ of the $θ∈ΘdH+1,t$ in definition 1 and $S(l)¯:={1,⋯,dl}∖S(l)$. Here, $Φ(S)$ and $Φ¯(S)$ are defined in theorem 2, and the matrices $D^kl(l)$ and $N^kl(l)$ are defined as follows. For all $l∈{1,⋯,t+1}$, $D^kl(l):=Dkl(l)$ for all $kl∈{1,…,dl}$ (where $Dkl(l)$ is defined in theorem 2). For all $l∈{t+2,⋯,H}$, $D^kl(l):=Dkl(l)$ for all $kl∈S(l)$, and
$D^kl(l):=∑kl+1=1dl+1⋯∑kH=1dH(Wkl,kl+1(l+1)⋯WkH-1,kH(H)WkH,·(H+1))⊤⊗Λl,kl⋯ΛH,kH[Φ·,j(l-1)]j∈S(l-1)¯,$
with $D^kl(H):=(WkH,·(H+1))⊤⊗ΛH,kH[Φ·,j(H-1)]j∈S(H-1)¯$ for all $kl∈S(l)¯$. For any $l∈{1,⋯,H}$ and any $kl∈{1,⋯,dl}$, $N^kl(l):=PN[Q¯kl-1(l)]$ with $N11:=Im$ where $Q¯kl(l):=[Q1(1),…,Qd1(1),Q1(2),…,Qd2(2),…,Q1(l),…,Qkl(l)]$, $Qkl(l):=N^kl(l)PN[Φ¯(S)]D^kl(l)$, and $Q¯0(l):=Q¯dl-1(l-1)$.

The proof of corollary 1 is provided in section A.3 and follows the proof of theorem 2. Here, $Φ(0)=X$ consists of training inputs $xi$ in the arbitrary given feature space embedded in $Rd0$; for example, given a raw input $xraw$ and any feature map $φ:xraw↦φ(xraw)∈Rd0$ (including identity as $φ(xraw)=xraw$), we write $x=φ(xraw)$. Therefore, theorem 2 and corollary 1 state that every differentiable local minima of deep neural networks can be guaranteed to be no worse than any given basis function regression model with a handcrafted basis taking values in $Rd$ with some finite $d$, such as polynomial regression with a finite degree and radial basis function regression with a finite number of centers.

To illustrate an advantage of the notion of weakly separated edges in definition 1, one can consider the following alternative definition that requires strongly separated edges.

Definition 2.

A parameter vector $θ$ is said to induce $(n,t)$ strongly separated linear units on the training input data set $X$ if there exist $(H+1-t)$ sets $S(t+1),S(t+2),⋯,S(H+1)$ such that for all $l∈{t+1,t+2,⋯,H+1}$, conditions i to iii in definition 1 hold and $Φ(l)(X,θ)W(l+1)(θ)·,k=∑k'∈S(l)Φ(l)(X,θ)·,k'W(l+1)(θ)k',k$ for all $k∈S(l+1)$ if $l≠{H,H+1}$.

Let $Θn,tstrong$ be the set of all parameter vectors that induces $(n,t)$ stronglyseparated linear units on the particular training input data set $X$ that defines the total loss $L(θ)$ in equation 2.1. Figure 2 shows a comparison of weekly separated edges and strongly separated edges. Under this stronger restriction on the local structure, we can obtain corollary 2.

Corollary 2.
For any $t∈{0,1,⋯,H}$, every differentiable local minimum $θ∈ΘdH+1,t$ of $L$ satisfies that for any $S⊆(t,H)$,
$L(θ)≤12PNΦ(S)YF2-∑l=1H∑kl=1dl12PNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22,$
where $Φ(S)$, $Φ¯(S)$, $Dkl(l)$, and $Nkl(l)$ are defined in theorem 2.

The proof of corollary 2 is provided in section A.4 and follows the proof of theorem 2. As a special case, corollary 2 also recovers the statement that every local minimum is a global minimum for deep linear neural networks in the same way as in theorem 2. When compared with theorem 2, one can see that the statement in corollary 2 is weaker, producing the upper bound only in terms of $S⊆(t,H)$. This is because the restriction of strongly separated units forces neural networks to have less expressive power with fewer effective edges. This illustrates an advantage of the notion of weakly separated edges in definition 1.

A limitation in theorems 1 and 2 and corollary 1 is the lack of treatment of nondifferentiable local minima. The Lebesgue measure of nondifferentiable points is zero, but this does not imply that the appropriate measure of nondifferentiable points is small. For example, if $L(θ)=|θ|$, the Lebesgue measure of the nondifferentiable point ($θ=0$) is zero, but the nondifferentiable point is the only local and global minimum. Thus, the treatment of nondifferentiable points in this context is a nonnegligible problem. The proofs of theorems 1 and 2 and corollary 1 are all based on the proof sketch in section 3.1, which heavily relies on the differentiability. Thus, the current proofs do not trivially extend to address this open problem.

In this letter, we have theoretically and empirically analyzed the effect of depth and width on the loss values of local minima, with and without a possible local nonlinear-linear structure. The local nonlinear-linear structure we have considered might naturally arise during training and also is guaranteed to emerge by using specific learning algorithms or architecture designs. With the local nonlinear-linear structure, we have proved that the values of local minima of neural networks are no worse than the global minimum values of corresponding basis function regression and can improve as depth and width increase. In the general case without the possible local structure, we have theoretically shown that increasing the depth and width can improve the quality of local minima, and we empirically supported this theoretical observation. Furthermore, without the local structure but with a shallow neural network and a gaussian data matrix, we have proven the probabilistic bounds on the rates of the improvements on the local minimum values with respect to width. Moreover, we have discussed a major limitation of this letter: all of its the results focus on the differentiable points on the loss surfaces. Additional treatments of the nondifferentiable points are left to future research.

Our results suggest that the values of local minima are not arbitrarily poor (unless one crafts a pathological worst-case example) and can be guaranteed to some desired degree in practice, depending on the degree of overparameterization, as well as the local or global structural assumption. Indeed, a structural assumption, namely the existence of an identity map, was recently used to analyze the quality of local minima (Shamir, 2018; Kawaguchi & Bengio, 2018). When compared with these previous studies (Shamir, 2018; Kawaguchi & Bengio, 2018), we have shown the effect of depth and width, as well as considered a different type of neural network without the explicit identity map.

In practice, we often “overparameterize” a hypothesis space in deep learning in a certain sense (e.g., in terms of expressive power). Theoretically, with strong overparameterization assumptions, we can show that every stationary point (including all local minima) with respect to a single layer is a global minimum with the zero training error and can memorize any data set. However, “overparameterization” in practice may not satisfy such strong overparameterization assumptions in the theoretical literature. In contrast, our results in this letter do not require overparameterization and show the gradual effects of overparameterization as consequences of general results.

Let $Dkl(l)$ be defined in theorem 2. Let $D(l):=[Dk(l)]k=1dl∈RmdH+1×dldl-1$ and $D:=[D(l)]l=1H∈RmdH+1×∑l=1Hdl-1dl$. Given a matrix-valued function $f(θ)∈Rd'×d$, let $∂W(l)f(θ):=∂vec(f)∂vec(W(l))∈Rd'd×dl-1dl$ be the partial derivative of $vec(f)$ with respect to $vec(W(l))$. Let ${j,j+1,⋯,j'}:=∅$ if $j>j'$. Let $M(l)M(l+1)⋯M(l')=I$ if $l>l'$. Let $Null(M)$ be the null space of a matrix $M$. Let $B(θ,ε)$ be an open ball of radius $ε$ with the center at $θ$.

The following lemma decomposes the model output $Y^$ in terms of the weight matrix $W(l)$ and $D(l)$ that coincides with its derivatives at differentiable points.

Lemma 1.
For all $l∈{1,⋯,H}$,
$vec(Y^(X,θ))=D(l)vec(W(l)(θ)),$
and at any differentiable $θ$,
$∂W(l)Y^(X,θ)=D(l).$
Proof.
Define $G(l)$ to be the preactivation output of the $l$th hidden layer as $G(l):=G(l)(X,θ):=Φ(l-1)(X,θ)W(l)$. By the linearity of the vec operation and the definition of $G(l)$, we have that
$vec[G(l+1)(X,θ)]=vec∑k=1dlΛl,kG(l)(X,θ)·,kWk,·(l+1)=∑k=1dl(Wk,·(l+1))⊤⊗Λl,kvecG(l)(X,θ)·,k=F(l+1)vecG(l)(X,θ),$
where $F(l+1):=[(Wk,·(l+1))⊤⊗Λl,k]k=1dl$. Therefore,
$vec(Y^)=F(H+1)F(H)⋯F(l+1)vec(G(l))=F(H+1)⋯F(l+1)[Idl⊗Φ(l-1)]vec(W(l)),$
where $F(H+1)⋯F(l+1)[Idl⊗Φ(l-1)]=[D1(l)D2(l)⋯Ddl(l)]=D(l)$, which proves the first statement that $vec(Y^)=D(l)vec(W(l))$. The second statement follows from the fact that the derivatives of $D(l)$ with respect to $vec(W(l))$ are zeros at any differentiable point, and hence $(∂W(l)Y^)=D(l)+0$.$□$

Lemma 2 generalizes part of theorem A.45 in Rao, Toutenburg, Shalabh, and Heumann (2007) by discarding invertibility assumptions.

Lemma 2.
For any block matrix $[AB]$ with real submatrices $A$ and $B$ such that $A⊤B=0$,
$PAB=P[A]+P[B].$
Proof.
It follows a straightforward calculation as
$PAB=ABA⊤A00B⊤B†AB⊤=AB(A⊤A)†00(B⊤B)†AB⊤=P[A]+P[B].$
$□$

Lemma 3 decomposes a norm of a projected target vector into a form that clearly shows an effect of depth and width.

Lemma 3.
For any $t∈{0,1,⋯,H}$ and any $S⊆(t,t+1,⋯,H)$,
$PPNΦ¯(S)Dvec(Y)22=∑l=1H∑kl=1dlPNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22.$
Proof.
Since the span of the columns of $AB$ is the same as the span of the columns of $[APN[A]B]$ for submatrices $A$ and $B$, the span of the columns of $PN[Φ¯(S)]D=[[PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H$ is the same as the span of the columns of $[[Nkl(l)PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H$. Then, by repeatedly applying lemma 2 to each block of $[[Nkl(l)PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H$, we have that
$PPN[Φ¯(S)]D=PNkl(l)PN[Φ¯(S)]Dkl(l)kl=1dll=1H=∑l=1H∑kl=1dlPNkl(l)PN[Φ¯(S)]Dkl(l).$
From the construction of $Nkl(l)$, we have that for all $(l,k)≠(l',k')$,
$P[Nk(l)PN[Φ¯(S)]Dk(l)]P[Nk'(l')PN[Φ¯(S)]Dk'(l')]=0.$
Therefore,
$PPNΦ¯SDvec(Y)22=∑l=1H∑kl=1dlPNkl(l)PN[Φ¯(S)]Dkl(l)vec(Y)22=∑l=1H∑kl=1dlPNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22.$
$□$

The following lemma plays a major role in the proof of theorem 2.

Lemma 4.
For any $t∈{0,1,⋯,H}$, every differentiable local minimum $θ∈ΘdH+1,t$ satisfies that for any $l∈{t,t+1,⋯,H}$,
$(Φ(l))⊤(Y^(X,θ)-Y)=0.$
Proof.
Fix $t$ to be a number in ${0,1,⋯,H}$. Let $θ$ be a differentiable local minimum in $ΘdH+1,t$. Then, from the definition of a local minimum, there exists $ε1>0$ such that $L(θ)≤L(θ')$ for all $θ'∈B(θ,ε1)$, and hence $L(θ)≤L(θ')$ for all $θ'∈B˜(θ,ε1)⊆B(θ,ε1)$, where $B˜(θ,ε1):=B(θ,ε1)∩{θ∈Rdθ:W(l+1)(θ)k',k=0foralll∈{t+1,t+2,⋯H-1}andall(k',k)∈S(l)×({1,⋯,dl+1}∖S(l+1))}$ with the index sets $S(t+1),S(t+2),⋯,S(H+1)$ of the $θ∈ΘdH+1,t$ in definition 1. Without loss of generality, we can permute the indices of the units within each layer such that for all $l∈{t+1,t+2,⋯,H+1}$, $S(l)⊇{1,2,⋯,dL}$ with some $dL≥dH+1$ in the definition of $ΘdH+1,t$ (see definition 1). Note that the considered activation functions $σi,j(l)(z)$ are continuous and act linearly on $z≥0$ . Thus, from the definition of $ΘdH+1,t$, there exists $ε2>0$ such that for all $θ'∈B˜(θ,ε2)$ and all $l∈{t,t+1,⋯,H}$,
$Y^(X,θ')=Φ(l)A(l+1)C(l+1)A(l+2)⋯A(H+1)+∑l'=lH-1Z(l'+1)C(l'+2)A(l'+3)⋯A(H+1),$
(A.1)
where $A(l),B(l)$ and $C(l)$ are submatrices of $W(l)(θ')$, and $Z(l)$ is a submatrix of $Φ(l)(X,θ')$ as defined below:
$A(l)ξ(l)C(l)B(l):=W(l)(θ'),$
and
$Z(t+1):=σ(t+1)Φ(t)ξ(t+1)B(t+1)withZ(l):=σ(l)(Z(l-1)B(l))forl≥t+2.$
Note that $Z(l)$ depends only on $Φ(t)$, $ξ(t)$, and $B(k)$ for all $k≤l$. Here, $Φ(t)$ does not depend on $A(l)$ and $C(l)$ for all $l≥t+1$. That is, at each layer $l∈{t+2,t+3,⋯,H}$, $A(l)∈RdL×dL$ connects $dL$ linearly acting units to next $dL$ linearly acting units, $B(l)∈R(dl-1-dL)×(dl-dL)$ connects other units to next other units (other units can include both nonlinear and linearly acting units), and $C(l)∈R(dl-1-dL)×dL$ connects other units to next linearly acting units, with $dL≥dH+1$. Here, $A(t+1),B(t+1),C(t+1)$, and $ξ(t+1)$ connect the possibly unstructured layer $Φ(t)$ to the next structured layer, $C(H+1)∈R(dH-dL)×dH+1$ connects other units in the last hidden layer to the output units, and $A(H+1)∈RdL×dH+1$ connects linearly acting units in the last hidden layer to the output units.
Let $ε3=min(ε1,ε2)$. Let $l$ be an arbitrary fixed number in ${t,t+1,⋯,H}$ in the following. Let $r:=Y^(X,θ)-Y$. Define
$R(l+1):=A(l+1)C(l+1).$
From the condition of differentiable local minimum, we have that
$0=∂R(l+1)L(θ)=vec((Φ(l))⊤r(A(l+2)⋯A(H+1))⊤),$
since otherwise, $R(l+1)$ can be moved to the direction of $∂R(l+1)L(θ)$ with a sufficiently small magnitude $ε3'∈(0,ε3)$ and decrease the loss value. This implies that
$(Φ(l))⊤r(A(l+2)⋯A(H+1))⊤=0.$
If $rank(A(l+2)⋯A(H+1))≥dH+1$ or $l=H$, then this equation yields the desired statement of this lemma as $(Φ(l))⊤r=0$. Hence, the rest of this proof considers the case of
$rank((A(l+2)⋯A(H+1))⊤)
Define an index $l*$ as
$l*:=min{l'∈Z+:l+3≤l'≤H+2∧rank(A(l')⋯A(H+1))≥dH+1},$
where $A(H+2)⋯A(H+1):=IdH+1$. This minimum exists since the set contains at least $H+2$ (nonempty) and is finite. Then we have that $rank(A(l*)⋯A(H+1))≥dH+1$ and $rank(A(l')⋯A(H+1)) for all $l'∈{l+2,l+3,⋯,l*-1}$, since $rank(M1M2)≤min(rank(M1),rank(M2))$. Therefore, for all $l'∈{l+1,l+2,⋯,l*-2}$, we have that $Null((A(l'+1)⋯A(H+1))⊤)≠0$, and there exists a vector $ul'∈RdL$ such that
$ul'∈Null((A(l'+1)⋯A(H+1))⊤)and∥ul'∥2=1.$
Let $ul'$ denote such a vector for all $l'∈{l+1,l+2,⋯,l*-2}$. For all $l'∈{l+2,l+3,⋯,l*-2}$, define
$A˜(l')(νl'):=A(l')+νl'ul'⊤andR˜(l+1)(νl+1):=R(l+1)+νl+1ul+1⊤,$
where $νl'∈RdL$ and $νl+1∈Rdl$. Let $θ˜(νl+1,νl+2,⋯,νl*-2)$ be $θ$ with $A(l')$ and $R(l+1)$ being replaced by $A˜(l')(νl')$ and $R˜(l+1)(νl+1)$ for all $l'∈{l+2,l+3,⋯,l*-2}$. Then for any $(νl+1,νl+2,⋯,νl*-2)$,
$Y^(X,θ˜(νl+1,⋯,νl*-2))=Y^(X,θ)andL(θ˜(νl+1,⋯,νl*-2))=L(θ),$
since $A˜(l')(νl')A(l'+1)⋯A(H+1)=A(l')A(l'+1)⋯A(H+1)$ for all $l'∈{l+2,l+3,⋯,l*-2}$ and $R˜(l+1)(νl+1)A(l+2)⋯A(H+1)=R(l+1)A(l+2)⋯A(H+1)$.
For any sufficiently small vector $(νl+1,⋯,νl*-2)$ such that $θ˜(νl+1,⋯,νl*-2)∈B˜(θ,ε3/2)$, if $θ$ is a local minimum, every $θ˜(νl+1,⋯,νl*-2)$ is also a local minimum with respect to the entries of $A(l'),B(l')$, and $C(l')$ for all $l'$ because there exists $ε3'=ε3/2>0$ such that
$L(θ˜(νl+1,⋯,νl*-2))=L(θ)≤L(θ')$
for all $θ'∈B˜(θ˜(νl+1,⋯,νl*-2),ε3')⊆B˜(θ,ε3)⊆B˜(θ,ε1)⊆B(θ,ε1)$, where the first inclusion follows the triangle inequality. Thus, for any such $θ˜(νl+1,⋯,νl*-2)$ in the sufficiently small open ball, we have that
$∂A(l*-1)L(θ˜(νl+1,⋯,νl*-2))=0,$
where $∂A(l*-1)L(θ˜(νl+1,⋯,νl*-2))$ exists within the sufficiently small open ball from equation A.1 (composed with the squared loss). In particular, by setting $νl+1=0$ and noticing that $Y^(X,θ˜(νl+1,⋯,νl*-2))-Y=Y(X,θ)-Y=r$,
$0=∂A(l*-1)L(θ˜(0,νl+2,⋯,νl*-2))=∂A(l*-1)Y^(X,θ˜(0,νl+2,⋯,νl*-2))⊤vec(r),$
and hence
$0=∂A(l*-1)L(θ˜(νl+1,⋯,νl*-2))=∂A(l*-1)Φ(l)(νl+1ul+1⊤)A¯(l+2)⋯A¯(H+1)+∂A(l*-1)Y^(X,θ˜(0,νl+2,⋯,νl*-2))⊤vec(r)=∂A(l*-1)Φ(l)(νl+1ul+1⊤)A¯(l+2)⋯A¯(H+1)⊤vec(r),$
where
$A¯(l')=A˜(l')(νl')ifl'∈{l+2,⋯,l*-2}A(l')ifl'∉{l+2,⋯,l*-2}.$
Since
$∂A(l*-1)Φ(l)(νl+1ul+1)A¯(l+2)⋯A¯(H+1)=(A(l*)⋯A(H+1))⊤⊗Φ(l)(νl+1ul+1)A˜(l+2)(νl+2)⋯A˜(l*-2)(νl*-2),$
this implies that
$A(l*)⋯A(H+1)r⊤Φ(l)(νl+1ul+1)A˜(l+2)(νl+2)⋯A˜(l*-2)(νl*-2)=0.$
By the definition of $l*$, this implies that
$r⊤Φ(l)(νl+1ul+1)A˜(l+2)(νl+2)⋯A˜(l*-2)(νl*-2)=0,$
where $A˜(l+2)(νl+2)⋯A˜(l+1)(νl+1):=IdL$.
We now show that for any sufficiently small vector $(νl+1,⋯,νl*-2)$ such that $θ˜(νl+1,⋯,νl*-2)∈B˜(θ,ε3/2)$,
$r⊤Φ(l)(νl+1ul+1)A˜(l+2)(νl+2)⋯A˜(j)(νj)=0,$
by induction on the index $j$ with the decreasing order $j=l*-2,l*-3,⋯,l+1$. The base case with $j=l*-2$ is proven above. Let $A˜(l):=A˜(l)(νl+2)$. For the inductive step, assuming that the statement holds for $j$, we show that it holds for $j-1$ as
$0=r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A˜(j)=r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A(j)+r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A˜(j-1)νjuj⊤=r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A˜(j-1)νjuj⊤,$
where the last line follows the fact that the first term in the second line is zero because of the inductive hypothesis with $νj=0$. Since $∥uj∥2=1$, by multiplying $uj$ both sides from the right, we have that for any sufficiently small $νj∈RdL$,
$r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A˜(j-1)νj=0,$
which implies that
$r⊤Φ(l)(νl+1ul+1⊤)A˜(l+2)⋯A˜(j-1)=0.$
This completes the inductive step and proves that
$r⊤Φ(l)(νl+1ul+1⊤)=0.$
Since $∥ul+1∥2=1$, by multiplying $ul+1$ both sides from the right, we have that for any sufficiently small $νl+1∈Rdl-dL$ such that $θ˜(νl+1,⋯,νl*-2)∈B˜(θ,ε3/2)$,
$(Y^(X,θ)-Y)⊤Φ(l)νl+1=0,$
which implies that
$(Φ(l))⊤(Y^(X,θ)-Y)=0.$
$□$

### A.1  Proof of Theorem 1

From the first-order necessary condition of differentiable local minima,
$0=∂W(H+1)L(θ)=(D1(H+1))⊤vec(Y^(X,θ)-Y),$
and $Y^(X,θ)=D1(H+1)W(H+1)$. From lemma 1 and the first-order necessary condition of differentiable local minima, $D⊤vec(Y^(X,θ)-Y)=0$ and $vec(Y^)=1HDθ1:H$ where $θ1:H:=vec([W(l)]l=1H$). Combining these, we have that
$DD1(H+1)⊤1H+1DD1(H+1)θ-vec(Y)=0,$
where $1H+1DD1(H+1)θ=vec(Y^(X,θ))$. This implies that
$vec(Y^(X,θ))=PDD1(H+1)vec(Y).$
Therefore,
$2L(θ)=∥vec(Y)-PDD1(H+1)vec(Y)∥22=∥vec(Y)∥22-∥PDD1(H+1)vec(Y)∥22,$
where the second line follows idempotence of the projection. Finally, decomposing the second term by directly following the proof of lemma 3 with $PNΦ¯(S)D$ being replaced by $DD1(H+1)$ yields the desired statement of this theorem.$□$

### A.2  Proof of Theorem 2

From lemma 4, we have that for any $l∈{t,t+1,⋯,H}$,
$(IdH+1⊗Φ(l))⊤vec(Y^(X,θ)-Y)=0.$
(A.2)
From equation A.1, by noticing that $Z(l+1)C(l+2)=Φ(l+1)0(C(l+2))⊤⊤$, we also have that
$vec(Y^(X,θ))=∑l=tH(IdH+1⊗Φ(l))⊤vec(R¯(l+1)A(l+2)⋯A(H+1)),$
(A.3)
where $R¯(t+1):=(A(t+1))⊤(C(t+1))⊤⊤$ and $R¯(l):=0(C(l))⊤⊤$ for $l≥t+2$. From lemma 1 and the first-order necessary condition of differentiable local minima, we also have that
$D⊤vec(Y^(X,θ)-Y)=0$
(A.4)
and
$vec(Y^)=1HDθ1:H,$
(A.5)
where $θ1:H:=vec([W(l)]l=1H$).
Combining equations A.2 to A.5 yields
$Φ¯D⊤Φ¯Dθ¯-vec(Y)=0,$
where $Φ¯Dθ¯=vec(Y^(X,θ))$, $Φ¯:=[IdH+1⊗Φ(l)]l=tH$, and
$θ¯:=12[vec(R¯(l+1)A(l+2)⋯A(H+1))⊤]l=tH1Hθ1:H⊤⊤.$
This implies that
$vec(Y^(X,θ))=PΦ¯Dvec(Y).$
Therefore, for any $S⊆(t,t+1,⋯,H)$,
$2L(θ)=∥vec(Y)-PΦ¯Dvec(Y)∥22≤∥vec(Y)-P[Φ¯(S)D]vec(Y)∥22=∥vec(Y)-P[Φ¯(S)]vec(Y)-P[PN[Φ¯(S)]D]vec(Y)∥22=∥PN[Φ(S)]Y∥F2-∥P[PN[Φ¯(S)]D]vec(Y)∥22,$
(A.6)
where the second inequality holds because the column space of $Φ¯D$ includes the column space of $Φ¯(S)D$. The third line follows lemma 2. The last line follows from the fact that $PN[Φ¯(S)]=(I-P[Φ¯(S)])$ and
$vec(Y)⊤PN[Φ¯(S)]⊤P[PN[Φ¯(S)]D]vec(Y)=vec(Y)⊤P[PN[Φ¯(S)]D]vec(Y)=∥P[PN[Φ¯(S)]D]vec(Y)∥22.$
By applying lemma 3 to the second term on the right-hand side of equation A.6, we obtain the desired upper bound in theorem 2. Finally, we complete the proof by noticing that $12∥PN[Φ(S)]Y∥F2$ is the global minimum value of basis function regression with the basis $Φ(S)$ for all $S⊆(0,1,⋯,H)$. This is because $12∥Φ(S)W-Y∥F2$ is convex in $W$ and, hence, $∂W12∥Φ(S)W-Y∥F2=0$ is a necessary and sufficient condition of global minima, solving which yields the global minimum value of $12∥PN[Φ(S)]Y∥F2$.$□$

### A.3  Proof of Corollary 1

The statement follows the proof of theorem 2 by noticing that lemma 4 still holds for the restriction of $L$ to $I$ as $B˜(θ,ε1)=B(θ,ε1)∩I$, and by replacing $Dkl(l)$ by $D^kl(l)$ in the proof, where $D^kl(l)$ is obtained from the proof of lemma 1 by setting $W(l+1)(θ)k',k=0for(k',k)∈S(l)×({1,⋯,dl+1}∖S(l+1))(l=t+1,t+2,⋯H-1)$ and by not considering their derivatives.$□$

### A.4  Proof of Corollary 2

The statement follows the proof of theorem 2 by setting $C(l'):=0$ for all $l'∉{t+1,H+1}$ and setting $l∈{t,H}$ in the proof of lemma 4 (instead of ${t,t+1,⋯,H}$).$□$

In the following lemma, we rewrite equation 3.2 in terms of the activation pattern, and data matrices $XY$.

Lemma 5.
Every differentiable local minimizer $θ$ of $L$ with the neural network 3.1 satisfies
$L(θ)=12∥Y∥22-12∥P[D˜]Y∥22,$
(B.1)
where
$D˜=Λ1,1XΛ1,2X⋯Λ1,dX.$
(B.2)
Proof.
With $r:=Y^(X,θ)-Y$, we have $L(θ)=r⊤r/2$. For expression 3.1, we have
$Y^(X,θ)=∑j=1dWj(2)Λ1,jXW·j(1).$
(B.3)
For any differentiable local minimum $θ$, from the first-order condition,
$0=∂Wij(1)r⊤r/2=Wj(2)r⊤Λ1,jX·i.$
(B.4)
We conclude that if $Wj(2)≠0$, then $r⊥Λ1,jX·i$ for $1≤i≤dx$. In fact, we have the same conclusion even if $Wj(2)=0$. To prove it, we use the second-order condition as follows. We notice that if $Wj(2)=0$, then
$∂Wij(1)2r⊤r/2∂Wj(2)∂Wij(1)r⊤r/2∂Wj(2)∂Wij(1)r⊤r/2∂Wj(2)2r⊤r/2=0r⊤Λ1,jX·ir⊤Λ1,jX·i*.$
(B.5)
By the second-order condition, the above matrix must be positive semidefinite, and we conclude that $r⊤Λ1,jX·i=0$. Therefore, $Y^(X,θ)-Y$ is perpendicular to the column space of $D˜$. Moreover, from expression B.3, $Y^(X,θ)$ is in the column space of $D˜$; $Y^(X,θ)$ is the projection of $Y$ to the column space of $D˜$, $Y^(X,θ)=P[D˜]Y$; and
$L(θ)=12∥Y^(X,θ)-Y∥22=12∥(I-P[D˜])Y∥22=12∥Y∥22-12∥P[D˜]Y∥22.$
(B.6)
$□$

From equation B.1, we expect that the larger the rank of the projection matrix $D˜$, the smaller is the loss $L(θ)$. In the following lemma, we prove that under the conditions of the activation pattern matrix $Λ$. In the regime $dxd≪m$, we have $rankD˜=dxd$. In the regime $dxd≫m$, we have $rankD˜=m$. As we show later, proposition 2 follows easily from the rank estimates of $D˜$.

Lemma 6.

Fix the activation pattern matrix $Λ:=[Λk]k=1d∈Rm×d$. Let $X$ be a random $m×dx$ gaussian matrix, with each entry having mean zero and variance one. Then the matrix $D˜$ as defined in equation B.2 satisfies both of the following statements:

• i.

If $m≥64ln2(dxdm/δ2)dxd$ and $smin(ΛI)≥δ$ for any index set $I⊆{1,2,⋯,m}$ with $|I|≥m/2$, then $rank=˜dxd$ with probability at least $1-e-m/(64ln(dxdm/δ2))-2e-t$.

• ii.

If $ddx≥2mln2(md/δ)$ with $dx≥ln2(dm)$ and $smin(ΛI)≥δ$ for any index set $I⊆{1,2,⋯,m}$ with $|I|≤d/2$, then $rankD˜=m$ with probability at least $1-2e-dx/20$.

Proof of Lemma 6.
We denote the event $Ωsum$ such that
$Ωsum={X:∥X∥F2≤2mdx}.$
(B.7)
Thanks to equation B.24 in lemma 7, $P(Ωsum)≥1-e-dxm/8$.
In the following, we first prove case i: that $rankD˜=dxd$ with high probability. We identify the space $Rddx$ with $d×dx$ matrix and fix $L=2⌈ln(dm/δ2)⌉$. We first prove that for any $V$ in the unit sphere in $Rdd0$, with probability at least $1-e-m/(16L)$, we have
$∥D˜vec(V)∥22≥δ2/(2L).$
(B.8)
We notice that
$D˜vec(V)=∑j=1dx∑i=1dΛiVjiX·j=:u.$
Then $u$ is a gaussian vector in $Rm$ with $k$th entry
$uk=∑i=1dx(ΛV)kiXki∼N0,ak2,ak2:=∑i=1d0(ΛV)ki2.$
Since by our assumption that the entries of $Λ$ are bounded by 1, we get
$ak2=∑i=1d0(ΛV)ki2≤∑j=1dΛkj2∥V∥F2≤d.$
We denote the sets $I0={1≤k≤m:ak2≤δ2/m}$ and
$Iℓ={1≤k≤m:eℓ-1δ2/m
There are two cases: if there exists some $ℓ≥1$ such that $|Iℓ|≥m/L$, then thanks to equation B.25 in lemma 7, we have that with probability at least $1-e-m/(16L)$,
$∥Λ(X)vec(V)∥22≥∑k∈Iℓuk2≥12∑k∈Iℓak2≥eℓ-1δ2/(2L).$
(B.9)
Otherwise, we have that $|I0|≥m(1-⌈log2(dm/δ2)⌉/L)=m/2$. Then
$∑k∈I0ak2≤∑k∈I0δ2/m<δ2.$
However, by our assumption that $smin(ΛI0)≥δ$,
$∑k∈I0ak2=∥ΛI0V∥F2≥smin2(ΛI0)∥V∥F2≥δ2.$
This leads to a contradiction. Claim B.8 follows from claim B.9.
We take an $ɛ$-net of the unit sphere in $Rddx$ and denote it by $E$. The cardinality of the set $E$ is at most $(2/ɛ)ddx$. We denote the event $Ω$ such that the following holds:
$minV∈E∥D˜vec(V)∥22≥δ2/(2L).$
(B.10)
Then by using a union bound, we get that the event $Ω∩Ωsum$ holds with probability at least $1-e-m/(16L)(2/ɛ)ddx-e-md0/4$.
Let $V^$ be a vector in the unit sphere of $Rdd0$. Then there exists a vector $V∈E$ such that $∥V-V^∥2≤ɛ$, and we have
$D˜vec(V^)=D˜vec(V)+D˜vec(V^-V).$
(B.11)
From equations B.5 and B.10, for $X∈Ω∩Ωsum$, we have that
$D˜vec(V)22≥δ2/(2L)$
(B.12)
and
$∥D˜vec(V^-V)∥22≤∑k=1m∑i=1d0(Λ(V^-V))ki2∥xk∥22≤∑k=1m∑j=1dΛkj2∥(V^-V)∥F2∥xk∥22≤∑k=1mdɛ2∥xk∥22≤2mdxdɛ2.$
(B.13)
It follows from combining equations B.11 to B.13 that, we get that on the event $Ω∩Ωsum$,
$∥D˜vec(V^)∥22≥δ2/(4L),$
provided that $ɛ≤δ/12dxdmL$. This implies that the smallest singular value of the matrix $D˜$ is at least $δ2/(4L)$, with probability
$1-e-m/(16L)(2/ɛ)ddx-e-md0/4≥1-em/(32L),$
provided that $m≥32Lln(dxdm/δ2)dxd$. This finishes the proof of Case i.
In the following we prove case ii that $rankD˜=m$ with high probability. We notice that for any vector $v∈Rm$,
$∥D˜⊤v∥22≤∑k=1m∑i=1d∑j=1d0(vkΛkiXkj)2≤∑j=1d0∑i=1d∑k=1m(vkΛki)2∑k=1mXkj2≤d∥v∥22∥X∥F2.$
(B.14)
In the event $Ωsum$ as defined in equation B.4, we have that $∥D˜⊤v∥22≤2dd0m∥v∥22$ for any vector $v∈Rm$.
In the following, we prove that for any vector $v∈Rm$, if its $L$th largest entry (in absolute value) is at least $a$ for some $L≤d/2$, then
$P(∥D˜⊤v∥22≥a2δ2Ld0/2)≥1-e-Ldx/16.$
(B.15)
We denote the vectors $ui=[X·i⊤Λ1,1v,X·i⊤Λ1,2v,…,X·i⊤Λ1,dv]⊤$, for any $i=1,2,⋯,dx$. Then $D˜⊤v=[u1,u2,…,udx]⊤$. Moreover, $u1,u2,⋯,udx∈Rd$ are independent and identically distributed (i.i.d) gaussian vectors, with mean zero and covariance matrix,
$Σ=Λ⊤V2Λ,$
where $V$ is the $m×m$ diagonal matrix, with diagonal entries given by $v$. We denote the eigenvalues of $Σ$ as $λ1(Σ)≥λ2(Σ)≥⋯≥λd(Σ)≥0$. Then in distribution
$∥ui∥22=λ1(Σ)zi12+λ2(Σ)zi22+⋯λd(Σ)zid2,$
(B.16)
where ${zij}1≤i≤dx,1≤j≤d$ are independent gaussian random variables with mean zero and variance one. If the $L$th largest entry of $v$ (in absolute value) is at least $a$ for some $L≤d/2$, we denote the index set $I={1≤k≤m:|vk|≥a}$, then
$Σ=Λ⊤V2Λ≻Λ⊤VI2Λ≻a2ΛI⊤ΛI.$
Therefore, the $j$th largest eigenvalue of $Σ$ is at least the $j$th largest eigenvalue of $a2ΛI⊤ΛI$ for any $1≤j≤d$. From our assumption, $smin(ΛI)≥δ$, and the $L$th largest eigenvalue of $a2ΛI⊤ΛI$ is at least $a2δ2$. Therefore, the $L$th largest eigenvalue of $Σ$ is at least $a2δ2$, that is, $λL(Σ)≥a2δ2$. We can rewrite equation B.16 as
$∥ui∥22=∑j=1dλj(Σ)zij2≥a2δ2∑j=1Lzij2.$
Thanks to equation B.25 in lemma 7,
$P∥D˜⊤v∥22≥a2δ2Ld0/2=P∑i=1dx∥ui∥22≥a2δ2Ld0/2≥P∑i=1dxa2δ2∑j=1Lzij2≥a2δ2Ld0/2=P∑i=1dx∑j=1Lzij2≥Ld0/2≥1-e-Ld0/16.$
This finishes the proof of claim B.15.
We take an $ɛ$-net of the unit sphere in $Rm$ and denote it by $E$. Let $v^$ be a vector in the unit sphere of $Rm$; then there exists a vector $v∈E$ such that $∥v-v^∥2≤ɛ$, and we have
$D˜⊤v^=D˜⊤v+D˜⊤(v^-v),$
(B.17)
and in the event $Ωsum$ using equation D.14, we have
$∥D˜⊤(v^-v)∥22≤2mdxdɛ2.$
(B.18)
In the rest of the proof, we show that with high probability, $∥D˜⊤v∥22$ is bounded away from zero for uniformly any $v∈E$.
For any given vector $v$ in the unit sphere of $Rm$, we sort its entries in absolute value:
$|v1*|≥|v2*|≥⋯≥|vm*|.$
We denote the sequence $1=L0≤L1≤⋯≤Lp≤Lp+1=m$, where $Li=⌈ln2(md/δ)Li+1/dx⌉$ for $1≤i≤p$ and $L1≤dx/ln2(md/δ)$. Then,
$p=⌈lnm/ln(dx/ln2(md))⌉.$
Thanks to our assumption that $ddx≥2mln2(md/δ)$, we have $Lp≤d/2$. We fix $ɛ$ as
$ɛ:=12δ4dmp+1.$
(B.19)
We denote $q(v)=min{0≤i≤p:|vLi*|≥4dm|vLi+1+1*|/δ}$, where $vm+1*=0$. We decompose the vector $v=v1+v2$, where $v1$ corresponds to the largest (in absolute value) $Lq(v)+1$ terms of $v$, and $v2$ corresponds to the rest of terms of $v$. Letting $L=Lq(v)$ and $a=vLq(v)*$ in equation B.15, we get
$P(∥D˜⊤v1∥22≥a2δ2Ldx/2)≥1-e-Ld0/16.$
(B.20)
By the definition of $q(v)$, we have
$|a|=|vLq(v)*|≥δ4dm|vLq(v)-1*|≥⋯≥δ4dmq(v)1m=:aq(v).$
We denote the event $Ωq$, such that equation B.20 holds for any $v∈E$ with $q(v)=q$. Since equation B.20 depends only on $Lq+1$ entries of $v$, by a union bound, we get
$P(Ωq)≥1-e-Lqdx/16mLq+1(2/ɛ)Lq+1≥1-e-Lqdx/16+Lq+1(lnm+ln(2/ɛ))≥1-e-Lqdx/20.$
(B.21)
Moreover, $∥v2∥22≤a2δ2/(16dm)$, in the event $Ωsum$ using equation B.14, we have
$∥Λ(X)⊤v2∥22≤2ddxm∥v2∥22≤a2δ2dx/8.$
(B.22)
It follows from combining equations B.20 and B.22, in the event $Ωq∩Ωsum$ for any $v∈E$ with $q(v)=q$, we get
$∥D˜⊤v∥22≥(∥D˜⊤v1∥-∥D˜⊤v2∥)2≥a2δ2dx/8≥aq2δ2dx/8.$
(B.23)
In the event $Ωsum∩q=0pΩq$, it follows from combining equations B.17, B.18, and B.23 that
$∥D˜⊤v^∥2≥∥D˜⊤v∥22-2md0dɛ≥apδdx/8-2md0dɛ≥δ4dmp+1.$
Moreover, thanks to equation B.21, $Ωsum∩q=0pΩq$ holds with probability at least
$P(Ωsum∩q=0pΩq)≥1-∑q=0pe-Lqdx/20-e-dxm/8≥1-2e-dx/20.$
This finishes the proof of case ii.$□$

The following concentration inequalities for the square of gaussian random variables are from Laurent and Massart (2000).

Lemma 7.
Let the weights $0≤a1,a2,…,an≤K$, and $g1,g2,…,gn$ independent random gaussian variables with mean zero and variance one. Then the following inequalities hold for any positive $t$:
$P∑i=1nai2(gi2-1)≥2t∑i=1nai41/2+2K2t≤e-t,$
B.24
$P∑i=1nai2(gi2-1)≤-2t∑i=1nai41/2≤e-t.$
B.25
Proof of Proposition 2.
In case i from lemma 6, $rankD˜=dxd$ with probability at least $1-e-m/(64ln(dxdm/δ2))$. Since the statement immediately follows from theorem 1 if $∥Y∥2=0$, we can focus on the case of $∥Y∥2≠0$. Conditioning on the event $rankD˜=dxd$,
$L(θ)∥Y∥22/2=∥PN[D˜]Y∥22∥Y∥22.$
(B.26)
The quantity in equation B.26 has the same law as
$z12+z22+⋯+zm-dxd2z12+z22+⋯+zm2,$
where $z1,z2,…,zm$ are independent gaussian random variables with mean zero and variance one. From lemma 7, we get that with probability at least $1-2e-t$,
$z12+z22+⋯+zm-dxd2z12+z22+⋯+zm2≤1+6tmm-dxdm.$
(B.27)
Case i follows from combining equations B.26 and B.27.
In case ii, thanks to lemma 6, $rankD˜=m$ with probability at least $1-2e-dx/20$. Conditioning on the event $rankD˜=m$, we have $P[D˜]Y=Y$, and
$L(θ)=12∥Y∥22-12∥P[D˜]Y∥22=0.$
This finishes the proof of case ii.$□$

By using the ground-truth network described in section 4.3, the synthetic data set was generated with i.i.d. random inputs $x$ and i.i.d. random weight matrices $W(l)$. Each input $x$ was randomly sampled from the standard normal distribution, and each entry of the weight matrix $W(l)$ was randomly sampled from a normal distribution with zero mean and normalized standard deviation ($2dl$).

For training, we used a standard training procedure with mini-batch stochastic gradient decent (SGD) with momentum. The learning rate was set to 0.01. The momentum coefficient was set to 0.9 for the synthetic data set and 0.5 for the image data sets. The mini-batch size was set to 200 for the synthetic data set and 64 for the image data sets.

From the proof of theorem 1, $J(θ)=∥(I-P[[DD1(H+1)]])vec(Y)∥22$ for all $θ$, which was used to numerically compute the values of $J(θ)$. This is mainly because the form of $J(θ)$ in theorem 1 may accumulate positive numerical errors for each $l≤H$ and $kl≤dl$ in the sum in its second term, which may easily cause a numerical overestimation of the effect of depth and width. To compute the projections, we adopted a method of computing a numerical cutoff criterion on singular values from Press, Teukolsky, Vetterling, and Flannery (2007) as (the numerical cutoff criterion) $=$$12$$×$ (maximum singular value of $M$) $×$ (machine precision of $M$) $×$ ($d'+d+1$), for a matrix of $M∈Rd'×d$. We also confirmed that the reported experimental results remained qualitatively unchanged with two other different cutoff criteria: a criterion based (Golub & Van Loan, 1996) as (the numerical cutoff criterion) = $12∥M∥∞$$×$ (machine precision of $M$) (where $∥M∥∞=max1≤i≤d'∑j=1d|Mi,j|$ for a matrix of $M∈Rd'×d$), as well as another criterion based on Netlib Repository LAPACK documentation as (the numerical cutoff criterion) = (maximum singular value of $M$) $×$ (machine precision of $M$).

We gratefully acknowledge support from NSF grants 1523767 and 1723381; AFOSR grant FA9550-17-1-0165; ONR grant N00014-18-1-2847; Honda Research; and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.

Arora
,
S.
,
Cohen
,
N.
,
Golowich
,
N.
, &
Hu
,
W.
(
2018
).
A convergence analysis of gradient descent for deep linear neural networks
.
arXiv:1810.02281
.
Arora
,
S.
,
Cohen
,
N.
, &
Hazan
,
E.
(
2018
).
On the optimization of deep networks: Implicit acceleration by overparameterization
. In
Proceedings of the International Conference on Machine Learning
.
Barron
,
A. R.
(
1993
).
Universal approximation bounds for superpositions of a sigmoidal function
.
IEEE Transactions on Information Theory
,
39
(
3
),
930
945
.
Blum
,
A. L.
, &
Rivest
,
R. L.
(
1992
).
Training a 3-node neural network is NP-complete
.
Neural Networks
,
5
(
1
),
117
127
.
Choromanska
,
A.
,
Henaff
,
M.
,
Mathieu
,
M.
,
Ben Arous
,
G.
, &
LeCun
,
Y.
(
2015
).
The loss surfaces of multilayer networks
. In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
(pp.
192
204
).
Dauphin
,
Y. N.
,
Pascanu
,
R.
,
Gulcehre
,
C.
,
Cho
,
K.
,
Ganguli
,
S.
, &
Bengio
,
Y.
(
2014
). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2933
2941
).
Golub
,
G. H.
, &
Van Loan
,
C. F.
(
1996
).
Matrix computations
.
Baltimore
:
Johns Hopkins University Press
.
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
.
Cambridge, MA
:
MIT Press
.
Hardt
,
M.
, &
Ma
,
T.
(
2017
).
Identity matters in deep learning
.
arXiv:1611.04231
.
Kawaguchi
,
K.
(
2016
). Deep learning without poor local minima. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
586
594
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
, &
Bengio
,
Y.
(
2018
).
Depth with nonlinearity creates no bad local minima in ResNets
.
arXiv:1810.09038
.
Kawaguchi
,
K.
,
Kaelbling
,
L. P.
, &
Lozano-Pérez
,
T.
(
2015
). Bayesian optimization with exponential convergence. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2809
2817
).
Red Hook, NY
:
Curran
.
Krizhevsky
,
A.
, &
Hinton
,
G.
(
2009
).
Learning multiple layers of features from tiny images
(
Technical Report
).
Toronto
:
University of Toronto
.
Laurent
,
B.
, &
Massart
,
P.
(
2000
).
Adaptive estimation of a quadratic functional by model selection
.
Ann. Statist.
,
28
(
5
),
1302
1338
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Leshno
,
M.
,
Lin
,
V. Y.
,
Pinkus
,
A.
, &
Schocken
,
S.
(
1993
).
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
.
Neural Networks
,
6
(
6
),
861
867
.
Livni
,
R.
,
Shalev-Shwartz
,
S.
, &
Shamir
,
O.
(
2014
). On the computational efficiency of training neural networks. In
Z.
Ghahramani
,
C.
Cortes
,
N.
Lawrence
, &
K.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
855
863
).
Red Hook, NY
:
Curran
.
Montufar
,
G. F.
,
Pascanu
,
R.
,
Cho
,
K.
, &
Bengio
,
Y.
(
2014
). On the number of linear regions of deep neural networks. In
Z.
Ghahramani
,
Z.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2924
2932
).
Red Hook, NY
:
Curran
.
Murty
,
K. G.
, &
,
S. N.
(
1987
).
Some NP-complete problems in quadratic and nonlinear programming
.
Mathematical Programming
,
39
(
2
),
117
129
.
Netzer
,
Y.
,
Wang
,
T.
,
Coates
,
A.
,
Bissacco
,
A.
,
Wu
,
B.
, &
Ng
,
A. Y.
(
2011
).
Reading digits in natural images with unsupervised feature learning
. In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
.
Nguyen
,
Q.
, &
Hein
,
M.
(
2017
).
The loss surface of deep and wide neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2603
2612
).
Nguyen
,
Q.
, &
Hein
,
M.
(
2018
).
Optimization landscape and expressivity of deep CNNS
. In
Proceedings of the International Conference on Machine Learning
(pp.
3727
3736
).
Press
,
W. H.
,
Teukolsky
,
S. A.
,
Vetterling
,
W. T.
, &
Flannery
,
B. P.
(
2007
).
Numerical recipes: The art of scientific computing
. (3rd ed.).
Cambridge
:
Cambridge University Press
.
Rao
,
C. R.
,
Toutenburg
,
H.
, Shalabh, &
Heumann
,
C.
(
2007
).
Linear models and generalizations: Least squares and alternatives
.
Berlin
:
Springer
.
Saxe
,
A. M.
,
McClelland
,
J. L.
, &
Ganguli
,
S.
(
2014
).
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
.
arXiv:1312.6120
.
Shamir
,
O.
(
2018
). Are ResNets provably better than linear predictors? In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Red Hook, NY
:
Curran
.
Telgarsky
,
M.
(
2016
).
Benefits of depth in neural networks
. In
Proceedings of the Conference on Learning Theory
(pp.
1517
1539
).
Xie
,
B.
,
Liang
,
Y.
, &
Song
,
L.
(
2017
).
Diverse neural network learns true target functions
. In
Proceedings of the Conference on Artificial Intelligence and Statistics
(pp.
1216
1224
).
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.