Abstract

In this paper, we analyze the effects of depth and width on the quality of local minima, without strong overparameterization and simplification assumptions in the literature. Without any simplification assumption, for deep nonlinear neural networks with the squared loss, we theoretically show that the quality of local minima tends to improve toward the global minimum value as depth and width increase. Furthermore, with a locally induced structure on deep nonlinear neural networks, the values of local minima of neural networks are theoretically proven to be no worse than the globally optimal values of corresponding classical machine learning models. We empirically support our theoretical observation with a synthetic data set, as well as MNIST, CIFAR-10, and SVHN data sets. When compared to previous studies with strong overparameterization assumptions, the results in this letter do not require overparameterization and instead show the gradual effects of overparameterization as consequences of general results.

1  Introduction

Deep learning with neural networks has been a significant practical success in many fields, including computer vision, machine learning, and artificial intelligence. Along with its practical success, deep learning has been theoretically analyzed and shown to be attractive in terms of its expressive power. For example, neural networks with one hidden layer can approximate any continuous function (Leshno, Lin, Pinkus, & Schocken, 1993; Barron, 1993), and deeper neural networks enable us to approximate functions of certain classes with fewer parameters (Montufar, Pascanu, Cho, & Bengio, 2014; Livni, Shalev-Shwartz, & Shamir, 2014; Telgarsky, 2016). However, training deep learning models requires us to work with a seemingly intractable problem: nonconvex and high-dimensional optimization. Finding a global minimum of a general nonconvex function is NP-hard (Murty & Kabadi, 1987), and nonconvex optimization to train certain types of neural networks is also known to be NP-hard (Blum & Rivest, 1992). These hardness results pose a serious concern only for high-dimensional problems, because global optimization methods can efficiently approximate global minima without convexity in relatively low-dimensional problems (Kawaguchi, Kaelbling, & Lozano-Pérez, 2015).

A hope is that beyond the worst-case scenarios, practical deep learning allows some additional structure or assumption to make nonconvex high-dimensional optimization tractable. Recently, it has been shown with strong simplification assumptions that there are novel loss landscape structures in deep learning optimization that may play a role in making the optimization tractable (Dauphin et al., 2014; Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016). Another key observation is that if a neural network is strongly overparameterized so that it can memorize any data set of a fixed size, then all stationary points (including all local minima and saddle points) become global minima, with some nondegeneracy assumptions. This observation was explained by Livni et al. (2014) and further refined by Nguyen and Hein (2017, 2018). However, these previous results (Livni et al., 2014; Nguyen and Hein, 2017, 2018) require strong overparameterization by assuming not only that a network's width is larger than the data set size but also that optimizing only a single layer (the last layer or some hidden layer) can memorize any data set based on an assumed condition on the rank or nondegeneracy of other layers.

In this letter, we analyze the effects of depth and width on the values of local minima, without the strong overparameterization and simplification assumptions in the literature. As a result, we prove quantitative upper bounds on the quality of local minima, which shows that the values of local minima of neural networks are guaranteed to be no worse than the globally optimal values of corresponding classical machine learning models, and the guarantee can improve as depth and width increase.

2  Preliminaries

This section defines the optimization problem considered in this letter and introduces the basic notation.

2.1  Problem Formulation

Let xRdx and yRdy be an input vector and a target vector, respectively. Let {(xi,yi)}i=1m be a training data set of size m. Given a set of n matrices or vectors {M(j)}j=1n, define [M(j)]j=1n:=M(1)M(2)M(n) to be a block matrix of each column block being M(1),M(2),,M(n). Define the training data matrices as X:=([xi]i=1m)Rm×dx and Y:=([yi]i=1m)Rm×dy.

This letter considers the squared loss function, with which the training objective of the neural networks can be formulated as the following optimization problem:
minimizeθL(θ):=12Y^(X,θ)-YF2,
(2.1)
where ·F is the Frobenius norm, Y^(X,θ)Rm×dy is the output prediction matrix of a neural network, and θRdθ is the vector consisting of all trainable parameters. Here, 2mL(θ) is the standard mean squared error, for which all of our results hold true as well, because multiplying L(θ) by a constant 2m (in θ) changes only the entire scale of the optimization landscape.

The output prediction matrix Y^(X,θ)Rm×dy is specified for shallow networks with rectified linear units (ReLUs) in section 3 and generalized to deep nonlinear neural networks in section 4.

2.2  Additional Notation

Define P[M] to be the orthogonal projection matrix onto the column space (or range space) of a matrix M. Let PN[M] be the orthogonal projection matrix onto the null space (or kernel space) of a matrix M. For a matrix MRd×d', we denote the standard vectorization of the matrix M as vec(M)=[M1,1,,Md,1,M1,2,,Md,2,,M1,d',,Md,d']T.

3  Shallow Nonlinear Neural Networks with Scalar-Valued Output

Before presenting our main results for deep nonlinear neural networks, this section provides the results for shallow networks with a single hidden layer (or three-layer networks with the input and output layers) and scalar-valued output (i.e., dy=1) to illustrate some of the ideas behind the discussed effects of the depth and width on local minima.

In this section, the vector θRdθ of all trainable parameters determines the entries of the weight matrices W(1):=W(1)(θ)Rdx×d and W(2):=W(2)(θ)Rd as vec([W(1)(θ),W(2)(θ)])=θ. Given an input matrix X and a parameter vector θ, the output prediction matrix Y^(X,θ)Rm of a fully connected feedforward network with a single hidden layer can be written as
Y^(X,θ):=σ(XW(1))W(2),
(3.1)
where σ:Rm×dRm×d is defined by coordinate-wise nonlinear activation functions σi,j as (σ(M))i,j:=σi,j(Mi,j) for each (i,j).

3.1  Analysis with ReLU Activations

In this section, the nonlinear activation function σi,j is assumed to be ReLU as σi,j(z)=max(0,z). Let Λ1,kRm×m represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the kth unit at the hidden layer over m different samples as for all i{1,,m} and all k{1,,d},
Λii1,k=1if(XW(1))i,k>00otherwise.
Let Φ(1):=Φ(1)(X,θ):=σ(XW(1))Rm×d be the postactivation output of the hidden layer.

Under this setting, proposition 1 provides an equation that holds at local minima and illustrates the effect of width for shallow ReLU neural networks.

Proposition 1.
Every differentiable local minimum θ of L satisfies that
L(θ)=12Y22-12PN1(2)Φ(1)Y22-k=1d12PNk(1)Dk(1)Y220furtherimprovementasanetworkgetswider,
(3.2)
where Dk(1)=Wk(2)Λ1,kX. Here, N1(1):=Im, Nk(1):=PN[Q¯k-1(1)] for any k{2,,d}, and N1(2):=PN[Q¯d(1)], where Q¯k(1):=[Q1(1),,Qk(1)] and Qk(l):=Nk(1)Dk(1) for any k{1,,d}.

Proposition 1 is an immediate consequence of our general result (see theorem 3) in the next section (the proof is provided in section A.1). In the rest of this section, we provide a proof sketch of proposition 1.

A geometric intuition behind proposition 1 is that a local minimum is a global minimum within a local region in Rdθ (i.e., a neighborhood of the local minimum), the dimension of which increases as a network gets wider (or the number of parameters increases). Thus, a local minimum is a global minimum of a search space with a larger dimension for a wider network. One can also see this geometric intuition in an analysis as follows. If θ is a differentiable local minimum, then θ must be a critical point and thus,
θL(θ)=θY^(X,θ)Y^(X,θ)-Y=0.
By rearranging this,
θY^(X,θ)Y^(X,θ)=θY^(X,θ)Y,
(3.3)
where we can already see the power of strong overparameterization in that if the matrix θY^(X,θ)Rdθ×m is left-invertible, Y^(X,θ)=Y, and hence every differentiable local minimum is a global minimum. Here, θY^(X,θ) is a dθ by m matrix, so significantly increasing dθ (strong overparameterization) can ensure the left invertibility.
Beyond the strong overparameterization, we proceed with the proof sketch of proposition 1 by taking advantage of the special neural network structures in Y^(X,θ) and θY^(X,θ). We first observe that Y^(X,θ)=Φ(1)W(1) and Y^(X,θ)=D(1)vec(W(2)), where D(1):=[Dk(1)]k=1dθ. Moreover, at any differentiable point, we have that W(1)Y^(X,θ)=(Φ(1)) and vec(W(2))Y^(X,θ)=(D(1)). Combining these with equation 3.3 yields
Φ(1)D(1)12Φ(1)D(1)W(1)vec(W(2))=Φ(1)D(1)Y,
where
Y^(X,θ)=12Φ(1)D(1)W(1)vec(W(2)).
By solving for the vector W(1)vec(W(2)),
Y^(X,θ)=PD(1)Φ(1)Y.
Therefore,
L(θ)=12Y-PD(1)Φ(1)Y22=Y22-PD(1)Φ(1)Y22,
where the second equality follows the idempotence of the projection. Finally, decomposing the second term P[D(1)Φ(1)]Y22 by following the Gram-Schmidt process on the set of column vectors of [D(1)Φ(1)] yields the desired statement of proposition 1, completing its proof sketch. In proposition 1, the matrices Nk(l) (and Qk(l)) are by-products of this Gram-Schmidt process.

3.2  Probabilistic Bound

From equation 2.2 in proposition 1, the loss L(θ) at differentiable local minima is expected to tend to get smaller as the width of the hidden layer d gets larger. To further support this theoretical observation, this section obtains a probabilistic upper bound on the loss L(θ) for white noise data by fixing the activation patterns Λ1,k for k{1,2,,d} and assuming that the data matrix [XY] is a random gaussian matrix, with each entry having mean zero and variance one.

In this section, each nonlinear activation function σi,j is assumed to be ReLU (σi,j(z)=max(0,z)) and leaky ReLU (σi,j(z)=max(az,z) with any fixed a1) or absolute value activation (σi,j(z)=|z|). Let Λ1,kRm×m represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the kth unit at the hidden layer over m different samples as
Λii1,k:=σi,k(1)(z)z|z=(XW(1))i,kifσi,k(1)(z)z|z=(XW(1))i,kexists0otherwise.

This definition of Λii1,k generalizes the corresponding definition in section 3.1. Proposition 1 holds for this generalized activation pattern by simply replacing the previous definition of Λii1,k by this more general definition. This can be seen from the proof sketch in section 3.1 and is later formalized in the proof of theorem 3.

We denote the vector consisting of the diagonal entries of Λ1,k by ΛkRm for k{1,2,,d}. Define the activation pattern matrix as Λ:=[Λk]k=1dRm×d. For any index set I{1,2,,m}, let ΛI denote the submatrix of Λ that consists of its rows of indices in I. Let smin(ΛI) be the smallest singular value of ΛI.

Proposition 2 proves that L(θ)(1-dxd/m)Y22/2 in the regime dxdm, and L(θ)=0 in the regime dxdm, under the corresponding conditions on Λ; that is, smin(ΛI)δ for any index set I{1,2,,m} such that |I|m/2 in the regime dxdm, and |I|d/2 in the regime dxdm. This supports our theoretical observation that increasing width helps improve the quality of local minima.

Proposition 2.

Fix the activation pattern matrix Λ=[Λk]k=1dRm×d. Let XY be a random m×(dx+1) gaussian matrix, with each entry having mean zero and variance one. Then the loss L(θ) as in equation 3.2 satisfies both of the following statements:

  • i.
    If m64ln2(dxdm/δ2)dxd and smin(ΛI)δ for any index set I{1,2,,m} with |I|m/2, then
    L(θ)1+6tmm-dxd2mY22,
    with probability at least 1-e-m/(64ln(dxdm/δ2))-2e-t.
  • ii.
    If ddx2mln2(md/δ) with dxln2(dm) and smin(ΛI)δ for any index set I{1,2,,m} with |I|d/2, then
    L(θ)=0
    with probability at least 1-2e-dx/20.

The proof of proposition 2 is provided in appendix B. In that proof, we first rewrite the loss L(θ) as the projection of Y onto the null space of an m×dd0 matrix D˜, with an explicit expression in terms of the activation pattern matrix Λ and the data matrix X. By our assumption, the data matrix X is a random gaussian matrix. The projection matrix D˜ is also a random matrix. Proposition 2 then boils down to understanding the rank of the projection matrix D˜, and we proceed to show that D˜ has the largest possible rank, min{dd0,m}, with high probability. In fact, we derive quantitative estimates on the smallest singular value of D˜. The main difficulties are that the columns of the matrix D˜ are correlated and variances of different entries vary. Our approach to obtain quantitative estimates on the smallest singular value of D˜ combines the epsilon net argument with an iterative argument.

In the regime dd0m, results similar to proposition 2ii were obtained under certain diversity assumptions on the entries of the weight matrices in a previous study (Xie, Liang, & Song, 2017). When compared with the previous study (Xie et al., 2017), proposition 2 specifies precise relations between the size dd0 of the neural network and the size m of the data set and also holds true in the regime dd0m. Moreover, our proof arguments for proposition 2ii are different. Xie et al. (2017), under the assumption that dd0m, show that D˜D˜T is close to its expectation in the sense of spectral norm. As a consequence, the lower bound of the smallest eigenvalue of E[D˜D˜T] gives the lower bound for the smallest singular value of D˜.

However, proposition 2 assumes a gaussian data matrix, which may be a substantial limitation. The proof of proposition 2 relies on the concentration properties of gaussian distribution. Whereas a similar proof would be able to extend proposition 2 to a nongaussian distribution with these properties (e.g., distributions with subgaussian tails), it would be challenging to use a similar proof for a general distribution without the properties similar to those.

4  Deep Nonlinear Neural Networks

Let H be the number of hidden layers and dl be the width (or, equivalently, the number of units) of the lth hidden layer. To theoretically analyze concrete phenomena, the rest of this letter focuses on fully connected feedforward networks with various depths H1 and widths dl1, using rectified linear units (ReLUs), leaky ReLUs, and absolute value activations, evaluated with the squared loss function. In the rest of this letter, the (finite) depth H can be arbitrarily large and the (finite) widths dl can arbitrarily differ among different layers.

4.1  Model and Notation

Let θRdθ be the vector consisting of all trainable parameters, which determines the entries of the weight matrix W(l):=W(l)(θ)Rdl-1×dl at every lth hidden layer as vec([W(l)(θ)]l=1H+1)=θ. Here, dθ:=l=1H+1dl-1dl is the number of trainable parameters. Given an input matrix X and a parameter vector θ, the output prediction matrix Y^(X,θ)Rm×dH+1 of a fully connected feedforward network can be written as
Y^(X,θ):=Φ(H)W(H+1),
(4.1)
where Φ(l):=Φ(l)(X,θ)Rm×dl is the postactivation output of lth hidden layer,
Φ(l)(X,θ):=σ(l)(Φ(l-1)W(l)),
where Φ(0)(X,θ):=X, Φ(H+1)(X,θ):=Y^(X,θ), and σ(l):Rm×dlRm×dl is defined by coordinate-wise nonlinear activation functions σi,j(l) as (σ(l)(M))i,j:=σi,j(l)(Mi,j) for each (l,i,j). Each nonlinear activation function σi,j(l) is allowed to differ among different layers and different units within each layer, but assumed to be ReLU (σi,j(l)(z)=max(0,z)), leaky ReLU (σi,j(l)(z)=max(az,z) with any fixed a1) or absolute value activation (σi,j(l)(z)=|z|). Here, dH+1=dy and d0=dx. Let Λl,kRm×m represent a diagonal matrix with diagonal elements corresponding to the activation pattern of the kth unit at the lth layer over m different samples as
Λiil,k:=σi,k(l)(z)z|z=(Φ(l-1)W(l))i,kifσi,k(l)(z)z|z=(Φ(l-1)W(l))i,kexists0otherwise.

This definition of Λiil,k generalizes the corresponding definition in section 3. Let Id be the identity matrix of size d by d. Define MM' to be the Kronecker product of matrices M and M'. Given a matrix M, M·,j and Mi,· denote the jth column vector of M and the ith row vector of M, respectively.

4.2  Theoretical Result

For the standard deep nonlinear neural networks, theorem 3 provides an equation that holds at local minima and illustrates the effect of depth and width. Let dl':=dl for all l{1,,H} and dH+1':=1.

Theorem 1.
Every differentiable local minimum θ of L satisfies that
L(θ)=12YF2-l=1H+1kl=1dl'12PNkl(l)Dkl(l)vec(Y)220furtherimprovementasanetworkgetswideranddeeper,
(4.2)
where Dkl(l):=Dkl(l)(θ) and Nkl(l):=Nkl(l)(θ) are defined as follows. For any l{1,,H} and any kl{1,,dl},
Dkl(l):=kl+1=1dl+1kH=1dH(Wkl,kl+1(l+1)WkH-1,kH(H)WkH,·(H+1))Λl,klΛH,kHΦ(l-1),
with Dkl(H):=(WkH,·(H+1))ΛH,kHΦ(H-1). For any l{1,,H} and any kl{1,,dl}, Nkl(l):=PN[Q¯kl-1(l)] with N1(1):=Im where Q¯kl(l):=[Q1(1),,Qd1(1),Q1(2),,Qd2(2),,Q1(l),,Qkl(l)], Qkl(l):=Nkl(l)Dkl(l), and Q¯0(l):=Q¯dl-1(l-1). Here, D1(H+1)(θ):=IdH+1Φ(H) and N1(H+1)(θ):=PN[Q¯dH(H)].

The complete proof of theorem 3 is provided in section A.1. Theorem 3 is a generalization of proposition 1. Accordingly, its proof follows the proof sketch presented in the previous section for proposition 1.

Unlike previous studies (Livni et al., 2014; Nguyen & Hein, 2017, 2018), theorem 3 requires no overparameterization such as dlm. Instead, it provides quantitative gradual effects of depth and width on local minima, from no overparameterization to overparameterization. Notably, theorem 3 shows the effect of overparameterization in terms of depth as well as width, which also differs from the results of previous studies that consider overparameterization in terms of width (Livni et al., 2014; Nguyen & Hein, 2017, 2018).

The proof idea behind these previous studies with strong overparameterization is captured in the discussion after equation 3.3—with strong overparameterization such that dlm and rank(D(1))m, vec(W)Y^(X,θ)Rdl×m is left-invertible and hence every local minimum is a global minimum with zero training error. Here, rank(M) represents the rank of a matrix M. The proof idea behind theorem 3 differs from those as shown in section 3.1. What is still missing in theorem 3 is the ability to provide a prior guarantee on L(θ) without strong overparameterization, which is addressed in sections 3.2 and 5 for some special cases but left as an open problem for other cases.

4.3  Experiments

In theorem 3, we have shown that at every differentiable local minimum θ, the total training loss value L(θ) has an analytical formula L(θ)=J(θ), where
J(θ):=12YF2-l=1H+1kl=1dl'12PNkl(l)(θ)Dkl(l)(θ)vec(Y)22
denotes the right-hand side of equation 4.1. In this section, we investigate the actual numerical values of the formula J(θ) with a synthetic data set and standard benchmark data sets for neural networks with different degrees of depth =H and hidden layers' width =dl for l{1,2,,H}.

In the synthetic data set, the data points {(xi,yi)}i=1m were randomly generated by a ground-truth, fully connected feedforward neural network with H=7, dl=50 for all l{1,2,,H}, tanh activation function, (x,y)R10×R and m=5000. MNIST (LeCun, Bottou, Bengio, & Haffner, 1998), a popular data set for recognizing handwritten digits, contains 28 × 28 gray-scale images. The CIFAR-10 (Krizhevsky & Hinton, 2009) data set consists of 32 × 32 color images that contain different types of objects such as “airplane,” “automobile,” and “cat.” The Street View House Numbers (SVHN) data set (Netzer et al., 2011) contains house digits collected by Google Street View, and we used the 32 × 32 color image version for the standard task of predicting the digits in the middle of these images. In order to reduce the computational cost, for the image data sets (MNIST, CIFAR-10, and SVHN), we center-cropped the images (24×24 for MNIST and 28×28 for CIFAR-10 and SVHN), then resized them to smaller gray-scale images (8×8 for MNIST and 12×12 for CIFAR-10 and SVHN), and used randomly selected subsets of the data sets with size m=10,000 as the training data sets.

For all the data sets, the network architecture was fixed to be a fully connected feedforward network with the ReLU activation function. For each data set, the values of J(θ) were computed with initial random weights drawn from a normal distribution with zero mean and normalized standard deviation (1/dl) and with trained weights at the end of 40 training epochs. (Additional experimental details are presented in appendix C.)

Figure 1 shows the results with the synthetic data set, as well as the MNIST, CIFAR-10, and SVHN data sets. As it can be seen, the values of J(θ) tend to decrease toward zero (and hence the global minimum value), as the width or depth of neural networks increases. In theory, the values of J(θ) may not improve as much as desired along depth and width if representations corresponding to each unit and each layer are redundant in the sense of linear dependence of the columns of Dkl(l)(θ) (see theorem 3). Intuitively, at initial random weights, one can mitigate this redundancy due to the randomness of the weights, and hence a major concern is whether such redundancy arises and J(θ) degrades along with training. From Figure 1, it can be also noticed that the values of J(θ) tend to decrease along with training. These empirical results partially support our theoretical observation that increasing the depth and width can improve the quality of local minima.

Figure 1:

The values of J(θ) for the training data sets (J(θ) are on the right-hand side of equation 4.1) with varying depth =H (y-axis) and width =dl for all l{1,2,,H} (x-axis). The heat map colors represent the values of J(θ). In all panels of this figure, the left heat map (initial) is computed with initial random weights and the right heat map (trained) is calculated after training. It can be seen that both depth and width helped improve the values of J(θ).

Figure 1:

The values of J(θ) for the training data sets (J(θ) are on the right-hand side of equation 4.1) with varying depth =H (y-axis) and width =dl for all l{1,2,,H} (x-axis). The heat map colors represent the values of J(θ). In all panels of this figure, the left heat map (initial) is computed with initial random weights and the right heat map (trained) is calculated after training. It can be seen that both depth and width helped improve the values of J(θ).

5  Deep Nonlinear Neural Networks with Local Structure

Given the scarcity of theoretical understanding of the optimality of deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that it is valuable to theoretically study simplified models: deep linear neural networks. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that in terms of optimization, deep linear networks exhibited several properties similar to those of deep nonlinear networks. Following these observations, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018), as a step toward the goal of establishing the optimization theory of deep learning.

As another step toward the goal, this section discards the strong linearity assumption and considers a locally induced nonlinear-linear structure in deep nonlinear networks with the piecewise linear activation functions such as ReLUs, leaky ReLUs, and absolute value activations.

5.1  Locally Induced Nonlinear-Linear Structure

In this section, we describe how a standard deep nonlinear neural network can induce nonlinear-linear structure. The nonlinear-linear structure considered in this letter is defined in definition 4: condition i simply defines the index subsets S(l) that pick out the relevant subset of units at each layer l, condition ii requires the existence of n linearly acting units, and condition iii imposes weak separability of edges.

Definition 1.

A parameter vector θ is said to induce (n,t) weakly separated linear units on a training input data set X if there exist (H+1-t) sets S(t+1),S(t+2),,S(H+1) such that for all l{t+1,t+2,,H+1}, the following three conditions hold:

  • i.

    S(l){1,,dl} with |S(l)|n.

  • ii.

    Φ(l)(X,θ)·,k=Φ(l-1)(X,θ)W(l)(θ)·,k for all kS(l).

  • iii.

    W(l+1)(θ)k',k=0 for all (k',k)S(l)×({1,,dl+1}S(l+1)) if lH-1.

Given a training input data set X, let Θn,t be the set of all parameter vectors that induce (n,t) weakly separated linear units on the training input data set X that defines the total loss L(θ) in equation 2.1. For standard deep nonlinear neural networks, all parameter vectors θ are in ΘdH+1,H, and some parameter vectors θ are in Θn,t for different values of (n,t). Figure 2 a illustrates locally induced structures for θΘ1,0. For a parameter θ to be in Θn,t, definition 4 requires only the existence of a portion n/dl of units to act linearly on the particular training data set merely at the particular θ. Thus, all units can be nonlinear, act nonlinearly on the training data set outside of some parameters θ, and operate nonlinearly always on other inputs x—for example, in a test data set or a different training data set. The weak separability requires that the edges going from the n units to the rest of the network are negligible. The weak separability does not require the n units to be separated from the rest of the neural network.

Figure 2:

Illustration of locally induced nonlinear-linear structures. (a) Simple examples of the structure with weakly separated edges considered in this section (see definition 4). (b) Examples of a simpler structure with strongly separated edges (see definition 8). The red nodes represent the linearly acting units on a training data set at a particular θ, and the white nodes are the remaining units. The black dashed edges represent standard edges without any assumptions. The red nodes are allowed to depend on all nodes from the previous layer in panel a, whereas they are not allowed in panel b except for the input layer. In both panels a and b, two examples of parameters θ are presented with the exact same network architecture (including activation functions and edges). Even if the network architecture (or parameterization) is identical, different parameters θ can induce different local structures. With Θ1,4, this local structure always holds in standard deep nonlinear networks with four hidden layers.

Figure 2:

Illustration of locally induced nonlinear-linear structures. (a) Simple examples of the structure with weakly separated edges considered in this section (see definition 4). (b) Examples of a simpler structure with strongly separated edges (see definition 8). The red nodes represent the linearly acting units on a training data set at a particular θ, and the white nodes are the remaining units. The black dashed edges represent standard edges without any assumptions. The red nodes are allowed to depend on all nodes from the previous layer in panel a, whereas they are not allowed in panel b except for the input layer. In both panels a and b, two examples of parameters θ are presented with the exact same network architecture (including activation functions and edges). Even if the network architecture (or parameterization) is identical, different parameters θ can induce different local structures. With Θ1,4, this local structure always holds in standard deep nonlinear networks with four hidden layers.

Here, a neural network with θΘn,t can be a standard deep nonlinear neural network (without any linear units in its architecture), a deep linear neural network (with all activation functions being linear), or a combination of these cases. Whereas a standard deep nonlinear neural network can naturally have parameters θΘn,t, it is possible to guarantee all parameters θ to be in Θn,t with desired (n,t) simply by using corresponding network architectures. For standard deep nonlinear neural networks, one can also restrict all relevant convergent solution parameters θ to be in Θn,t by using some corresponding learning algorithms. Our theoretical results hold for all of these cases.

5.2  Theoretical Result

We state our main theoretical result in theorem 5 and corollary 7; a simplified statement is presented in remark 6. Here, a classical machine learning method, basis function regression, is used as a baseline to be compared with neural networks. The global minimum value of basis function regression with an arbitrary basis matrix M(X) is infR12M(X)R-YF2, where the basis matrix M(X) does not depend on R and can represent nonlinear maps, for example, by setting M=([φ(xi)]i=1m)Rm×dφ with any nonlinear basis functions φ and any finite dφ. In theorem 5, the expression PNΦ(S)Y represents the projection of Y onto the null space of (Φ(S)), which is also (Y—the projection of Y onto the column space of Φ(S)). Given matrices (M(j))jS with a sequence S=(s1,s2,,sn), define [M(j)]jS:=M(s1)M(s2)M(sn) to be a block matrix with columns being M(s1),M(s2),,M(sn). Let S(s1,s2,,sn) denote a subsequence of (s1,s2,,sn).

Theorem 2.
For any t{0,1,,H}, every differentiable local minimum θΘdH+1,t of L satisfies that for any subsequence S(t,t+1,,H) (including the case of S being the empty sequence),
L(θ)12PNΦ(S)YF2globalminimumvalueofbasisfunctionregressionwithbasismatrixΦ(S)-l=1Hkl=1dl12PNkl(l)PNΦ¯(S)Dkl(l)vec(Y)220furtherimprovementasanetworkgetswideranddeeper,
(5.1)
where PNΦ(S)Rm×m, PNΦ¯(S)RmdH+1×mdH+1, Φ(S)=[Φ(l)]lS, Φ¯(S)=[IdH+1Φ(l)]lS. If S is empty, PN[Φ(S)]=Im and PN[Φ¯(S)]=ImdH+1. The matrices Dkl(l) and Nkl(l) are defined in theorem 3 with the exception that Qkl(l)=Nkl(l)PN[Φ¯(S)]Dkl(l) (instead of Qkl(l):=Nkl(l)Dkl(l)).
Remark 1.

From theorem 5 (or corollary 7), one can see the following properties of the loss landscape:

  • i.

    Every differentiable local minimum, θΘdH+1,t has a loss value L(θ) better than or equal to any global minimum value of basis function regression with any combination of the basis matrices in the set {Φ(l)}l=tH of fixed deep hierarchical representation matrices. In particular with t=0, every differentiable local minimum θΘdH+1,0 has a loss value L(θ) no worse than the global minimum values of standard basis function regression with the handcrafted basis matrix Φ(0)=X, and of basis function regression with the larger basis matrix [Φ(l)]l=0H.

  • ii.

    As dl and H increase (or, equivalently, as a neural network gets wider and deeper), the upper bound on the loss values of local minima can further improve.

The proof of theorem 5 is provided in section A.2. The proof is based on the combination of the idea presented in section 3.1 and perturbations of a local minimum candidate. That is, if a θ is a local minimum, then the θ is a global minimum within a local region (i.e., a neighborhood of θ). Thus, after perturbing θ as θ'=θ+Δθ such that Δθ is sufficiently small (so that θ' stays in the local region) and L(θ')=L(θ), the θ' must be still a global minimum within the local region and, hence, the θ' is also a local minimum. The proof idea of theorem 5 is to apply the proof sketch in section 3.1 to not only a local minimum candidate θ but also its perturbations θ'=θ+Δθ.

In terms of overparameterization, theorem 5 states that local minima of deep neural networks are as good as global minima of the corresponding basis function regression even without overparameterization, and overparameterization helps to further improve the guarantee on local minima. The effect of overparameterization is captured in both the first and second terms on the right-hand side of equation 5.1. As depth and width increase, the second term tends to increase, and hence the guarantee on local minima can improve. Moreover, as depth and width increase (for some of t+1,t+2,,Hth layers in theorem 5), the first term tends to decrease and the guarantee on local minima can also improve. For example, if [Φ(l)]l=tH has rank at least m, then the first term is zero and, hence, every local minimum is a global minimum with zero loss value. As a special case of this example, since every θ is automatically in ΘdH+1,H, if Φ(H) is forced to have rank at least m, every local minimum becomes a global minimum for standard deep nonlinear neural networks, which coincides with the observation about overparameterization by Livni et al. (2014).

Without overparameterization, theorem 5 also recovers one of the main results in the literature of deep linear neural networks as a special case—that is, every local minimum is a global minimum. If dH+1min{dl:1lH}, every local minimum θ for deep linear networks is differentiable and in ΘdH+1,0, and hence theorem 3 yields that L(θ)12PN[X]YF2. Because 12PN[X]YF2 is the global minimum value, this implies that every local minimum is a global minimum for deep linear neural networks.

Corollary 7 states that the same conclusion and discussions as in theorem 5 hold true even if we fix the edges in condition iii in definition 4 to be zero (by removing them as an architectural design or by forcing it with a learning algorithm) and consider optimization problems only with remaining edges.

Corollary 1.
For any t{0,1,,H}, every differentiable local minimum θΘdH+1,t of L|I satisfies that for any subsequence S(t,t+1,,H) (including the case of S being the empty sequence),
L(θ)12PNΦ(S)YF2globalminimumvalueofbasisfunctionregressionwithbasismatrixΦ(S)-l=1Hkl=1dl12PN^kl(l)PNΦ¯(S)D^kl(l)vec(Y)220furtherimprovementasanetworkgetswideranddeeper,
(5.2)
where L|I is the restriction of L to I={θ'Rdθ:l{t+1,,H-1},(k',k)S(l)×S(l+1)¯,W(l+1)(θ')k',k=0} with the index sets S(t+1),S(t+2),,S(H+1) of the θΘdH+1,t in definition 4 and S(l)¯:={1,,dl}S(l). Here, Φ(S) and Φ¯(S) are defined in theorem 5, and the matrices D^kl(l) and N^kl(l) are defined as follows. For all l{1,,t+1}, D^kl(l):=Dkl(l) for all kl{1,,dl} (where Dkl(l) is defined in theorem 5). For all l{t+2,,H}, D^kl(l):=Dkl(l) for all klS(l), and
D^kl(l):=kl+1=1dl+1kH=1dH(Wkl,kl+1(l+1)WkH-1,kH(H)WkH,·(H+1))Λl,klΛH,kH[Φ·,j(l-1)]jS(l-1)¯,
with D^kl(H):=(WkH,·(H+1))ΛH,kH[Φ·,j(H-1)]jS(H-1)¯ for all klS(l)¯. For any l{1,,H} and any kl{1,,dl}, N^kl(l):=PN[Q¯kl-1(l)] with N11:=Im where Q¯kl(l):=[Q1(1),,Qd1(1),Q1(2),,Qd2(2),,Q1(l),,Qkl(l)], Qkl(l):=N^kl(l)PN[Φ¯(S)]D^kl(l), and Q¯0(l):=Q¯dl-1(l-1).

The proof of corollary 7 is provided in section A.3 and follows the proof of theorem 5. Here, Φ(0)=X consists of training inputs xi in the arbitrary given feature space embedded in Rd0; for example, given a raw input xraw and any feature map φ:xrawφ(xraw)Rd0 (including identity as φ(xraw)=xraw), we write x=φ(xraw). Therefore, theorem 5 and corollary 7 state that every differentiable local minima of deep neural networks can be guaranteed to be no worse than any given basis function regression model with a handcrafted basis taking values in Rd with some finite d, such as polynomial regression with a finite degree and radial basis function regression with a finite number of centers.

To illustrate an advantage of the notion of weakly separated edges in definition 4, one can consider the following alternative definition that requires strongly separated edges.

Definition 2.

A parameter vector θ is said to induce (n,t) strongly separated linear units on the training input data set X if there exist (H+1-t) sets S(t+1),S(t+2),,S(H+1) such that for all l{t+1,t+2,,H+1}, conditions i to iii in definition 4 hold and Φ(l)(X,θ)W(l+1)(θ)·,k=k'S(l)Φ(l)(X,θ)·,k'W(l+1)(θ)k',k for all kS(l+1) if l{H,H+1}.

Let Θn,tstrong be the set of all parameter vectors that induces (n,t) stronglyseparated linear units on the particular training input data set X that defines the total loss L(θ) in equation 2.1. Figure 2 shows a comparison of weekly separated edges and strongly separated edges. Under this stronger restriction on the local structure, we can obtain corollary 9.

Corollary 2.
For any t{0,1,,H}, every differentiable local minimum θΘdH+1,t of L satisfies that for any S(t,H),
L(θ)12PNΦ(S)YF2-l=1Hkl=1dl12PNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22,
where Φ(S), Φ¯(S), Dkl(l), and Nkl(l) are defined in theorem 5.

The proof of corollary 9 is provided in section A.4 and follows the proof of theorem 5. As a special case, corollary 9 also recovers the statement that every local minimum is a global minimum for deep linear neural networks in the same way as in theorem 5. When compared with theorem 5, one can see that the statement in corollary 9 is weaker, producing the upper bound only in terms of S(t,H). This is because the restriction of strongly separated units forces neural networks to have less expressive power with fewer effective edges. This illustrates an advantage of the notion of weakly separated edges in definition 4.

A limitation in theorems 3 and 5 and corollary 7 is the lack of treatment of nondifferentiable local minima. The Lebesgue measure of nondifferentiable points is zero, but this does not imply that the appropriate measure of nondifferentiable points is small. For example, if L(θ)=|θ|, the Lebesgue measure of the nondifferentiable point (θ=0) is zero, but the nondifferentiable point is the only local and global minimum. Thus, the treatment of nondifferentiable points in this context is a nonnegligible problem. The proofs of theorems 3 and 5 and corollary 7 are all based on the proof sketch in section 3.1, which heavily relies on the differentiability. Thus, the current proofs do not trivially extend to address this open problem.

6  Conclusion

In this letter, we have theoretically and empirically analyzed the effect of depth and width on the loss values of local minima, with and without a possible local nonlinear-linear structure. The local nonlinear-linear structure we have considered might naturally arise during training and also is guaranteed to emerge by using specific learning algorithms or architecture designs. With the local nonlinear-linear structure, we have proved that the values of local minima of neural networks are no worse than the global minimum values of corresponding basis function regression and can improve as depth and width increase. In the general case without the possible local structure, we have theoretically shown that increasing the depth and width can improve the quality of local minima, and we empirically supported this theoretical observation. Furthermore, without the local structure but with a shallow neural network and a gaussian data matrix, we have proven the probabilistic bounds on the rates of the improvements on the local minimum values with respect to width. Moreover, we have discussed a major limitation of this letter: all of its the results focus on the differentiable points on the loss surfaces. Additional treatments of the nondifferentiable points are left to future research.

Our results suggest that the values of local minima are not arbitrarily poor (unless one crafts a pathological worst-case example) and can be guaranteed to some desired degree in practice, depending on the degree of overparameterization, as well as the local or global structural assumption. Indeed, a structural assumption, namely the existence of an identity map, was recently used to analyze the quality of local minima (Shamir, 2018; Kawaguchi & Bengio, 2018). When compared with these previous studies (Shamir, 2018; Kawaguchi & Bengio, 2018), we have shown the effect of depth and width, as well as considered a different type of neural network without the explicit identity map.

In practice, we often “overparameterize” a hypothesis space in deep learning in a certain sense (e.g., in terms of expressive power). Theoretically, with strong overparameterization assumptions, we can show that every stationary point (including all local minima) with respect to a single layer is a global minimum with the zero training error and can memorize any data set. However, “overparameterization” in practice may not satisfy such strong overparameterization assumptions in the theoretical literature. In contrast, our results in this letter do not require overparameterization and show the gradual effects of overparameterization as consequences of general results.

Appendix A:  Proofs for Nonprobabilistic Statements

Let Dkl(l) be defined in theorem 5. Let D(l):=[Dk(l)]k=1dlRmdH+1×dldl-1 and D:=[D(l)]l=1HRmdH+1×l=1Hdl-1dl. Given a matrix-valued function f(θ)Rd'×d, let W(l)f(θ):=vec(f)vec(W(l))Rd'd×dl-1dl be the partial derivative of vec(f) with respect to vec(W(l)). Let {j,j+1,,j'}:= if j>j'. Let M(l)M(l+1)M(l')=I if l>l'. Let Null(M) be the null space of a matrix M. Let B(θ,ε) be an open ball of radius ε with the center at θ.

The following lemma decomposes the model output Y^ in terms of the weight matrix W(l) and D(l) that coincides with its derivatives at differentiable points.

Lemma 1.
For all l{1,,H},
vec(Y^(X,θ))=D(l)vec(W(l)(θ)),
and at any differentiable θ,
W(l)Y^(X,θ)=D(l).
Proof.
Define G(l) to be the preactivation output of the lth hidden layer as G(l):=G(l)(X,θ):=Φ(l-1)(X,θ)W(l). By the linearity of the vec operation and the definition of G(l), we have that
vec[G(l+1)(X,θ)]=veck=1dlΛl,kG(l)(X,θ)·,kWk,·(l+1)=k=1dl(Wk,·(l+1))Λl,kvecG(l)(X,θ)·,k=F(l+1)vecG(l)(X,θ),
where F(l+1):=[(Wk,·(l+1))Λl,k]k=1dl. Therefore,
vec(Y^)=F(H+1)F(H)F(l+1)vec(G(l))=F(H+1)F(l+1)[IdlΦ(l-1)]vec(W(l)),
where F(H+1)F(l+1)[IdlΦ(l-1)]=[D1(l)D2(l)Ddl(l)]=D(l), which proves the first statement that vec(Y^)=D(l)vec(W(l)). The second statement follows from the fact that the derivatives of D(l) with respect to vec(W(l)) are zeros at any differentiable point, and hence (W(l)Y^)=D(l)+0.

Lemma 11 generalizes part of theorem A.45 in Rao, Toutenburg, Shalabh, and Heumann (2007) by discarding invertibility assumptions.

Lemma 2.
For any block matrix [AB] with real submatrices A and B such that AB=0,
PAB=P[A]+P[B].
Proof.
It follows a straightforward calculation as
PAB=ABAA00BBAB=AB(AA)00(BB)AB=P[A]+P[B].

Lemma 12 decomposes a norm of a projected target vector into a form that clearly shows an effect of depth and width.

Lemma 3.
For any t{0,1,,H} and any S(t,t+1,,H),
PPNΦ¯(S)Dvec(Y)22=l=1Hkl=1dlPNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22.
Proof.
Since the span of the columns of AB is the same as the span of the columns of [APN[A]B] for submatrices A and B, the span of the columns of PN[Φ¯(S)]D=[[PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H is the same as the span of the columns of [[Nkl(l)PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H. Then, by repeatedly applying lemma 11 to each block of [[Nkl(l)PN[Φ¯(S)]Dkl(l)]kl=1dl]l=1H, we have that
PPN[Φ¯(S)]D=PNkl(l)PN[Φ¯(S)]Dkl(l)kl=1dll=1H=l=1Hkl=1dlPNkl(l)PN[Φ¯(S)]Dkl(l).
From the construction of Nkl(l), we have that for all (l,k)(l',k'),
P[Nk(l)PN[Φ¯(S)]Dk(l)]P[Nk'(l')PN[Φ¯(S)]Dk'(l')]=0.
Therefore,
PPNΦ¯SDvec(Y)22=l=1Hkl=1dlPNkl(l)PN[Φ¯(S)]Dkl(l)vec(Y)22=l=1Hkl=1dlPNkl(l)PNΦ¯(S)Dkl(l)vec(Y)22.

The following lemma plays a major role in the proof of theorem 5.

Lemma 4.
For any t{0,1,,H}, every differentiable local minimum θΘdH+1,t satisfies that for any l{t,t+1,,H},
(Φ(l))(Y^(X,θ)-Y)=0.
Proof.
Fix t to be a number in {0,1,,H}. Let θ be a differentiable local minimum in ΘdH+1,t. Then, from the definition of a local minimum, there exists ε1>0 such that L(θ)L(θ') for all θ'B(θ,ε1), and hence L(θ)L(θ') for all θ'B˜(θ,ε1)B(θ,ε1), where B˜(θ,ε1):=B(θ,ε1){θRdθ:W(l+1)(θ)k',k=0foralll{t+1,t+2,H-1}andall(k',k)S(l)×({1,,dl+1}S(l+1))} with the index sets S(t+1),S(t+2),,S(H+1) of the θΘdH+1,t in definition 4. Without loss of generality, we can permute the indices of the units within each layer such that for all l{t+1,t+2,,H+1}, S(l){1,2,,dL} with some dLdH+1 in the definition of ΘdH+1,t (see definition 4). Note that the considered activation functions σi,j(l)(z) are continuous and act linearly on z0 . Thus, from the definition of ΘdH+1,t, there exists ε2>0 such that for all θ'B˜(θ,ε2) and all l{t,t+1,,H},
Y^(X,θ')=Φ(l)A(l+1)C(l+1)A(l+2)A(H+1)+l'=lH-1Z(l'+1)C(l'+2)A(l'+3)A(H+1),
(A.1)
where A(l),B(l) and C(l) are submatrices of W(l)(θ'), and Z(l) is a submatrix of Φ(l)(X,θ') as defined below:
A(l)ξ(l)C(l)B(l):=W(l)(θ'),
and
Z(t+1):=σ(t+1)Φ(t)ξ(t+1)B(t+1)withZ(l):=σ(l)(Z(l-1)B(l))forlt+2.
Note that Z(l) depends only on Φ(t), ξ(t), and B(k) for all kl. Here, Φ(t) does not depend on A(l) and C(l) for all lt+1. That is, at each layer l{t+2,t+3,,H}, A(l)RdL×dL connects dL linearly acting units to next dL linearly acting units, B(l)R(dl-1-dL)×(dl-dL) connects other units to next other units (other units can include both nonlinear and linearly acting units), and C(l)R(dl-1-dL)×dL connects other units to next linearly acting units, with dLdH+1. Here, A(t+1),B(t+1),C(t+1), and ξ(t+1) connect the possibly unstructured layer Φ(t) to the next structured layer, C(H+1)R(dH-dL)×dH+1 connects other units in the last hidden layer to the output units, and A(H+1)RdL×dH+1 connects linearly acting units in the last hidden layer to the output units.
Let ε3=min(ε1,ε2). Let l be an arbitrary fixed number in {t,t+1,,H} in the following. Let r:=Y^(X,θ)-Y. Define
R(l+1):=A(l+1)C(l+1).
From the condition of differentiable local minimum, we have that
0=R(l+1)L(θ)=vec((Φ(l))r(A(l+2)A(H+1))),
since otherwise, R(l+1) can be moved to the direction of R(l+1)L(θ) with a sufficiently small magnitude ε3'(0,ε3) and decrease the loss value. This implies that
(Φ(l))r(A(l+2)A(H+1))=0.
If rank(A(l+2)A(H+1))dH+1 or l=H, then this equation yields the desired statement of this lemma as (Φ(l))r=0. Hence, the rest of this proof considers the case of
rank((A(l+2)A(H+1)))<dH+1andl{t,t+1,,H-1}.
Define an index l* as
l*:=min{l'Z+:l+3l'H+2rank(A(l')A(H+1))dH+1},
where A(H+2)A(H+1):=IdH+1. This minimum exists since the set contains at least H+2 (nonempty) and is finite. Then we have that rank(A(l*)A(H+1))dH+1 and rank(A(l')A(H+1))<dH+1 for all l'{l+2,l+3,,l*-1}, since rank(M1M2)min(rank(M1),rank(M2)). Therefore, for all l'{l+1,l+2,,l*-2}, we have that Null((A(l'+1)A(H+1)))0, and there exists a vector ul'RdL such that
ul'Null((A(l'+1)A(H+1)))andul'2=1.
Let ul' denote such a vector for all l'{l+1,l+2,,l*-2}. For all l'{l+2,l+3,,l*-2}, define
A˜(l')(νl'):=A(l')+νl'ul'andR˜(l+1)(νl+1):=R(l+1)+νl+1ul+1,
where νl'RdL and νl+1Rdl. Let θ˜(νl+1,νl+2,,νl*-2) be θ with A(l') and R(l+1) being replaced by A˜(l')(νl') and R˜(l+1)(νl+1) for all l'{l+2,l+3,,l*-2}. Then for any (νl+1,νl+2,,νl*-2),
Y^(X,θ˜(νl+1,,νl*-2))=Y^(X,θ)andL(θ˜(νl+1,,νl*-2))=L(θ),
since A˜(l')(νl')A(l'+1)A(H+1)=A(l')A(l'+1)A(H+1) for all l'{l+2,l+3,,l*-2} and R˜(l+1)(νl+1)A(l+2)A(H+1)=R(l+1)A(l+2)A(H+1).
For any sufficiently small vector (νl+1,,νl*-2) such that θ˜(νl+1,,νl*-2)B˜(θ,ε3/2), if θ is a local minimum, every θ˜(νl+1,,νl*-2) is also a local minimum with respect to the entries of A(l'),B(l'), and C(l') for all l' because there exists ε3'=ε3/2>0 such that
L(θ˜(νl+1,,νl*-2))=L(θ)L(θ')
for all θ'B˜(θ˜(νl+1,,νl*-2),ε3')B˜(θ,ε3)B˜(θ,ε1)B(θ,ε1), where the first inclusion follows the triangle inequality. Thus, for any such θ˜(νl+1,,νl*-2) in the sufficiently small open ball, we have that
A(l*-1)L(θ˜(νl+1,,νl*-2))=0,
where A(l*-1)L(θ˜(νl+1,,νl*-2)) exists within the sufficiently small open ball from equation A.1 (composed with the squared loss). In particular, by setting νl+1=0 and noticing that Y^(X,θ˜(νl+1,,νl*-2))-Y=Y(X,θ)-Y=r,
0=A(l*-1)L(θ˜(0,νl+2,,νl*-2))=A(l*-1)Y^(X,θ˜(0,νl+2,,νl*-2))vec(r),
and hence
0=A(l*-1)L(θ˜(νl+1,,νl*-2))=A(l*-1)Φ(l)(νl+1ul+1)A¯(l+2)A¯(H+1)+A(l*-1)Y^(X,θ˜(0,νl+2,,νl*-2))vec(r)=A(l*-1)Φ(l)(νl+1ul+1)A¯(l+2)A¯(H+1)vec(r),
where
A¯(l')=A˜(l')(νl')ifl'{l+2,,l*-2}A(l')ifl'{l+2,,l*-2}.
Since
A(l*-1)Φ(l)(νl+1ul+1)A¯(l+2)A¯(H+1)=(A(l*)A(H+1))Φ(l)(νl+1ul+1)A˜(l+2)(νl+2)A˜(l*-2)(νl*-2),
this implies that
A(l*)A(H+1)rΦ(l)(νl+1ul+1)A˜(l+2)(νl+2)A˜(l*-2)(νl*-2)=0.
By the definition of l*, this implies that
rΦ(l)(νl+1ul+1)A˜(l+2)(νl+2)A˜(l*-2)(νl*-2)=0,
where A˜(l+2)(νl+2)A˜(l+1)(νl+1):=IdL.
We now show that for any sufficiently small vector (νl+1,,νl*-2) such that θ˜(νl+1,,νl*-2)B˜(θ,ε3/2),
rΦ(l)(νl+1ul+1)A˜(l+2)(νl+2)A˜(j)(νj)=0,
by induction on the index j with the decreasing order j=l*-2,l*-3,,l+1. The base case with j=l*-2 is proven above. Let A˜(l):=A˜(l)(νl+2). For the inductive step, assuming that the statement holds for j, we show that it holds for j-1 as
0=rΦ(l)(νl+1ul+1)A˜(l+2)A˜(j)=rΦ(l)(νl+1ul+1)A˜(l+2)A(j)+rΦ(l)(νl+1ul+1)A˜(l+2)A˜(j-1)νjuj=rΦ(l)(νl+1ul+1)A˜(l+2)A˜(j-1)νjuj,
where the last line follows the fact that the first term in the second line is zero because of the inductive hypothesis with νj=0. Since uj2=1, by multiplying uj both sides from the right, we have that for any sufficiently small νjRdL,
rΦ(l)(νl+1ul+1)A˜(l+2)A˜(j-1)νj=0,
which implies that
rΦ(l)(νl+1ul+1)A˜(l+2)A˜(j-1)=0.
This completes the inductive step and proves that
rΦ(l)(νl+1ul+1)=0.
Since ul+12=1, by multiplying ul+1 both sides from the right, we have that for any sufficiently small νl+1Rdl-dL such that θ˜(νl+1,,νl*-2)B˜(θ,ε3/2),
(Y^(X,θ)-Y)Φ(l)νl+1=0,
which implies that
(Φ(l))(Y^(X,θ)-Y)=0.

A.1  Proof of Theorem 3

From the first-order necessary condition of differentiable local minima,
0=W(H+1)L(θ)=(D1(H+1))vec(Y^(X,θ)-Y),
and Y^(X,θ)=D1(H+1)W(H+1). From lemma 10 and the first-order necessary condition of differentiable local minima, Dvec(Y^(X,θ)-Y)=0 and vec(Y^)=1HDθ1:H where θ1:H:=vec([W(l)]l=1H). Combining these, we have that
DD1(H+1)1H+1DD1(H+1)θ-vec(Y)=0,
where 1H+1DD1(H+1)θ=vec(Y^(X,θ)). This implies that
vec(Y^(X,θ))=PDD1(H+1)vec(Y).
Therefore,
2L(θ)=vec(Y)-PDD1(H+1)vec(Y)22=vec(Y)22-PDD1(H+1)vec(Y)22,
where the second line follows idempotence of the projection. Finally, decomposing the second term by directly following the proof of lemma 12 with PNΦ¯(S)D being replaced by DD1(H+1) yields the desired statement of this theorem.

A.2  Proof of Theorem 5

From lemma 13, we have that for any l{t,t+1,,H},
(IdH+1Φ(l))vec(Y^(X,θ)-Y)=0.
(A.2)
From equation A.1, by noticing that Z(l+1)C(l+2)=Φ(l+1)0(C(l+2)), we also have that
vec(Y^(X,θ))=l=tH(IdH+1Φ(l))vec(R¯(l+1)A(l+2)A(H+1)),
(A.3)
where R¯(t+1):=(A(t+1))(C(t+1)) and R¯(l):=0(C(l)) for lt+2. From lemma 10 and the first-order necessary condition of differentiable local minima, we also have that
Dvec(Y^(X,θ)-Y)=0
(A.4)
and
vec(Y^)=1HDθ1:H,
(A.5)
where θ1:H:=vec([W(l)]l=1H).
Combining equations A.2 to A.5 yields
Φ¯DΦ¯Dθ¯-vec(Y)=0,
where Φ¯Dθ¯=vec(Y^(X,θ)), Φ¯:=[IdH+1Φ(l)]l=tH, and
θ¯:=12[vec(R¯(l+1)A(l+2)A(H+1))]l=tH1Hθ1:H.
This implies that
vec(Y^(X,θ))=PΦ¯Dvec(Y).
Therefore, for any S(t,t+1,,H),
2L(θ)=vec(Y)-PΦ¯Dvec(Y)22vec(Y)-P[Φ¯(S)D]vec(Y)22=vec(Y)-P[Φ¯(S)]vec(Y)-P[PN[Φ¯(S)]D]vec(Y)22=PN[Φ(S)]YF2-P[PN[Φ¯(S)]D]vec(Y)22,
(A.6)
where the second inequality holds because the column space of Φ¯D includes the column space of Φ¯(S)D. The third line follows lemma 11. The last line follows from the fact that PN[Φ¯(S)]=(I-P[Φ¯(S)]) and
vec(Y)PN[Φ¯(S)]P[PN[Φ¯(S)]D]vec(Y)=vec(Y)P[PN[Φ¯(S)]D]vec(Y)=P[PN[Φ¯(S)]D]vec(Y)22.
By applying lemma 12 to the second term on the right-hand side of equation A.6, we obtain the desired upper bound in theorem 5. Finally, we complete the proof by noticing that 12PN[Φ(S)]YF2 is the global minimum value of basis function regression with the basis Φ(S) for all S(0,1,,H). This is because 12Φ(S)W-YF2 is convex in W and, hence, W12Φ(S)W-YF2=0 is a necessary and sufficient condition of global minima, solving which yields the global minimum value of 12PN[Φ(S)]YF2.

A.3  Proof of Corollary 7

The statement follows the proof of theorem 5 by noticing that lemma 13 still holds for the restriction of L to I as B˜(θ,ε1)=B(θ,ε1)I, and by replacing Dkl(l) by D^kl(l) in the proof, where D^kl(l) is obtained from the proof of lemma 10 by setting W(l+1)(θ)k',k=0for(k',k)S(l)×({1,,dl+1}S(l+1))(l=t+1,t+2,H-1) and by not considering their derivatives.

A.4  Proof of Corollary 9

The statement follows the proof of theorem 5 by setting C(l'):=0 for all l'{t+1,H+1} and setting l{t,H} in the proof of lemma 13 (instead of {t,t+1,,H}).

Appendix B:  Proofs for Probabilistic Statements

In the following lemma, we rewrite equation 3.2 in terms of the activation pattern, and data matrices XY.

Lemma 5.
Every differentiable local minimizer θ of L with the neural network 3.1 satisfies
L(θ)=12Y22-12P[D˜]Y22,
(B.1)
where
D˜=Λ1,1XΛ1,2XΛ1,dX.
(B.2)
Proof.
With r:=Y^(X,θ)-Y, we have L(θ)=rr/2. For expression 3.1, we have
Y^(X,θ)=j=1dWj(2)Λ1,jXW·j(1).
(B.3)
For any differentiable local minimum θ, from the first-order condition,
0=Wij(1)rr/2=Wj(2)rΛ1,jX·i.
(B.4)
We conclude that if Wj(2)0, then rΛ1,jX·i for 1idx. In fact, we have the same conclusion even if Wj(2)=0. To prove it, we use the second-order condition as follows. We notice that if Wj(2)=0, then
Wij(1)2rr/2Wj(2)Wij(1)rr/2Wj(2)Wij(1)rr/2Wj(2)2rr/2=0rΛ1,jX·irΛ1,jX·i*.
(B.5)
By the second-order condition, the above matrix must be positive semidefinite, and we conclude that rΛ1,jX·i=0. Therefore, Y^(X,θ)-Y is perpendicular to the column space of D˜. Moreover, from expression B.3, Y^(X,θ) is in the column space of D˜; Y^(X,θ) is the projection of Y to the column space of D˜, Y^(X,θ)=P[D˜]Y; and
L(θ)=12Y^(X,θ)-Y22=12(I-P[D˜])Y22=12Y22-12P[D˜]Y22.
(B.6)

From equation B.1, we expect that the larger the rank of the projection matrix D˜, the smaller is the loss L(θ). In the following lemma, we prove that under the conditions of the activation pattern matrix Λ. In the regime dxdm, we have rankD˜=dxd. In the regime dxdm, we have rankD˜=m. As we show later, proposition 2 follows easily from the rank estimates of D˜.

Lemma 6.

Fix the activation pattern matrix Λ:=[Λk]k=1dRm×d. Let X be a random m×dx gaussian matrix, with each entry having mean zero and variance one. Then the matrix D˜ as defined in equation B.2 satisfies both of the following statements:

  • i.

    If m64ln2(dxdm/δ2)dxd and smin(ΛI)δ for any index set I{1,2,,m} with |I|m/2, then rank=˜dxd with probability at least 1-e-m/(64ln(dxdm/δ2))-2e-t.

  • ii.

    If ddx2mln2(md/δ) with dxln2(dm) and smin(ΛI)δ for any index set I{1,2,,m} with |I|d/2, then rankD˜=m with probability at least 1-2e-dx/20.

Proof of Lemma 6.
We denote the event Ωsum such that
Ωsum={X:XF22mdx}.
(B.7)
Thanks to equation B.24 in lemma 16, P(Ωsum)1-e-dxm/8.
In the following, we first prove case i: that rankD˜=dxd with high probability. We identify the space Rddx with d×dx matrix and fix L=2ln(dm/δ2). We first prove that for any V in the unit sphere in Rdd0, with probability at least 1-e-m/(16L), we have
D˜vec(V)22δ2/(2L).
(B.8)
We notice that
D˜vec(V)=j=1dxi=1dΛiVjiX·j=:u.
Then u is a gaussian vector in Rm with kth entry
uk=i=1dx(ΛV)kiXkiN0,ak2,ak2:=i=1d0(ΛV)ki2.
Since by our assumption that the entries of Λ are bounded by 1, we get
ak2=i=1d0(ΛV)ki2j=1dΛkj2VF2d.
We denote the sets I0={1km:ak2δ2/m} and
I={1km:e-1δ2/m<ak2eδ2/m},1ln(dm/δ2).
There are two cases: if there exists some 1 such that |I|m/L, then thanks to equation B.25 in lemma 16, we have that with probability at least 1-e-m/(16L),
Λ(X)vec(V)22kIuk212kIak2e-1δ2/(2L).
(B.9)
Otherwise, we have that |I0|m(1-log2(dm/δ2)/L)=m/2. Then
kI0ak2kI0δ2/m<δ2.
However, by our assumption that smin(ΛI0)δ,
kI0ak2=ΛI0VF2smin2(ΛI0)VF2δ2.
This leads to a contradiction. Claim B.8 follows from claim B.9.
We take an ɛ-net of the unit sphere in Rddx and denote it by E. The cardinality of the set E is at most (2/ɛ)ddx. We denote the event Ω such that the following holds:
minVED˜vec(V)22δ2/(2L).
(B.10)
Then by using a union bound, we get that the event ΩΩsum holds with probability at least 1-e-m/(16L)(2/ɛ)ddx-e-md0/4.
Let V^ be a vector in the unit sphere of Rdd0. Then there exists a vector VE such that V-V^2ɛ, and we have
D˜vec(V^)=D˜vec(V)+D˜vec(V^-V).
(B.11)
From equations B.5 and B.10, for XΩΩsum, we have that
D˜vec(V)22δ2/(2L)
(B.12)
and
D˜vec(V^-V)22k=1mi=1d0(Λ(V^-V))ki2xk22k=1mj=1dΛkj2(V^-V)F2xk22k=1mdɛ2xk222mdxdɛ2.
(B.13)
It follows from combining equations B.11 to B.13 that, we get that on the event ΩΩsum,
D˜vec(V^)22δ2/(4L),
provided that ɛδ/12dxdmL. This implies that the smallest singular value of the matrix D˜ is at least δ2/(4L), with probability
1-e-m/(16L)(2/ɛ)ddx-e-md0/41-em/(32L),
provided that m32Lln(dxdm/δ2)dxd. This finishes the proof of Case i.
In the following we prove case ii that rankD˜=m with high probability. We notice that for any vector vRm,
D˜v22k=1mi=1dj=1d0(vkΛkiXkj)2j=1d0i=1dk=1m(vkΛki)2k=1mXkj2dv22XF2.
(B.14)
In the event Ωsum as defined in equation B.4, we have that D˜v222dd0mv22 for any vector vRm.
In the following, we prove that for any vector vRm, if its Lth largest entry (in absolute value) is at least a for some Ld/2, then
P(D˜v22a2δ2Ld0/2)1-e-Ldx/16.
(B.15)
We denote the vectors ui=[X·iΛ1,1v,X·iΛ1,2v,,X·iΛ1,dv], for any i=1,2,,dx. Then D˜v=[u1,u2,,udx]. Moreover, u1,u2,,udxRd are independent and identically distributed (i.i.d) gaussian vectors, with mean zero and covariance matrix,
Σ=ΛV2Λ,
where V is the m×m diagonal matrix, with diagonal entries given by v. We denote the eigenvalues of Σ as λ1(Σ)λ2(Σ)λd(Σ)0. Then in distribution
ui22=λ1(Σ)zi12+λ2(Σ)zi22+λd(Σ)zid2,
(B.16)
where {zij}1idx,1jd are independent gaussian random variables with mean zero and variance one. If the Lth largest entry of v (in absolute value) is at least a for some Ld/2, we denote the index set I={1km:|vk|a}, then
Σ=ΛV2ΛΛVI2Λa2ΛIΛI.
Therefore, the jth largest eigenvalue of Σ is at least the jth largest eigenvalue of a2ΛIΛI for any 1jd. From our assumption, smin(ΛI)δ, and the Lth largest eigenvalue of a2ΛIΛI is at least a2δ2. Therefore, the Lth largest eigenvalue of Σ is at least a2δ2, that is, λL(Σ)a2δ2. We can rewrite equation B.16 as
ui22=j=1dλj(Σ)zij2a2δ2j=1Lzij2.
Thanks to equation B.25 in lemma 16,
PD˜v22a2δ2Ld0/2=Pi=1dxui22a2δ2Ld0/2Pi=1dxa2δ2j=1Lzij2a2δ2Ld0/2=Pi=1dxj=1Lzij2Ld0/21-e-Ld0/16.
This finishes the proof of claim B.15.
We take an ɛ-net of the unit sphere in Rm and denote it by E. Let v^ be a vector in the unit sphere of Rm; then there exists a vector vE such that v-v^2ɛ, and we have
D˜v^=D˜v+D˜(v^-v),
(B.17)
and in the event Ωsum using equation D.14, we have
D˜(v^-v)222mdxdɛ2.
(B.18)
In the rest of the proof, we show that with high probability, D˜v22 is bounded away from zero for uniformly any vE.
For any given vector v in the unit sphere of Rm, we sort its entries in absolute value:
|v1*||v2*||vm*|.
We denote the sequence 1=L0L1LpLp+1=m, where Li=ln2(md/δ)Li+1/dx for 1ip and L1dx/ln2(md/δ). Then,
p=lnm/ln(dx/ln2(md)).
Thanks to our assumption that ddx2mln2(md/δ), we have Lpd/2. We fix ɛ as
ɛ:=12δ4dmp+1.
(B.19)
We denote q(v)=min{0ip:|vLi*|4dm|vLi+1+1*|/δ}, where vm+1*=0. We decompose the vector v=v1+v2, where v1 corresponds to the largest (in absolute value) Lq(v)+1 terms of v, and v2 corresponds to the rest of terms of v. Letting L=Lq(v) and a=vLq(v)* in equation B.15, we get
P(D˜v122a2δ2Ldx/2)1-e-Ld0/16.
(B.20)
By the definition of q(v), we have
|a|=|vLq(v)*|δ4dm|vLq(v)-1*|δ4dmq(v)1m=:aq(v).
We denote the event Ωq, such that equation B.20 holds for any vE with q(v)=q. Since equation B.20 depends only on Lq+1 entries of v, by a union bound, we get
P(Ωq)1-e-Lqdx/16mLq+1(2/ɛ)Lq+11-e-Lqdx/16+Lq+1(lnm+ln(2/ɛ))1-e-Lqdx/20.
(B.21)
Moreover, v222a2δ2/(16dm), in the event Ωsum using equation B.14, we have
Λ(X)v2222ddxmv222a2δ2dx/8.
(B.22)
It follows from combining equations B.20 and B.22, in the event ΩqΩsum for any vE with q(v)=q, we get
D˜v22(D˜v1-D˜v2)2a2δ2dx/8aq2δ2dx/8.
(B.23)
In the event Ωsumq=0pΩq, it follows from combining equations B.17, B.18, and B.23 that
D˜v^2D˜v22-2md0dɛapδdx/8-2md0dɛδ4dmp+1.
Moreover, thanks to equation B.21, Ωsumq=0pΩq holds with probability at least
P(Ωsumq=0pΩq)1-q=0pe-Lqdx/20-e-dxm/81-2e-dx/20.
This finishes the proof of case ii.

The following concentration inequalities for the square of gaussian random variables are from Laurent and Massart (2000).

Lemma 7.
Let the weights 0a1,a2,,anK, and g1,g2,,gn independent random gaussian variables with mean zero and variance one. Then the following inequalities hold for any positive t:
Pi=1nai2(gi2-1)2ti=1nai41/2+2K2te-t,
B.24
Pi=1nai2(gi2-1)-2ti=1nai41/2e-t.
B.25
Proof of Proposition 2.
In case i from lemma 15, rankD˜=dxd with probability at least 1-e-m/(64ln(dxdm/δ2)). Since the statement immediately follows from theorem 3 if Y2=0, we can focus on the case of Y20. Conditioning on the event rankD˜=dxd,
L(θ)Y22/2=PN[D˜]Y22Y22.
(B.26)
The quantity in equation B.26 has the same law as
z12+z22++zm-dxd2z12+z22++zm2,
where z1,z2,,zm are independent gaussian random variables with mean zero and variance one. From lemma 16, we get that with probability at least 1-2e-t,
z12+z22++zm-dxd2z12+z22++zm21+6tmm-dxdm.
(B.27)
Case i follows from combining equations B.26 and B.27.
In case ii, thanks to lemma 15, rankD˜=m with probability at least 1-2e-dx/20. Conditioning on the event rankD˜=m, we have P[D˜]Y=Y, and
L(θ)=12Y22-12P[D˜]Y22=0.
This finishes the proof of case ii.

Appendix C:  Additional Experimental Details

By using the ground-truth network described in section 4.3, the synthetic data set was generated with i.i.d. random inputs x and i.i.d. random weight matrices W(l). Each input x was randomly sampled from the standard normal distribution, and each entry of the weight matrix W(l) was randomly sampled from a normal distribution with zero mean and normalized standard deviation (2dl).

For training, we used a standard training procedure with mini-batch stochastic gradient decent (SGD) with momentum. The learning rate was set to 0.01. The momentum coefficient was set to 0.9 for the synthetic data set and 0.5 for the image data sets. The mini-batch size was set to 200 for the synthetic data set and 64 for the image data sets.

From the proof of theorem 3, J(θ)=(I-P[[DD1(H+1)]])vec(Y)22 for all θ, which was used to numerically compute the values of J(θ). This is mainly because the form of J(θ) in theorem 3 may accumulate positive numerical errors for each lH and kldl in the sum in its second term, which may easily cause a numerical overestimation of the effect of depth and width. To compute the projections, we adopted a method of computing a numerical cutoff criterion on singular values from Press, Teukolsky, Vetterling, and Flannery (2007) as (the numerical cutoff criterion) =12× (maximum singular value of M) × (machine precision of M) × (d'+d+1), for a matrix of MRd'×d. We also confirmed that the reported experimental results remained qualitatively unchanged with two other different cutoff criteria: a criterion based (Golub & Van Loan, 1996) as (the numerical cutoff criterion) = 12M× (machine precision of M) (where M=max1id'j=1d|Mi,j| for a matrix of MRd'×d), as well as another criterion based on Netlib Repository LAPACK documentation as (the numerical cutoff criterion) = (maximum singular value of M) × (machine precision of M).

Acknowledgments

We gratefully acknowledge support from NSF grants 1523767 and 1723381; AFOSR grant FA9550-17-1-0165; ONR grant N00014-18-1-2847; Honda Research; and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.

References

Arora
,
S.
,
Cohen
,
N.
,
Golowich
,
N.
, &
Hu
,
W.
(
2018
).
A convergence analysis of gradient descent for deep linear neural networks
.
arXiv:1810.02281
.
Arora
,
S.
,
Cohen
,
N.
, &
Hazan
,
E.
(
2018
).
On the optimization of deep networks: Implicit acceleration by overparameterization
. In
Proceedings of the International Conference on Machine Learning
.
Barron
,
A. R.
(
1993
).
Universal approximation bounds for superpositions of a sigmoidal function
.
IEEE Transactions on Information Theory
,
39
(
3
),
930
945
.
Blum
,
A. L.
, &
Rivest
,
R. L.
(
1992
).
Training a 3-node neural network is NP-complete
.
Neural Networks
,
5
(
1
),
117
127
.
Choromanska
,
A.
,
Henaff
,
M.
,
Mathieu
,
M.
,
Ben Arous
,
G.
, &
LeCun
,
Y.
(
2015
).
The loss surfaces of multilayer networks
. In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
(pp.
192
204
).
Dauphin
,
Y. N.
,
Pascanu
,
R.
,
Gulcehre
,
C.
,
Cho
,
K.
,
Ganguli
,
S.
, &
Bengio
,
Y.
(
2014
). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2933
2941
).
Golub
,
G. H.
, &
Van Loan
,
C. F.
(
1996
).
Matrix computations
.
Baltimore
:
Johns Hopkins University Press
.
Goodfellow
,
I.
,
Bengio
,
Y.
, &
Courville
,
A.
(
2016
).
Deep learning
.
Cambridge, MA
:
MIT Press
.
Hardt
,
M.
, &
Ma
,
T.
(
2017
).
Identity matters in deep learning
.
arXiv:1611.04231
.
Kawaguchi
,
K.
(
2016
). Deep learning without poor local minima. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
586
594
).
Red Hook, NY
:
Curran
.
Kawaguchi
,
K.
, &
Bengio
,
Y.
(
2018
).
Depth with nonlinearity creates no bad local minima in ResNets
.
arXiv:1810.09038
.
Kawaguchi
,
K.
,
Kaelbling
,
L. P.
, &
Lozano-Pérez
,
T.
(
2015
). Bayesian optimization with exponential convergence. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2809
2817
).
Red Hook, NY
:
Curran
.
Krizhevsky
,
A.
, &
Hinton
,
G.
(
2009
).
Learning multiple layers of features from tiny images
(
Technical Report
).
Toronto
:
University of Toronto
.
Laurent
,
B.
, &
Massart
,
P.
(
2000
).
Adaptive estimation of a quadratic functional by model selection
.
Ann. Statist.
,
28
(
5
),
1302
1338
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient-based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Leshno
,
M.
,
Lin
,
V. Y.
,
Pinkus
,
A.
, &
Schocken
,
S.
(
1993
).
Multilayer feedforward networks with a nonpolynomial activation function can approximate any function
.
Neural Networks
,
6
(
6
),
861
867
.
Livni
,
R.
,
Shalev-Shwartz
,
S.
, &
Shamir
,
O.
(
2014
). On the computational efficiency of training neural networks. In
Z.
Ghahramani
,
C.
Cortes
,
N.
Lawrence
, &
K.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
855
863
).
Red Hook, NY
:
Curran
.
Montufar
,
G. F.
,
Pascanu
,
R.
,
Cho
,
K.
, &
Bengio
,
Y.
(
2014
). On the number of linear regions of deep neural networks. In
Z.
Ghahramani
,
Z.
Welling
,
C.
Cortes
,
N.
Lawrence
, &
K.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
2924
2932
).
Red Hook, NY
:
Curran
.
Murty
,
K. G.
, &
Kabadi
,
S. N.
(
1987
).
Some NP-complete problems in quadratic and nonlinear programming
.
Mathematical Programming
,
39
(
2
),
117
129
.
Netzer
,
Y.
,
Wang
,
T.
,
Coates
,
A.
,
Bissacco
,
A.
,
Wu
,
B.
, &
Ng
,
A. Y.
(
2011
).
Reading digits in natural images with unsupervised feature learning
. In
NIPS Workshop on Deep Learning and Unsupervised Feature Learning
.
Nguyen
,
Q.
, &
Hein
,
M.
(
2017
).
The loss surface of deep and wide neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2603
2612
).
Nguyen
,
Q.
, &
Hein
,
M.
(
2018
).
Optimization landscape and expressivity of deep CNNS
. In
Proceedings of the International Conference on Machine Learning
(pp.
3727
3736
).
Press
,
W. H.
,
Teukolsky
,
S. A.
,
Vetterling
,
W. T.
, &
Flannery
,
B. P.
(
2007
).
Numerical recipes: The art of scientific computing
. (3rd ed.).
Cambridge
:
Cambridge University Press
.
Rao
,
C. R.
,
Toutenburg
,
H.
, Shalabh, &
Heumann
,
C.
(
2007
).
Linear models and generalizations: Least squares and alternatives
.
Berlin
:
Springer
.
Saxe
,
A. M.
,
McClelland
,
J. L.
, &
Ganguli
,
S.
(
2014
).
Exact solutions to the nonlinear dynamics of learning in deep linear neural networks
.
arXiv:1312.6120
.
Shamir
,
O.
(
2018
). Are ResNets provably better than linear predictors? In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Red Hook, NY
:
Curran
.
Telgarsky
,
M.
(
2016
).
Benefits of depth in neural networks
. In
Proceedings of the Conference on Learning Theory
(pp.
1517
1539
).
Xie
,
B.
,
Liang
,
Y.
, &
Song
,
L.
(
2017
).
Diverse neural network learns true target functions
. In
Proceedings of the Conference on Artificial Intelligence and Statistics
(pp.
1216
1224
).
This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.