Every Local Minimum Value Is the Global Minimum Value of Induced Model in Nonconvex Machine Learning

For nonconvex optimization in machine learning, this article proves that every local minimum achieves the globally optimal value of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this article improves or complements several state-of-the-art theoretical results on deep neural networks, deep residual networks, and overparameterized deep neural networks with a unified proof technique and novel geometric insights. A special case of our results also contributes to the theoretical foundation of representation learning.


Introduction
Deep learning has achieved considerable empirical success in machine learning applications.However, insufficient work has been done on theoretically understanding deep learning, partly because of the nonconvexity and high-dimensionality of the objective functions used to train deep models.In general, theoretical understanding of nonconvex, high-dimensional optimization is challenging.Indeed, finding a global minimum of a general nonconvex function (Murty & Kabadi, 1987) and training certain types of neural networks (Blum & Rivest, 1992) are both NP-hard.Considering the NP-hardness for a general set of relevant problems, it is necessary to use additional assumptions to guarantee efficient global optimality in deep learning.Accordingly, recent theoretical studies have proven global optimality in deep learning by using additional strong assumptions such as linear activation, random activation, semirandom activation, gaussian inputs, single hidden-layer network, and significant overparameterization (Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015;Kawaguchi, 2016;Hardt & Ma, 2017;Nguyen & Hein, 2017, 2018;Brutzkus & Globerson, 2017;Soltanolkotabi, 2017;Ge, Lee, & Ma, 2017;Goel & Klivans, 2017;Zhong, Song, Jain, Bartlett, & Dhillon, 2017;Li & Yuan, 2017;Kawaguchi, Xie, & Song, 2018;Du & Lee, 2018).
A study proving efficient global optimality in deep learning is thus closely related to the search for additional assumptions that might not hold in many practical applications.Toward widely applicable practical theory, we can also ask a different type of question: If standard global optimality requires additional assumptions, then what type of global optimality does not?In other words, instead of searching for additional assumptions to guarantee standard global optimality, we can also search for another type of global optimality under mild assumptions.Furthermore, instead of an arbitrary type of global optimality, it is preferable to develop a general theory of global optimality that not only works under mild assumptions but also produces the previous results with the previous additional assumptions, while predicting new results with future additional assumptions.This type of general theory may help not only to explain when and why an existing machine learning method works but also to predict the types of future methods that will or will not work.
As a step toward this goal, this article proves a series of theoretical results.The major contributions are summarized as follows: • For nonconvex optimization in machine learning with mild assumptions, we prove that every differentiable local minimum achieves global optimality of the perturbable gradient basis model class.This result is directly applicable to many existing machine learning models, including practical deep learning models, and to new models to be proposed in the future, nonconvex and convex.• The proposed general theory with a simple and unified proof technique is shown to be able to prove several concrete guarantees that improve or complement several state-of-the-art results.• In general, the proposed theory allows us to see the effects of the design of models, methods, and assumptions on the optimization landscape through the lens of the global optima of the perturbable gradient basis model class.
Because a local minimum θ in R dθ only requires the θ to be locally optimal in R dθ , it is nontrivial that the local minimum is guaranteed to achieve the globally optimality in R dθ of the induced perturbable gradient basis model class.The reason we can possibly prove something more than many worstcase results in general nonconvex optimization is that we explicitly take advantage of mild assumptions that commonly hold in machine learning and deep learning.In particular, we assume that an objective function to be optimized is structured with a sum of weighted errors, where each error is an output of composition of a loss function and a function of a hypothesis class.Moreover, we make mild assumptions on the loss function and a hypothesis class, all of which typically hold in practice.

Preliminaries
This section defines the problem setting and common notation.
2.1 Problem Description.Let x ∈ X and y ∈ Y be an input vector and a target vector, respectively.Define ((x i , y i )) m i=1 as a training data set of size m.Let θ ∈ R dθ be a parameter vector to be optimized.Let f (x; θ ) ∈ R dy be the output of a model or a hypothesis, and let : R dy × Y → R ≥0 be a loss function.Here, d θ , d y ∈ N >0 .We consider the following standard objective function L to train a model f (x; θ ): This article allows the weights λ 1 , . . ., λ m > 0 to be arbitrarily fixed.With m , all of our results hold true for the standard average loss L as a special case.

Notation.
Because the focus of this article is the optimization of the vector θ , the following notation is convenient: y (q) = (q, y) and f x (q) = f (x; q).Then we can write We use the following standard notation for differentiation.Given a scalar-valued or vector-valued function ϕ :  M d,d ] T represents the standard vectorization of the matrix M. Given a set of n matrices or vectors {M ( j) } n j=1 , define [M ( j) ] n j=1 = [M (1) , M (2) , . . ., M (n) ] to be a block matrix of each column block being M (1) , M (2) , . . ., M (n) .Similarly, given a set

Nonconvex Optimization Landscapes for Machine Learning
This section shows our first main result that under mild assumptions, every differentiable local minimum achieves the global optimality of the perturbable gradient basis model class.
3.1 Assumptions.Given a hypothesis class f and data set, let be a set of nondifferentiable points θ as Here, B(θ, ) is the open ball with the center θ and the radius .In common nondifferentiable models f such as neural networks with rectified linear units (ReLUs) and pooling operations, we have that = ˜ , and the Lebesgue measure of (= ˜ ) is zero.This section uses the following mild assumptions.

Assumption 2 (Use of Common Model Structures
).There exists a function g : R dθ → R dθ such that Assumption 1 is satisfied by simply using common loss criteria that include the squared loss (q, y) = q − y 2 2 , cross-entropy loss (q, y) = − dy k=1 y k log exp(q k ) k exp(q k ) , and smoothed hinge loss (q, y) = (max{0, 1 − yq}) p with p ≥ 2 (the hinge loss with d y = 1).Although the objective function L : θ → L(θ ) used to train a complex machine learning model (e.g., a neural network) is nonconvex in θ , the loss criterion y i : q → (q, y i ) is usually convex in q.In this article, the cross-entropy loss includes the softmax function, and thus f x (θ ) is the pre-softmax output of the last layer in related deep learning models.
Assumption 2 is satisfied by simply using a common architecture in deep learning or a classical machine learning model.For example, consider a deep neural network of the form f x (θ ) = Wh(x; u) + b, where h(x; u) is an output of an arbitrary representation at the last hidden layer and θ = vec ([W, b, u]).Then assumption 2 holds because , where g(θ ) k = θ k for all k corresponding to the parameters (W, b) in the last layer and g(θ ) k = 0 for all other k corresponding to u.In general, because g is a function of θ , assumption 2 is easily satisfiable.Assumption 2 does not require the model f (x; θ ) to be linear in θ or x.
Note that we allow the nondifferentiable points to exist in L(θ ); for example, the use of ReLU is allowed.For a nonconvex and nondifferentiable function, we can still have first-order and second-order necessary conditions of local minima (e.g., Rockafellar & Wets, 2009, theorem 13.24).However, subdifferential calculus of a nonconvex function requires careful treatment at nondifferentiable points (see Rockafellar & Wets, 2009;Kakade & Lee, 2018;Davis, Drusvyatskiy, Kakade, & Lee, 2019), and deriving guarantees at nondifferentiable points is left to a future study.

Theory for Critical Points.
Before presenting the first main result, this section provides a simpler result for critical points to illustrate the ideas behind the main result for local minima.We define the (theoretical) objective function L θ of the gradient basis model class as is the induced gradient basis model class.The following theorem shows that every differentiable critical point of our original objective L (including every differentiable local minimum and saddle point) achieves the global minimum value of L θ .The complete proofs of all the theoretical results are presented in appendix A.
Theorem 1.Let assumptions 1 and 2 hold.Then for any critical point θ ∈ (R dθ \ ) of L, the following holds: An important aspect in theorem 1 is that L θ on the right-hand side is convex, while L on the left-hand side can be nonconvex or convex.Here, following convention, inf S is defined to be the infimum of a subset S of R (the set of affinely extended real numbers); that is, if S has no lower bound, inf S = −∞ and inf ∅ = ∞.Note that theorem 1 vacuously holds true if there is no critical point for L. To guarantee the existence of a minimizer in a (nonempty) subspace S ⊆ R dθ for L (or L θ ), a classical proof requires two conditions: a lower semicontinuity of L (or L θ ) and the existence of a Figure 1: Illustration of gradient basis model class and theorem 1 with θ ∈ R 2 and f X (θ ) ∈ R 3 (d y = 1).Theorem 1 translates the local condition of θ in the parameter space R 2 (on the left) to the global optimality in the output space R 3 (on the right).The subspace T f X (θ ) is the space of the outputs of the gradient basis model class.Theorem 1 states that f X (θ ) is globally optimal in the subspace as f X (θ ) ∈ argmin f∈T f X (θ ) dist(f, y) for any differentiable critical point θ of L. q ∈ S for which the set {q ∈ S : L(q ) ≤ L(q)} (or {q ∈ S : L θ (q ) ≤ L θ (q)}) is compact (see Bertsekas, 1999, for different conditions).

Geometric
View.This section presents the geometric interpretation of theorem 1 that provides an intuitive yet formal description of gradient basis model class.Figure 1 illustrates the gradient basis model class and theorem 1 with θ ∈ R 2 and f X (θ ) ∈ R 3 .Here, we consider the following map from the parameter space to the concatenation of the output of the model at x 1 , x 2 , . . ., x m : In the output space R mdy of f X , the objective function L induces the notion of distance from the target vector y = (y 1 , . . ., y m ) ∈ R mdy to a vector f = We consider the affine subspace T fX (θ ) of R mdy that passes through the point f X (θ ) and is spanned by the set of vectors {∂ 1 f X (θ ), . . ., ∂ dθ f X (θ )}, where the sum of the two sets represents the Minkowski sum of the sets.
Then the subspace T fX (θ ) is the space of the outputs of the gradient basis model class in general beyond the low-dimensional illustration.This is because by assumption 2, for any given θ , Therefore, in general, theorem 1 states that under assumptions 1 and 2, f X (θ ) is globally optimal in the subspace T fX (θ ) as for any differentiable critical point θ of L. Theorem 1 concludes this global optimality in the affine subspace of the output space based on the local condition in the parameter space (i.e., differentiable critical point).A key idea behind theorem 1 is to consider the map between the parameter space and the output space, which enables us to take advantage of assumptions 1 and 2.
Figure 2 illustrates the gradient basis model class and theorem 1 with a union of manifolds and a tangent space.Under the constant rank condition, the image of the map f X locally forms a single manifold.More precisely, if there exists a small neighborhood U(θ ) of θ such that f X is differentiable in U(θ ) and rank(∂ f X (θ )) = r is constant with some r for all θ ∈ U(θ ) (the constant rank condition), then the rank theorem states that the image f X (U(θ )) is a manifold of dimension r (Lee, 2013, theorem 4.12).We note that the rank map θ → rank(∂ f X (θ )) is lower semicontinuous (i.e., if rank(∂ f X (θ )) = r, then there exists a neighborhood U(θ ) of θ such that rank(∂ f X (θ )) ≥ r for any θ ∈ U(θ )).Therefore, if ∂ f X (θ ) at θ has the maximum rank in a small neighborhood of θ , then the constant rank condition is satisfied.
For points θ where the constant rank condition is violated, the image of the map f X is no longer a single manifold.However, locally it decomposes as a union of finitely many manifolds.More precisely, if there exists a small neighborhood U(θ ) of θ such that f X is analytic over U(θ ) (this condition is satisfied for commonly used activation functions such as ReLU, sigmoid,  The space R 2 θ on the left is the parameter space, and the space R 3 f X (θ ) on the right is the output space.The surface M ⊂ R 3 on the right is the image of f X , which is a union of finitely many manifolds.The tangent space T f X (θ ) is the space of the outputs of the gradient basis model class.Theorem 1 states that if θ is a differentiable critical point of L, then f X (θ ) is globally optimal in the tangent space T f X (θ ) .
and hyperbolic tangent at any differentiable point), then the image f X (U(θ )) admits a locally finite partition M into connected submanifolds such that whenever See Hardt (1975) for the proof.
If the point θ satisfies the constant rank condition, then T fX (θ ) is exactly the tangent space of the manifold formed by the image f X (U(θ )).Otherwise, locally the image decomposes into a finite union M of submanifolds.In this case, T fX (θ ) belongs to the span of the tangent space of those manifolds in M as where T p M is the tangent space of the manifold M at the point p.

Examples.
In this section, we show through examples that theorem 1 generalizes the previous results in special cases while providing new theoretical insights based on the gradient basis model class and its geometric view.In the following, whenever the form of f is specified, we require only assumption 1 because assumption 2 is automatically satisfied by a given f .
For classical machine learning models, example 1 shows that the gradient basis model class is indeed equivalent to a given model class.From the geometric view, this means that for any θ , the tangent space T fX (θ ) is equal to the whole image M of f X (i.e., T fX (θ ) does not depend on θ ).This reduces theorem 1 to the statement that every critical point of L is a global minimum of L.
Example 1: Classical Machine Learning Models.For any basis func- , as well as = ∅.In other words, in this special case, theorem 1 states that every critical point of L is a global minimum of L. Here, we do not assume that a critical point or a global minimum exists or can be attainable.Instead, the statement logically means that if a point is a critical point, then the point is a global minimum.This type of statement vacuously holds true if there is no critical point.
For overparameterized deep neural networks, example 2 shows that the induced gradient basis model class is highly expressive such that it must contain the globally optimal model of a given model class of deep neural networks.In this example, the tangent space T fX (θ ) is equal to the whole output space R mdy .This reduces theorem 1 to the statement that every critical point of L is a global minimum of L for overparameterized deep neural networks.
Intuitively, in Figure 1 or 2, we can increase the number of parameters and raise the number of partial derivatives ∂ k f X (θ ) in order to increase the dimensionality of the tangent space T fX (θ ) so that T fX (θ ) = R mdy .This is indeed what happens in example 2, as well as in the previous studies of significantly overparameterized deep neural networks (Allen-Zhu, Li, & Song, 2018;Du, Lee, Li, Wang, & Zhai, 2018;Zou et al., 2018).In the previous studies, the significant overparameterization is required so that the tangent space T fX (θ ) does not change from the initial tangent space T fX (θ (0) ) = R mdy during training.Thus, theorem 1, with its geometric view, provides the novel algebraic and geometric insights into the results of the previous studies and the reason why overparameterized deep neural networks are easy to be optimized despite nonconvexity.
Example 2: Overparameterized Deep Neural Networks.Theorem 1 implies that every critical point (and every local minimum) is a global minimum for sufficiently overparameterized deep neural networks.Let n be the number of units in each layer of a fully connected feedforward deep neural network.Let us consider a significant overparameterization such that n ≥ m.Let us write a fully connected feedforward deep neural network with the trainable parameters (θ, u) by f (x; θ ) = Wφ(x; u), where W ∈ R dy×n is the weight matrix in the last layer, θ = vec(W ), u contains the rest of the parameters, and φ(x; u) is the output of the last hidden layer.Denote x i = [(x (raw)   i ) , 1] to contain the constant term to account for the bias term in the first layer.Assume that the input samples are normalized as x (raw)   i 2 = 1 for all i ∈ {1, . . ., m} and distinct as (x (raw) i ) x (raw)   i < 1 − δ with some δ > 0 for all i = i.Assume that the activation functions are ReLU activation functions.Then we can efficiently set u to guarantee rank([φ(x i ; u)] m i=1 ) ≥ m (e.g., by choosing u to make each unit of the last layer to be active only for each sample x i ). 1 Theorem 1 implies that every critical point θ with this u is a global minimum of the whole set of trainable parameters (θ, u) because For deep neural networks, example 3 shows that standard networks have the global optimality guarantee with respect to the representation learned at the last layer, and skip connections further ensure the global optimality with respect to the representation learned at each hidden layer.This is because adding the skip connections incurs new partial derivatives {∂ k f X (θ )} k that span the tangent space containing the output of the best model with the corresponding learned representation.
Example 3: Deep Neural Networks and Learned Representations.Consider a feedforward deep neural network, and let I (skip) ⊆ {1, . . ., H} be the set of indices such that there exists a skip connection from the (l − 1)th layer to the last layer for all l ∈ I (skip) ; that is, in this example, The conclusion in this example holds for standard deep neural networks without skip connections too, since we always have H ∈ I (skip) for standard deep neural networks.Let assumption 1 hold.Then theorem 1 implies that for any critical point θ ∈ (R dθ \ ) of L, the following holds: 1 For example, choose the first layer's weight matrix W (1) such that for all i ∈ {1, . . ., m}, (W (1) x i ) i > 0 and (W (1) x i ) i ≤ 0 for all i = i.This can be achieved by choosing the ith row of W (1) to be [(x (raw) i ) , − 1] with 0 < ≤ δ for i ≤ m.Then choose the weight matrices for the lth layer for all l ≥ 2 such that for all j, W (l) j, j = 0 and W (l) j , j = 0 for all j = j.This guarantees rank ([φ(x where ) ), and thus assumption 2 is automatically satisfied.Here, h (l) (x i ; u) is the representation learned at the l-layer.Therefore, inf α∈R d θ L (skip) θ (α) is at most the global minimum value of the basis models with the learned representations of the last layer and all hidden layers with the skip connections.

Theory for Local Minima.
We are now ready to present our first main result.We define the (theoretical) objective function Lθ of the perturbable gradient basis model class as where fθ (x i ; α, , S) is a perturbed gradient basis model defined as Here, S is a finite set of vectors S 1 , . . ., for any i ∈ {1, . . ., m}.Let S ⊆ fin S denote a finite subset S of a set S .For an . This enables the greater expressivity of fθ (x i ; α, , S) with a S ⊆ fin V[θ, ] when compared with f θ (x i ; α).
The following theorem shows that every differentiable local minimum of L achieves the global minimum value of Lθ : Theorem 2. Let assumptions 1 and 2 hold.Then, for any local minimum θ ∈ (R dθ \ ˜ ) of L, the following holds: there exists 0 > 0 such that for any ∈ [0, 0 ), To understand the relationship between theorems 1 and 2, let us consider the following general inequalities: for any θ ∈ (R dθ \ ˜ ) with ≥ 0 being sufficiently small, Here, whereas theorem 1 states that the first inequality becomes equality as L(θ ) = inf α∈R d θ L θ (α) at every differentiable critical point, theorem 2 states that both inequalities become equality as at every differentiable local minimum.
From theorem 1 to theorem 2, the power of increasing the number of parameters (including overparameterization) is further improved.The righthand side in equation 3.2 is the global minimum value over the variables S ⊆ fin V[θ, ] and α ∈ R dθ ×|S| .Here, as d θ increases, we may obtain the global minimum value of a larger search space R dθ ×|S| , which is similar to theorem 1.A concern in theorem 1 is that as d θ increases, we may also significantly increase the redundancy among the elements in {∂ k f x i (θ )} dθ k=1 .Although this remains a valid concern, theorem 2 allows us to break the redundancy by the globally optimal S ⊆ fin V[θ, ] to some degree.
For example, consider f (x; θ ) = g(W (l) h (l) (x; u); u), which represents a deep neural network, with some lth-layer output h (l) (x; u) ∈ R d l , a trainable weight matrix W (l) , and an arbitrary function g to compute the rest of the forward pass.Here, θ = vec([W (l) , u]).Let h (l) (X; u) = [h (l) (x i ; u)] m i=1 ∈ R d l ×m and, similarly, f (X; θ ) = g(W (l) h (l) (X; u); u) ∈ R dy×m .Then, all vectors v corresponding to any elements in the left null space of h (l) (X; u) are in V[θ, ] (i.e., v k = 0 for all k corresponding to u and the rest of v k is set to perturb W (l) by an element in the left null space).Thus, as the redundancy increases such that the dimension of the left null space of h (l) (X; u) increases, we have a larger space of V[θ, ], for which a global minimum value is guaranteed at a local minimum.

Geometric
View.This section presents the geometric interpretation of the perturbable gradient basis model class and theorem 2. Figure 3 illustrates the perturbable gradient basis model class and theorem 2 with θ ∈ R 2 and f X (θ ) ∈ R 3 .Figure 4 illustrates them with a union of manifolds and tangent spaces at a singular point.Given a (≤ 0 ), define the affine subspace TfX (θ ) of the output space R mdy by Figure 3: Illustration of perturbable gradient basis model class and theorem 2 with θ ∈ R 2 and f X (θ ) ∈ R 3 (d y = 1).Theorem 2 translates the local condition of θ in the parameter space R 2 (on the left) to the global optimality in the output space R 3 (on the right).The subspace Tf X (θ ) is the space of the outputs of the perturbable gradient basis model class.Theorem 2 states that f X (θ ) is globally optimal in the subspace as f X (θ ) ∈ argmin f∈ Tf X (θ ) dist(f, y) for any differentiable local minima θ of L. In this example, Tf X (θ ) is the whole output space R 3 , while T f X (θ ) is not, illustrating the advantage of the perturbable gradient basis over the gradient basis.Since Tf X (θ ) = R 3 , f X (θ ) must be globally optimal in the whole output space R 3 .
Then the subspace TfX (θ ) is the space of the outputs of the perturbable gradient basis model class in general beyond the low-dimensional illustration (this follows equation 3.1 and the definition of the perturbable gradient basis model).Therefore, in general, theorem 2 states that under assumptions 1 and 2, f X (θ ) is globally optimal in the subspace TfX (θ ) as for any differentiable local minima θ of L. Theorem 2 concludes the global optimality in the affine subspace of the output space based on the local condition in the parameter space-that is, differentiable local minima.Here, a (differentiable) local minimum θ is required to be optimal only in an arbitrarily small local neighborhood in the parameter space, and yet f X (θ ) is guaranteed to be globally optimal in the affine subspace of the output space.This illuminates the fact that nonconvex optimization in machine learning has a particular structure beyond general nonconvex optimization.The whole space Tf X (θ ) = R 3 on the right panel is the space of the outputs of the perturbable gradient basis model class.The space Tf X (θ ) is the span of the set of the vectors in the tangent spaces T f X (θ ) , T f X (θ ) , and T f X (θ ) .Theorem 2 states that if θ is a differentiable local minimum of L, then f X (θ ) is globally optimal in the space Tf X (θ ) .

Applications to Deep Neural Networks
The previous section showed that all local minima achieve the global optimality of the perturbable gradient basis model class with several direct consequences for special cases.In this section, as consequences of theorem 2, we complement or improve the state-of-the-art results in the literature.

Example: ResNets.
As an example of theorem 2, we set f to be the function of a certain type of residual networks (ResNets) that Shamir (2018) studied.That is, both Shamir (2018) and this section set f as Here, z(x; u) ∈ R dz represents an output of deep residual functions with a parameter vector u.No assumption is imposed on the form of z(x; u), and z(x; u) can represent an output of possibly complicated deep residual functions that arise in ResNets.For example, the function f can represent deep preactivation ResNets (He, Zhang, Ren, & Sun, 2016), which are widely used in practice.To simplify theoretical study, Shamir (2018) assumed that every entry of the matrix R is unconstrained (e.g., instead of R representing convolutions).We adopt this assumption based on the previous study (Shamir, 2018).

Background.
Along with an analysis of approximate critical points, Shamir (2018) proved the following main result, proposition 1, under the assumptions PA1, PA2, and PA3: PA1: The output dimension d y = 1.PA2: For any y, the function y is convex and twice differentiable.PA3: On any bounded subset of the domain of L, the function L u (W, R), its gradient ∇L u (W, R), and its Hessian ∇ 2 L u (W, R) are all Lipschitz continuous in (W, R), where L u (W, R) = L(θ ) with a fixed u.
Proposition 1 (Shamir, 2018).Let f be specified by equation 4.1, Let assumptions PA1, PA2, and PA3 hold.Then for any local minimum θ of L, Shamir (2018) remarked that it is an open problem whether proposition 1 and another main result in the article can be extended to networks with d y > 1 (multiple output units).Note that Shamir (2018) also provided proposition 1 with an expected loss and an analysis for a simpler decoupled model, Wx + Vz(x; u).For the simpler decoupled model, our theorem 1 immediately concludes that given any u, every critical point with respect to θ −u = (W, R) achieves a global minimum value with respect to θ −u as L(θ . This holds for every critical point θ since any critical point θ must be a critical point with respect to θ −u .

Result.
The following theorem shows that every differentiable local minimum achieves the global minimum value of L(ResNet) θ (the right-hand side in equation 4.2), which is no worse than the upper bound in proposition 1 and is strictly better than the upper bound as long as z(x i , u) or fθ (x i ; α, , S) is nonnegligible.Indeed, the global minimum value of L(ResNet) θ (the right-hand side in equation 4.2) is no worse than the global minimum value of all models parameterized by the coefficients of the basis x and z(x; u), and further improvement is guaranteed through a nonnegligible fθ (x i ; α, , S).
Theorem 3. Let f be specified by equation 4.1.Let assumption 1 hold.Assume that d y ≤ min{d x , d z }.Then for any local minimum θ ∈ (R dθ \ ˜ ) of L, the following holds: there exists 0 > 0 such that for any ∈ (0, 0 ), where Theorem 3 also successfully solved the first part of the open problem in the literature (Shamir, 2018) by discarding the assumption of d y = 1.From the geometric view, theorem 3 states that the span TfX (θ ) of the set of the vectors in the tangent spaces {T fX (θ + v ) : v ∈ V[θ, ]} contains the output of the best basis model with the linear feature x and the learned nonlinear feature z(x i ; u).Similar to the examples in Figures 3 and 4, TfX (θ ) = T f (θ ) and the output of the best basis model with these features is contained in TfX (θ ) but not in T f (θ ) .
Unlike the recent study on ResNets (Kawaguchi & Bengio, 2019), our theorem 3 predicts the value of L through the global minimum value of a large search space (i.e., the domain of L(ResNet) θ ) and is proven as a consequence of our general theory (i.e., theorem 2) with a significantly different proof idea (see section 4.3) and with the novel geometric insight.

Example: Deep Nonlinear Networks with Locally Induced Partial Linear
Structures.We specify f to represent fully connected feedforward networks with arbitrary nonlinearity σ and arbitrary depth H as follows: where for all l ∈ {1, . . ., H} with h (0) (x; θ ) = x.Here, θ = vec([W (l) ] H+1 l=1 ) ∈ R dθ with W (l) ∈ R d l ×d l−1 , d H+1 = d y , and d 0 = d x .In addition, σ (l) : R d l → R d l represents an arbitrary nonlinear activation function per layer l and is allowed to differ among different layers.
Along this line, Laurent and Brecht (2018) recently proved the following main result, proposition 2, under the assumptions PA4, PA5, and PA6: PA4: Every activation function is identity as σ (l) (q) = q for every l ∈ {1, . . ., H} (i.e., deep linear networks).PA5: For any y, the function y is convex and differentiable.PA6: The thinnest layer is either the input layer or the output layer as min{d x , d y } ≤ min{d 1 , . . ., d H }.

Result.
Instead of studying deep linear networks, we now consider a partial linear structure locally induced by a parameter vector with nonlinear activation functions.This relaxes the linearity assumption and extends our understanding of deep linear networks to deep nonlinear networks.
Intuitively, J n,t [θ ] is a set of partial linear structures locally induced by a vector θ , which is now formally defined as follows.Given a θ ∈ R dθ , let J n,t [θ ] be a set of all sets J = {J (t+1) , . . ., J (H+1) } such that each set J = {J (t+1) , . . ., J (H+1) } ∈ J n,t [θ ] satisfies the following conditions: there exists > 0 such that for all l ∈ {t + 1, t + 2, . . ., H + 1}, Let n,t be the set of all parameter vectors θ such that J n,t [θ ] is nonempty.As the definition reveals, a neural network with a θ ∈ dy,t can be a standard deep nonlinear neural network (with no linear units).
Theorem 4. Let f be specified by equation 4.3.Let assumption 1 hold.Then for any t ∈ {1, . . ., H}, at every local minimum θ ∈ ( dy,t \ ˜ ) of L, the following holds.There exists 0 > 0 such that for any ∈ (0, 0 ), where Theorem 4 is a special case of theorem 2. A special case of theorem 4 then results in one of the main results in the literature regarding deep linear neural networks, that is, every local minimum is a global minimum.Consider any deep linear network with d y ≤ min{d 1 , . . ., d H }. Then every local minimum θ is in dy,0 \ ˜ = dy,0 .Hence, theorem 4 is reduced to the statement that for any local minimum, L(θ , which is the global minimum value.Thus, every local minimum is a global minimum for any deep linear neural network with d y ≤ min{d 1 , . . ., d H }. Therefore, theorem 4 successfully generalizes the recent previous result in the literature (proposition 2) for a common scenario of d y ≤ d x .
Beyond deep linear networks, theorem 4 illustrates both the benefit of the locally induced structure and the overparameterization for deep nonlinear networks.In the first term, H l=t α (l+1) θ,t , we benefit by decreasing t (a more locally induced structure) and increasing the width of the lth layer for any l ≥ t (overparameterization).The second term, fθ (x i ; α, , S) in L (ff) θ,t , is the general term that is always present from theorem 2, where we benefit from increasing d θ because α ∈ R dθ ×|S| .
From the geometric view, theorem 4 captures the intuition that the span TfX (θ ) of the set of the vectors in the tangent spaces {T fX (θ + v ) : v ∈ V[θ, ]} contains the best basis model with the linear feature for deep linear networks, as well as the best basis models with more nonlinear features as more local structures arise.Similar to the examples in Figures 3 and 4, TfX (θ ) = T f (θ ) and the output of the best basis models with those features are contained in TfX (θ ) but not in T f (θ ) .
A similar local structure was recently considered in Kawaguchi, Huang, and Kaelbling (2019).However, both the problem settings and the obtained results largely differ from those in Kawaguchi et al. (2019).Furthermore, theorem 4 is proven as a consequence of our general theory (theorem 2), and accordingly, the proofs largely differ from each other as well.Theorem 4 also differs from recent results on the gradient decent algorithm for deep linear networks (Arora, Cohen, Golowich, & Hu, 2018;Arora, Cohen, & Hazan, 2018;Bartlett et al., 2019;Du & Hu, 2019), since we analyze the loss surface instead of a specific algorithm and theorem 4 applies to deep nonlinear networks as well.

Proof Idea in
Applications of Theorem 2. Theorems 3 and 4 are simple consequences of theorem 2, and their proof is illustrative as a means of using theorem 2 in future studies with different additional assumptions.The high-level idea behind the proofs in the applications of theorem 2 is captured in the geometric view of theorem 2 (see Figures 3 and 4).That is, given a desired guarantee, we check whether the space TfX (θ ) is expressive enough to contain the output of the desired model corresponding to the desired guarantee.
To simplify the use of theorem 2, we provide the following lemma.This lemma states that the expressivity of the model fθ (x; α, , S) with respect to (α, S) is the same as that of fθ (x; α, , S) + fθ (x; α , , S ) with respect to (α, α , S, S ).As shown in its proof, this is essentially because fθ is linear in α, and a union of two sets S ⊆ fin V[θ, ] and Lemma 1.For any θ , any ≥ 0, any S ⊆ f in V[θ, ], and any x, it holds that { fθ (x; α, , S) Based on theorem 2 and lemma 1, the proofs of theorems 3 and 4 are reduced to a simple search for finding S ⊆ fin V[θ, ] such that the expressivity of fθ (x i ; α , , S ) with respect to α is no worse than the expressivity of α w x i + α r z(x i ; u) with respect to (α w , α r ) (see theorem 3) and that of H l=t α (l+1) h h (l) (x i ; u) with respect to α (l+1) h (see theorem 4).In other words, { fθ (x i ; α , , S ) : Only with this search for S , theorem 2 together with lemma 1 implies the desired statements for theorems 3 and 4 (see sections A.4 and A.5 in the appendix for further details).Thus, theorem 2 also enables simple proofs.

Conclusion
This study provided a general theory for nonconvex machine learning and demonstrated its power by proving new competitive theoretical results with it.In general, the proposed theory provides a mathematical tool to study the effects of hypothesis classes f , methods, and assumptions through the lens of the global optima of the perturbable gradient basis model class.
In convex machine learning with a model output f (x; θ ) = θ x with a (nonlinear) feature output x = φ(x (raw) ), achieving a critical point ensures the global optimality in the span of the fixed basis x = φ(x (raw) ).In nonconvex machine learning, we have shown that achieving a critical point ensures the global optimality in the span of the gradient basis ∂ f x (θ ), which coincides with the fixed basis x = φ(x (raw) ) in the case of the convex machine learning.Thus, whether convex or nonconvex, achieving a critical point ensures the global optimality in the span of some basis, which might be arbitrarily bad (or good) depending on the choice of the handcrafted basis φ(x (raw) ) = ∂ f x (θ ) (for the convex case) or the induced basis ∂ f x (θ ) (for the nonconvex case).Therefore, in terms of the loss values at critical points, nonconvex machine learning is theoretically as justified as the convex one, except in the case when a preference is given to φ(x (raw) ) over ∂ f x (θ ) (both of which can be arbitrarily bad or good).The same statement holds for local minima and perturbable gradient basis.

Appendix: Proofs of Theoretical Results
In this appendix, we provide complete proofs of the theoretical results.
A.1 Proof of Theorem 1.The proof of theorem 1 combines lemma 2 with assumptions 1 and 2 by taking advantage of the structure of the objective function L. Although lemma 2 is rather weak and assumptions 1 and 2 are mild (in the sense that they usually hold in practice), a right combination of these with the structure of L can prove the desired statement.
Lemma 2. Assume that for any i ∈ {1, . . ., m}, the function y i : q → (q, y i ) is differentiable.Then for any critical point θ ∈ (R dθ \ ) of L, the following holds: for any k ∈ {1, . . ., Proof of Lemma 2. Let θ be an arbitrary critical point θ ∈ (R dθ \ ) of L. Since y i : R dy → R is assumed to be differentiable and f x i ∈ R dy is differentiable at the given θ , the composition ( In addition, L is differentiable because a sum of differentiable functions is differentiable.Therefore, for any critical point θ of L, we have that ∂L(θ ) = 0, and, hence, Proof of Theorem 1.Let θ ∈ (R dθ \ ) be an arbitrary critical point of L. From assumption 2, there exists a function g such that where the first line follows from assumption 1 (differentiable and convex y i ), the second line follows from linearity of summation, and the third line follows from assumption 2. Thus, on the one hand, we have that A.2 Proof of Theorem 2. The proof of theorem 2 uses lemma 3, the structure of the objective function L, and assumptions 1 and 2. Lemma 3. Assume that for any i ∈ {1, . . ., m}, the function y i : q → (q, y i ) is differentiable.Then for any local minimum θ ∈ (R dθ \ ˜ ) of L, the following holds: there exists 0 > 0 such that for any ∈ [0, 0 ), any v ∈ V[θ, ], and any k ∈ {1, . . ., Proof of Lemma 3. Let θ ∈ (R dθ \ ˜ ) be an arbitrary local minimum of L. Since θ is a local minimum of L, by the definition of a local minimum, there exists 1 > 0 such that L(θ ) ≤ L(θ ) for all θ ∈ B(θ, 1 ).Then for any ∈ [0, 1 /2) and any ν ∈ V[θ, ], the vector (θ + v ) is also a local minimum because for all θ ∈ B(θ + v, 1 /2) ⊆ B(θ, 1 ) (the inclusion follows from the triangle inequality), which satisfies the definition of a local minimum for (θ + v ).
Since θ ∈ (R dθ \ ˜ ), there exists 2 > 0 such that f x 1 , . . ., f xm are differentiable in B(θ, 2 ).Since y i : R dy → R is assumed to be differentiable and Therefore, with 0 = min( 1 /2, 2 ), we have that for any ∈ [0, 0 ) and any ν ∈ V[θ, ], the vector (θ + v ) is a differentiable local minimum, and hence the first-order necessary condition of differentiable local minima implies that for any k ∈ {1, . . ., d θ }, where we used the fact that Proof of Theorem 2. Let θ ∈ (R dθ \ ˜ ) be an arbitrary local minimum of L. Since (R dθ \ ˜ ) ⊆ (R dθ \ ), from assumption 2, there exists a function g such that f x i (θ ) = dθ k=1 g(θ ) k ∂ k f x i (θ ) for all i ∈ {1, . . ., m}.Then from lemma 3, there exists 0 > 0 such that for any ∈ [0, 0 ), any S ⊆ fin V[θ, ] and any where the first line follows from assumption 1 (differentiable and convex y i ), the second line follows from linearity of summation and the definition of fθ (x i ; α, , S), and the third line follows from assumption 2.
Thus, on the one hand, there exists 0 > 0 such that for any ∈ [0, 0 ), Combining these yields the desired statement.
A.3 Proof of Lemma 1.As shown in the proof of lemma 1, lemma 1 is a simple consequence of the following facts: fθ is linear in α and a union of two sets S ⊆ fin V[θ, ] and where the second line follows from the facts that a finite union of finite sets is finite and hence S ∪ S ⊆ fin V[θ, ] (i.e., the set in the first line is a superset of ⊇, the set in the second line), and that α ∈ R dθ ×|S∪S | can vanish the extra terms due to S in fθ (x; α, , S ∪ S ) (i.e., the set in the first line is a subset of, ⊆, the set in the second line).The last line follows from the same facts.The third line follows from the definition of fθ (x; α, , S).The fourth line follows from the following equality due to the linearity of fθ in α: A.4 Proof of Theorem 3. As shown in the proof of theorem 3, thanks to theorem 2 and lemma 1, the remaining task to prove theorem 3 is to find a set Let Null(M) be the null space of a matrix M.
Here, α w, j , v w, j ∈ R dy×dx , α r, j , v r, j ∈ R dx×dz , and α u, j , v u, j ∈ R du .Let ∈ (0, 0 ) be fixed.Consider the case of rank(W ) ≥ d y .Define S such that | S| = 1 and S1 = 0 ∈ R dθ , which is in V[θ, ].Then by setting α u,1 = 0 and rewriting α r,1 such that Wα r,1 = α (1) r,1 − α w,1 R with an arbitrary matrix α r,1 ∈ R dy×dz (this is possible since rank(W ) ≥ d y ), we have that Consider the case of rank(W ) < d y .Since W ∈ R dy×dx and rank(W ) < d y ≤ min(d x , d z ) ≤ d x , we have that Null(W ) = {0}, and there exists a vector a ∈ R dx such that a ∈ Null(W ) and a 2 = 1.Let a be such a vector.Define S as follows: | S | = d y d z + 1, S 1 = 0 ∈ R dθ , and set S j for all j ∈ {2, . . ., d y d z + 1} such that v w, j = 0, v u, j = 0, and v r, j = ab j where b j ∈ R dz is an arbitrary column vector with b j 2 ≤ 1.Then S j ∈ V[θ, ] for all j ∈ {1, . . ., d y d z + 1}.By setting α r, j = 0 and α u, j = 0 for all j ∈ {1, . . ., d y d z + 1} and by rewriting α w, j and α w, j = 1 q j a T for all j ∈ {2, . . ., d y d z + 1} with an arbitrary vector q j ∈ R dy (this is possible since > 0 is fixed first and α w, j is arbitrary), we have that Since q j ∈ R dy and b j ∈ R dz are arbitrary, we can rewrite By summarizing above, in both cases of rank(W ), there exists a set S ⊆ fin where the second line follows from lemma 1.On the other hand, since the set in the first line is a subset of the set in the last line, { fθ (x i ; α, , S) : This immediately implies the desired statement from theorem 2.
are the input variables of ϕ.Given a function ϕ : R d → R d , let ∂ k ϕ : R d → R be the partial derivative ∂ k ϕ with respect to the kth variable of ϕ.For the syntax of any differentiation map ∂, given functions ϕ and ζ , let 1

Figure 2 :
Figure2: Illustration of gradient basis model class and theorem 1 with manifold and tangent space.The space R 2 θ on the left is the parameter space, and the space R 3 f X (θ ) on the right is the output space.The surface M ⊂ R 3 on the right is the image of f X , which is a union of finitely many manifolds.The tangent space T f X (θ ) is the space of the outputs of the gradient basis model class.Theorem 1 states that if θ is a differentiable critical point of L, then f X (θ ) is globally optimal in the tangent space T f X (θ ) .

Figure 4 :
Figure4: Illustration of perturbable gradient basis model class and theorem 2 with manifold and tangent space at a singular point.The surface M ⊂ R 3 is the image of f X , which is a union of finitely many manifolds.The line T f X (θ ) on the left panel is the space of the outputs of the gradient basis model class.The whole space Tf X (θ ) = R 3 on the right panel is the space of the outputs of the perturbable gradient basis model class.The space Tf X (θ ) is the span of the set of the vectors in the tangent spaces T f X (θ ) , T f X (θ ) , and T f X (θ ) .Theorem 2 states that if θ is a differentiable local minimum of L, then f X (θ ) is globally optimal in the space Tf X (θ ) .