For nonconvex optimization in machine learning, this article proves that every local minimum achieves the globally optimal value of the perturbable gradient basis model at any differentiable point. As a result, nonconvex machine learning is theoretically as supported as convex machine learning with a handcrafted basis in terms of the loss at differentiable local minima, except in the case when a preference is given to the handcrafted basis over the perturbable gradient basis. The proofs of these results are derived under mild assumptions. Accordingly, the proven results are directly applicable to many machine learning models, including practical deep neural networks, without any modification of practical methods. Furthermore, as special cases of our general results, this article improves or complements several state-of-the-art theoretical results on deep neural networks, deep residual networks, and overparameterized deep neural networks with a unified proof technique and novel geometric insights. A special case of our results also contributes to the theoretical foundation of representation learning.
Deep learning has achieved considerable empirical success in machine learning applications. However, insufficient work has been done on theoretically understanding deep learning, partly because of the nonconvexity and high-dimensionality of the objective functions used to train deep models. In general, theoretical understanding of nonconvex, high-dimensional optimization is challenging. Indeed, finding a global minimum of a general nonconvex function (Murty & Kabadi, 1987) and training certain types of neural networks (Blum & Rivest, 1992) are both NP-hard. Considering the NP-hardness for a general set of relevant problems, it is necessary to use additional assumptions to guarantee efficient global optimality in deep learning. Accordingly, recent theoretical studies have proven global optimality in deep learning by using additional strong assumptions such as linear activation, random activation, semirandom activation, gaussian inputs, single hidden-layer network, and significant overparameterization (Choromanska, Henaff, Mathieu, Ben Arous, & LeCun, 2015; Kawaguchi, 2016; Hardt & Ma, 2017; Nguyen & Hein, 2017, 2018; Brutzkus & Globerson, 2017; Soltanolkotabi, 2017; Ge, Lee, & Ma, 2017; Goel & Klivans, 2017; Zhong, Song, Jain, Bartlett, & Dhillon, 2017; Li & Yuan, 2017; Kawaguchi, Xie, & Song, 2018; Du & Lee, 2018).
A study proving efficient global optimality in deep learning is thus closely related to the search for additional assumptions that might not hold in many practical applications. Toward widely applicable practical theory, we can also ask a different type of question: If standard global optimality requires additional assumptions, then what type of global optimality does not? In other words, instead of searching for additional assumptions to guarantee standard global optimality, we can also search for another type of global optimality under mild assumptions. Furthermore, instead of an arbitrary type of global optimality, it is preferable to develop a general theory of global optimality that not only works under mild assumptions but also produces the previous results with the previous additional assumptions, while predicting new results with future additional assumptions. This type of general theory may help not only to explain when and why an existing machine learning method works but also to predict the types of future methods that will or will not work.
As a step toward this goal, this article proves a series of theoretical results. The major contributions are summarized as follows:
For nonconvex optimization in machine learning with mild assumptions, we prove that every differentiable local minimum achieves global optimality of the perturbable gradient basis model class. This result is directly applicable to many existing machine learning models, including practical deep learning models, and to new models to be proposed in the future, nonconvex and convex.
The proposed general theory with a simple and unified proof technique is shown to be able to prove several concrete guarantees that improve or complement several state-of-the-art results.
In general, the proposed theory allows us to see the effects of the design of models, methods, and assumptions on the optimization landscape through the lens of the global optima of the perturbable gradient basis model class.
Because a local minimum in only requires the to be locally optimal in , it is nontrivial that the local minimum is guaranteed to achieve the globally optimality in of the induced perturbable gradient basis model class. The reason we can possibly prove something more than many worst-case results in general nonconvex optimization is that we explicitly take advantage of mild assumptions that commonly hold in machine learning and deep learning. In particular, we assume that an objective function to be optimized is structured with a sum of weighted errors, where each error is an output of composition of a loss function and a function of a hypothesis class. Moreover, we make mild assumptions on the loss function and a hypothesis class, all of which typically hold in practice.
This section defines the problem setting and common notation.
2.1 Problem Description
We use the following standard notation for differentiation. Given a scalar-valued or vector-valued function with components and variables , let be the matrix-valued function with each entry . Note that if is a scalar-valued function, outputs a row vector. In addition, if are the input variables of . Given a function , let be the partial derivative with respect to the th variable of . For the syntax of any differentiation map , given functions and , let be the (partial) derivative evaluated at an output of a function .
Given a matrix , represents the standard vectorization of the matrix . Given a set of matrices or vectors , define to be a block matrix of each column block being . Similarly, given a set with increasing, define .
3 Nonconvex Optimization Landscapes for Machine Learning
This section shows our first main result that under mild assumptions, every differentiable local minimum achieves the global optimality of the perturbable gradient basis model class.
Given a hypothesis class and data set, let be a set of nondifferentiable points as is not differentiable at . Similarly, define is not differentiable at . Here, is the open ball with the center and the radius . In common nondifferentiable models such as neural networks with rectified linear units (ReLUs) and pooling operations, we have that , and the Lebesgue measure of ) is zero.
This section uses the following mild assumptions.
(Use of Common Loss criteria). For all , the function is differentiable and convex (e.g., the squared loss, cross-entropy loss, or polynomial hinge loss satisfies this assumption).
(Use of Common Model Structures). There exists a function such that for all and all .
Assumption 1 is satisfied by simply using common loss criteria that include the squared loss , cross-entropy loss , and smoothed hinge loss with (the hinge loss with ). Although the objective function used to train a complex machine learning model (e.g., a neural network) is nonconvex in , the loss criterion is usually convex in . In this article, the cross-entropy loss includes the softmax function, and thus is the pre-softmax output of the last layer in related deep learning models.
Assumption 2 is satisfied by simply using a common architecture in deep learning or a classical machine learning model. For example, consider a deep neural network of the form , where is an output of an arbitrary representation at the last hidden layer and . Then assumption 2 holds because , where for all corresponding to the parameters in the last layer and for all other corresponding to . In general, because is a function of , assumption 2 is easily satisfiable. Assumption 2 does not require the model to be linear in or .
Note that we allow the nondifferentiable points to exist in ; for example, the use of ReLU is allowed. For a nonconvex and nondifferentiable function, we can still have first-order and second-order necessary conditions of local minima (e.g., Rockafellar & Wets, 2009, theorem 13.24). However, subdifferential calculus of a nonconvex function requires careful treatment at nondifferentiable points (see Rockafellar & Wets, 2009; Kakade & Lee, 2018; Davis, Drusvyatskiy, Kakade, & Lee, 2019), and deriving guarantees at nondifferentiable points is left to a future study.
3.2 Theory for Critical Points
An important aspect in theorem 1 is that on the right-hand side is convex, while on the left-hand side can be nonconvex or convex. Here, following convention, is defined to be the infimum of a subset of (the set of affinely extended real numbers); that is, if has no lower bound, and . Note that theorem 1 vacuously holds true if there is no critical point for . To guarantee the existence of a minimizer in a (nonempty) subspace for (or ), a classical proof requires two conditions: a lower semicontinuity of (or ) and the existence of a for which the set (or ) is compact (see Bertsekas, 1999, for different conditions).
3.2.1 Geometric View
Figure 2 illustrates the gradient basis model class and theorem 1 with a union of manifolds and a tangent space. Under the constant rank condition, the image of the map locally forms a single manifold. More precisely, if there exists a small neighborhood of such that is differentiable in and is constant with some for all (the constant rank condition), then the rank theorem states that the image is a manifold of dimension (Lee, 2013, theorem 4.12). We note that the rank map is lower semicontinuous (i.e., if , then there exists a neighborhood of such that for any ). Therefore, if at has the maximum rank in a small neighborhood of , then the constant rank condition is satisfied.
In this section, we show through examples that theorem 1 generalizes the previous results in special cases while providing new theoretical insights based on the gradient basis model class and its geometric view. In the following, whenever the form of is specified, we require only assumption 1 because assumption 2 is automatically satisfied by a given .
For classical machine learning models, example 1 shows that the gradient basis model class is indeed equivalent to a given model class. From the geometric view, this means that for any , the tangent space is equal to the whole image of (i.e., does not depend on ). This reduces theorem 1 to the statement that every critical point of is a global minimum of .
For any basis function model in classical machine learning with any fixed feature map , we have that , and hence , as well as . In other words, in this special case, theorem 1 states that every critical point of is a global minimum of . Here, we do not assume that a critical point or a global minimum exists or can be attainable. Instead, the statement logically means that if a point is a critical point, then the point is a global minimum. This type of statement vacuously holds true if there is no critical point.
For overparameterized deep neural networks, example 2 shows that the induced gradient basis model class is highly expressive such that it must contain the globally optimal model of a given model class of deep neural networks. In this example, the tangent space is equal to the whole output space . This reduces theorem 1 to the statement that every critical point of is a global minimum of for overparameterized deep neural networks.
Intuitively, in Figure 1 or 2, we can increase the number of parameters and raise the number of partial derivatives in order to increase the dimensionality of the tangent space so that . This is indeed what happens in example 2, as well as in the previous studies of significantly overparameterized deep neural networks (Allen-Zhu, Li, & Song, 2018; Du, Lee, Li, Wang, & Zhai, 2018; Zou et al., 2018). In the previous studies, the significant overparameterization is required so that the tangent space does not change from the initial tangent space during training. Thus, theorem 1, with its geometric view, provides the novel algebraic and geometric insights into the results of the previous studies and the reason why overparameterized deep neural networks are easy to be optimized despite nonconvexity.
Theorem 1 implies that every critical point (and every local minimum) is a global minimum for sufficiently overparameterized deep neural networks. Let be the number of units in each layer of a fully connected feedforward deep neural network. Let us consider a significant overparameterization such that . Let us write a fully connected feedforward deep neural network with the trainable parameters by , where is the weight matrix in the last layer, , contains the rest of the parameters, and is the output of the last hidden layer. Denote to contain the constant term to account for the bias term in the first layer. Assume that the input samples are normalized as for all and distinct as with some for all . Assume that the activation functions are ReLU activation functions. Then we can efficiently set to guarantee (e.g., by choosing to make each unit of the last layer to be active only for each sample ).1 Theorem 1 implies that every critical point with this is a global minimum of the whole set of trainable parameters because (with assumption 1).
For deep neural networks, example 3 shows that standard networks have the global optimality guarantee with respect to the representation learned at the last layer, and skip connections further ensure the global optimality with respect to the representation learned at each hidden layer. This is because adding the skip connections incurs new partial derivatives that span the tangent space containing the output of the best model with the corresponding learned representation.
3.3 Theory for Local Minima
The following theorem shows that every differentiable local minimum of achieves the global minimum value of :
From theorem 1 to theorem 2, the power of increasing the number of parameters (including overparameterization) is further improved. The right-hand side in equation 3.2 is the global minimum value over the variables and . Here, as increases, we may obtain the global minimum value of a larger search space , which is similar to theorem 1. A concern in theorem 1 is that as increases, we may also significantly increase the redundancy among the elements in . Although this remains a valid concern, theorem 2 allows us to break the redundancy by the globally optimal to some degree.
For example, consider , which represents a deep neural network, with some th-layer output , a trainable weight matrix , and an arbitrary function to compute the rest of the forward pass. Here, . Let and, similarly, . Then, all vectors corresponding to any elements in the left null space of are in (i.e., for all corresponding to and the rest of is set to perturb by an element in the left null space). Thus, as the redundancy increases such that the dimension of the left null space of increases, we have a larger space of , for which a global minimum value is guaranteed at a local minimum.
3.3.1 Geometric View
4 Applications to Deep Neural Networks
The previous section showed that all local minima achieve the global optimality of the perturbable gradient basis model class with several direct consequences for special cases. In this section, as consequences of theorem 2, we complement or improve the state-of-the-art results in the literature.
4.1 Example: ResNets
PA1: The output dimension .
PA2: For any , the function is convex and twice differentiable.
PA3: On any bounded subset of the domain of , the function , its gradient , and its Hessian are all Lipschitz continuous in , where with a fixed .
Shamir (2018) remarked that it is an open problem whether proposition 1 and another main result in the article can be extended to networks with (multiple output units). Note that Shamir (2018) also provided proposition 1 with an expected loss and an analysis for a simpler decoupled model, . For the simpler decoupled model, our theorem 1 immediately concludes that given any , every critical point with respect to achieves a global minimum value with respect to as (). This holds for every critical point since any critical point must be a critical point with respect to .
The following theorem shows that every differentiable local minimum achieves the global minimum value of (the right-hand side in equation 4.2), which is no worse than the upper bound in proposition 1 and is strictly better than the upper bound as long as or is nonnegligible. Indeed, the global minimum value of (the right-hand side in equation 4.2) is no worse than the global minimum value of all models parameterized by the coefficients of the basis and , and further improvement is guaranteed through a nonnegligible .
Theorem 3 also successfully solved the first part of the open problem in the literature (Shamir, 2018) by discarding the assumption of . From the geometric view, theorem 3 states that the span of the set of the vectors in the tangent spaces contains the output of the best basis model with the linear feature and the learned nonlinear feature . Similar to the examples in Figures 3 and 4, and the output of the best basis model with these features is contained in but not in .
Unlike the recent study on ResNets (Kawaguchi & Bengio, 2019), our theorem 3 predicts the value of through the global minimum value of a large search space (i.e., the domain of ) and is proven as a consequence of our general theory (i.e., theorem 2) with a significantly different proof idea (see section 4.3) and with the novel geometric insight.
4.2.1 Example: Deep Nonlinear Networks with Locally Induced Partial Linear Structures
Given the difficulty of theoretically understanding deep neural networks, Goodfellow, Bengio, and Courville (2016) noted that theoretically studying simplified networks (i.e., deep linear networks) is worthwhile. For example, Saxe, McClelland, and Ganguli (2014) empirically showed that deep linear networks may exhibit several properties analogous to those of deep nonlinear networks. Accordingly, the theoretical study of deep linear neural networks has become an active area of research (Kawaguchi, 2016; Hardt & Ma, 2017; Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett, Helmbold, & Long, 2019; Du & Hu, 2019).
PA4: Every activation function is identity as for every (i.e., deep linear networks).
PA5: For any , the function is convex and differentiable.
PA6: The thinnest layer is either the input layer or the output layer as .
Instead of studying deep linear networks, we now consider a partial linear structure locally induced by a parameter vector with nonlinear activation functions. This relaxes the linearity assumption and extends our understanding of deep linear networks to deep nonlinear networks.
Intuitively, is a set of partial linear structures locally induced by a vector , which is now formally defined as follows. Given a , let be a set of all sets such that each set satisfies the following conditions: there exists such that for all ,
for all .
for all if .
Let be the set of all parameter vectors such that is nonempty. As the definition reveals, a neural network with a can be a standard deep nonlinear neural network (with no linear units).
Theorem 4 is a special case of theorem 2. A special case of theorem 4 then results in one of the main results in the literature regarding deep linear neural networks, that is, every local minimum is a global minimum. Consider any deep linear network with . Then every local minimum is in . Hence, theorem 4 is reduced to the statement that for any local minimum, , which is the global minimum value. Thus, every local minimum is a global minimum for any deep linear neural network with . Therefore, theorem 4 successfully generalizes the recent previous result in the literature (proposition 2) for a common scenario of .
Beyond deep linear networks, theorem 4 illustrates both the benefit of the locally induced structure and the overparameterization for deep nonlinear networks. In the first term, , in , we benefit by decreasing (a more locally induced structure) and increasing the width of the th layer for any (overparameterization). The second term, in , is the general term that is always present from theorem 2, where we benefit from increasing because .
From the geometric view, theorem 4 captures the intuition that the span of the set of the vectors in the tangent spaces contains the best basis model with the linear feature for deep linear networks, as well as the best basis models with more nonlinear features as more local structures arise. Similar to the examples in Figures 3 and 4, and the output of the best basis models with those features are contained in but not in .
A similar local structure was recently considered in Kawaguchi, Huang, and Kaelbling (2019). However, both the problem settings and the obtained results largely differ from those in Kawaguchi et al. (2019). Furthermore, theorem 4 is proven as a consequence of our general theory (theorem 2), and accordingly, the proofs largely differ from each other as well. Theorem 4 also differs from recent results on the gradient decent algorithm for deep linear networks (Arora, Cohen, Golowich, & Hu, 2018; Arora, Cohen, & Hazan, 2018; Bartlett et al., 2019; Du & Hu, 2019), since we analyze the loss surface instead of a specific algorithm and theorem 4 applies to deep nonlinear networks as well.
4.3 Proof Idea in Applications of Theorem 2
Theorems 3 and 4 are simple consequences of theorem 2, and their proof is illustrative as a means of using theorem 2 in future studies with different additional assumptions. The high-level idea behind the proofs in the applications of theorem 2 is captured in the geometric view of theorem 2 (see Figures 3 and 4). That is, given a desired guarantee, we check whether the space is expressive enough to contain the output of the desired model corresponding to the desired guarantee.
To simplify the use of theorem 2, we provide the following lemma. This lemma states that the expressivity of the model with respect to is the same as that of with respect to . As shown in its proof, this is essentially because is linear in , and a union of two sets and remains a finite subset of .
For any , any , any , and any , it holds that
Based on theorem 2 and lemma 1, the proofs of theorems 3 and 4 are reduced to a simple search for finding such that the expressivity of with respect to is no worse than the expressivity of with respect to (see theorem 3) and that of with respect to (see theorem 4). In other words, (see theorem 3) and (see theorem 4). Only with this search for , theorem 2 together with lemma 1 implies the desired statements for theorems 3 and 4 (see sections A.4 and A.5 in the appendix for further details). Thus, theorem 2 also enables simple proofs.
This study provided a general theory for nonconvex machine learning and demonstrated its power by proving new competitive theoretical results with it. In general, the proposed theory provides a mathematical tool to study the effects of hypothesis classes , methods, and assumptions through the lens of the global optima of the perturbable gradient basis model class.
In convex machine learning with a model output with a (nonlinear) feature output , achieving a critical point ensures the global optimality in the span of the fixed basis . In nonconvex machine learning, we have shown that achieving a critical point ensures the global optimality in the span of the gradient basis , which coincides with the fixed basis in the case of the convex machine learning. Thus, whether convex or nonconvex, achieving a critical point ensures the global optimality in the span of some basis, which might be arbitrarily bad (or good) depending on the choice of the handcrafted basis (for the convex case) or the induced basis (for the nonconvex case). Therefore, in terms of the loss values at critical points, nonconvex machine learning is theoretically as justified as the convex one, except in the case when a preference is given to over (both of which can be arbitrarily bad or good). The same statement holds for local minima and perturbable gradient basis.
Appendix: Proofs of Theoretical Results
In this appendix, we provide complete proofs of the theoretical results.
A.1 Proof of Theorem 1
The proof of theorem 1 combines lemma 2 with assumptions 1 and 2 by taking advantage of the structure of the objective function . Although lemma 2 is rather weak and assumptions 1 and 2 are mild (in the sense that they usually hold in practice), a right combination of these with the structure of can prove the desired statement.
Let be an arbitrary critical point of . Since is assumed to be differentiable and is differentiable at the given , the composition is also differentiable, and . In addition, is differentiable because a sum of differentiable functions is differentiable. Therefore, for any critical point of , we have that , and, hence, for any from linearity of differentiation operation.
A.2 Proof of Theorem 2
Since , there exists such that are differentiable in . Since is assumed to be differentiable and is differentiable in , the composition is also differentiable, and in . In addition, is differentiable in because a sum of differentiable functions is differentiable.
A.3 Proof of Lemma 1
A.4 Proof of Theorem 3
A.5 Proof of Theorem 4
- Find with : Since(where ) is the desired set.
- Find with : With , we have that
For example, choose the first layer's weight matrix such that for all , and for all . This can be achieved by choosing the th row of to be with for . Then choose the weight matrices for the th layer for all such that for all , and for all . This guarantees .
We gratefully acknowledge support from NSF grants 1523767 and 1723381, AFOSR grant FA9550-17-1-0165, ONR grant N00014-18-1-2847, Honda Research, and the MIT-Sensetime Alliance on AI. Any opinions, findings, and conclusions or recommendations expressed in this material are our own and do not necessarily reflect the views of our sponsors.