Abstract
Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss. In this letter, we study the dynamics of such implicit nonlinear representation learning. We identify a pair of a new assumption and a novel condition, called the on-model structure assumption and the data architecture alignment condition. Under the on-model structure assumption, the data architecture alignment condition is shown to be sufficient for the global convergence and necessary for global optimality. Moreover, our theory explains how and when increasing network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure; for example, the on-model structure assumption can be used as a justification for using a particular model structure instead of others. As an application, we then derive a new training framework, which satisfies the data architecture alignment condition without assuming it by automatically modifying any given training algorithm dependent on data and architecture. Given a standard training algorithm, the framework running its modified version is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN.
1 Introduction
LeCun, Bengio, and Hinton (2015) described deep learning as one of hierarchical nonlinear representation learning approaches:
Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level (p. 436).
In applications such as computer vision and natural language processing, the success of deep learning can be attributed to its ability to learn hierarchical nonlinear representations by automatically changing nonlinear features and kernels during training based on the given data. This is in contrast to classical machine-learning methods where representations or equivalently nonlinear features and kernels are fixed during training.
Deep learning in practical regimes, which has the ability to learn nonlinear representation (Bengio, Courville, & Vincent, 2013), has had a profound impact in many areas, including object recognition in computer vision (Rifai et al., 2011; Hinton, Osindero, & Teh, 2006; Bengio, Lamblin, Popovici, & Larochelle, 2007; Ciregan, Meier, & Schmidhuber, 2012; Krizhevsky, Sutskever, & Hinton, 2012), style transfer (Gatys, Ecker, & Bethge, 2016; Luan, Paris, Shechtman, & Bala, 2017), image super-resolution (Dong, Loy, He, & Tang, 2014), speech recognition (Dahl, Ranzato, Mohamed, & Hinton, 2010; Deng et al., 2010; Seide, Li, & Yu, 2011; Mohamed, Dahl, & Hinton, 2011; Dahl, Yu, Deng, & Acero, 2011; Hinton et al., 2012), machine translation (Schwenk, Rousseau, & Attik, 2012; Le, Oparin, Allauzen, Gauvain, & Yvon, 2012), paraphrase detection (Socher, Huang, Pennington, Ng, & Manning, 2011), word sense disambiguation (Bordes, Glorot, Weston, & Bengio, 2012), and sentiment analysis (Glorot, Bordes, & Bengio, 2011; Socher, Pennington, Huang, Ng, & Manning, 2011). However, we do not yet know the precise condition that makes deep learning tractable in the practical regime of representation learning.
For example, one of the simplest models is the linear model in the form of , where is a fixed function and is a nonlinear representation of input data . This is a classical machine learning model where much of the effort goes into the design of the handcrafted feature map via feature engineering (Turner, Fuggetta, Lavazza, & Wolf, 1999; Zheng & Casari, 2018). In this linear model, we do not learn the representation because the feature map is fixed without dependence on the model parameter that is optimized with the data set .
Similar to many definitions in mathematics, where an intuitive notion in a special case is formalized to a definition for a more general case, we now abstract and generalize this intuitive notion of the representation of the linear model to that of all differentiable models as follows:
Given any and differentiable function , we define to be the gradient representation of the data under the model at .
In this letter, we initiate the study of the dynamics of learning gradient representation that are nonlinear in . That is, we focus on the regime where the gradient representation at the end of training time differs greatly from the initial representation . This regime was studied in the past for the case where the function is affine for some fixed feature map (Saxe, McClelland, & Ganguli, 2014; Kawaguchi, 2016, 2021; Laurent & Brecht, 2018; Bartlett, Helmbold, & Long, 2019; Zou, Long, & Gu, 2020; Xu et al., 2021). Unlike any previous studies, we focus on the problem setting where the function is nonlinear and nonaffine, with the effect of nonlinear (gradient) representation learning. The results of this letter avoid the curse of dimensionality by studying the global convergence of the gradient-based dynamics instead of the dynamics of global optimization (Kawaguchi et al., 2016) and Bayesian optimization (Kawaguchi, Kaelbling, & Lozano-Pérez, 2015). Importantly, we do not require any wide layer or large input dimension throughout this letter. Our main contributions are summarized as follows:
In section 2, we identify a pair of a novel assumption and a new condition, called the common model structure assumption and the data-architecture alignment condition. Under the common model structure assumption, the data-architecture alignment condition is shown to be a necessary condition for the globally optimal model and a sufficient condition for the global convergence. The condition is dependent on both data and architecture. Moreover, we empirically verify and deepen this new understanding. When we apply representation learning in practice, we often have overwhelming options regarding which model structure to be used. Our results provide a practical guidance for choosing or designing model structure via the common model structure assumption, which is indeed satisfied by many representation learning models used in practice.
In section 3, we discard the assumption of the data-architecture alignment condition. Instead, we derive a novel training framework, called the exploration-exploitation wrapper (EE wrapper), which satisfies the data-architecture alignment condition time-independently a priori. The EE wrapper is then proved to have global convergence guarantees under the safe-exploration condition. The safe-exploration condition is what allows us to explore various gradient representations safely without getting stuck in the states where we cannot provably satisfy the data-architecture alignment condition. The safe-exploration condition is shown to hold true for ResNet-18 with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN time-independently.
In section 3.4, the EE wrapper is shown to not degrade practical performances of ResNet-18 for the standard data sets, MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN. To our knowledge, this letter provides the first practical algorithm with global convergence guarantees without degrading practical performances of ResNet-18 on these standard data sets, using convolutions, skip connections, and batch normalization without any extremely wide layer of the width larger than the number of data points. To the best of our knowledge, we are not aware of any similar algorithms with global convergence guarantees in the regime of learning nonlinear representations without degrading practical performances.
2 Understanding Dynamics via Common Model Structure and Data-Architecture Alignment
In this section, we identify the common model structure assumption and study the data-architecture alignment condition for the global convergence in nonlinear representation learning. We begin by presenting an overview of our results in section 2.1, deepen our understandings with experiments in section 2.2, discuss implications of our results in section 2.3, and establish mathematical theories in section 2.4.
2.1 Overview
2.1.1 Common Model Structure Assumption
Through examinations of representation learning models used in applications, we identified and formalized one of their common properties as follows:
(Common Model Structure Assumption). There exists a subset such that for any and .
Assumption 1 is satisfied by common machine learning models, such as kernel models and multilayer neural networks, with or without convolutions, batch normalization, pooling, and skip connections. For example, consider a multilayer neural network of the form , where is an output of its last hidden layer and the parameter vector consists of the parameters at the last layer and the parameters in all other layers as . Here, for any matrix , we let be the standard vectorization of the matrix by stacking columns. Then, assumption 1 holds because , where is defined by with . Since is arbitrary in this example, the common model structure assumption holds, for example, for any multilayer neural networks with a fully connected last layer. In general, because the nonlinearity at the output layer can be treated as a part of the loss function while preserving convexity of (e.g., cross-entropy loss with softmax), this assumption is satisfied by many machine learning models, including ResNet-18 and all models used in the experiments in this letter (as well as all linear models). Moreover, assumption 1 is automatically satisfied in the next section by using the EE wrapper.
2.1.2 Data-Architecture Alignment Condition
Given a target matrix and a loss function , we define the modified target matrix by for the squared loss , and by for the (binary and multiclass) cross-entropy losses with . Given input matrix , the output matrix is defined by . For any matrix , we let be its column space. With these notations, we are now ready to introduce the data-architecture alignment condition:
(Data-Architecture Alignment Condition). Given any data set , differentiable function , and loss function , the data-architecture alignment condition is said to be satisfied at if .
The data-architecture alignment condition depends on both data (through the target and the input ) and architecture (through the model ). It is satisfied only when the data and architecture align well to each other. For example, in the case of linear model , the condition can be written as where and . In definition 2, is a matrix of the preactivation outputs of the last layer. Thus, in the case of classification tasks with a nonlinear activation at the output layer, and are not in the same space, which is the reason we use here instead of .
Importantly, the data-architecture alignment condition does not make any requirements on the the rank of the Jacobian matrix : the rank of is allowed to be smaller than and . Thus, for example, the data-architecture alignment condition can be satisfied depending on the given data and architecture even if the minimum eigenvalue of the matrix is zero, in both cases of overparameterization (e.g., ) and underparameterization (e.g., ). This is further illustrated in section 2.2 and discussed in section 2.3. We note that we further discard the assumption of the data-architecture alignment condition in section 3 as it is automatically satisfied by using the EE wrapper.
2.1.3 Global Convergence
Under the common model structure assumption, the data-architecture alignment condition is shown to be what lets us avoid the failure of the global convergence and suboptimal local minima. More concretely, we prove a global convergence guarantee under the data-architecture alignment condition as well as the necessity of the condition for the global optimality:
(Informal Version). Let assumption 1 hold. Then the following two statements hold for gradient-based dynamics:
The global optimality gap bound decreases per iteration toward zero at the rate of for any such that the data-architecture alignment condition is satisfied at for .
For any , the data-architecture alignment condition at is necessary to have the globally optimal model at for any .
Theorem 1i guarantees the global convergence without the need to satisfy the data-architecture alignment condition at every iteration or at the limit point. Instead, it shows that the bound on the global optimality gap decreases toward zero per iteration whenever the data-architecture alignment condition holds. Theorem 1ii shows that the data-architecture alignment condition is necessary for the global optimality. Intuitively, this is because the expressivity of a model class satisfying the common model structure assumption is restricted such that it is required to align the architecture to the data in order to contain the globally optimal model (for any ).
To better understand the statement of theorem 1i, consider a counterexample with a data set consisting of the single point , the model , and the squared loss . In this example, we have , which has multiple suboptimal local minima of different values. Then, via gradient descent, the model converges to the closest local minimum and, in particular, does not necessarily converge to a global minimum. Indeed, this example violates the common model structure assumption (assumption 1) (although it satisfies the data-architecture alignment condition), showing the importance of the common model structure assumption along with the data-architecture alignment. This also illustrates the nontriviality of theorem 1i in that the data-architecture alignment is not sufficient, and we needed to understand what types of model structures are commonly used in practice and formalize the understanding as the common model structure assumption.
Since , this implies that any stationary point is a global minimum if the minimum eigenvalue of the matrix is nonzero, without the common model structure assumption (see assumption 1). Indeed, in the above example with the model , the common model structure assumption is violated, but we still have the global convergence if the minimum eigenvalue is nonzero—for example, at any stationary point such that the minimum eigenvalue of the matrix is nonzero. In contrast, theorem 1 allows the global convergence even when the minimum eigenvalue of the matrix is zero by utilizing the common model structure assumption.
The formal version of theorem 1 is presented in section 2.4 and is proved in appendix A in the supplementary information that relies on the additional previous works of Clanuwat et al. (2019), Krizhevsky and Hinton (2009), Mityagin (2015), Netzer et al. (2011), Paszke et al. (2019a, 2019b), and Poggio et al. (2017). Before proving the statement, we first examine the meaning and implications of our results through illustrative examples in sections 2.2 and 2.3.
2.2 Illustrative Examples in Experiments
Theorem 1 suggests that data-architecture alignment condition has the ability to distinguish the success and failure cases, even when the minimum eigenvalue of the matrix is zero for all . In this section, we conduct experiments to further verify and deepen this theoretical understanding.
We employ a fully connected network having four layers with 300 neurons per hidden layer, and a convolutional network, LeNet (LeCun et al., 1998), with five layers. For the fully connected network, we use the two-moons data set (Pedregosa et al., 2011) and a sine wave data set. To create the sine wave data set, we randomly generated the input from the uniform distribution on the interval and set for all with . For the convolutional network, we use the Semeion data set (Srl & Brescia, 1994) and a random data set. The random data set was created by randomly generating each pixel of the input image from the standard normal distribution and by sampling uniformly from for all with . We set the activation functions of all layers to be softplus with , which approximately behaves as the ReLU activation as shown in appendix C in the supplementary information. See appendix B in the supplementary information for more details of the experimental settings.
Figures 1b and 1c show the results for the convolutional networks with two random initial points using two different random seeds. In the figure panels, we report the training behaviors with different network sizes , and 4; the number of convolutional filters per convolutional layer is and the number of neurons per fully connected hidden layer is . As can be seen, with the Semeion data set, the networks of all sizes achieved zero error with for all . With the random data set, the deep networks yielded the zero training error whenever is not linearly increasing over the time or, equivalently, whenever the condition of holds sufficiently many steps . This is consistent with our theory.
Finally, we also confirmed that gradient representation changed significantly from the initial one in our experiments. That is, the values of were significantly large and tended to increase as increases, where the matrix is defined by . Table 1 summarizes the values of at the end of the training.
a. Fully-Connected Network . | ||||||
---|---|---|---|---|---|---|
. | . | Data Set . | ||||
Two moons | ||||||
Sine wave | ||||||
b. Convolutional Network | ||||||
Data Set | seed#1 | seed#2 | seed#1 | seed#2 | seed#1 | seed#2 |
Semeion | ||||||
Random |
a. Fully-Connected Network . | ||||||
---|---|---|---|---|---|---|
. | . | Data Set . | ||||
Two moons | ||||||
Sine wave | ||||||
b. Convolutional Network | ||||||
Data Set | seed#1 | seed#2 | seed#1 | seed#2 | seed#1 | seed#2 |
Semeion | ||||||
Random |
2.3 Implications
In section 2.1.3, we showed that an uncommon model structure does not satisfy assumption 1, and assumption 1 is not required for global convergence if the minimum eigenvalue is nonzero. However, in practice, we typically use machine learning models that satisfy assumption 1 instead of the model , and the minimum eigenvalue is zero in many cases. In this context, theorem 1 provides the justification for common practice in nonlinear representation learning. Furthermore, theorem 1i contributes to the literature by identifying the common model structure assumption (assumption 1) and the data-architecture alignment condition (definition 1) as the novel and practical conditions to ensure the global convergence even when the minimum eigenvalue becomes zero. Moreover, theorem 1ii shows that this condition is not arbitrary in the sense that it is also necessary to obtain the globally optimal models. Furthermore, the data-architecture alignment condition is strictly more general than the condition of the minimum eigenvalue being nonzero, in the sense that the latter implies the former but not vice versa.
Our new theoretical understanding based on the data-architecture alignment condition can explain and deepen the previously known empirical observation that increasing network size tends to improve training behaviors. Indeed, the size of networks seems to correlate well with the training error to a certain degree in Figure 1b. However, the size and the training error do not correlate well in Figure 1c. Our new theoretical understanding explains that the training behaviors correlate more directly with the data-architecture alignment condition of instead. The seeming correlation with the network size is indirect and caused by another correlation between the network size and the condition of . That is, the condition of more likely tends to hold when the network size is larger because the matrix is of size where is the number of parameters: that is, by increasing , we can increase the column space to increase the chance of satisfying the condition of .
Note that the minimum eigenvalue of the matrix is zero at all iterations in Figures 1b and 1c for all cases of . Thus, Figures 1b and 1c also illustrate the fact that while having the zero minimum eigenvalue of the matrix , the dynamics can achieve the global convergence under the data-architecture alignment condition. Moreover, because the multilayer neural network in the lazy training regime (Kawaguchi & Sun, 2021) achieves zero training errors for all data sets, Figure 1 additionally illustrates that our theoretical and empirical results apply to the models outside of the lazy training regime and can distinguish “good” data sets from “bad” data sets given a learning algorithm.
In sum, our new theoretical understanding has the ability to explain and distinguish the successful case and failure case based on the data-architecture alignment condition for the common machine learning models. Because the data-architecture alignment condition is dependent on data and architecture, theorem 1, along with our experimental results, shows why and when the global convergence in nonlinear representation learning is achieved based on the relationship between the data and architecture . This new understanding is used in section 3 to derive a practical algorithm and is expected to be a basis for many future algorithms.
2.4 Details and Formalization of Theorem 1
2.4.1 Preliminaries
Let be the sequence defined by with an initial parameter vector , a learning rate , and an update vector . The analysis in this section relies on the following assumption on the update vector :
There exist such that and for any .
Assumption 2 is satisfied by using where is any positive-definite symmetric matrix with eigenvalues in the interval . If we set , we have gradient descent, and assumption 2 is satisfied with . This section also uses the standard assumption of differentiability and Lipschitz continuity:
For every , the function is differentiable and convex, the map is differentiable, and for all in the domain of for some .
The assumptions on the loss function in assumption 3 are satisfied by using standard loss functions, including the squared loss, logistic loss, and cross-entropy loss. Although the objective function is nonconvex and non-invex, the function is typically convex.
Suppose assumption 1 holds. If , then for any .
All proofs of this letter are presented in appendix A in the supplementary information.
2.4.2 Global Optimality at the Limit Point
The following theorem shows that every limit point of the sequence achieves a loss value no worse than for any such that for all with some :
In practice, one can easily satisfy all the assumptions in theorem 2 except for the condition that for all . Accordingly, we now weaken this condition by analyzing optimality at each iteration so that the condition is verifiable in experiments.
2.4.3 Global Optimality Gap at Each Iteration
The following theorem states that under standard settings, the sequence converges to a loss value no worse than at the rate of for any and such that for :
3 Application to the Design of Training Framework
The results in the previous section show that the bound on the global optimality gap decreases per iteration whenever the data-architecture alignment condition holds. Using this theoretical understanding, in this section, we propose a new training framework with prior guarantees while learning hierarchical nonlinear representations without assuming the data-architecture alignment condition. As a result, we made significant improvements over the most closely related study on global convergence guarantees (Kawaguchi & Sun, 2021). In particular, whereas the related study requires a wide layer with a width larger than , our results reduce the requirement to a layer with a width larger than . For example, the MNIST data set has and hence previous studies require 60,000 neurons at a layer, whereas we only require neurons at a layer. Our requirement is consistent and satisfied by the models used in practice that typically have from 256 to 1024 neurons for some layers.
3.1 Additional Notations
For an arbitrary matrix , we let be its th column vector in , be its th row vector in , and be its matrix rank. We define to be the Hadamard product of any matrices and . For any vector , we let be the diagonal matrix with for . We denote by the identity matrix.
3.2 Exploration-Exploitation Wrapper
In this section, we introduce the exploration-exploitation (EE) wrapper . The EE wrapper is not a stand-alone training algorithm. Instead, it takes any training algorithm as its input and runs the algorithm in a particular way to guarantee global convergence. We note that the exploitation phase in the EE wrapper does not optimize the last layer; instead, it optimizes hidden layers, whereas the exploration phase optimizes all layers. The EE wrapper allows us to learn the representation that differs significantly from the initial representation without making assumptions on the minimum eigenvalue of the matrix by leveraging the data-architecture alignment condition. The data-architecture alignment condition is ensured by the safe-exploration condition (defined in section 3.3.1), which is time independent and holds in practical common architectures (as demonstrated in section 3.4).
3.2.1 Main Mechanisms
Algorithm 1 outlines the EE wrapper . During the exploration phase in lines 3 to 7 of algorithm 1, the EE wrapper freely explores hierarchical nonlinear representations to be learned without any restrictions. Then, during the exploitation phase in lines 8 to 12, it starts exploiting the current knowledge to ensure for all to guarantee global convergence. The value of is the hyperparameter that controls the time when it transitions from the exploration phase to the exploitation phase.
In the exploitation phase, the wrapper only optimizes the parameter vector at the th hidden layer, instead of the parameter vector at the last layer or the th layer. Despite this, the EE wrapper is proved to converge to global minima of all layers in . The exploitation phase still allows us to significantly change the representations as for . This is because we optimize the hidden layers instead of the last layer without any significant overparameterizations.
The exploitation phase uses an arbitrary optimizer with the update vector with . During the two phases, we can use the same optimizers (e.g., SGD for both and ) or different optimizers (e.g., SGD for and L-BFGS for ).
3.2.2 Model Modification
3.3 Convergence Analysis
In this section, we establish global convergence of the EE wrapper without using assumptions from the previous section. Let be an arbitrary positive integer and be an arbitrary positive real number. Let be a sequence generated by the EE wrapper . We define and where for any .
3.3.1 Safe-Exploration Condition
The mathematical analysis in this section relies on the safe-exploration condition, which allows us to safely explore deep nonlinear representations in the exploration phase without getting stuck in the states of . The safe-exploration condition is verifiable, time-independent, data-dependent, and architecture-dependent. The verifiability and time independence make the assumption strong enough to provide prior guarantees before training. The data dependence and architecture dependence make the assumption weak enough to be applicable for a wide range of practical settings.
(Safe-Exploration Condition). There exist a and a such that .
The safe-exploration condition asks for only the existence of one parameter vector in the network architecture such that . It is not about the training trajectory . Since the matrix is of size , the safe-exploration condition does not require any wide layer of size or . Instead, it requires a layer of size . This is a significant improvement over the most closely related study (Kawaguchi & Sun, 2021) where the wide layer of size was required. Note that having does not imply the safe-exploration condition. Instead, is a necessary condition to satisfy the safe-exploration condition, whereas or was a necessary condition to satisfy assumptions in previous papers, including the most closely related study (Kawaguchi & Sun, 2021). The safe-exploration condition is verified in experiments in section 3.4.
3.3.2 Additional Assumptions
We also use the following assumptions:
For any , the function is differentiable, and for all .
For each , the functions and are real analytic.
Assumption 5 is satisfied by using standard loss functions such as the squared loss and cross-entropy loss . The assumptions of the invexity and convexity of the function in sections 3.3.3 and 3.3.4 also hold for these standard loss functions. Using in assumption 5, we define , where is defined by with .
Assumption 6 is satisfied by using any analytic activation function such as sigmoid, hyperbolic tangents, and softplus activations with any hyperparameter . This is because a composition of real analytic functions is real analytic, and the following are all real analytic functions in : the convolution, affine map, average pooling, skip connection, and batch normalization. Therefore, the assumptions can be satisfied by using a wide range of machine learning models, including deep neural networks with convolution, skip connection, and batch normalization. Moreover, the softplus activation can approximate the ReLU activation for any desired accuracy, that is, , where represents the ReLU activation.
3.3.3 Global Optimality at the Limit Point
The following theorem proves the global optimality at limit points of the EE wrapper with a wide range of optimizers, including gradient descent and modified Newton methods:
Suppose assumptions 4 to 6 hold and that the function is invex for any . Assume that there exist such that and for any . Assume that the learning rate sequence satisfies either (i) for some or (ii) and . Then with probability one, every limit point of the sequence is a global minimum of as for all .
3.3.4 Global Optimality Gap at Each Iteration
We now present global convergence guarantees of the EE wrapper with gradient decent and SGD:
3.4 Experiments
This section presents empirical evidence to support our theory and what is predicted by a well-known hypothesis. We note that there is no related work or algorithm that can guarantee global convergence in the setting of our experiments where the model has convolutions, skip connections, and batch normalizations without any wide layer (of the width larger than ). Moreover, unlike any previous studies that propose new methods, our training framework works by modifying any given method.
3.4.1 Sine Wave Data Set
3.4.2 Image Data Sets
The standard convolutional ResNet with 18 layers (He, Zhang, Ren, & Sun, 2016) is used as the base model . We use ResNet-18 for the illustration of our theory because it is used in practice and it has convolution, skip connections, and batch normalization without any width larger than the number of data points. This setting is not covered by any of the previous theories for global convergence. We set the activation to be the softplus function with for all layers of the base ResNet. This approximates the ReLU activation well, as shown in appendix C in the supplementary information. We employ the cross-entropy loss and . We use a standard algorithm, SGD, with its standard hyperparameter setting for the training algorithm with —that is, we let the minibatch size be 64, the weight decay rate be , the momentum coefficient be 0.9, the learning rate be , and the last epoch be 200 (with data augmentation) and 100 (without data augmentation). The hyperparameters and were selected from and by only using training data. That is, we randomly divided each training data set (100%) into a smaller training data set (80%), and a validation data set (20%) for a grid search over the hyperparameters. See appendix B in the supplementary information for the results of the grid search and details of the experimental setting. This standard setting satisfies assumptions 5 and 6, leaving assumption 4 to be verified.
Verification of Assumption 4.
Table 2 summarizes the verification results of the safe-exploration condition. Because the condition only requires an existence of a pair satisfying the condition, we verified it by using a randomly sampled from the standard normal distribution and a returned by a common initialization scheme (He et al., 2015). As (512 + the constant neuron for the bias term) for the standard ResNet, we set throughout all the experiments with the ResNet. For each data set, the rank condition was verified twice by the two standard methods: one from Press, Teukolsky, Vetterling, and Flannery (2007) and another from Golub and Van Loan (1996).
Data Set . | . | . | . | Assumption 4 . |
---|---|---|---|---|
MNIST | 60,000 | 513 | 234 | Verified |
CIFAR-10 | 50,000 | 513 | 195 | Verified |
CIFAR-100 | 50,000 | 513 | 195 | Verified |
Semeion | 1,000 | 513 | 4 | Verified |
KMNIST | 60,000 | 513 | 234 | Verified |
SVHN | 73,257 | 513 | 286 | Verified |
Data Set . | . | . | . | Assumption 4 . |
---|---|---|---|---|
MNIST | 60,000 | 513 | 234 | Verified |
CIFAR-10 | 50,000 | 513 | 195 | Verified |
CIFAR-100 | 50,000 | 513 | 195 | Verified |
Semeion | 1,000 | 513 | 4 | Verified |
KMNIST | 60,000 | 513 | 234 | Verified |
SVHN | 73,257 | 513 | 286 | Verified |
Note: where is the number of training data, is the width of the last hidden layer, and is the width of the penultimate hidden layer.
Test Performance.
One well-known hypothesis is that the success of deep-learning methods partially comes from its ability to automatically learn deep nonlinear representations suitable for making accurate predictions from data (LeCun et al., 2015). As the EE wrapper keeps this ability of representation learning, the hypothesis suggests that the test performance of the EE wrapper of a standard method is approximately comparable to that of the standard method. Unlike typical experimental studies, our objective here is to confirm this prediction instead of showing improvements over a previous method. We empirically confirmed the prediction in Tables 3 and 4 where the numbers indicate the mean test errors (and standard deviations are in parentheses) over five random trials. As expected, the values of were also large—for example, for the standard method and for wrapper of the method with the Semeion data set.
Data Set . | Standard . | . |
---|---|---|
MNIST | 0.40 (0.05) | 0.30 (0.05) |
CIFAR-10 | 7.80 (0.50) | 7.14 (0.12) |
CIFAR-100 | 32.26 (0.15) | 28.38 (0.42) |
Semeion | 2.59 (0.57) | 2.56 (0.55) |
KMNIST | 1.48 (0.07) | 1.36 (0.11) |
SVHN | 4.67 (0.05) | 4.43 (0.11) |
Data Set . | Standard . | . |
---|---|---|
MNIST | 0.40 (0.05) | 0.30 (0.05) |
CIFAR-10 | 7.80 (0.50) | 7.14 (0.12) |
CIFAR-100 | 32.26 (0.15) | 28.38 (0.42) |
Semeion | 2.59 (0.57) | 2.56 (0.55) |
KMNIST | 1.48 (0.07) | 1.36 (0.11) |
SVHN | 4.67 (0.05) | 4.43 (0.11) |
Data Set . | Standard . | . |
---|---|---|
MNIST | 0.52 (0.16) | 0.49 (0.02) |
CIFAR-10 | 15.15 (0.87) | 14.56 (0.38) |
CIFAR-100 | 54.99 (2.29) | 46.13 (1.80) |
Data Set . | Standard . | . |
---|---|---|
MNIST | 0.52 (0.16) | 0.49 (0.02) |
CIFAR-10 | 15.15 (0.87) | 14.56 (0.38) |
CIFAR-100 | 54.99 (2.29) | 46.13 (1.80) |
Training Behavior.
Computational Time.
The EE wrapper runs the standard SGD in the exploration phase and the SGD only on the subset of the weights in the exploitation phase. Thus, the computational time of the EE wrapper is similar to that of the SGD in the exploration phase, and it tends to be faster than the SGD in the exploitation phase. To confirm this, we measure computational time with the Semeion and CIFAR-10 data sets under the same computational resources (e.g., without running other jobs in parallel) in a local workstation for each method. The mean wall-clock time (in seconds) over five random trials is summarized in Table 5, where the numbers in parentheses are standard deviations. It shows that the EE wrapper is slightly faster than the standard method, as expected.
Data Set . | Standard . | . |
---|---|---|
Semeion | 364.60 (0.94) | 356.82 (0.67) |
CIFAR-10 | 3616.92 (10.57) | 3604.5 (6.80) |
Data Set . | Standard . | . |
---|---|---|
Semeion | 364.60 (0.94) | 356.82 (0.67) |
CIFAR-10 | 3616.92 (10.57) | 3604.5 (6.80) |
Effect of Learning Rate and Optimizer.
We also conducted experiments on the effects of learning rates and optimizers using the MNIST data set with data augmentation. Using the best learning rate from {0.2, 0.1, 0.01, 0.001} for each method (with SGD), the mean test errors (%) over five random trials were 0.33 (0.03) for the standard base method and 0.27 (0.03) for the wrapper of the standard base method (the numbers in parentheses are standard deviations). Moreover, Table 6 reports the preliminary results on the effect of optimizers with being set to the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) (with = the standard SGD). By comparing Tables 3 and 6, we can see that using a different optimizer in the exploitation phase can potentially lead to performance improvements. A comprehensive study of this phenomenon is left to future work.
(a) with data augmentation . | (b) without data augmentation . | ||||||
---|---|---|---|---|---|---|---|
. | |||||||
. | 0.4 . | 0.6 . | 0.8 . | . | 0.4 . | 0.6 . | 0.8 . |
0.26 | 0.38 | 0.37 | 0.36 | 0.43 | 0.42 | ||
0.37 | 0.32 | 0.37 | 0.42 | 0.35 | 0.35 |
(a) with data augmentation . | (b) without data augmentation . | ||||||
---|---|---|---|---|---|---|---|
. | |||||||
. | 0.4 . | 0.6 . | 0.8 . | . | 0.4 . | 0.6 . | 0.8 . |
0.26 | 0.38 | 0.37 | 0.36 | 0.43 | 0.42 | ||
0.37 | 0.32 | 0.37 | 0.42 | 0.35 | 0.35 |
4 Conclusion
Despite the nonlinearity of the dynamics and the noninvexity of the objective, we have rigorously proved convergence of training dynamics to global minima for nonlinear representation learning. Our results apply to a wide range of machine learning models, allowing both underparameterization and overparameterization. For example, our results are applicable to the case where the minimum eigenvalue of the matrix is zero for all . Under the common model structure assumption, models that cannot achieve zero error for all data sets (except some “good” data sets) are shown to achieve global optimality with zero error exactly when the dynamics satisfy the data-architecture alignment condition. Our results provide guidance for choosing and designing model structure and algorithms via the common model structure assumption and data-architecture alignment condition.
The key limitation in our analysis is the differentiability of the function . For multilayer neural networks, this is satisfied by using standard activation functions, such as softplus, sigmoid, and hyperbolic tangents. Whereas softplus can approximate ReLU arbitrarily well, the direct treatment of ReLU in nonlinear representation learning is left to future work.
Our theoretical results and numerical observations uncover novel mathematical properties and provide a basis for future work. For example, we have shown global convergence under the data-architecture alignment condition . The EE wrapper is only one way to ensure this condition. There are many other ways to ensure the data-architecture alignment condition, and each way can result in a new algorithm with guarantees.