Understanding Dynamics of Nonlinear Representation Learning and Its Application

Abstract Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss. In this letter, we study the dynamics of such implicit nonlinear representation learning. We identify a pair of a new assumption and a novel condition, called the on-model structure assumption and the data architecture alignment condition. Under the on-model structure assumption, the data architecture alignment condition is shown to be sufficient for the global convergence and necessary for global optimality. Moreover, our theory explains how and when increasing network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure; for example, the on-model structure assumption can be used as a justification for using a particular model structure instead of others. As an application, we then derive a new training framework, which satisfies the data architecture alignment condition without assuming it by automatically modifying any given training algorithm dependent on data and architecture. Given a standard training algorithm, the framework running its modified version is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN.

1 Introduction LeCun, Bengio, and Hinton (LeCun et al., 2015) described deep learning as one of hierarchical nonlinear representation learning approaches: Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. (p. 436) In applications such as computer vision and natural language processing, the success of deep learning can be attributed to its ability to learn hierarchical nonlinear representations by automatically changing nonlinear features and kernels during training based on the given data. This is in contrast to the neural tangent kernel (NTK) regime, extremely wide neural networks, and classical machine-learning methods, where representations or equivalently nonlinear features and kernels are (approximately) fixed during training.
To initiate a study towards such a condition, we consider the following problem setup that covers deep learning in the practical regime and other nonlinear representation learning methods in general. We are given a training dataset ((x i , y i )) n i=1 of n samples where x i ∈ X ⊆ R mx and y i ∈ Y ⊆ R my are the i-th input and the i-th target respectively. We would like to learn a predictor (or hypothesis) from a parametric family H = {f (·, θ) : R mx → R my | θ ∈ R d } by minimizing the objective L: where : R my × Y → R ≥0 is the loss function that measures the discrepancy between the prediction f (x i , θ) and the target y i for each sample. In this paper, it is allowed to let y i = x i (for i = 1, . . . , n) to include the setting of unsupervised learning. The function f is also allowed to represent a wide range of machine learning models. For a (deep) neural network, f (x i , θ) represents the pre-activation output of the last layer of the network, and the parameter vector θ ∈ R d contains all the trainable parameters, including weights and bias terms of all layers.
For example, one of the simplest models is the linear model in the form of f (x, θ) = φ(x) θ where φ : X → R d is a fixed function and φ(x) is a nonlinear representation of input data x. This is a classical machine learning model where much of the effort goes into the design of the hand-crafted feature map φ via feature engineering (Turner et al., 1999;Zheng and Casari, 2018). In this linear model, we do not learn the representation φ(x) because the feature map φ is fixed without dependence on the model parameter θ that is optimized with the dataset ((x i , y i )) n i=1 . Similarly to many definitions in mathematics, where an intuitive notion in a special case is formalized to a definition for a more general case, we now abstract and generalize this intuitive notion of the representation φ(x) of the linear model to that of all differentiable models as follows: Definition 1. Given any x ∈ X and differentiable function f , we define ∂f (x,θ) ∂θ to be the gradient representation of the data x under the model f at θ.
This definition recovers the standard representation φ(x) in the linear model as ∂f (x,θ) ∂θ = ∂φ(x) θ ∂θ = φ(x) and is applicable to all differentiable nonlinear models in representation learning. Moreover, this definition captures the key challenge of understanding the dynamics of nonlinear representation learning well, as illustrated below. Using the notation of dθ t dt = ∆ t , the dynamics of the model f (x, θ t ) over the time t can be written by Here, we can see that the dynamics are linear in ∆ t if there is no gradient representation learning as ∂f (x,θ t ) ∂θ t ≈ ∂f (x,θ 0 ) ∂θ 0 , which is indeed the case for the neural tangent kernel (NTK) regime and extremely wide neural networks. However, with representation learning in the practical regime, the gradient representation ∂f (x,θ t ) ∂θ t changes depending on t (and ∆ t ), resulting in the dynamics that are nonlinear in ∆ t . Therefore, the definition of the gradient representation can distinguish fundamentally different dynamics in machine learning.
In this paper, we initiate the study of the dynamics of learning gradient representation that are nonlinear in ∆ t . That is, we focus on the regime where the gradient representation ∂f (x,θ t ) ∂θ t at the end of training time t differs greatly from the initial representation ∂f (x,θ 0 ) ∂θ 0 , unlike the NTK regime where we have ∂f (x,θ t ) ∂θ t ≈ ∂f (x,θ 0 ) ∂θ 0 . This practical regime of representation learning was studied in the past for the case where the function φ(x) → f (x, θ) is affine for some fixed feature map φ (Saxe et al., 2014;Kawaguchi, 2016;Laurent and Brecht, 2018;Bartlett et al., 2019;Zou et al., 2020;Kawaguchi, 2021;Xu et al., 2021). Unlike any previous studies, we focus on the problem setting where the function φ(x) → f (x, θ) is nonlinear and non-affine, with the effect of nonlinear (gradient) representation learning having ∂f (x,θ t ) ∂θ t ≈ ∂f (x,θ 0 ) ∂θ 0 . The results of this paper avoid the curse of dimensionality by studying the global convergence of the gradient-based dynamics instead of the dynamics of global optimization (Kawaguchi et al., 2016) and Bayesian optimization (Kawaguchi et al., 2015). Importantly, we do not require any wide layer or large input dimension throughout this paper. Our main contributions are summarized as follows: 1. In Section 2, we identify a pair of a novel assumption and a new condition, called the common model structure assumption and the data-architecture alignment condition. Under the common model structure assumption, the data-architecture alignment condition is shown to be a necessary condition for the globally optimal model and a sufficient condition for the global convergence. The condition is dependent on both data and architecture. Moreover, we empirically verify and deepen this new understanding. When we apply representation learning in practice, we often have overwhelming options regarding which model structure to be used. Our results provide a practical guidance for choosing or designing model structure via the common model structure assumption, which is indeed satisfied by many representation learning models used in practice.
2. In Section 3, we discard the assumption of the data-architecture alignment condition. Instead, we derive a novel training framework, called the Exploration-Exploitation Wrapper (EE Wrapper), which satisfies the data-architecture alignment condition time-independently a priori. The EE Wrapper is then proved to have global convergence guarantees under the safe-exploration condition. The safe-exploration condition is what allows us to explore various gradient representations safely without getting stuck in the states where we cannot provably satisfy the data-architecture alignment condition. The safe-exploration condition is shown to hold true for ResNet-18 with standard benchmark datasets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST and SVHN timeindependently.
3. In Subsection 3.4, the Exploration-Exploitation Wrapper is shown to not degrade practical performances of ResNet-18 for the standard datasets, MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST and SVHN. To our knowledge, the present paper provides the first practical algorithm with global convergence guarantees without degrading practical performances of ResNet-18 on these standard datasets, using convolutions, skip connections, and batch normalization without any extremely wide layer of the width larger than the number of data points. To the best of our knowledge, we are not aware of any similar algorithms with global convergence guarantees in the regime of learning nonlinear representations without degrading practical performances.
It is empirically known that increasing the network size tends to improve the training behaviors. Indeed, the size of networks correlates well with the training error in many cases in our experiments (e.g., Figure 1 b). However, the size and the training error do not correlate well in some experiments (e.g., Figure 1 c). Our new theoretical results explain that the training behaviors correlate more directly with the data-architecture alignment condition instead. The seeming correlation with the network size is caused by another correlation between the network size and the data-architecture alignment condition. This is explained in more details in Section 2.3.

Understanding Dynamics via Common Model Structure and Data-Architecture Alignment
In this section, we identify the common model structure assumption and study the dataarchitecture alignment condition for the global convergence in nonlinear representation learning. We begin by presenting an overview of our results in Subsection 2.1, deepen our understandings via experiments in Subsection 2.2, discuss implications of our results in Subsection 2.3, and establish mathematical theories in Subsection 2.4.

Overview
We introduce the common model structure assumption in Subsection 2.1.1 and define the data-architecture alignment condition in Subsection 2.1.2. Using the assumption and the condition, we present the global convergence result in Subsection 2.1.3.

Common Model Structure Assumption
Through examinations of representation learning models used in applications, we identified and formalized one of their common properties as follows: Assumption 1 is satisfied by common machine learning models, such as kernel models and multilayer neural networks, with or without convolutions, batch normalization, pooling, and skip connections. For example, consider a multilayer neural network of the form f (x, θ) = W h(x, u) + b, where h(x, u) is an output of its last hidden layer and the parameter vector θ consists of the parameters (W, b) at the last layer and the parameters u in all other layers as θ = vec ([W, b, u]). Here, for any matrix M ∈ R m×m , we let vec(M ) ∈ R mm be the standard vectorization of the matrix M by stacking columns. Then, Assumption 1 holds because Since h is arbitrary in this example, the common model structure assumption holds, for example, for any multilayer neural networks with a fully-connected last layer. In general, because the nonlinearity at the output layer can be treated as a part of the loss function while preserving convexity of q → (q, y) (e.g., cross-entropy loss with softmax), this assumption is satisfied by many machine learning models, including ResNet-18 and all models used in the experiments in this paper (as well as all linear models). Moreover, Assumption 1 is automatically satisfied in the next section by using the EE Wrapper.

Data-Architecture Alignment Condition
Given a target matrix Y = (y 1 , y 2 , . . . , y n ) ∈ R n×my and a loss function , we define the modified target matrix Y = (y 1 , y 2 , . . . , y n ) ∈ R n×my by Y = Y for the squared loss , and by (Y ) ij = 2Y ij − 1 for the (binary and multi-class) cross-entropy losses with Y ij ∈ {0, 1}. Given input matrix X = (x 1 , x 2 , . . . , x n ) ∈ R n×mx , the output matrix f X (θ) ∈ R n×my is defined by f X (θ) ij = f (x i , θ) j ∈ R. For any matrix M ∈ R m×m , we let Col(M ) ⊆ R m be its column space. With these notations, we are now ready to introduce the data-architecture alignment condition: Definition 2. (Data-Architecture Alignment Condition) Given any dataset (X, Y ), differentiable function f , and loss function , the data-architecture alignment condition is said to be satisfied at θ if vec(Y ) ∈ Col( ∂ vec(f X (θ)) ∂θ ).
The data-architecture alignment condition depends on both data (through the target Y and the input X) and architecture (through the model f ). It is satisfied only when the data and architecture align well to each other. For example, in the case of linear model f (x, θ) = φ(x) θ ∈ R, the condition can be written as vec(Y ) ∈ Col(Φ(X)) where Φ(X) ∈ R n×d and Φ(X) ij = φ(x i ) j . In Definition 2, f X (θ) is a matrix of the pre-activation outputs of the last layer. Thus, in the case of classification tasks with a nonlinear activation at the output layer, f X (θ) and Y are not in the same space, which is the reason why we use Y here instead of Y .
Importantly, the data-architecture alignment condition does not make any requirements on the the rank of the Jacobian matrix ∂ vec(f X (θ)) ∂θ ∈ R nmy×d : i.e., the rank of ∂ vec(f X (θ)) ∂θ is allowed to be smaller than nm y and d. Thus, for example, the dataarchitecture alignment condition can be satisfied depending on the given data and architecture even if the minimum eigenvalue of the matrix ∂ vec(f X (θ)) ∂θ ( ∂ vec(f X (θ)) ∂θ ) is zero, in both cases of over-parameterization (e.g., d n) and under-parameterization (e.g., d n). This is further illustrated in Subsection 2.2 and discussed in Subsection 2.3. We note that we further discard the assumption of the data-architecture alignment condition in Section 3 as it is automatically satisfied by using the EE Wrapper.

Global Convergence
Under the common model structure assumption, the data-architecture alignment condition is shown to be what lets us avoid the failure of the global convergence and suboptimal local minima. More concretely, we prove a global convergence guarantee under the data-architecture alignment condition as well as the necessity of the condition for the global optimality: Theorem 1. (Informal Version) Let Assumption 1 hold. Then, the following two statements hold for gradient-based dynamics: (i) The global optimality gap bound decreases per iteration towards zero at the rate of O(1/ |T |) for any T such that the data-architecture alignment condition is satisfied at θ t for t ∈ T .
(ii) For any θ ∈ R d , the data-architecture alignment condition at θ is necessary to have the globally optimal model f X (θ) = ηY at θ for any η ∈ R.
Theorem 1 (i) guarantees the global convergence without the need to satisfy the data-architecture alignment condition at every iteration or at the limit point. Instead, it shows that the bound on the global optimality gap decreases towards zero per iteration whenever the data-architecture alignment condition holds. Theorem 1 (ii) shows that the data-architecture alignment condition is necessary for the global optimality. Intuitively, this is because the expressivity of a model class satisfying the common model structure assumption is restricted such that it is required to align the architecture to the data in order to contain the globally optimal model f X (θ) = ηY (for any η ∈ R).
To better understand the statement of Theorem 1 (i), consider a counter example with a dataset consisting of the single point (x, y) = (1, 0), the model f (x, θ) = θ 4 − 10θ 2 + 6θ + 100, and the squared loss (q, y) = (q − y) 2 . In this example, we have L(θ) = f (x, θ) 2 , which has multiple suboptimal local minima of different values. Then, via gradient descent, the model converges to the closest local minimum and, in particular, does not necessarily converge to a global minimum. Indeed, this example violates the common model structure assumption (Assumption 1) (although it satisfies the data-architecture alignment condition), showing the importance of the common model structure assumption along with the data-architecture alignment. This also illustrates the non-triviality of Theorem 1 (i) in that the data-architecture alignment is not sufficient, and we needed to understand what types of model structures are commonly used in practice and formalize the understanding as the common model structure assumption.
To further understand the importance of the common model structure assumption in Theorem 1, let us now consider the case where we do not require the assumption. Indeed, we can guarantee the global convergence without the common model structure assumption if we ensure that the minimum eigenvalue of the matrix ∂ vec(f X (θ)) is nonzero. This can be proved by the following derivation. Let θ be an arbitrary stationary point of L. Then, we have 0 = ∂L(θ) where ) is nonzero, then we have v = 0: i.e., ∂ (q,u i ) ∂q | q=f (x i ,θ) = 0 for all i ∈ {1, 2, . . . , n}. Using the convexity of the map q → (q, y) (which is satisfied by the squared loss and cross-entropy loss), this implies that for any q 1 , q 2 , . . . , q n ∈ R my , Since f (x 1 ), f (x 2 ), . . . , f (x n ) ∈ R my , this implies that any stationary point θ is a global minimum if the minimum eigenvalue of the matrix ∂ vec(f X (θ)) ∂θ ( ∂ vec(f X (θ)) ∂θ ) is nonzero, without the common model structure assumption (Assumption 1). Indeed, in the above andT is the last time step. Dataset m c = 4 m c = 2 m c = 1 seed#1 seed#2 seed#1 seed#2 seed#1 seed#2 Semeion 8.09×10 12 5.19×10 12 9.82×10 12 3.97×10 12 2.97×10 12 5.41×10 12 Random 3.73×10 12 1.64×10 12 3.43×10 7 4.86×10 12 1.40×10 7 8.57×10 11 example with the model f (x, θ) = θ 4 − 10θ 2 + 6θ + 100, the common model structure assumption is violated but we still have the global convergence if the minimum eigenvalue is nonzero: e.g., f (x, θ) = y = 0 at any stationary point θ such that the minimum eigenvalue of the matrix ∂ vec(f X (θ)) ) is nonzero. In contrast, Theorem 1 allows the global convergence even when the minimum eigenvalue of the matrix is zero, by utilizing the common model structure assumption. The formal version of Theorem 1 is presented in Subsection 2.4 and is proved in Appendix A in the Supplementary Information. Before proving the statement, we first examine the meaning and implications of our results through illustrative examples in the two next subsections.

Illustrative Examples in Experiments
Theorem 1 suggests that data-architecture alignment condition vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) has the ability to distinguish the success and failure cases, even when the minimum eigenvalue of the matrix ∂ vec(f X (θ t )) ∂θ t is zero for all t ≥ 0. In this subsection, we conduct experiments to further verify and deepen this theoretical understanding.
We employ a fully-connected network having four layers with 300 neurons per hidden layer, and a convolutional network, LeNet (LeCun et al., 1998), with five layers. For the fully-connected network, we use the two-moons dataset (Pedregosa et al., 2011) and a sine wave dataset. To create the sine wave dataset, we randomly generated the input x i from the uniform distribution on the interval [−1, 1] and set y i = 1{sin(20x i ) < 0} ∈ R for all i ∈ [n] with n = 100. For the convolutional network, we use the Semeion dataset (Srl and Brescia, 1994) and a random dataset. The random dataset was created by randomly generating each pixel of the input image x i ∈ R 16×16×1 from the standard normal distribution and by sampling y i uniformly from {0, 1} for all i ∈ [n] with n = 1000. We set the activation functions of all layers to be softplusσ(z) = ln(1 + exp(ςz))/ς with ς = 100, which approximately behaves as the ReLU activation as shown in Appendix C in the Supplementary Information. See Appendix B in the Supplementary Information for more details of the experimental settings.
The results of the experiments are presented in Figure 1. In each subfigure, the training errors and the values of Q T are plotted over time T . Here, Q T counts the number of Y not satisfying the condition vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) during t ∈ {0, 1, . . . , T } and is defined by Figure 1 (a) shows the results for the fully-connected network. For the two-moons dataset, the network achieved the zero training error with Q T = 0 for all T (i.e., ). This figure validates our theoretical understanding that the bound on the global optimality gap decreases at any iteration when Q T does not increase at the iteration: i.e., when the Q T value is flat. The minimum eigenvalue of the matrix M(θ t ) = ∂ vec(f X (θ)) ∂θ ( ∂ vec(f X (θ)) ∂θ ) is zero at all iterations in Figure 1 (b) and (c) for all cases with m c = 1.
) for all T ). For the sine wave dataset, it obtained high training errors with Q T = T for all T (i.e., vec(Y ) / ∈ Col( ∂ vec(f X (θ T )) ∂θ T ) for all T ). This is consistent with our theory. Our theory explains that what makes a dataset easy to be fitted or not is whether the condition vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) is satisfied or not. Figure 1 (b) and (c) shows the results for the convolutional networks with two random initial points using two different random seeds. In the subfigures, we report the training behaviors with different network sizes m c = 1, 2 and 4: i.e., the number of convolutional filters per convolutional layer is 8 × m c and the number of neurons per fully-connected hidden layer is 128 × m c . As can be seen, with the Semeion dataset, the networks of all sizes achieved the zero error with Q T = 0 for all T . With the random dataset, the deep networks yielded the zero training error whenever Q T is not linearly increasing over the time, or equivalently whenever the condition of vec(Y ) ∈ Col( ∂ vec(f X (θ T )) ∂θ T ) holds sufficiently many steps T . This is consistent with our theory.
Finally, we also confirmed that gradient representation ∂f (x,θ t ) ∂θ t changed significantly from the initial one ∂f (x,θ 0 ) ∂θ 0 in our experiments. That is, the values of M(θ T ) − M(θ 0 ) 2 F were significantly large and tended to increase as T increases, where the matrix M(θ) ∈ R nmy×nmy is defined by Table 1 summarizes the values of M(θ T ) − M(θ 0 ) 2 F at the end of the training.

Implications
In Section 2.1.3, we showed that an uncommon model structure f (x, θ) = θ 4 − 10θ 2 + 6θ + 100 does not satisfy Assumption 1 and Assumption 1 is not required for global convergence if the minimum eigenvalue is nonzero. However, in practice, we typically use machine learning models that satisfy Assumption 1 instead of the model f (x, θ) = θ 4 − 10θ 2 + 6θ + 100, and the minimum eigenvalue is zero in many cases. In this context, Theorem 1 provides the justification for common practice in nonlinear representation learning. Furthermore, Theorem 1 (i) contributes to the literature by identifying the common model structure assumption (Assumption 1) and the data-architecture alignment condition (Definition 2) as the novel and practical conditions to ensure the global convergence even when the minimum eigenvalue becomes zero. Moreover, Theorem 1 (ii) shows that this condition is not arbitrary in the sense that it is also necessary to obtain the globally optimal models. Furthermore, the data-architecture alignment condition is strictly more general than the condition of the minimum eigenvalue being nonzero, in the sense that the latter implies the former but not vice versa.
Our new theoretical understanding based on the data-architecture alignment condition can explain and deepen the previously known empirical observation that increasing the network size tends to improve the training behaviors. Indeed, the size of networks seems to correlate well with the training error to a certain degree in Figure 1 (b). However, the size and the training error do not correlate well in Figure 1 (c). Our new theoretical understanding explains that the training behaviors correlate more directly with the data-architecture alignment condition of vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) instead. The seeming correlation with the network size is indirect and caused by another correlation between the network size and the condition of vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ). That is, the condition of vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) more likely tends to hold when the network size is larger because the matrix ∂ vec(f X (θ t )) ∂θ t is of size nm y × d where d is the number of parameters: i.e., by increasing d, we can increase the column space Col( ∂ vec(f X (θ t )) ∂θ t ) to increase the chance of satisfying the condition of vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t

).
Note that the minimum eigenvalue of the matrix M(θ is zero at all iterations in Figure 1 (b) and (c) for all cases of m c = 1. Thus, Figure  1 (b) and (c) also illustrates the fact that while having the zero minimum eigenvalue of the matrix M(θ t ), the dynamics can achieve the global convergence under the dataarchitecture alignment condition. Moreover, because the multilayer neural network in the lazy training regime (Kawaguchi and Sun, 2021) achieve zero training errors for all datasets, Figure 1 additionally illustrates the fact that our theoretical and empirical results apply to the models outside of the lazy training regime and can distinguishing 'good' datasets from 'bad' datasets given a learning algorithm.
In sum, our new theoretical understanding has the ability to explain and distinguish the successful case and failure case based on the data-architecture alignment condition for the common machine learning models. Because the data-architecture alignment condition is dependent on data and architecture, Theorem 1 along with our experimental results show why and when the global convergence in nonlinear representation learning is achieved based on the relationship between the data (X, Y ) and architecture f . This new understanding is used in Section 3 to derive a practical algorithm, and is expected to be a basis for many future algorithms.

Details and Formalization of Theorem 1
This subsection presents the precise mathematical statements that formalize the informal description of Theorem 1. In the following subsections, we devide the formal version of Theorem 1 into Theorem 2, Theorem 3, and Proposition 1.

Preliminaries
Let (θ t ) ∞ t=0 be the sequence defined by θ t+1 = θ t − α tḡt with an initial parameter vector θ 0 , a learning rate α t , and an update vectorḡ t . The analysis in this section relies on the following assumption on the update vectorḡ t : √c ]. If we set D t = I, we have gradient descent and Assumption 2 is satisfied with c =c = 1. This section also uses the standard assumption of differentiability and Lipschitz continuity: The assumptions on the loss function in Assumption 3 are satisfied by using standard loss functions, including the squared loss, logistic loss, and cross entropy loss. Although the objective function L is non-convex and non-invex, the function q → (q, y i ) is typically convex.
For any matrix Y * = (y * 1 , y * 2 , . . . , y * n ) ∈ R n×my , we define For example, for the squared loss , the value of L * (Y ) is at most the global minimum value of L as This paper also uses the notation of [k] = {1, 2, . . . , k} for any k ∈ N + and · = · 2 (Euclidean norm). Finally, we note that for any η ∈ R, the condition of vec(Y ) ∈ Col( ∂ vec(f X (θ)) ∂θ ) is necessary to learn a near global optimal model f X (θ) = ηY : Proof. All proofs of this paper are presented in Appendix A in the Supplementary Information.

Global Optimality at the Limit Point
The following theorem shows that every limit pointθ of the sequence (θ t ) t achieves a loss value L(θ) no worse than inf η∈R L * (ηY * ) for any Theorem 2. Suppose Assumptions 1-3 hold. Assume that the learning rate sequence For example, for the squared loss (q, y) = q − y 2 , Theorem 2 implies that every limit pointθ of the sequence (θ t ) t is a global minimum as This is because L(θ) ≤ L * (Y ) ≤ L(θ) ∀θ ∈ R d from Theorem 2 and equation (9).
In practice, one can easily satisfy all the assumptions in Theorem 2 except for the condition that vec(Y * ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) for all t ∈ [τ, ∞). Accordingly, we will now weaken this condition by analyzing optimality at each iteration so that the condition is verifiable in experiments.

Global Optimality Gap at Each Iteration
The following theorem states that under standard settings, the sequence (θ t ) t∈T converges to a loss value no worse than inf η∈R L * (ηY * ) at the rate of O(1/ |T |) for any Theorem 3. Suppose Assumptions 1 and 3 hold. Let (α t ,ḡ t ) = ( 2α L , ∇L(θ t )) with an arbitrary α ∈ (0, 1). Then, for any T ⊆ N 0 and Y * ∈ R n×my such that vec( , and ν(θ) k = θ k for all k ∈ S and ν(θ) k = 0 for all k / ∈ S with the set S being defined in Assumption 1.
For the squared loss , Theorem 3 implies the following for any T ≥ 1: for any Similarly, for the binary and multi-class cross-entropy losses, Theorem 3 implies the following for any T ≥ 1: Given any desired > 0, since L * (ηY ) → 0 as η → ∞, setting η to be sufficiently large obtains the desired value as min t∈T L(θ t ) ≤ in equation (14) as |T | → ∞.

Application to the Design of Training Framework
The results in the previous section show that the bound on the global optimality gap decreases per iteration whenever the data-architecture alignment condition holds. Using this theoretical understanding, in this section, we propose a new training framework with prior guarantees while learning hierarchical nonlinear representations without assuming the data-architecture alignment condition. As a result, we made significant improvements over the most closely related study on global convergence guarantees (Kawaguchi and Sun, 2021). In particular, whereas the related study requires a wide layer with a width larger than n, our results reduce the requirement to a layer with a width larger than √ n. For example, the MNIST dataset has n = 60000 and hence previous studies require 60000 neurons at a layer, whereas we only require √ 60000 ≈ 245 neurons at a layer. Our requirement is consistent and satisfied by the models used in practice that typically have from 256 to 1024 neurons for some layers. We begin in Subsection 3.1 with additional notations, and then present the training framework in Subsection 3.2 and convergence analysis in Subsection 3.3. We conclude in Subsection 3.4 by providing empirical evidence to support our theory.

Additional Notations
We denote by θ (l) ∈ R d l the vector of all the trainable parameters at the l-th layer for l = 1, . . . , H where H is the depth or the number of trainable layers (i.e., one plus the number of hidden layers). That is, the H-th layer is the last layer containing the trainable parameter vector θ (H) at the last layer. For any pair (l, l ) such that 1 ≤ l ≤ l ≤ H, we define θ (l:l ) = [θ (l) , . . . , θ (l ) ] ∈ R d l:l : for example, θ (1:H) = θ and θ (l:l ) = θ (l) if l = l . We consider a family of training algorithms that update the parameter vector θ as follows: for each l = 1, . . . , H, where the function G (l) outputs a distribution over the vector g t (l) and differs for different training algorithms. For example, for (mini-batch) stochastic gradient descent (SGD), g t (l) represents a product of a learning rate and a stochastic gradient with respect to θ (l) at the time t. We define G = (G (1) , . . . , G (H) ) to represent a training algorithm.
For an arbitrary matrix M ∈ R m×m , we let M * j be its j-th column vector in R m , M i * be its i-th row vector in R m , and rank(M ) be its matrix rank. We define M • M to be the Hadamard product of any matrices M and M . For any vector v ∈ R m , we let We denote by I m the m × m identity matrix.

Exploration-Exploitation Wrapper
In this section, we introduce the Exploration-Exploitation (EE) wrapper A. The EE wrapper A is not a stand-alone training algorithm. Instead, the EE wrapper A takes any training algorithm G as its input, and runs the algorithm G in a particular way to guarantee global convergence. We note that the exploitation phase in the EE wrapper does not optimize the last layer, and instead it optimizes hidden layers, whereas the exploration phase optimizes all layers. The EE wrapper allows us to learn the representation ∂f (x,θ t ) ∂θ t that differs significantly from the initial representation ∂f (x,θ 0 ) ∂θ 0 without making assumptions on the minimum eigenvalue of the matrix ∂ vec(f X (θ)) ∂θ ( ∂ vec(f X (θ)) ∂θ ) by leveraging the data-architecture alignment condition. The data-architecture alignment condition is ensured by the safe-exploration condition (defined in Section 3.3.1), which is timeindependent and holds in practical common architectures (as demonstrated in Section 3.4).

Main Mechanisms
Algorithm 1 outlines the EE wrapper A. During the exploration phase in lines 3-7 of Algorithm 1, the EE wrapper A freely explores hierarchical nonlinear representations end for 8: Exploitation phase: 9: end for to be learned without any restrictions. Then, during the exploitation phase in lines 8-12, it starts exploiting the current knowledge to ensure vec ) for all t to guarantee global convergence. The value of τ is the hyper-parameter that controls the time when it transitions from the exploration phase to the exploitation phase.
In the exploitation phase, the wrapper A only optimizes the parameter vector θ t at the (H − 1)-th hidden layer, instead of the parameter vector θ t (H) at the last layer or the H-th layer. Despite this, the EE wrapper A is proved to converge to global minima of all layers in R d . The exploitation phase still allows us to significantly change the representations as M(θ t ) ≈ M(θ τ ) for t > τ . This is because we optimize the hidden layers instead of the last layer without any significant over-parameterizations.
The exploitation phase uses an arbitrary optimizerG with the update vectorg t ∼ G (H−1) (θ t , t) withg t = α tĝt ∈ R d H−1 . During the two phases, we can use the same optimizers (e.g., SGD for both G andG) or different optimizers (e.g., SGD for G and L-BFGS forG).

Model Modification
This subsection defines the details of the model modification at line 2 of Algorithm 1. Given any base networkf , the wrapper A first checks whether or not the last two layers of the given networkf are fully connected. If not, one or two fully-connected last layers are added such that the output of the networkf can be written byf (x,θ) = W (H)σ (W (H−1) z(x, θ (1:H−2) )). Here, z(x, θ (1:H−2) ) is the output of the (H −2)-th layer, and the function z is arbitrary and can represent various deep networks. Moreover,σ is a nonlinear activation function, andW (H−1) andW (H) are the weight matrices of the last two layers. The wrapper A then modifies these last two layers as follows. In the case of m y = 1, the modelf is modified to where W (H−1) ∈ R m H ×m H−1 and W (H) ∈ R my×m H are the weight matrices of the last two layers. The nonlinear activation σ is defined by σ(q, q ) =σ(qq ) • (qq ), whereσ is some nonlinear function. For example, we can setσ(q) = 1 1+e −ς q (sigmoid) with any hyper-parameter ς > 0, for which it holds that as ς → ∞, We generalize equation (16) to the case of m y ≥ 2 as: To consider the bias term, we include the constant neuron to the output of the (H − 1)th layer as z(x, θ (1: ) is the output without the constant neuron.

Convergence Analysis
In this subsection, we establish global convergence of the EE wrapper A without using assumptions from the previous section. Let τ be an arbitrary positive integer and ε be an arbitrary positive real number. Let (θ t ) ∞ t=0 be a sequence generated by the EE wrapper

Safe-exploration Condition
The mathematical analysis in this section relies on the safe-exploration condition, which is what allows us to safely explore deep nonlinear representations in the exploration phase without getting stuck in the states of vec( ). The safeexploration condition is verifiable, time-independent, data-dependent and architecturedependent. The verifiability and time-independence makes the assumption strong enough to provide prior guarantees before training. The data-dependence and architecturedependence make the assumption weak enough to be applicable for a wide range of practical settings. For Using this function, the safe-exploration condition is formally stated as: Assumption 4. (Safe-exploration condition) There exist a q ∈ R m H−1 ×m H and a θ (1:H−2) ∈ R d 1:H−2 such that rank(φ(q, θ (1:H−2) )) = n.
The safe-exploration condition asks for only the existence of one parameter vector in the network architecture such that rank(φ(q, θ (1:H−2) )) = n. It is not about the training trajectory (θ t ) t . Since the matrix φ(q, θ (1:H−2) ) is of size n × m H m H−1 , the safeexploration condition does not require any wide layer of size m H ≥ n or m H−1 ≥ n. Instead, it requires a layer of size m H m H−1 ≥ n. This is a significant improvement over the most closely related study (Kawaguchi and Sun, 2021) where the wide layer of size m H ≥ n was required. Note that having m H m H−1 ≥ n does not imply the safe-exploration condition. Instead, m H m H−1 ≥ n is a necessary condition to satisfy the safe-exploration condition, whereas m H ≥ n or m H−1 ≥ n was a necessary condition to satisfy assumptions in previous papers, including the most closely related study (Kawaguchi and Sun, 2021). The safe-exploration condition is verified in experiments in Subsection 3.4.

Additional Assumptions
We also use the following assumptions: ) and q →σ(q) are real analytic.
Assumption 5 is satisfied by using standard loss functions such as the squared loss (q, y) = q − y 2 and cross entropy loss (q, y) = − dy k=1 y k log exp(q k ) k exp(q k ) . The assumptions of the invexity and convexity of the function q → (q, y i ) in Subsections 3.3.3-3.3.4 also hold for these standard loss functions. Using L in Assumption 5, we . Assumption 6 is satisfied by using any analytic activation function such as sigmoid, hyperbolic tangents and softplus activations q → ln(1 + exp(ςq))/ς with any hyperparameter ς > 0. This is because a composition of real analytic functions is real analytic and the following are all real analytic functions in θ (1:H−2) : the convolution, affine map, average pooling, skip connection, and batch normalization. Therefore, the assumptions can be satisfied by using a wide range of machine learning models, including deep neural networks with convolution, skip connection, and batch normalization. Moreover, the softplus activation can approximate the ReLU activation for any desired accuracy: i.e., ln(1 + exp(ςq))/ς → relu(q) as ς → ∞, where relu represents the ReLU activation.

Global Optimality at the Limit Point
The following theorem proves the global optimality at limit points of the EE wrapper with a wide range of optimizers, including gradient descent and modified Newton methods: Theorem 4. Suppose Assumptions 4-6 hold and that the function i : q → (q, y i ) is invex for any i ∈ [n]. Assume that there existc, c > 0 such that c ∇L(θ t (H−1) ) 2 ≤ ∇L(θ t (H−1) ) ĝ t and ĝ t 2 ≤c ∇L(θ t (H−1) ) 2 for any t ≥ τ . Assume that the learning rate sequence (α t ) t≥τ satisfies either (i) ≤ α t ≤ c(2− ) Lc for some > 0, or (ii) lim t→∞ α t = 0 and ∞ t=τ α t = ∞. Then with probability one, every limit pointθ of the sequence (θ t ) t is a global minimum of L as L(θ) ≤ L(θ) for all θ ∈ R d .

Global Optimality Gap at Each Iteration
We now present global convergence guarantees of the EE wrapper A with gradient decent and SGD: Theorem 5. Suppose Assumptions 4-6 hold and that the function i : q → (q, y i ) is convex for any i ∈ [n]. Then, with probability one, the following two statements hold: (i) (Gradient descent) ifĝ t = ∇L(θ t (H−1) ) and α t = 1 L for t ≥ τ , then for any¯ ≥ 0 and t > τ , where t * ∈ argmin k∈{τ,τ +1,...,t} L(θ k ).

Experiments
This section presents empirical evidence to support our theory and what is predicted by a well-known hypothesis. We note that there is no related work or algorithm that can guarantee global convergence in the setting of our experiments where the model has convolutions, skip connections, and batch normalizations without any wide layer (of the width larger than n). Moreover, unlike any previous studies that propose new methods, our training framework works by modifying any given method.

Sine Wave Dataset
We have seen in Subsection 2.2 that gradient descent gets stuck at sub-optimal points for the sine wave dataset. Using the same setting as that in Subsection 2.2 with ε = 0.01, τ = 2000, andG = G, we confirm in Figure 2 that the EE wrapper A can modify gradient descent to avoid sub-optimal points and converge to global minima as predicted by our theory. Figure 3 shows the value of the change of the gradient representation, M(θ T ) − M(θ 0 ) 2 F , for each time step T . As it can be seen, the values of M(θ T ) − M(θ 0 ) 2 F are large for both methods. Notably, the EE wrapper A of the base case significantly increases the value of M(θ T ) − M(θ 0 ) 2 F even in the exploitation phase after τ = 2000 as we are optimizing the hidden layer. See Appendix B in the Supplementary Information for more details of the experiments for the sine wave dataset.

Image datasets
The standard convolutional ResNet with 18 layers (He et al., 2016) is used as the base modelf . We use ResNet-18 for the illustration of our theory because it is used in practice and it has convolution, skip connections, and batch normalization without any width larger than the number of data points. This setting is not covered by any of the previous theories for global convergence. We set the activation to be the softplus function q → ln(1 + exp(ςq))/ς with ς = 100 for all layers of the base ResNet. This approximates the ReLU activation well, as shown in Appendix C in the Supplementary Information. We employ the cross-entropy loss andσ(q) = 1 1+e −q . We use a standard algorithm, SGD, with its standard hyper-parameter setting for the training algorithm G withG = G: i.e., we let the mini-batch size be 64, the weight decay rate be 10 −5 , the momentum coefficient be 0.9, the learning rate be α t = 0.1, the last epochT be 200 (with data augmentation) and 100 (without data augmentation). The hyper-parameters ε and τ = τ 0T were selected from ε ∈ {10 −3 , 10 −5 } and τ 0 ∈ {0.4, 0.6, 0.8} by only using training data. That is, we randomly divided each training data (100%) into a smaller training data (80%) and a validation data (20%) for a grid search over the hyper-parameters. See Appendix B in the Supplementary Information for the results of the grid search and details of the experimental setting. This standard setting satisfies Assumptions 5-6, leaving Assumption 4 to be verified.
Verification of Assumption 4. Table 2 summarizes the verification results of the safeexploration condition. Because the condition only requires an existence of a pair (θ, q) satisfying the condition, we verified it by using a randomly sampled q from the standard normal distribution and a θ returned by a common initialization scheme He et al. ∂ vec(f X (θ)) ∂θ dataset, the rank condition was verified twice by the two standard methods: one from (Press et al., 2007) and another from (Golub and Van Loan, 1996).
Test Performance. One well-known hypothesis is that the success of deep-learning methods partially comes from its ability to automatically learn deep nonlinear representations suitable for making accurate predictions from data (e.g., LeCun et al., 2015). As the EE wraper A keeps this ability of representation learning, the hypothesis suggests that the test performance of the EE wrapper A of a standard method is approximately comparable with that of the standard method. Unlike typical experimental studies, our objective here is to confirm this prediction, instead of showing improvements over a previous method. We empirically confirmed the prediction in Tables 3 and 4 where the numbers indicate the mean test errors (and standard deviations are in parentheses) over five random trials. As expected, the values of M(θT ) − M(θ 0 ) 2 2 were also large: e.g., 4.64 × 10 12 for the standard method and 3.43 × 10 12 for the wrapper A of the method with Semeion dataset.
Training Behavior. Figure 4 shows that the EE wrapper A can improve training loss values of the standard SGD algorithm in the exploitation phase without changing its       Computational Time. The EE wrapper A runs the standard SGD G in the exploration phase and the SGDG = G only on the subset of the weights θ (H−1) in the exploitation phase. Thus, the computational time of the EE wrapper A is similar to that of the SGD in the exploration phase, and it tends to be faster than the SGD in the exploitation phase.
To confirm this, we measure computational time with Semeion and CIFAR-10 datasets under the same computational resources (e.g., without running other jobs in parallel) in a local workstation for each method. The mean wall-clock time (in seconds) over five random trials is summarised in Table 5, where the numbers in parentheses are standard deviations. It shows that the EE wrapper A is slightly faster than the standard method, as expected.
Effect of Learning Rate and Optimizer. We also conducted experiments on the effects of learning rates and optimizers using MNIST dataset with data-augmentation. Using the best learning rate from {0.2, 0.1, 0.01, 0.001} for each method (withG = G = SGD), the mean test errors (%) over five random trials were 0.33 (0.03) for the standard base method, and 0.27 (0.03) for the A wrapper of the standard base method (the numbers in parentheses are standard deviations). Moreover, Table 6 reports the preliminary results on the effect of optimizers withG being set to Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) (with G = the standard SGD). By comparing Tables 3 and 6, we can see that using a different optimizer in the exploitation phase can potentially lead to performance improvements. A comprehensive study of this phenomena is left to future work.

Conclusion
Despite the nonlinearity of the dynamics and the non-invexity of the objective, we have rigorously proved convergence of training dynamics to global minima for nonlinear representation learning. Our results apply to a wide range of machine learning models, allowing both under-parameterization and over-parameterization. For example, our results are applicable to the case where the minimum eigenvalue of the ma- is zero for all t ≥ 0. Under the common model structure assumption, models that cannot achieve zero error for all datasets (except some 'good' datasets) are shown to achieve global optimality with zero error exactly when the dynamics satisfy the data-architecture alignment condition. Our results provide guidance for choosing and designing model structure and algorithms via the common model structure assumption and data-architecture alignment condition.
The key limitation in our analysis is the differentiability of the function f . For multilayer neural networks, this is satisfied by using standard activation functions, such as softplus, sigmoid, and hyperbolic tangents. Whereas softplus can approximate ReLU arbitrarily well, the direct treatment of ReLU in nonlinear representation learning is left to future work.
Our theoretical results and numerical observations uncover novel mathematical properties and provide a basis for future work. For example, we have shown global convergence under the data-architecture alignment condition vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ). The EE wrapper A is only one way to ensure this condition. There are many other ways to ensure the data-architecture alignment condition and each way can result in a new algorithm with guarantees.

Supplementary Material Appendix A Proofs
In Appendix A, we present the proofs of all the theoretical results. We show the proofs of Theorem 2 in Appendix A.2, Theorem 3 in Appendix A.3, Proposition 1 in Appendix A.4, Theorem 4 in Appendix A.5, and Theorem 5 in Appendix A.6. We also provide a proof idea of Theorem 2 in the beginning of Appendix A.2. Before starting our proofs, we introduce the additional notations used in the proofs. We define f i (θ) = f (x i , θ), i (q) = i (q, y i ), and [k , k] = {k , k + 1, . . . , k} for any k, k ∈ N + with k > k . Given a scalar-valued variable a ∈ R and a matrix M ∈ R m×m , we define where M ij represents the (i, j)-th entry of the matrix M . Similarly, given a vectorvalued variable a ∈ R m and a column vector b ∈ R m , we let where b i represents the i-th entry of the column vector b. Given a vector-valued variable a ∈ R m and a row vector b ∈ R 1×m , we write where b 1i represents the i-th entry of the row vector b. Given a function a : v → a(v), we define For completeness, we also recall the standard definition of the Kronecker product of two matrices: for matrices M ∈ R m M ×m M andM ∈ R mM ×m M , A.1 Proof of Theorem 1 The formal version of Theorem 1 is the combination of Theorem 2, Theorem 3, and Proposition 1, which are proved in Appendix A.2, Appendix A.3, and Appendix A.4.

A.2 Proof of Theorem 2
We begin with a proof idea of Theorem 2. We first relate the value of L(θ) and that of ∇L(θ) in the way that if ∇L(θ) = 0, then the value of L(θ) is bounded from above by L θ (β) for any θ, β ∈ R d . This is a key lemma -Lemma 1. This lemma implies that since every limit point achieves ∇L(θ) = 0, the value of L(θ) has the upper bound of L θ (β) at every limit point θ ∈ R d . Then, another key observation is that the upper bound L θ (β) on L(θ) can be replaced by ∂θ t ) is crucial and used to ensure the quality of the upper bound L θ (β).
To prove Theorem 2, we first prove the following key lemma that relates the value of L(θ) and that of ∇L(θ) : Lemma 1. Assume that the function i : q → (q, y i ) ∈ R ≥0 is differentiable and convex for every i ∈ {1, . . . , n}. Let θ be any differentiable point such that f i : θ → f (x i , θ) is differentiable at θ for every i ∈ {1, . . . , n}. Then for any β ∈ R d , and ν(θ) k = θ k for all k ∈ S in Assumption 1 and ν(θ) k = 0 for all k / ∈ S.
Proof of Lemma 1. Let θ be any differentiable point such that f i : θ → f (x i , θ) is differentiable at θ for every i ∈ {1, . . . , n}. Then, for any β ∈ R d , where the third line follows from the differentiability and and convexity of i . From Assumption 1, we have that Combining these, We also utilize the following lemma, which is a slightly modified version of a wellknown fact: Accordingly, we can define a functionφ : Thus, ∇φ : [0, 1] → R is Lipschitz continuous with the Lipschitz constant L ϕ z − z 2 , and hence ∇φ is continuous. By using the fundamental theorem of calculus with the continuous function ∇φ : With Lemmas 1 and 2, we are now ready to complete the proof of Theorem 2: Proof of Theorem 2. The function L is differentiable since i is differentiable, θ → f (x i , θ) is differentiable, and a composition of differentiable functions is differentiable. We will first show that in both cases of (i) and (ii) for the learning rates, we have lim t→∞ ∇L(θ t ) = 0. If ∇L(θ t ) = 0 at any t ≥ 0, then Assumption 2 ( ḡ t 2 2 ≤ c ∇L(θ t ) 2 2 ) impliesḡ t = 0, which implies θ t+1 = θ t and ∇L(θ t+1 ) = ∇L(θ t ) = 0.
This means that if ∇L(θ t ) = 0 at any t ≥ 0, we have thatḡ t = 0 and ∇L(θ t ) = 0 for all t ≥ 0 and hence as desired. Therefore, we now focus on the remaining scenario where ∇L(θ t ) = 0 for all t ≥ 0. By using Lemma 2, By rearranging and using Assumption 2, By simplifying the right-hand-side, Let us now focus on case (i). Then, using α t ≤ c (2− ) Lc , Using this inequality and using ≤ α t in equation (52), Since ∇L(θ t ) = 0 for any t ≥ 0 (see above) and > 0, this means that the sequence (L(θ t )) t is monotonically decreasing. Since L(q) ≥ 0 for any q in its domain, this implies that the sequence (L(θ t )) t converges. Therefore, L(θ t ) − L(θ t+1 ) → 0 as t → ∞. Using equation (53), this implies that lim t→∞ ∇L(θ t ) = 0, which proves the desired result for the case (i). We now focus on the case (ii). Then, we still have equation (52). Since lim t→∞ α t = 0 in equation (52), the first order term in α t dominates after sufficiently large t: i.e., there existst ≥ 0 such that for any t ≥t, for some constant c > 0. Since ∇L(θ t ) = 0 for any t ≥ 0 (see above) and cα t > 0, this means that the sequence (L(θ t )) t is monotonically decreasing. Since L(q) ≥ 0 for any q in its domain, this implies that the sequence (L(θ t )) t converges to a finite value. Thus, by adding Eq. (54) both sides over all t ≥t, Since ∞ t=0 α t = ∞, this implies that lim inf t→∞ ∇L(θ t ) = 0. We now show that by contradiction, lim sup t→∞ ∇L(θ t ) = 0. Suppose that lim sup t→∞ ∇L(θ t ) > 0. Then, there exists δ > 0 such that lim sup t→∞ ∇L(θ t ) ≥ δ. Since lim inf t→∞ ∇L(θ t ) = 0 and lim sup t→∞ ∇L(θ t ) ≥ δ, let (ρ j ) j and (ρ j ) j be sequences of indexes such that ρ j < ρ j < ρ j+1 , ∇L(θ t ) > δ 3 for ρ j ≤ t < ρ j , and ∇L(θ t ) ≤ δ 3 for ρ j ≤ t < ρ j+1 . Since ∞ t=t α t ∇L(θ t ) 2 < ∞, letj be sufficiently large such that ∞ t=ρj α t ∇L(θ t ) 2 < δ 2 9L √c . Then, for any j ≥j and any ρ such that ρ j ≤ ρ ≤ ρ j − 1, we have that where the first and third lines use the triangle inequality (and symmetry), the forth line uses the assumption that ∇L(θ) − ∇L(θ ) ≤ L θ − θ , and the last line follows the definition of θ t+1 − θ t = −α tḡ and the assumption of ḡ t 2 ≤c ∇L(θ t ) 2 . Then, by using the definition of the sequences of the indexes, Here, since ∇L(θ ρ j ) ≤ δ 3 , by rearranging the inequality, we have that for any ρ ≥ ρj, This contradicts the inequality of lim sup t→∞ ∇L(θ t ) ≥ δ. Thus, we have This implies that lim t→∞ ∇L(θ t ) = 0, which proves the desired result for the case (ii). Therefore, in both cases of (i) and (ii) for the learning rates, we have lim t→∞ ∇L(θ t ) = 0. We now use the fact that lim t→∞ ∇L(θ t ) = 0, in order to prove the statement of this theorem. From Lemma 1 and the differentiability assumption, for any θ ∈ R d , it holds that for any β ∈ R d , ∂θ k and thus Using thisβ(θ, η) in Lemma 1, we have that for any θ satisfying vec(Y * ) ∈ Col( Note that the statement of this theorem vacuously holds if there is not limit point of the sequence (θ t ) t . Thus, letθ ∈ R d be a limit point of the sequence (θ t ) t . Then, lim t→∞ ∇L(θ t ) = 0 implies ∇L(θ) = 0 and that if θ t satisfies vec(Y * ) ∈ Col( sinceθ ∈ R d and ∇L(θ) = 0.
Proof of Lemma 3. Using Lemma 2 and θ t+1 − θ t = − 2α L ∇L(θ t ), which implies that the sequence (L(θ t )) t is non-increasing and that By summing up both sides over the time, with T = {t 0 , t 1 . . . , t |T |−1 }, where the last line follows from the fact that L(θ t k +1 ) ≥ L(θ t k+1 ) and thus −L(θ t k +1 )+ L(θ t k+1 ) ≤ 0 for all k, since the sequence (L(θ t )) t is non-increasing from equation (72). Using this inequality, With Lemmas 1 and 3, we are now ready to complete the proof of Theorem 3: Proof of Theorem 3. From Lemma 1 and the differentiability assumption, for any θ ∈ R d , it holds that for any β ∈ R d , ∂θ k and thus Using thisβ(θ, η) in Lemma 1, we have that for any θ satisfying vec(Y * ) ∈ Col( ∂ vec(f X (θ)) ∂θ ), where Using Lemma 3, we have that where the last line follows from the fact that By noticing that t 0 = min{t : t ∈ T }, we complete the proof.

A.4 Proof of Proposition 1
Proof. From Assumption 1, we have that where ν(θ) k = θ k for all k ∈ S in Assumption 1 and ν(θ) k = 0 for all k / ∈ S. This implies that we have vec(f X (θ t )) = vec(ηY * ) for any η ∈ R, which implies the statement of this proposition.

A.5 Proof of Theorem 4
The following two lemmas are the key lemmas in the proof of Theorem 4 -the safeexploration condition ensures that the Lebesgue measure of the set of getting stuck to the states of vec(Y * ) / ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) is zero:  (91) implies that Combining (90) and (92) Since the functions θ (1:H−2) → z(x i , θ (1:H−2) ) and q →σ(q) are real analytic for each i ∈ {1, . . . , n} (Assumption 6), the function φ is real analytic (because a composition of real analytic functions is real analytic and an affine map is real analytic). This implies that the function ϕ is real analytic since φ is real analytic, the function det is real analytic, and a composition of real analytic functions is real analytic. Since ϕ is real analytic, if ϕ is not identically zero (ϕ = 0), the Lebesgue measure of its zero set is zero (Mityagin, 2015). Equation (93) shows that ϕ is not identically zero (ϕ = 0), and thus the Lebesgue measure of the set is zero. Using equation (90) Lemma 5. Suppose Assumptions 4 and 6 hold. Therefore, with probability one, for all j ∈ {1, . . . , m y }, Proof of Lemma 5. From Lemma 4, the Lebesgue measure of the set {[vec(R (j) ) , θ (1:H−2) , θ (H,j) ] ∈ R m H m H−1 +d 1:H−2 +m H : rank(φ(R (j) , θ (1:H−2) )[diag(θ (H,j) )⊗I m H−1 ] ) = n} is zero. Moreover, θ τ = θ τ −1 + εδ defines a non-degenerate Gaussian measure with the mean shifted by θ τ −1 and a non-degenerate Gaussian measure with any mean and variance is absolutely continuous with respect to the Lebesgue measure. Therefore, for each j ∈ {1, . . . , m y }, with probability one, Since a product of ones is one, this holds with probability one for all j ∈ {1, . . . , m y }.
In the following, we will use the following additional notation: The following lemma relates the two functions f X and φ: Lemma 6. For θ (H−1) ∈ R d H−1 and j ∈ {1, . . . , m y }, the following holds: Proof of Lemma 6. Using the definition of f and z = z(x, θ (1:H−2) ), we rewrite the inner product in the definition of f (x, θ) j with a sum of scalar multiplications and then rewrite the sum to a dot product of two block vectors: we have that H−1,j) ) ).
From Assumption 5, we also have that for all q, q ∈ R,
Note that this is not an additional assumption. This is already assumed in Theorem 4. For example, Assumption 7 is satisfied by using any optimizerG such that for all t ≥ τ ,ḡ t = D t ∇L(θ t (H−1) ), where D is a positive definite symmetric matrix with eigenvalues in the interval [c, √c ]. If we set D t = I, then we satisfy Assumption 7 with c =c = 1.
To generate the sine wave dataset, we randomly generated the input x i from the uniform distribution on the interval [−1, 1] and set y i = 1{sin(20x i ) < 0} ∈ R for all i ∈ [n] with n = 100. For the convolutional deep neural network, we used the Semeion dataset (Srl and Brescia, 1994) and a random dataset. The Semeion dataset has 1593 data points and we randomly selected 1000 data points as training data points. The random dataset was created by randomly generating each pixel of the input image x i ∈ R 16×16×1 from the standard normal distribution and by sampling y i uniformly from {0, 1} for all i ∈ [n] with n = 1000.
Training. For each dataset, we used (mini-batch) stochastic gradient descent (SGD) with mini-batch size of 100. This effectively results in (full-batch) gradient decent for the two-moons dataset and sine wave dataset (because they only have 100 data points), whereas it behaves as SGD for the Semeion dataset and the random dataset (because they have 1000 data points). We used the cross-entropy loss as the training loss. We fixed the momentum coefficient to be 0.9 for the convolutional network and 0.98 for the fully-connected network. Under this setting, deep neural networks have been empirically observed to have implicit regularization effects (e.g., see (Poggio et al., 2017)). We set the learning rate to be 0.1 for all the experiments. We fixed the end epoch to be 1000 for the fully-connected networks and 400 for the convolutional network.

B.2 Experimental Setup and Additional Results for Section 3.4
We provide additional details and results for the experiments with the sine wave dataset in Appendix B.2.1 and image datasets in Appendix B.2.2.

B.2.1 Sine Wave Dataset
For the sine wave dataset, we also verified that the safe-exploration condition (Assumption 4) is satisfied in the setting of the experiment in Section 3.4. We set ε = 0.01, τ = 2000, andG = G for the EE wrapper A for the sine wave dataset. We fixed the last epoch to be 10000. All other settings were the same as those for Subsection 2.2. That is, to generate the sine wave dataset, we randomly generated the input x i from the uniform distribution on the interval [−1, 1] and set y i = 1{sin(20x i ) < 0} ∈ R for all i ∈ [n] with n = 100. We used (mini-batch) stochastic gradient descent (SGD) with a mini-batch size of 100. We fixed the momentum coefficient to be 0.98 and the learning rate to be 0.1. We used a fully-connected deep neural network with four layers and 300 neurons per hidden layer. The input dimension and the output dimension were one. We set the activation functions of all layers to be softplus σ(z) = ln(1 + exp(ςz))/ς with ς = 100. We setσ(q) = 1 1+e −ς q with ς = 1000 for the EE wrapper A. Each entry of the weight matrices was initialized independently by the normal distribution with the standard deviation of 1/ √ m with m = 300 for all layers. For the purpose of the random trials, we repeated this random initialization for three independent random trials with different random seeds in Figure 2. In Figure 3, we reported the result of the first trial.

B.2.2 Image Datasets
In the following, we provide the details of experiments for image datasets. We used the exactly same setting across the experiments for test performances, training behaviors and computational time. For the experiments on the effect of learning rates and optimizers, we changed the learning rates and optimizers as explained in Section 3.4.
Model. We used the standard (convolutional) pre-activation ResNet with 18 layers (He et al., 2016). We set the activation to be the softplus function q → ln(1+exp(ςq))/ς with ς = 100 for all layers of the base ResNet. This approximates the ReLU activation well as shown in Appendix C. We setσ(q) = 1 1+e −q for the EE wrapper A.

Appendix C Additional Discussion
On softplus activation. Throughout the experiments in this paper, we use the softplus function q → ln(1 + exp(ςq))/ς with ς = 100, as a real analytic (and differentiable) function that approximates the ReLU activation. This approximation is of high accuracy as shown in Figure 6. On Theorem 3. Theorem 3 predicts that the upper bound on the global optimality gap (min t∈[T ] L(θ t ) − inf θ∈R d L(θ)) decreases towards zero whenever vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) at each iteration. This follows from equations (13) and (14). Moreover, Theorem 3 implies that -near global minimum values are achieved for any > 0, if vec(Y ) ∈ Col( ∂ vec(f X (θ t )) ∂θ t ) at sufficiently many iterations t for both regression and classification losses.
On cross-entropy losses with Theorem 3. Note that even with η = 1, the value of L * (ηY ) in Theorem 3 is that of the correct classification with zero classification error. Moreover, given any desired accuracy > 0, there exists η ∈ R to obtain -near global optimality for the binary and multi-class cross-entropy losses with Theorem 3. However, unlike the squared loss case, we cannot set = 0 as it requires η = ∞ for which the right-hand-side of equation (14) yields infinity. This is consistent with the properties of the logistic loss and multi-class cross-entropy loss. This shows that our theory is consistent with true behaviors of the logistic loss and multi-class cross-entropy loss as desired On the EE wrapper A. The EE wrapper A is designed to provide prior guarantees while not hurting practical performances, instead of improving them. We can use any