## Abstract

Representations of the world environment play a crucial role in artificial intelligence. It is often inefficient to conduct reasoning and inference directly in the space of raw sensory representations, such as pixel values of images. Representation learning allows us to automatically discover suitable representations from raw sensory data. For example, given raw sensory data, a deep neural network learns nonlinear representations at its hidden layers, which are subsequently used for classification (or regression) at its output layer. This happens implicitly during training through minimizing a supervised or unsupervised loss. In this letter, we study the dynamics of such implicit nonlinear representation learning. We identify a pair of a new assumption and a novel condition, called the *on-model structure assumption* and the *data architecture alignment condition*. Under the on-model structure assumption, the data architecture alignment condition is shown to be sufficient for the global convergence and necessary for global optimality. Moreover, our theory explains how and when increasing network size does and does not improve the training behaviors in the practical regime. Our results provide practical guidance for designing a model structure; for example, the on-model structure assumption can be used as a justification for using a particular model structure instead of others. As an application, we then derive a new training framework, which satisfies the data architecture alignment condition without assuming it by automatically modifying any given training algorithm dependent on data and architecture. Given a standard training algorithm, the framework running its modified version is empirically shown to maintain competitive (practical) test performances while providing global convergence guarantees for deep residual neural networks with convolutions, skip connections, and batch normalization with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN.

## 1 Introduction

LeCun, Bengio, and Hinton (2015) described deep learning as one of hierarchical nonlinear representation learning approaches:

Deep-learning methods are representation-learning methods with multiple levels of representation, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level (p. 436).

In applications such as computer vision and natural language processing, the success of deep learning can be attributed to its ability to learn hierarchical nonlinear representations by automatically changing nonlinear features and kernels during training based on the given data. This is in contrast to classical machine-learning methods where representations or equivalently nonlinear features and kernels are fixed during training.

Deep learning in practical regimes, which has the ability to learn nonlinear representation (Bengio, Courville, & Vincent, 2013), has had a profound impact in many areas, including object recognition in computer vision (Rifai et al., 2011; Hinton, Osindero, & Teh, 2006; Bengio, Lamblin, Popovici, & Larochelle, 2007; Ciregan, Meier, & Schmidhuber, 2012; Krizhevsky, Sutskever, & Hinton, 2012), style transfer (Gatys, Ecker, & Bethge, 2016; Luan, Paris, Shechtman, & Bala, 2017), image super-resolution (Dong, Loy, He, & Tang, 2014), speech recognition (Dahl, Ranzato, Mohamed, & Hinton, 2010; Deng et al., 2010; Seide, Li, & Yu, 2011; Mohamed, Dahl, & Hinton, 2011; Dahl, Yu, Deng, & Acero, 2011; Hinton et al., 2012), machine translation (Schwenk, Rousseau, & Attik, 2012; Le, Oparin, Allauzen, Gauvain, & Yvon, 2012), paraphrase detection (Socher, Huang, Pennington, Ng, & Manning, 2011), word sense disambiguation (Bordes, Glorot, Weston, & Bengio, 2012), and sentiment analysis (Glorot, Bordes, & Bengio, 2011; Socher, Pennington, Huang, Ng, & Manning, 2011). However, we do not yet know the precise condition that makes deep learning tractable in the practical regime of representation learning.

For example, one of the simplest models is the linear model in the form of $f(x,\theta )=\varphi (x)\u22a4\theta $, where $\varphi :X\u2192Rd$ is a fixed function and $\varphi (x)$ is a nonlinear representation of input data $x$. This is a classical machine learning model where much of the effort goes into the design of the handcrafted feature map $\varphi $ via feature engineering (Turner, Fuggetta, Lavazza, & Wolf, 1999; Zheng & Casari, 2018). In this linear model, we do not learn the representation $\varphi (x)$ because the feature map $\varphi $ is fixed without dependence on the model parameter $\theta $ that is optimized with the data set $((xi,yi))i=1n$.

Similar to many definitions in mathematics, where an intuitive notion in a special case is formalized to a definition for a more general case, we now abstract and generalize this intuitive notion of the representation $\varphi (x)$ of the linear model to that of all differentiable models as follows:

Given any $x\u2208X$ and differentiable function $f$, we define $\u2202f(x,\theta )\u2202\theta $ to be the *gradient representation of the data $x$ under the model $f$ at $\theta $*.

*linear*in $\Delta t$ if there is no gradient representation learning as $\u2202f(x,\theta t)\u2202\theta t\u2248\u2202f(x,\theta 0)\u2202\theta 0$. However, with representation learning, the gradient representation $\u2202f(x,\theta t)\u2202\theta t$ changes depending on $t$ (and $\Delta t$), resulting in the dynamics that are

*nonlinear*in $\Delta t$. Therefore, the definition of the gradient representation can distinguish fundamentally different dynamics in machine learning.

In this letter, we initiate the study of the dynamics of learning gradient representation that are nonlinear in $\Delta t$. That is, we focus on the regime where the gradient representation $\u2202f(x,\theta t)\u2202\theta t$ at the end of training time $t$ differs greatly from the initial representation $\u2202f(x,\theta 0)\u2202\theta 0$. This regime was studied in the past for the case where the function $\varphi (x)\u21a6f(x,\theta )$ is affine for some fixed feature map $\varphi $ (Saxe, McClelland, & Ganguli, 2014; Kawaguchi, 2016, 2021; Laurent & Brecht, 2018; Bartlett, Helmbold, & Long, 2019; Zou, Long, & Gu, 2020; Xu et al., 2021). Unlike any previous studies, we focus on the problem setting where the function $\varphi (x)\u21a6f(x,\theta )$ is nonlinear and nonaffine, with the effect of nonlinear (gradient) representation learning. The results of this letter avoid the curse of dimensionality by studying the global convergence of the gradient-based dynamics instead of the dynamics of global optimization (Kawaguchi et al., 2016) and Bayesian optimization (Kawaguchi, Kaelbling, & Lozano-Pérez, 2015). Importantly, we do not require any wide layer or large input dimension throughout this letter. Our main contributions are summarized as follows:

In section 2, we identify a pair of a novel assumption and a new condition, called the

*common model structure assumption*and the*data-architecture alignment condition*. Under the common model structure assumption, the data-architecture alignment condition is shown to be a necessary condition for the globally optimal model and a sufficient condition for the global convergence. The condition is dependent on both data and architecture. Moreover, we empirically verify and deepen this new understanding. When we apply representation learning in practice, we often have overwhelming options regarding which model structure to be used. Our results provide a practical guidance for choosing or designing model structure via the common model structure assumption, which is indeed satisfied by many representation learning models used in practice.In section 3, we discard the assumption of the data-architecture alignment condition. Instead, we derive a novel training framework, called the exploration-exploitation wrapper (EE wrapper), which satisfies the data-architecture alignment condition time-independently a priori. The EE wrapper is then proved to have global convergence guarantees under the

*safe-exploration condition*. The safe-exploration condition is what allows us to explore various gradient representations safely without getting stuck in the states where we cannot provably satisfy the data-architecture alignment condition. The safe-exploration condition is shown to hold true for ResNet-18 with standard benchmark data sets, including MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN time-independently.In section 3.4, the EE wrapper is shown to not degrade practical performances of ResNet-18 for the standard data sets, MNIST, CIFAR-10, CIFAR-100, Semeion, KMNIST, and SVHN. To our knowledge, this letter provides the first practical algorithm with global convergence guarantees without degrading practical performances of ResNet-18 on these standard data sets, using convolutions, skip connections, and batch normalization without any extremely wide layer of the width larger than the number of data points. To the best of our knowledge, we are not aware of any similar algorithms with global convergence guarantees in the regime of learning nonlinear representations without degrading practical performances.

## 2 Understanding Dynamics via Common Model Structure and Data-Architecture Alignment

In this section, we identify the common model structure assumption and study the data-architecture alignment condition for the global convergence in nonlinear representation learning. We begin by presenting an overview of our results in section 2.1, deepen our understandings with experiments in section 2.2, discuss implications of our results in section 2.3, and establish mathematical theories in section 2.4.

### 2.1 Overview

#### 2.1.1 Common Model Structure Assumption

Through examinations of representation learning models used in applications, we identified and formalized one of their common properties as follows:

(Common Model Structure Assumption). There exists a subset $S\u2286{1,2,\u2026,d}$ such that $f(xi,\theta )=\u2211k=1d1{k\u2208S}\theta k\u2202f(xi,\theta )\u2202\theta k$ for any $i\u2208{1,\u2026,n}$ and $\theta \u2208Rd$.

Assumption 1 is satisfied by common machine learning models, such as kernel models and multilayer neural networks, with or without convolutions, batch normalization, pooling, and skip connections. For example, consider a multilayer neural network of the form $f(x,\theta )=Wh(x,u)+b$, where $h(x,u)$ is an output of its last hidden layer and the parameter vector $\theta $ consists of the parameters $(W,b)$ at the last layer and the parameters $u$ in all other layers as $\theta =vec([W,b,u])$. Here, for any matrix $M\u2208Rm\xd7m\xaf$, we let $vec(M)\u2208Rmm\xaf$ be the standard vectorization of the matrix $M$ by stacking columns. Then, assumption 1 holds because $f(x,\theta )=\u2211k=1d1{k\u2208S}\theta k(\u2202f(xi,\theta )\u2202\theta k)$, where $S$ is defined by ${\theta k:k\u2208S}={vec([W,b])k:k=1,2,\u2026,\xi}$ with $vec([W,b])\u2208R\xi $. Since $h$ is arbitrary in this example, the common model structure assumption holds, for example, for any multilayer neural networks with a fully connected last layer. In general, because the nonlinearity at the output layer can be treated as a part of the loss function $\u2113$ while preserving convexity of $q\u21a6\u2113(q,y)$ (e.g., cross-entropy loss with softmax), this assumption is satisfied by many machine learning models, including ResNet-18 and all models used in the experiments in this letter (as well as all linear models). Moreover, assumption 1 is automatically satisfied in the next section by using the EE wrapper.

#### 2.1.2 Data-Architecture Alignment Condition

Given a target matrix $Y=(y1,y2,\u2026,yn)\u22a4\u2208Rn\xd7my$ and a loss function $\u2113$, we define the modified target matrix $Y\u2113=(y1\u2113,y2\u2113,\u2026,yn\u2113)\u22a4\u2208Rn\xd7my$ by $Y\u2113=Y$ for the squared loss $\u2113$, and by $(Y\u2113)ij=2Yij-1$ for the (binary and multiclass) cross-entropy losses $\u2113$ with $Yij\u2208{0,1}$. Given input matrix $X=(x1,x2,\u2026,xn)\u22a4\u2208Rn\xd7mx$, the output matrix $fX(\theta )\u2208Rn\xd7my$ is defined by $fX(\theta )ij=f(xi,\theta )j\u2208R$. For any matrix $M\u2208Rm\xd7m\xaf$, we let $Col(M)\u2286Rm$ be its column space. With these notations, we are now ready to introduce the data-architecture alignment condition:

(Data-Architecture Alignment Condition). Given any data set $(X,Y)$, differentiable function $f$, and loss function $\u2113$, the *data-architecture alignment condition* is said to be satisfied at $\theta $ if $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta ))\u2202\theta )$.

The data-architecture alignment condition depends on both data (through the target $Y$ and the input $X$) and architecture (through the model $f$). It is satisfied only when the data and architecture align well to each other. For example, in the case of linear model $f(x,\theta )=\varphi (x)\u22a4\theta \u2208R$, the condition can be written as $vec(Y\u2113)\u2208Col(\Phi (X))$ where $\Phi (X)\u2208Rn\xd7d$ and $\Phi (X)ij=\varphi (xi)j$. In definition 2, $fX(\theta )$ is a matrix of the preactivation outputs of the last layer. Thus, in the case of classification tasks with a nonlinear activation at the output layer, $fX(\theta )$ and $Y$ are not in the same space, which is the reason we use $Y\u2113$ here instead of $Y$.

Importantly, the data-architecture alignment condition does not make any requirements on the the rank of the Jacobian matrix $\u2202vec(fX(\theta ))\u2202\theta \u2208Rnmy\xd7d$: the rank of $\u2202vec(fX(\theta ))\u2202\theta $ is allowed to be smaller than $nmy$ and $d$. Thus, for example, the data-architecture alignment condition can be satisfied depending on the given data and architecture even if the minimum eigenvalue of the matrix $\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ is zero, in both cases of overparameterization (e.g., $d\u226bn$) and underparameterization (e.g., $d\u226an$). This is further illustrated in section 2.2 and discussed in section 2.3. We note that we further discard the assumption of the data-architecture alignment condition in section 3 as it is automatically satisfied by using the EE wrapper.

#### 2.1.3 Global Convergence

Under the common model structure assumption, the data-architecture alignment condition is shown to be what lets us avoid the failure of the global convergence and suboptimal local minima. More concretely, we prove a global convergence guarantee under the data-architecture alignment condition as well as the necessity of the condition for the global optimality:

(Informal Version). Let assumption 1 hold. Then the following two statements hold for gradient-based dynamics:

The global optimality gap bound decreases per iteration toward zero at the rate of $O(1/|T|)$ for any $T$ such that the data-architecture alignment condition is satisfied at $\theta t$ for $t\u2208T$.

For any $\theta \u2208Rd$, the data-architecture alignment condition at $\theta $ is necessary to have the globally optimal model $fX(\theta )=\eta Y\u2113$ at $\theta $ for any $\eta \u2208R$.

Theorem 1i guarantees the global convergence without the need to satisfy the data-architecture alignment condition at every iteration or at the limit point. Instead, it shows that the bound on the global optimality gap decreases toward zero per iteration whenever the data-architecture alignment condition holds. Theorem 1ii shows that the data-architecture alignment condition is necessary for the global optimality. Intuitively, this is because the expressivity of a model class satisfying the common model structure assumption is restricted such that it is required to align the architecture to the data in order to contain the globally optimal model $fX(\theta )=\eta Y\u2113$ (for any $\eta \u2208R$).

To better understand the statement of theorem 1i, consider a counterexample with a data set consisting of the single point $(x,y)=(1,0)$, the model $f(x,\theta )=\theta 4-10\theta 2+6\theta +100$, and the squared loss $\u2113(q,y)=(q-y)2$. In this example, we have $L(\theta )=f(x,\theta )2$, which has multiple suboptimal local minima of different values. Then, via gradient descent, the model converges to the closest local minimum and, in particular, does not necessarily converge to a global minimum. Indeed, this example violates the common model structure assumption (assumption 1) (although it satisfies the data-architecture alignment condition), showing the importance of the common model structure assumption along with the data-architecture alignment. This also illustrates the nontriviality of theorem 1i in that the data-architecture alignment is not sufficient, and we needed to understand what types of model structures are commonly used in practice and formalize the understanding as the common model structure assumption.

Since $f(x1),f(x2),\u2026,f(xn)\u2208Rmy$, this implies that any stationary point $\theta $ is a global minimum if the minimum eigenvalue of the matrix $\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ is nonzero, without the common model structure assumption (see assumption 1). Indeed, in the above example with the model $f(x,\theta )=\theta 4-10\theta 2+6\theta +100$, the common model structure assumption is violated, but we still have the global convergence if the minimum eigenvalue is nonzero—for example, $f(x,\theta )=y=0$ at any stationary point $\theta $ such that the minimum eigenvalue of the matrix $\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ is nonzero. In contrast, theorem 1 allows the global convergence even when the minimum eigenvalue of the matrix $\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ is zero by utilizing the common model structure assumption.

The formal version of theorem 1 is presented in section 2.4 and is proved in appendix A in the supplementary information that relies on the additional previous works of Clanuwat et al. (2019), Krizhevsky and Hinton (2009), Mityagin (2015), Netzer et al. (2011), Paszke et al. (2019a, 2019b), and Poggio et al. (2017). Before proving the statement, we first examine the meaning and implications of our results through illustrative examples in sections 2.2 and 2.3.

### 2.2 Illustrative Examples in Experiments

Theorem 1 suggests that data-architecture alignment condition $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ has the ability to distinguish the success and failure cases, even when the minimum eigenvalue of the matrix $\u2202vec(fX(\theta t))\u2202\theta t(\u2202vec(fX(\theta t))\u2202\theta t)\u22a4$ is zero for all $t\u22650$. In this section, we conduct experiments to further verify and deepen this theoretical understanding.

We employ a fully connected network having four layers with 300 neurons per hidden layer, and a convolutional network, LeNet (LeCun et al., 1998), with five layers. For the fully connected network, we use the two-moons data set (Pedregosa et al., 2011) and a sine wave data set. To create the sine wave data set, we randomly generated the input $xi$ from the uniform distribution on the interval $[-1,1]$ and set $yi=1{sin(20xi)<0}\u2208R$ for all $i\u2208[n]$ with $n=100$. For the convolutional network, we use the Semeion data set (Srl & Brescia, 1994) and a random data set. The random data set was created by randomly generating each pixel of the input image $xi\u2208R16\xd716\xd71$ from the standard normal distribution and by sampling $yi$ uniformly from ${0,1}$ for all $i\u2208[n]$ with $n=1000$. We set the activation functions of all layers to be softplus $\sigma \xaf(z)=ln(1+exp(\u03c2z))/\u03c2$ with $\u03c2=100$, which approximately behaves as the ReLU activation as shown in appendix C in the supplementary information. See appendix B in the supplementary information for more details of the experimental settings.

*not*satisfying the condition $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ during $t\u2208{0,1,\u2026,T}$ and is defined by

Figures 1b and 1c show the results for the convolutional networks with two random initial points using two different random seeds. In the figure panels, we report the training behaviors with different network sizes $mc=1,2$, and 4; the number of convolutional filters per convolutional layer is $8\xd7mc$ and the number of neurons per fully connected hidden layer is $128\xd7mc$. As can be seen, with the Semeion data set, the networks of all sizes achieved zero error with $QT=0$ for all $T$. With the random data set, the deep networks yielded the zero training error whenever $QT$ is not linearly increasing over the time or, equivalently, whenever the condition of $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta T))\u2202\theta T)$ holds sufficiently many steps $T$. This is consistent with our theory.

Finally, we also confirmed that gradient representation $\u2202f(x,\theta t)\u2202\theta t$ changed significantly from the initial one $\u2202f(x,\theta 0)\u2202\theta 0$ in our experiments. That is, the values of $\u2225M(\theta T)-M(\theta 0)\u2225F2$ were significantly large and tended to increase as $T$ increases, where the matrix $M(\theta )\u2208Rnmy\xd7nmy$ is defined by $M(\theta )=\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$. Table 1 summarizes the values of $\u2225M(\theta T)-M(\theta 0)\u2225F2$ at the end of the training.

a. Fully-Connected Network . | ||||||
---|---|---|---|---|---|---|

. | . | Data Set . | ||||

Two moons | $2.09\xd71011$ | |||||

Sine wave | $3.95\xd7109$ | |||||

b. Convolutional Network | ||||||

$mc=4$ | $mc=2$ | $mc=1$ | ||||

Data Set | seed#1 | seed#2 | seed#1 | seed#2 | seed#1 | seed#2 |

Semeion | $8.09\xd71012$ | $5.19\xd71012$ | $9.82\xd71012$ | $3.97\xd71012$ | $2.97\xd71012$ | $5.41\xd71012$ |

Random | $3.73\xd71012$ | $1.64\xd71012$ | $3.43\xd7107$ | $4.86\xd71012$ | $1.40\xd7107$ | $8.57\xd71011$ |

a. Fully-Connected Network . | ||||||
---|---|---|---|---|---|---|

. | . | Data Set . | ||||

Two moons | $2.09\xd71011$ | |||||

Sine wave | $3.95\xd7109$ | |||||

b. Convolutional Network | ||||||

$mc=4$ | $mc=2$ | $mc=1$ | ||||

Data Set | seed#1 | seed#2 | seed#1 | seed#2 | seed#1 | seed#2 |

Semeion | $8.09\xd71012$ | $5.19\xd71012$ | $9.82\xd71012$ | $3.97\xd71012$ | $2.97\xd71012$ | $5.41\xd71012$ |

Random | $3.73\xd71012$ | $1.64\xd71012$ | $3.43\xd7107$ | $4.86\xd71012$ | $1.40\xd7107$ | $8.57\xd71011$ |

### 2.3 Implications

In section 2.1.3, we showed that an uncommon model structure $f(x,\theta )=\theta 4-10\theta 2+6\theta +100$ does not satisfy assumption 1, and assumption 1 is not required for global convergence if the minimum eigenvalue is nonzero. However, in practice, we typically use machine learning models that satisfy assumption 1 instead of the model $f(x,\theta )=\theta 4-10\theta 2+6\theta +100$, and the minimum eigenvalue is zero in many cases. In this context, theorem 1 provides the justification for common practice in nonlinear representation learning. Furthermore, theorem 1i contributes to the literature by identifying the common model structure assumption (assumption 1) and the data-architecture alignment condition (definition 1) as the novel and practical conditions to ensure the global convergence even when the minimum eigenvalue becomes zero. Moreover, theorem 1ii shows that this condition is not arbitrary in the sense that it is also necessary to obtain the globally optimal models. Furthermore, the data-architecture alignment condition is strictly more general than the condition of the minimum eigenvalue being nonzero, in the sense that the latter implies the former but not vice versa.

Our new theoretical understanding based on the data-architecture alignment condition can explain and deepen the previously known empirical observation that increasing network size tends to improve training behaviors. Indeed, the size of networks seems to correlate well with the training error to a certain degree in Figure 1b. However, the size and the training error do not correlate well in Figure 1c. Our new theoretical understanding explains that the training behaviors correlate more directly with the data-architecture alignment condition of $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ instead. The seeming correlation with the network size is indirect and caused by another correlation between the network size and the condition of $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$. That is, the condition of $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ more likely tends to hold when the network size is larger because the matrix $\u2202vec(fX(\theta t))\u2202\theta t$ is of size $nmy\xd7d$ where $d$ is the number of parameters: that is, by increasing $d$, we can increase the column space $Col(\u2202vec(fX(\theta t))\u2202\theta t)$ to increase the chance of satisfying the condition of $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$.

Note that the minimum eigenvalue of the matrix $M(\theta t)=\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ is zero at all iterations in Figures 1b and 1c for all cases of $mc=1$. Thus, Figures 1b and 1c also illustrate the fact that while having the zero minimum eigenvalue of the matrix $M(\theta t)$, the dynamics can achieve the global convergence under the data-architecture alignment condition. Moreover, because the multilayer neural network in the lazy training regime (Kawaguchi & Sun, 2021) achieves zero training errors for *all* data sets, Figure 1 additionally illustrates that our theoretical and empirical results apply to the models outside of the lazy training regime and can distinguish “good” data sets from “bad” data sets given a learning algorithm.

In sum, our new theoretical understanding has the ability to explain and distinguish the successful case and failure case based on the data-architecture alignment condition for the common machine learning models. Because the data-architecture alignment condition is dependent on data and architecture, theorem 1, along with our experimental results, shows why and when the global convergence in nonlinear representation learning is achieved based on the relationship between the data $(X,Y)$ and architecture $f$. This new understanding is used in section 3 to derive a practical algorithm and is expected to be a basis for many future algorithms.

### 2.4 Details and Formalization of Theorem 1

#### 2.4.1 Preliminaries

Let $(\theta t)t=0\u221e$ be the sequence defined by $\theta t+1=\theta t-\alpha tg\xaft$ with an initial parameter vector $\theta 0$, a learning rate $\alpha t$, and an update vector $g\xaft$. The analysis in this section relies on the following assumption on the update vector $g\xaft$:

There exist $c\xaf,c\u0332>0$ such that $c\u0332\u2225\u2207L(\theta t)\u22252\u2264\u2207L(\theta t)\u22a4g\xaft$ and $\u2225g\xaft\u22252\u2264c\xaf\u2225\u2207L(\theta t)\u22252$ for any $t\u22650$.

Assumption 2 is satisfied by using $g\xaft=Dt\u2207L(\theta t),$ where $Dt$ is any positive-definite symmetric matrix with eigenvalues in the interval $[c\u0332,c\xaf]$. If we set $Dt=I$, we have gradient descent, and assumption 2 is satisfied with $c\u0332=c\xaf=1$. This section also uses the standard assumption of differentiability and Lipschitz continuity:

For every $i\u2208[n]$, the function $\u2113i:q\u21a6\u2113(q,yi)$ is differentiable and convex, the map $fi:\theta \u21a6f(xi,\theta )$ is differentiable, and $\u2225\u2207L(\theta )-\u2207L(\theta ')\u2225\u2264L\u2225\theta -\theta '\u2225$ for all $\theta ,\theta '$ in the domain of $L$ for some $L\u22650$.

The assumptions on the loss function in assumption 3 are satisfied by using standard loss functions, including the squared loss, logistic loss, and cross-entropy loss. Although the objective function $L$ is nonconvex and non-invex, the function $q\u21a6\u2113(q,yi)$ is typically convex.

Suppose assumption 1 holds. If $vec(Y\u2113)\u2209Col(\u2202vec(fX(\theta ))\u2202\theta )$, then $fX(\theta )\u2260\eta Y\u2113$ for any $\eta \u2208R$.

All proofs of this letter are presented in appendix A in the supplementary information.

#### 2.4.2 Global Optimality at the Limit Point

The following theorem shows that every limit point $\theta ^$ of the sequence $(\theta t)t$ achieves a loss value $L(\theta ^)$ no worse than $inf\eta \u2208RL*(\eta Y*)$ for any $Y*$ such that $vec(Y*)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ for all $t\u2208[\tau ,\u221e)$ with some $\tau \u22650$:

In practice, one can easily satisfy all the assumptions in theorem 2 except for the condition that $vec(Y*)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ for all $t\u2208[\tau ,\u221e)$. Accordingly, we now weaken this condition by analyzing optimality at each iteration so that the condition is verifiable in experiments.

#### 2.4.3 Global Optimality Gap at Each Iteration

The following theorem states that under standard settings, the sequence $(\theta t)t\u2208T$ converges to a loss value no worse than $inf\eta \u2208RL*(\eta Y*)$ at the rate of $O(1/|T|)$ for any $T$ and $Y*$ such that $vec(Y*)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ for $t\u2208T$:

## 3 Application to the Design of Training Framework

The results in the previous section show that the bound on the global optimality gap decreases per iteration whenever the data-architecture alignment condition holds. Using this theoretical understanding, in this section, we propose a new training framework with prior guarantees while learning hierarchical nonlinear representations without assuming the data-architecture alignment condition. As a result, we made significant improvements over the most closely related study on global convergence guarantees (Kawaguchi & Sun, 2021). In particular, whereas the related study requires a wide layer with a width larger than $n$, our results reduce the requirement to a layer with a width larger than $n$. For example, the MNIST data set has $n=60,000$ and hence previous studies require 60,000 neurons at a layer, whereas we only require $60,000\u2248245$ neurons at a layer. Our requirement is consistent and satisfied by the models used in practice that typically have from 256 to 1024 neurons for some layers.

### 3.1 Additional Notations

For an arbitrary matrix $M\u2208Rm\xd7m'$, we let $M*j$ be its $j$th column vector in $Rm$, $Mi*$ be its $i$th row vector in $Rm'$, and $rank(M)$ be its matrix rank. We define $M\u2218M'$ to be the Hadamard product of any matrices $M$ and $M'$. For any vector $v\u2208Rm$, we let $diag(v)\u2208Rm\xd7m$ be the diagonal matrix with $diag(v)ii=vi$ for $i\u2208[m]$. We denote by $Im$ the $m\xd7m$ identity matrix.

### 3.2 Exploration-Exploitation Wrapper

In this section, we introduce the exploration-exploitation (EE) wrapper $A$. The EE wrapper $A$ is not a stand-alone training algorithm. Instead, it takes any training algorithm $G$ as its input and runs the algorithm $G$ in a particular way to guarantee global convergence. We note that the exploitation phase in the EE wrapper does not optimize the last layer; instead, it optimizes hidden layers, whereas the exploration phase optimizes all layers. The EE wrapper allows us to learn the representation $\u2202f(x,\theta t)\u2202\theta t$ that differs significantly from the initial representation $\u2202f(x,\theta 0)\u2202\theta 0$ without making assumptions on the minimum eigenvalue of the matrix $\u2202vec(fX(\theta ))\u2202\theta (\u2202vec(fX(\theta ))\u2202\theta )\u22a4$ by leveraging the data-architecture alignment condition. The data-architecture alignment condition is ensured by the safe-exploration condition (defined in section 3.3.1), which is time independent and holds in practical common architectures (as demonstrated in section 3.4).

#### 3.2.1 Main Mechanisms

Algorithm 1 outlines the EE wrapper $A$. During the exploration phase in lines 3 to 7 of algorithm 1, the EE wrapper $A$ freely explores hierarchical nonlinear representations to be learned without any restrictions. Then, during the exploitation phase in lines 8 to 12, it starts exploiting the current knowledge to ensure $vec(Y\u2113)\u2208Col(\u2202vec(fX(\theta t))\u2202\theta t)$ for all $t$ to guarantee global convergence. The value of $\tau $ is the hyperparameter that controls the time when it transitions from the exploration phase to the exploitation phase.

In the exploitation phase, the wrapper $A$ only optimizes the parameter vector $\theta (H-1)t$ at the $(H-1)$th hidden layer, instead of the parameter vector $\theta (H)t$ at the last layer or the $H$th layer. Despite this, the EE wrapper $A$ is proved to converge to global minima of all layers in $Rd$. The exploitation phase still allows us to significantly change the representations as $M(\theta t)\xac\u2248M(\theta \tau )$ for $t>\tau $. This is because we optimize the hidden layers instead of the last layer without any significant overparameterizations.

The exploitation phase uses an arbitrary optimizer $G\u02dc$ with the update vector $g\u02dct\u223cG\u02dc(H-1)(\theta t,t)$ with $g\u02dct=\alpha tg^t\u2208RdH-1$. During the two phases, we can use the same optimizers (e.g., SGD for both $G$ and $G\u02dc$) or different optimizers (e.g., SGD for $G$ and L-BFGS for $G\u02dc$).

#### 3.2.2 Model Modification

### 3.3 Convergence Analysis

In this section, we establish global convergence of the EE wrapper $A$ without using assumptions from the previous section. Let $\tau $ be an arbitrary positive integer and $\u025b$ be an arbitrary positive real number. Let $(\theta t)t=0\u221e$ be a sequence generated by the EE wrapper $A$. We define $L^(\theta (H-1))=L(\theta (1:H-2)\tau ,\theta (H-1),\theta (H)\tau )$ and $B\epsilon \xaf=min\theta (H-1)\u2208\Theta \epsilon \xaf\u2225\theta (H-1)-\theta (H-1)\tau \u2225$ where $\Theta \epsilon \xaf=argmin\theta (H-1)max(L^(\theta (H-1)),\epsilon \xaf)$ for any $\epsilon \xaf\u22650$.

#### 3.3.1 Safe-Exploration Condition

The mathematical analysis in this section relies on the safe-exploration condition, which allows us to safely explore deep nonlinear representations in the exploration phase without getting stuck in the states of $vec(Y\u2113)\u2209Col(\u2202vec(fX(\theta t))\u2202\theta t)$. The safe-exploration condition is verifiable, time-independent, data-dependent, and architecture-dependent. The verifiability and time independence make the assumption strong enough to provide prior guarantees before training. The data dependence and architecture dependence make the assumption weak enough to be applicable for a wide range of practical settings.

(Safe-Exploration Condition). There exist a $q\u2208RmH-1\xd7mH$ and a $\theta (1:H-2)\u2208Rd1:H-2$ such that $rank(\varphi (q,\theta (1:H-2)))=n$.

The safe-exploration condition asks for only the existence of one parameter vector in the network architecture such that $rank(\varphi (q,\theta (1:H-2)))=n$. It is not about the training trajectory $(\theta t)t$. Since the matrix $\varphi (q,\theta (1:H-2))$ is of size $n\xd7mHmH-1$, the safe-exploration condition does not require any wide layer of size $mH\u2265n$ or $mH-1\u2265n$. Instead, it requires a layer of size $mHmH-1\u2265n$. This is a significant improvement over the most closely related study (Kawaguchi & Sun, 2021) where the wide layer of size $mH\u2265n$ was required. Note that having $mHmH-1\u2265n$ does not imply the safe-exploration condition. Instead, $mHmH-1\u2265n$ is a necessary condition to satisfy the safe-exploration condition, whereas $mH\u2265n$ or $mH-1\u2265n$ was a necessary condition to satisfy assumptions in previous papers, including the most closely related study (Kawaguchi & Sun, 2021). The safe-exploration condition is verified in experiments in section 3.4.

#### 3.3.2 Additional Assumptions

We also use the following assumptions:

For any $i\u2208[n]$, the function $\u2113i:q\u21a6\u2113(q,yi)$ is differentiable, and $\u2225\u2207\u2113i(q)-\u2207\u2113i(q')\u2225\u2264L\u2113\u2225q-q'\u2225$ for all $q,q'\u2208R$.

For each $i\u2208[n]$, the functions $\theta (1:H-2)\u21a6z(xi,\theta (1:H-2))$ and $q\u21a6\sigma \u02dc(q)$ are real analytic.

Assumption 5 is satisfied by using standard loss functions such as the squared loss $\u2113(q,y)=\u2225q-y\u22252$ and cross-entropy loss $\u2113(q,y)=-\u2211k=1dyyklogexp(qk)\u2211k'exp(qk')$. The assumptions of the invexity and convexity of the function $q\u21a6\u2113(q,yi)$ in sections 3.3.3 and 3.3.4 also hold for these standard loss functions. Using $L\u2113$ in assumption 5, we define $L^=L\u2113n\u2225Z\u22252$, where $Z\u2208Rn$ is defined by $Zi=maxj\u2208[my]\u2225[diag(\theta (H,j)\tau )\u2297ImH-1](\varphi (\theta (H-1,j)\tau ,\theta (1:H-2)\tau )i*)\u22a4\u2225$ with $\theta (H,j)=(Wj*(H))\u22a4$.

Assumption 6 is satisfied by using any analytic activation function such as sigmoid, hyperbolic tangents, and softplus activations $q\u21a6ln(1+exp(\u03c2q))/\u03c2$ with any hyperparameter $\u03c2>0$. This is because a composition of real analytic functions is real analytic, and the following are all real analytic functions in $\theta (1:H-2)$: the convolution, affine map, average pooling, skip connection, and batch normalization. Therefore, the assumptions can be satisfied by using a wide range of machine learning models, including deep neural networks with convolution, skip connection, and batch normalization. Moreover, the softplus activation can approximate the ReLU activation for any desired accuracy, that is, $ln(1+exp(\u03c2q))/\u03c2\u2192relu(q)as\u03c2\u2192\u221e$, where $ReLU$ represents the ReLU activation.

#### 3.3.3 Global Optimality at the Limit Point

The following theorem proves the global optimality at limit points of the EE wrapper with a wide range of optimizers, including gradient descent and modified Newton methods:

Suppose assumptions 4 to 6 hold and that the function $\u2113i:q\u21a6\u2113(q,yi)$ is invex for any $i\u2208[n]$. Assume that there exist $c\xaf,c\u0332>0$ such that $c\u0332\u2225\u2207L^(\theta (H-1)t)\u22252\u2264\u2207L^(\theta (H-1)t)\u22a4g^t$ and $\u2225g^t\u22252\u2264c\xaf\u2225\u2207L^(\theta (H-1)t)\u22252$ for any $t\u2265\tau $. Assume that the learning rate sequence $(\alpha t)t\u2265\tau $ satisfies either (i) $\epsilon \u2264\alpha t\u2264c\u0332(2-\epsilon )L^c\xaf$ for some $\epsilon >0$ or (ii) $limt\u2192\u221e\alpha t=0$ and $\u2211t=\tau \u221e\alpha t=\u221e$. Then with probability one, every limit point $\theta ^$ of the sequence $(\theta t)t$ is a global minimum of $L$ as $L(\theta ^)\u2264L(\theta )$ for all $\theta \u2208Rd$.

#### 3.3.4 Global Optimality Gap at Each Iteration

We now present global convergence guarantees of the EE wrapper $A$ with gradient decent and SGD:

Suppose assumptions 4 to 6 hold and that the function $\u2113i:q\u21a6\u2113(q,yi)$ is convex for any $i\u2208[n]$. Then, with probability one, the following two statements hold:

- (Gradient descent) if $g^t=\u2207L^(\theta (H-1)t)$ and $\alpha t=1L^$ for $t\u2265\tau $, then for any $\epsilon \xaf\u22650$ and $t>\tau $,$L(\theta t)\u2264inf\theta \u2208Rdmax(L(\theta ),\epsilon \xaf)+B\epsilon \xaf2L^2(t-\tau ).$
- (SGD) if $E[g^t|\theta t]=\u2207L^(\theta (H-1)t)$ (almost surely) with $E[\u2225g^t\u22252]\u2264G2$, and if $\alpha t\u22650$, $\u2211t=\tau \u221e(\alpha t)2<\u221e$ and $\u2211t=\tau \u221e\alpha t=\u221e$ for $t\u2265\tau $, then for any $\epsilon \xaf\u22650$ and $t>\tau $,where $t*\u2208argmink\u2208{\tau ,\tau +1,\u2026,t}L(\theta k)$.$E[L(\theta t*)]\u2264inf\theta \u2208Rdmax(L(\theta ),\epsilon \xaf)+B\epsilon \xaf2+G2\u2211k=\tau t\alpha k22\u2211k=\tau t\alpha k$(3.5)

### 3.4 Experiments

This section presents empirical evidence to support our theory and what is predicted by a well-known hypothesis. We note that there is no related work or algorithm that can guarantee global convergence in the setting of our experiments where the model has convolutions, skip connections, and batch normalizations without any wide layer (of the width larger than $n$). Moreover, unlike any previous studies that propose new methods, our training framework works by modifying any given method.

#### 3.4.1 Sine Wave Data Set

#### 3.4.2 Image Data Sets

The standard convolutional ResNet with 18 layers (He, Zhang, Ren, & Sun, 2016) is used as the base model $f\xaf$. We use ResNet-18 for the illustration of our theory because it is used in practice and it has convolution, skip connections, and batch normalization without any width larger than the number of data points. This setting is not covered by any of the previous theories for global convergence. We set the activation to be the softplus function $q\u21a6ln(1+exp(\u03c2q))/\u03c2$ with $\u03c2=100$ for all layers of the base ResNet. This approximates the ReLU activation well, as shown in appendix C in the supplementary information. We employ the cross-entropy loss and $\sigma \u02dc(q)=11+e-q$. We use a standard algorithm, SGD, with its standard hyperparameter setting for the training algorithm $G$ with $G\u02dc=G$—that is, we let the minibatch size be 64, the weight decay rate be $10-5$, the momentum coefficient be 0.9, the learning rate be $\alpha t=0.1$, and the last epoch $T^$ be 200 (with data augmentation) and 100 (without data augmentation). The hyperparameters $\u025b$ and $\tau =\tau 0T^$ were selected from $\u025b\u2208{10-3,10-5}$ and $\tau 0\u2208{0.4,0.6,0.8}$ by only using training data. That is, we randomly divided each training data set (100%) into a smaller training data set (80%), and a validation data set (20%) for a grid search over the hyperparameters. See appendix B in the supplementary information for the results of the grid search and details of the experimental setting. This standard setting satisfies assumptions 5 and 6, leaving assumption 4 to be verified.

##### Verification of Assumption 4.

Table 2 summarizes the verification results of the safe-exploration condition. Because the condition only requires an existence of a pair $(\theta ,q)$ satisfying the condition, we verified it by using a randomly sampled $q$ from the standard normal distribution and a $\theta $ returned by a common initialization scheme (He et al., 2015). As $mH-1=513$ (512 + the constant neuron for the bias term) for the standard ResNet, we set $mH=\u23082(n/mH-1)\u2309$ throughout all the experiments with the ResNet. For each data set, the rank condition was verified twice by the two standard methods: one from Press, Teukolsky, Vetterling, and Flannery (2007) and another from Golub and Van Loan (1996).

Data Set . | $n$ . | $mH-1$ . | $mH$ . | Assumption 4 . |
---|---|---|---|---|

MNIST | 60,000 | 513 | 234 | Verified |

CIFAR-10 | 50,000 | 513 | 195 | Verified |

CIFAR-100 | 50,000 | 513 | 195 | Verified |

Semeion | 1,000 | 513 | 4 | Verified |

KMNIST | 60,000 | 513 | 234 | Verified |

SVHN | 73,257 | 513 | 286 | Verified |

Data Set . | $n$ . | $mH-1$ . | $mH$ . | Assumption 4 . |
---|---|---|---|---|

MNIST | 60,000 | 513 | 234 | Verified |

CIFAR-10 | 50,000 | 513 | 195 | Verified |

CIFAR-100 | 50,000 | 513 | 195 | Verified |

Semeion | 1,000 | 513 | 4 | Verified |

KMNIST | 60,000 | 513 | 234 | Verified |

SVHN | 73,257 | 513 | 286 | Verified |

Note: $mH=\u23082(n/mH-1)\u2309$ where $n$ is the number of training data, $mH$ is the width of the last hidden layer, and $mH-1$ is the width of the penultimate hidden layer.

##### Test Performance.

One well-known hypothesis is that the success of deep-learning methods partially comes from its ability to automatically learn deep nonlinear representations suitable for making accurate predictions from data (LeCun et al., 2015). As the EE wrapper $A$ keeps this ability of representation learning, the hypothesis suggests that the test performance of the EE wrapper $A$ of a standard method is approximately comparable to that of the standard method. Unlike typical experimental studies, our objective here is to confirm this prediction instead of showing improvements over a previous method. We empirically confirmed the prediction in Tables 3 and 4 where the numbers indicate the mean test errors (and standard deviations are in parentheses) over five random trials. As expected, the values of $\u2225M(\theta T^)-M(\theta 0)\u222522$ were also large—for example, $4.64\xd71012$ for the standard method and $3.43\xd71012$ for wrapper $A$ of the method with the Semeion data set.

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

MNIST | 0.40 (0.05) | 0.30 (0.05) |

CIFAR-10 | 7.80 (0.50) | 7.14 (0.12) |

CIFAR-100 | 32.26 (0.15) | 28.38 (0.42) |

Semeion | 2.59 (0.57) | 2.56 (0.55) |

KMNIST | 1.48 (0.07) | 1.36 (0.11) |

SVHN | 4.67 (0.05) | 4.43 (0.11) |

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

MNIST | 0.40 (0.05) | 0.30 (0.05) |

CIFAR-10 | 7.80 (0.50) | 7.14 (0.12) |

CIFAR-100 | 32.26 (0.15) | 28.38 (0.42) |

Semeion | 2.59 (0.57) | 2.56 (0.55) |

KMNIST | 1.48 (0.07) | 1.36 (0.11) |

SVHN | 4.67 (0.05) | 4.43 (0.11) |

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

MNIST | 0.52 (0.16) | 0.49 (0.02) |

CIFAR-10 | 15.15 (0.87) | 14.56 (0.38) |

CIFAR-100 | 54.99 (2.29) | 46.13 (1.80) |

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

MNIST | 0.52 (0.16) | 0.49 (0.02) |

CIFAR-10 | 15.15 (0.87) | 14.56 (0.38) |

CIFAR-100 | 54.99 (2.29) | 46.13 (1.80) |

##### Training Behavior.

##### Computational Time.

The EE wrapper $A$ runs the standard SGD $G$ in the exploration phase and the SGD $G\u02dc=G$ only on the subset of the weights $\theta (H-1)$ in the exploitation phase. Thus, the computational time of the EE wrapper $A$ is similar to that of the SGD in the exploration phase, and it tends to be faster than the SGD in the exploitation phase. To confirm this, we measure computational time with the Semeion and CIFAR-10 data sets under the same computational resources (e.g., without running other jobs in parallel) in a local workstation for each method. The mean wall-clock time (in seconds) over five random trials is summarized in Table 5, where the numbers in parentheses are standard deviations. It shows that the EE wrapper $A$ is slightly faster than the standard method, as expected.

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

Semeion | 364.60 (0.94) | 356.82 (0.67) |

CIFAR-10 | 3616.92 (10.57) | 3604.5 (6.80) |

Data Set . | Standard . | $A(Standard)$ . |
---|---|---|

Semeion | 364.60 (0.94) | 356.82 (0.67) |

CIFAR-10 | 3616.92 (10.57) | 3604.5 (6.80) |

##### Effect of Learning Rate and Optimizer.

We also conducted experiments on the effects of learning rates and optimizers using the MNIST data set with data augmentation. Using the best learning rate from {0.2, 0.1, 0.01, 0.001} for each method (with $G\u02dc=G$$=$ SGD), the mean test errors (%) over five random trials were 0.33 (0.03) for the standard base method and 0.27 (0.03) for the $A$ wrapper of the standard base method (the numbers in parentheses are standard deviations). Moreover, Table 6 reports the preliminary results on the effect of optimizers with $G\u02dc$ being set to the limited-memory Broyden–Fletcher–Goldfarb–Shanno algorithm (L-BFGS) (with $G$ = the standard SGD). By comparing Tables 3 and 6, we can see that using a different optimizer in the exploitation phase can potentially lead to performance improvements. A comprehensive study of this phenomenon is left to future work.

(a) with data augmentation . | (b) without data augmentation . | ||||||
---|---|---|---|---|---|---|---|

. | |||||||

$\tau 0$$\u025b$ . | 0.4 . | 0.6 . | 0.8 . | $\tau 0$$\u025b$ . | 0.4 . | 0.6 . | 0.8 . |

$10-3$ | 0.26 | 0.38 | 0.37 | $10-3$ | 0.36 | 0.43 | 0.42 |

$10-5$ | 0.37 | 0.32 | 0.37 | $10-5$ | 0.42 | 0.35 | 0.35 |

(a) with data augmentation . | (b) without data augmentation . | ||||||
---|---|---|---|---|---|---|---|

. | |||||||

$\tau 0$$\u025b$ . | 0.4 . | 0.6 . | 0.8 . | $\tau 0$$\u025b$ . | 0.4 . | 0.6 . | 0.8 . |

$10-3$ | 0.26 | 0.38 | 0.37 | $10-3$ | 0.36 | 0.43 | 0.42 |

$10-5$ | 0.37 | 0.32 | 0.37 | $10-5$ | 0.42 | 0.35 | 0.35 |

## 4 Conclusion

Despite the nonlinearity of the dynamics and the noninvexity of the objective, we have rigorously proved convergence of training dynamics to global minima for nonlinear representation learning. Our results apply to a wide range of machine learning models, allowing both underparameterization and overparameterization. For example, our results are applicable to the case where the minimum eigenvalue of the matrix $\u2202vec(fX(\theta t))\u2202\theta t(\u2202vec(fX(\theta t))\u2202\theta t)\u22a4$ is zero for all $t\u22650$. Under the common model structure assumption, models that cannot achieve zero error for all data sets (except some “good” data sets) are shown to achieve global optimality with zero error exactly when the dynamics satisfy the data-architecture alignment condition. Our results provide guidance for choosing and designing model structure and algorithms via the common model structure assumption and data-architecture alignment condition.

The key limitation in our analysis is the differentiability of the function $f$. For multilayer neural networks, this is satisfied by using standard activation functions, such as softplus, sigmoid, and hyperbolic tangents. Whereas softplus can approximate ReLU arbitrarily well, the direct treatment of ReLU in nonlinear representation learning is left to future work.

Our theoretical results and numerical observations uncover novel mathematical properties and provide a basis for future work. For example, we have shown global convergence under the data-architecture alignment condition $vec(Y\u2113)\u2208Col\u2202vec(fX(\theta t))\u2202\theta t$. The EE wrapper $A$ is only one way to ensure this condition. There are many other ways to ensure the data-architecture alignment condition, and each way can result in a new algorithm with guarantees.

## References

*Neural Computation*

*IEEE Transactions on Pattern Analysis and Machine Intelligence*

*Advances in neural information processing systems*

*Proceedings of the 15th International Conference on Artificial Intelligence and Statistics*

*Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition*

*NeurIPS Creativity Workshop 2019*

*Advances in neural information processing systems*

*IEEE Transactions on Audio, Speech, and Language Processing*

*Proceedings of the Eleventh Annual Conference of the International Speech Communication Association*

*Proceedings of the European Conference on Computer Vision*

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*Proceedings of the International Conference on Machine Learning.*

*Matrix computations*

*Proceedings of the IEEE International Conference on Computer Vision*

*Proceedings of the European Conference on Computer Vision*

*IEEE Signal Processing Magazine*

*Neural Computation*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Learning Representations*

*Advances in neural information processing systems*

*Journal of Artificial Intelligence Research*

*Proceedings of the AAAI Conference on Artificial Intelligence*

*Learning multiple layers of features from tiny images*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*IEEE Transactions on Audio, Speech, and Language Processing*

*Proceedings of the IEEE*

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*The zero set of a real analytic function.*

*IEEE Transactions on Audio, Speech, and Language Processing*

*NIPS Workshop on Deep Learning and Unsupervised Feature Learning*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Journal of Machine Learning Research*

*Theory of deep learning III: Explaining the non-overfitting puzzle*

*Numerical recipes: The art of scientific computing*

*Advances in neural information processing systems*

*24*(pp.

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the NAACL-HLT 2012 Workshop: Will We Ever Really Replace the N-gram Model? On the Future of Language Modeling for HLT*

*Proceedings of the Twelfth Annual Conference of the International Speech Communication Association*

*Advances in neural information processing systems*

*Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing*

*Semeion handwritten digit data set*

*Journal of Systems and Software*

*Proceedings of the International Conference on Machine Learning*

*Feature engineering for machine learning: Principles and techniques for data scientists*

*Proceedings of the International Conference on Learning Representations*