## Abstract

A hierarchical neural network usually has many singular regions in the parameter space due to the degeneration of hidden units. Here, we focus on a three-layer perceptron, which has one-dimensional singular regions comprising both attractive and repulsive parts. Such a singular region is often called a Milnor-like attractor. It is empirically known that in the vicinity of a Milnor-like attractor, several parameters converge much faster than the rest and that the dynamics can be reduced to smaller-dimensional ones. Here we give a rigorous proof for this phenomenon based on a center manifold theory. As an application, we analyze the reduced dynamics near the Milnor-like attractor and study the stochastic effects of the online learning.

## 1 Introduction

In this article, we treat the supervised learning. The goal of the supervised learning is to find an optimal parameter $\theta $ so that $f(d)(x;\theta )$ approximates a given target function $T(x)$. Such a problem is based on the universal approximation property of the three-layer perceptron. For a suitable activation function $\varphi $ (e.g., sigmoidal or ReLU), the model, equation 1.2, can approximate quite a wide range of functions as the number $d$ of hidden units tends to infinity (Cybenko, 1989; Funahashi, 1989; Sonoda & Murata, 2017).

*stochastic gradient descent method*.

Fukumizu and Amari (2000) studied singular regions arising from degeneration of hidden units of a three-layer perceptron. Here, the degeneration of hidden units means that several weight parameters $wi$ take the same value and the effective number of hidden units becomes fewer than $d$. When $m=1$, they found a novel type of singular region, often called a *Milnor-like attractor*. This region has both an attractive part consisting of local minima of $L(d)$ and a repulsive part consisting of saddle points. In practical learning, there may be some stochastic effects. Therefore, once the parameter $\theta $ is trapped in the attractive part of this region, it fluctuates in the region by stochastic effects for a long time, until it reaches the repulsive part. This may cause serious stagnation of learning, called *plateau phenomena*. Later, Amari, Ozeki, Karakida, Yoshida, and Okada (2018) pointed out a notable fact that a Milnor-like attractor may not cause serious stagnation of learning when $m\u22652$, which is also treated in this article.

The objective of this article is to provide a solid ground on Amari et al.'s (2018) point of view. We introduce a new coordinate system that admits a center manifold structure around a special point on the Milnor-like attractor. By using the coordinate system, we can analyze the Milnor-like attractor more rigorously and integrate the reduced dynamical system explicitly to obtain analytical trajectories. The obtained trajectories are comparable to the preceding work. It is confirmed by several settings of numerical simulations that trajectories in actual learning agree with the analytical ones.

In addition to the averaged gradient descent method, we also address online learning, a stochastic gradient descent method. Around a Milnor-like attractor, the behavior of sample paths by the online learning seems qualitatively different from that of trajectories by the averaged gradient descent. To investigate why they are different, we divide the dynamics of parameters into fast and slow ones, as is the case in the averaged gradient descent. In this case, we observed in numerical simulations that the fast parameters fluctuate intensively around the center manifold for the averaged system. We show that such a deviation of the fast parameter from the center manifold can influence a trend of the slow parameter.

This article is organized as follows. In section 2, we give a quick review of Amari et al.'s (2018) work. In section 3, after a brief account of the center manifold theory, we introduce a new coordinate system in the parameter space and prove that it admits the center manifold structure. In section 4, we carry out numerical simulations and observe the center manifold structure around a Milnor-like attractor. In section 5, we consider the online learning from the viewpoint of the center manifold theory. Section 6 offers concluding remarks.

## 2 Singular Region and Milnor-Like Attractor

In this section, we give a quick review of the Milnor-like attractor that Fukumizu and Amari (2000) found, which appears when the number $m$ of output units is equal to 1. We also mention an interesting insight by Amari et al. (2018) for the case $m\u22652$.

On such a singular region, some properties of $L(1)$ are inherited by $L(2)$. The following lemma implies that a criticality is a hereditary property.

Let $\theta *=(w*,v*)$ be a critical point of $L(1)$. Then the parameter $\theta =(w1,w2,v1,v2)=(w*,w*,\lambda v*,(1-\lambda )v*)$ is a critical point of $L(2)$ for any $\lambda \u2208R$.

When $m=1$, in particular, every point $\theta \u2208R(w*,v*)$ is a critical point of $L(2)$, since the parameter $v$, as well as the output $f(d)(x;\theta )$, is a scalar quantity. In this case, the second-order property of $L(1)$ is also inherited by $L(2)$ to some extent, and the singular region $R(w*,v*)$ may have an interesting structure, which causes serious stagnation of learning.

This proposition implies that the one-dimensional region $R(w*,v*)={\theta \lambda \u2223\lambda \u2208R}$ may have both attractive parts and repulsive parts in the gradient descent method. Such a region is referred to as a *Milnor-like attractor* (Wei et al., 2008). The parameter $\theta $ near the attractive part flows into the Milnor-like attractor and fluctuates in the region for a long time, until it reaches the repulsive part.

The original theorem (Fukumizu & Amari, 2000) is for an ($n$-$d$-1)-perceptron that contains ($n$-($d$-1)-1)-perceptron as a subnetwork and that the phenomenon itself is universal with respect to the number $d$ of hidden units. The proposition above for ($n$-2-1)-perceptron is a minimal version.

We also remark that the point $\theta \lambda $ cannot be a strict local minimizer since $L(2)$ takes the same value on the singular region ${\theta \lambda \u2223\lambda \u2208R}$ and is flat along its direction. The proof of proposition 1 is given mainly by a discussion of the Hessian matrix of $L(2)$; however, we need to treat higher-order derivatives of $L(2)$ than the second order, since the Hessian matrix degenerates on the singular region (see appendix A).

We next treat the case when $m\u22652$. There also exists a one-dimensional region consisting of critical points due to lemma 1. However, in this case, the region becomes simply repulsive and does not have an attractive part, as the following theorem asserts.

## 3 Center Manifold of Milnor-Like Attractor

In this section, we give a rigorous justification for their hypothesis. We first give a quick review of the center manifold theory and then introduce a new coordinate system under which the center manifold structures arise near certain points on the Milnor-like attractor.

### 3.1 Brief Review of Center Manifold

A local invariant manifold represented in the form of $y=h(x)$ is called a local center manifold (or simply a center manifold) if $h$ is differentiable and satisfies $h(0)=0$ and $\u2202h\u2202x(0)=O$.

The following center manifold theorems give us a method of simplifying a dynamical system around an equilibrium point.

### 3.2 Main Results

We focus on the dynamics of the learning process around the two points, $\theta =\theta 0,\theta 1$, which are boundaries of the repulsive and attractive parts of a Milnor-like attractor ${\theta \lambda |\lambda \u2208R}$. Concretely, they are denoted as $\theta 0=(w*,w*,0,v*)$ and $\theta 1=(w*,w*,v*,0)$, where $(w*,v*)$ is a minimizer of the loss $L(1)$ for the ($n$-1-1)-perceptron as mentioned in proposition 1. This is because the rank of the Hessian matrix at $\theta \lambda $ degenerates by one dimension for $\lambda \u22600,1$ and by $n+2$ dimension for $\lambda =0,1$, which is shown in appendix A.

In the coordinate system $\xi =(w,v,u,z)$, the dynamical system, equation 1.5, admits a center manifold structure around the critical points $\theta =\theta 0,\theta 1$ in which $(w,v)$ converge exponentially fast.

To prove the theorem, we make use of the following lemma.

If the matrix $X$ is positive definite and $Y$ is positive semidefinite, all the eigenvalues of the matrix $XY$ are nonnegative.

Due to proposition 2, there are center manifolds parametrized by $(u,z)$ around $\theta =\theta 0,\theta 1$ respectively.

### 3.3 Reduced Dynamical System

We remark that theorem 2 is valid even when there exists a true parameter in the singular region $R(w*,v*)$; however, in this case, such a simple form of the reduced dynamical system as equation 3.9 is not obtained. As mentioned above, this case implies that $H$ becomes the zero matrix. Then, the second-order terms of the reduced dynamical system, equation 3.9, vanish, and the third-order terms become dominant. Thus, we have to take into account the cross terms between $(w-w*,v-v*)$ and $(u,z-v*)$. It needs to calculate the center manifold $(w,v)=h(u,z)$ up to the second order, which makes the analysis complicated.

Finally, we remark on a difference between our analysis and previous work. Wei et al. (2008) have studied a reduced dynamical system in the vicinity of the whole part of a Milnor-like attractor. On the other hand, a center manifold is defined locally, and center manifolds around each of two points cannot be connected at a midpoint in general. Thus, one cannot discuss a center manifold defined around the entire region of a Milnor-like attractor.

### 3.4 More General Models

## 4 Numerical Simulations

In the previous section, we showed that the dynamics of $(w,v)$ are fast and those of $(u,z)$ are slow under the coordinate system, equation 3.5. In this section, we verify this fact by numerical simulations.

### 4.1 Example 1

In this simulation, we set the size $S$ of the data set to be 1000, and data ${xs}s=1S$ are drawn identical and independenty distributed (i.i.d.) according to $N(0,22)$. Here, $N(\mu ,\sigma 2)$ denotes the gaussian distribution with mean $\mu $ and variance $\sigma 2$.

For a data set given as above, we obtained a local minimizer $\theta *=(w*,v*)=(0.459,1.15)$ of $L(1)$. The shape of the function that corresponds to the local minimizer is shown by the dashed blue line in Figure 2. The value of $H$ is approximately 0.0472. Since $H>0$, the attractive region is ${\theta \lambda \u2223\lambda \u2208(0,1)}$, due to proposition 1.

### 4.2 Example 2

## 5 Aspects of Online Learning

*online learning*, a typical stochastic gradient descent method. Mathematically, it is given by

We deduce that a fluctuation of the parameter around a center manifold causes constants $C1$ and $C2$ working as drift terms and that it makes the dynamics of the online learning qualitatively different from those of the averaged gradient descent. This example suggests that stochastic effects can influence a macroscopic flow of the learning process via a center manifold structure.

## 6 Conclusion

In this article, we first gave a quick review of a mechanism that causes plateau phenomena in a three-layer perceptron—in particular, how degeneration of hidden units gives rise to a Milnor-like attractor consisting of both attractive and repulsive parts. We next investigated the dynamics of learning around special points on a Milnor-like attractor and proved the existence of the center manifold. We also succeeded in integrating the reduced dynamical system to obtain an analytical form of a trajectory. We performed several numerical simulations to demonstrate the accuracy of our results. As an application of the center manifold structure, we gave an explanation for a characteristic behavior of the online learning.

Unfortunately, the two examples presented in section 4 were the only ones that we could find in which the assumptions of proposition 1 are fulfilled. This might suggest that the appearance of a Milnor-like attractor would be a rather rare situation in a perceptron that has bias terms. In fact, just by replacing the activation function $Sgm$ with $tanh$ in example 2, the matrix $H$ becomes indefinite and the assumption of proposition 1 is violated. Finding more suggestive examples that shed light on the complex behavior of the dynamics of learning is an important subject for future study.

In section 5, we investigated stochastic effects of the online learning from an intermediate viewpoint between fully stochastic and averaged dynamics. We made use of the center manifold of the averaged dynamics and discussed an integration with quickly fluctuating parameters. There have been many reports of qualitative differences between stochastic and deterministic methods; however, there are few general theories for analyzing such dissimilarities. We expect that the intermediate viewpoint in this article can be a clue to clarify stochastic effects in learning.

## Appendix: Proofs of Proposition 1 and Theorem 1

This appendix gives proofs of proposition 1 and Theorem 1. The proof of proposition 1 is based on the analysis of the Hessian matrix of $L(2)$. However, the Hessian at the point $\theta \lambda $ becomes singular, since $L(2)(\theta \lambda )$ is constant along $\lambda \u2208R$. Thus, we need to take into account higher-order derivatives of $L(2)$, which is overlooked in the Fukumizu and Amari (2000). The prototype of theorem 1 was given by Amari et al. (2018); however, they proved it only for a special case. Here, we give a rigorous proof with an additional mild assumption.

**Proof of Proposition 1.**We introduce a new coordinate system $\xi =(w,v,u,z)$ by

**Proof of Theorem 1.**Since $v1$ and $v2$ are no longer scalars, we cannot use the coordinate system, given by equation A.1. Therefore, in order to analyze the Hessian matrix, we introduce another coordinate system $\xi =(w,v,u,z)$ as

Fix $\lambda \u2208R$ arbitrarily. We show that the Hessian matrix $Hess(\xi \lambda )$ of $L(2)(\xi )$ at $\xi =\xi \lambda $ has both positive and negative eigenvalues. It suffices to show that the $(w,v)$-part of the Hessian is positive definite and that the $(u,z)$-part is not positive semidefinite. They imply that the full Hessian matrix $Hess(\xi \lambda )$ is neither negative semidefinite nor positive semidefinite, and hence that $Hess(\xi \lambda )$ is indefinite.

One can check that the $(w,v)$-part of $Hess(\xi \lambda )$ is equal to the Hessian matrix of $L(1)$ at $\theta *$ by direct calculation. Since $\theta *$ is a strict local minimizer of $L(1)$, this is positive definite.

## Acknowledgments

I express my gratitude to Akio Fujiwara for his helpful guidance, discussions, and advice. I thank as well Yuzuru Sato for many discussions and helpful comments. And I thank the anonymous referees for their insightful suggestions to improve this article.