A hierarchical neural network usually has many singular regions in the parameter space due to the degeneration of hidden units. Here, we focus on a three-layer perceptron, which has one-dimensional singular regions comprising both attractive and repulsive parts. Such a singular region is often called a Milnor-like attractor. It is empirically known that in the vicinity of a Milnor-like attractor, several parameters converge much faster than the rest and that the dynamics can be reduced to smaller-dimensional ones. Here we give a rigorous proof for this phenomenon based on a center manifold theory. As an application, we analyze the reduced dynamics near the Milnor-like attractor and study the stochastic effects of the online learning.

A three-layer perceptron is one of the simplest hierarchical learning machines. Mathematically, it is defined by the function
f(d)(x;θ)=i=1dviϕwi·x+bi,xRn,θ=(w1,,wd,b1,,bd,v1,,vd),
(1.1)
where θ is a system parameter with w1,,wdRn being the weight vectors for the second layer, b1,,bdR the bias terms for the second, v1,,vdRm the weight vectors for the third, and ϕ an activation function. Throughout this article, we assume that the activation function is twice differentiable. Figure 1 is a schematic diagram of the three-layer perceptron. We shall call the function 1.1 an (n-d-m)-perceptron. The numbers n and m are fixed at the outset as the sizes of input and output vectors, while the number d of hidden units can be varied in our analysis. For notational simplicity, we incorporate the bias b in the weight w as w=(b,w1,,wn), and accordingly, we enlarge x as x=(1,x1,,xn). By using these conventions, we obtain the abridged presentation of the three-layer perceptron as
f(d)(x;θ)=i=1dviϕwi·x.
(1.2)
Figure 1:

A schematic diagram of a three-layer perceptron presented in equation 1.1.

Figure 1:

A schematic diagram of a three-layer perceptron presented in equation 1.1.

Close modal

In this article, we treat the supervised learning. The goal of the supervised learning is to find an optimal parameter θ so that f(d)(x;θ) approximates a given target function T(x). Such a problem is based on the universal approximation property of the three-layer perceptron. For a suitable activation function ϕ (e.g., sigmoidal or ReLU), the model, equation 1.2, can approximate quite a wide range of functions as the number d of hidden units tends to infinity (Cybenko, 1989; Funahashi, 1989; Sonoda & Murata, 2017).

The (averaged) gradient descent method is a standard method for finding an optimal value of θ numerically. Suppose that a loss function (x,y) is nonnegative and is equal to zero if and only if y=T(x) (e.g., the squared error ||y-T(x)||2). In the gradient descent method, we aim at minimizing
L(d)(θ):=Ex(x,f(d)(x;θ))
(1.3)
by changing the parameter θ iteratively as
θt+1=θt-ɛL(d)θ(θt),
(1.4)
where ɛ>0 is a learning constant. Here, we assume that the input vector x is a random variable drawn according to an unknown probability distribution and Ex denotes the expectation with respect to x. In order for the differential L(d)/θ to make sense, we also assume that (x,y) is differentiable with respect to y and that we can interchange the order of the differentiation /θ and the expectation Ex. We study the dynamical system, which represents the averaged gradient descent method with infinitesimal learning constant:
dθdt=-L(d)θ(θ).
(1.5)
The parameter θ descends along the gradient of L(d) into a local minimum. In practice, the expectation in equation 1.3 is replaced with the arithmetic mean over a given data set, or, roughly, with a single realization of the random variable (x,f(d)(x;θ)) for each learning iteration. Such a learning method involving some stochastic effects is called a stochastic gradient descent method.

Fukumizu and Amari (2000) studied singular regions arising from degeneration of hidden units of a three-layer perceptron. Here, the degeneration of hidden units means that several weight parameters wi take the same value and the effective number of hidden units becomes fewer than d. When m=1, they found a novel type of singular region, often called a Milnor-like attractor. This region has both an attractive part consisting of local minima of L(d) and a repulsive part consisting of saddle points. In practical learning, there may be some stochastic effects. Therefore, once the parameter θ is trapped in the attractive part of this region, it fluctuates in the region by stochastic effects for a long time, until it reaches the repulsive part. This may cause serious stagnation of learning, called plateau phenomena. Later, Amari, Ozeki, Karakida, Yoshida, and Okada (2018) pointed out a notable fact that a Milnor-like attractor may not cause serious stagnation of learning when m2, which is also treated in this article.

More quantitative analyses for m=1 have also been carried out by Coussear, Ozeki, and Amari (2008), Wei, Zhang, Cousseau, Ozeki, and Amari (2008), and Amari et al. (2018) in the simplest case d=2. In particular, Wei et al. (2008) introduced a new coordinate system in the parameter space by
w=v1w1+v2w2v1+v2v=v1+v2u=w1-w2z=v1-v2v1+v2,
(1.6)
and claimed, based on evidence observed in numerical simulations, that when the initial point is taken near a Milnor-like attractor, the parameters (w,v) quickly converge to equilibrium values (w*,v*). They hypothesized that this would always be the case and analyzed only the reduced dynamical system for the subparameters (u,z), setting the remaining parameters (w,v) to be (w*,v*). However, to the best of our knowledge, no mathematical justification for this hypothesis has been established.

The objective of this article is to provide a solid ground on Amari et al.'s (2018) point of view. We introduce a new coordinate system that admits a center manifold structure around a special point on the Milnor-like attractor. By using the coordinate system, we can analyze the Milnor-like attractor more rigorously and integrate the reduced dynamical system explicitly to obtain analytical trajectories. The obtained trajectories are comparable to the preceding work. It is confirmed by several settings of numerical simulations that trajectories in actual learning agree with the analytical ones.

In addition to the averaged gradient descent method, we also address online learning, a stochastic gradient descent method. Around a Milnor-like attractor, the behavior of sample paths by the online learning seems qualitatively different from that of trajectories by the averaged gradient descent. To investigate why they are different, we divide the dynamics of parameters into fast and slow ones, as is the case in the averaged gradient descent. In this case, we observed in numerical simulations that the fast parameters fluctuate intensively around the center manifold for the averaged system. We show that such a deviation of the fast parameter from the center manifold can influence a trend of the slow parameter.

This article is organized as follows. In section 2, we give a quick review of Amari et al.'s (2018) work. In section 3, after a brief account of the center manifold theory, we introduce a new coordinate system in the parameter space and prove that it admits the center manifold structure. In section 4, we carry out numerical simulations and observe the center manifold structure around a Milnor-like attractor. In section 5, we consider the online learning from the viewpoint of the center manifold theory. Section 6 offers concluding remarks.

In this section, we give a quick review of the Milnor-like attractor that Fukumizu and Amari (2000) found, which appears when the number m of output units is equal to 1. We also mention an interesting insight by Amari et al. (2018) for the case m2.

The parameter space of a perceptron is sometimes called a perceptron manifold. However, in many cases, it is not really a manifold since it usually contains a subset whose points correspond to the same input-output relation. Such a subset is usually referred to as a singular region. In general, there are many singular regions due to the degeneration of hidden units. For example, let us consider an (n-2-m)-perceptron. Then the subset
R(w,v):={θ=(w1,w2,v1,v2)|w1=w2=w,v1+v2=v}
of the parameter space forms a typical singular region. In fact, on the subset R(w,v), an (n-2-m)-perceptron f(2)(x;θ) is reduced to the following (n-1-m)-perceptron:
f(1)(x;w,v):=vϕw·x.

On such a singular region, some properties of L(1) are inherited by L(2). The following lemma implies that a criticality is a hereditary property.

Lemma 1.

Let θ*=(w*,v*) be a critical point of L(1). Then the parameter θ=(w1,w2,v1,v2)=(w*,w*,λv*,(1-λ)v*) is a critical point of L(2) for any λR.

Proof.
L(2)wi(θ)=E(x,f(1)(x;θ*))y·viϕ'(w*·x)xT=λiL(1)w(θ*),L(2)vi(θ)=E(x,f(1)(x;θ*))yϕ(w*·x)=L(1)v(θ*),i=1,2,
where λ1:=λ and λ2:=1-λ. Since θ* is a critical point of L(1), these are all zero.

When m=1, in particular, every point θR(w*,v*) is a critical point of L(2), since the parameter v, as well as the output f(d)(x;θ), is a scalar quantity. In this case, the second-order property of L(1) is also inherited by L(2) to some extent, and the singular region R(w*,v*) may have an interesting structure, which causes serious stagnation of learning.

Proposition 1
(Fukumizu & Amari, 2000). Let m=1 and θ*=(w*,v*) be a strict local minimizer of L(1) with v*0. Define an (n+1)×(n+1) matrix H by
H:=Ex(x,f(1)(x;θ*))yv*ϕ''(w*·x)xxT,
(2.1)
and for λR, let
θλ:=(w*,w*,λv*,(1-λ)v*).
If the matrix H is positive (resp. negative) definite, then the point θ=θλ is a local minimizer (resp. saddle point) of L(2) for any λ(0,1) and is a saddle point (resp. local minimizer) for any λR[0,1]. On the other hand, if the matrix H is indefinite, then the point θλ is a saddle point of L(2) for all λR{0,1}.

This proposition implies that the one-dimensional region R(w*,v*)={θλλR} may have both attractive parts and repulsive parts in the gradient descent method. Such a region is referred to as a Milnor-like attractor (Wei et al., 2008). The parameter θ near the attractive part flows into the Milnor-like attractor and fluctuates in the region for a long time, until it reaches the repulsive part.

The original theorem (Fukumizu & Amari, 2000) is for an (n-d-1)-perceptron that contains (n-(d-1)-1)-perceptron as a subnetwork and that the phenomenon itself is universal with respect to the number d of hidden units. The proposition above for (n-2-1)-perceptron is a minimal version.

We also remark that the point θλ cannot be a strict local minimizer since L(2) takes the same value on the singular region {θλλR} and is flat along its direction. The proof of proposition 1 is given mainly by a discussion of the Hessian matrix of L(2); however, we need to treat higher-order derivatives of L(2) than the second order, since the Hessian matrix degenerates on the singular region (see appendix A).

Let us suppose a situation where a three-layer perceptron has some redundant hidden units to represent the target function T(x). Mathematically, we suppose that a true parameter θtrue exists (T(x)=f(2)(x;θtrue)) and that it lies in the singular region R(w*,v*). In this case, the function L(2) takes the same value L(1)(w*,v*)=0 on R(w*,v*). Therefore, every point of R(w*,v*) becomes a global minimizer of L(2), and a Milnor-like attractor does not appear. In fact, one can check that the assumption of proposition 1 fails as follows. For each xRn, we obtain
(x,f(1)(x;w*,v*))/y=0,
since a loss function (x,y) takes its minimum 0 at y=T(x)=f(1)(x;w*,v*). This implies that the matrix H becomes the zero matrix. Thus, H is neither positive nor negative definite.

We next treat the case when m2. There also exists a one-dimensional region consisting of critical points due to lemma 1. However, in this case, the region becomes simply repulsive and does not have an attractive part, as the following theorem asserts.

Theorem 1.
Let θ* = (w*,v*) be a local minimizer of L(1). If the m×(n+1) matrix
Ex(x,f(1)(x;θ*))yϕ'(w*·x)xT
(2.2)
is nonzero, then θλ=(w*,w*,λv*,(1-λ)v*) is a saddle point of L(2) for any λR, where we regard the derivative /y as a column vector.
Amari et al. (2018) stated a prototype of theorem 1, although they did not give a full proof. In fact, we found that some additional assumption was necessary to prove their assertion. In theorem 1, we have added a mild assumption that the matrix, equation 2.2, is nonzero. Note that since θ*=(w*,v*) is a local minimizer of L(1), it holds that
0=L(1)w(θ*)=(v*)TEx(x,f(1)(x;θ*))yϕ'(w*·x)xT.
Thus, the matrix 2.2 has a kernel whose dimension is greater than or equal to one. Hence, the assumption automatically fails when m=1. This is an underlying mechanism for proposition 1.
In their analysis of an (n-2-1)-perceptron, Wei et al. (2008) introduced a coordinate transformation,
w=v1w1+v2w2v1+v2v=v1+v2u=w1-w2z=v1-v2v1+v2,
(3.1)
and claimed that the parameters (w,v) quickly converge to (w*,v*) when the initial point is taken near a Milnor-like attractor. Amari et al. (2018) mentioned that the dynamics in this coordinate system should be analyzed by using the center manifold theory, and they analyzed only the reduced dynamical system for the subparameters (u,z), setting the remaining parameters (w,v) to be (w*,v*). Strictly speaking, however, their coordinate system does not admit any center manifold structure, and their claim is at the stage of hypothesis.

In this section, we give a rigorous justification for their hypothesis. We first give a quick review of the center manifold theory and then introduce a new coordinate system under which the center manifold structures arise near certain points on the Milnor-like attractor.

3.1  Brief Review of Center Manifold

Suppose that we are given a dynamical system,
x˙(t)=Ax(t)+f(x(t),y(t))y˙(t)=By(t)+g(x(t),y(t)),
(3.2)
for the parameters (x,y)Rd1×Rd2, where A and B are constant matrices and f and g are C2 functions such that they, along with their first derivatives, vanish at the origin. We assume that all the eigenvalues of A have zero real parts, while all the eigenvalues of B have negative real parts. This assumption means that parameter y converges to the origin exponentially quickly, and the parameter x is driven only by the higher-order terms of f and evolves very slowly compared with y. Since f and g are of the second order with respect to x and y, the assumptions for equation 3.2 imply that the coefficient matrix of the linearization of the system has the form
AOOB.
Definition 1.

A set SRd1×Rd2 is said to be a local invariant manifold of equation 3.2 if for (x0,y0)S, the solution (x(t),y(t)) of equation 3.2 with (x(0),y(0))=(x0,y0) is in S for |t|<T with some T>0.

Definition 2.

A local invariant manifold represented in the form of y=h(x) is called a local center manifold (or simply a center manifold) if h is differentiable and satisfies h(0)=0 and hx(0)=O.

The following center manifold theorems give us a method of simplifying a dynamical system around an equilibrium point.

Proposition 2

(Center manifold theorem 1: Carr, 1981). Equation 3.2 has a center manifold y=h(x) for ||x||<δ, for some δ>0 and C2 function h.

Proposition 3
(Center manifold theorem 2: Carr, 1981). Suppose that the origin u=0 is a stable equilibrium point of the reduced dynamical system
u˙(t)=Au(t)+f(u(t),h(u(t))).
(3.3)
Let (x(t),y(t)) be a solution of equation 3.2 with the initial value (x0,y0). Then if ||(x0,y0)|| is sufficiently small, there exists a solution u(t) of equation 3.3 such that
x(t)=u(t)+O(e-γt),y(t)=h(u(t))+O(e-γt),
as t, where γ is a positive constant.

Proposition 3 asserts that the parameter (x,y) approaches the center manifold y=h(x) quickly and then evolves along it. Thus, the dynamical system, equation 3.2, around the origin is essentially controlled by the slow parameter x and reduced to the lower-dimensional system.

3.2  Main Results

Let us return to the analysis of an (n-2-1)-perceptron. In a column vector representation, the dynamical system, equation 1.5, for the (n-2-1)-perceptron is written as
θ˙=-L(2)θ(θ)T.
We consider another coordinate system ξ=ξ(θ) and investigate the dynamical system in it. By the coordinate transformation, the dynamical system above is transformed to
ξ˙=-ξθξθTL(2)ξ(ξ)T.
(3.4)
Thus, the coefficient matrix of its linearization at a critical point ξ=ξ* is
ξ˙ξ(ξ*)=-ξξθξθTL(2)ξ(ξ)Tξ=ξ*=-ξθξθT2L(2)ξξ(ξ*),
where we used (L(2)/ξ)(ξ*)=0. This relation implies that the coefficient matrix has the same rank as the Hessian matrix (2L(2)/ξξ)(ξ*). In particular, the rank of the coefficient matrix of the linearization does not depend on the choice of a coordinate system.

We focus on the dynamics of the learning process around the two points, θ=θ0,θ1, which are boundaries of the repulsive and attractive parts of a Milnor-like attractor {θλ|λR}. Concretely, they are denoted as θ0=(w*,w*,0,v*) and θ1=(w*,w*,v*,0), where (w*,v*) is a minimizer of the loss L(1) for the (n-1-1)-perceptron as mentioned in proposition 1. This is because the rank of the Hessian matrix at θλ degenerates by one dimension for λ0,1 and by n+2 dimension for λ=0,1, which is shown in appendix A.

We introduce a new coordinate system ξ=(w,v,u,z) by
w=v1(w1-w*)+v2(w2-w*)v*+w*v=v1+v2u=v2(w1-w*)-v1(w2-w*)v*z=v1-v2
(3.5)
This formula defines a coordinate system on the region {v12+v220}. Under the coordinate system, equation 3.5, the critical points θλ are denoted as
ξλ=(w*,v*,0,(2λ-1)v*).
In particular, ξ0=(w*,v*,0,-v*) and ξ1=(w*,v*,0,v*). Now we arrive at the main theorem of this article:
Theorem 2.

In the coordinate system ξ=(w,v,u,z), the dynamical system, equation 1.5, admits a center manifold structure around the critical points θ=θ0,θ1 in which (w,v) converge exponentially fast.

To prove the theorem, we make use of the following lemma.

Lemma 2.

If the matrix X is positive definite and Y is positive semidefinite, all the eigenvalues of the matrix XY are nonnegative.

Proof.
The matrix XY is rewritten as
XY=X12(X12YX12)X-12,
where X12 is a unique positive-definite matrix such that (X12)2=X. Here, the matrix Z:=X12YX12 is positive semidefinite. Hence, for each eigenvector a of Z, the vector X12a is an eigenvector of the matrix XY, and the corresponding eigenvalue is nonnegative.
Proof of Theorem 2.
The proof is essentially based on a straightforward calculation. The coefficient matrix of the linearization of the dynamical system, equation 3.4, under the coordinate system, equation 3.5, splits into (w,v) part and (u,z) part for λ=0,1. In fact, for λR, the negative of the coefficient matrix is written as
w,vuzA˜λ=w,v(1+2kλ)Q+1+3kλ1+2kλH(1+2kλ)P-(-1+2λ)kλ1+2kλH02PT2R00u-(-1+2λ)kλ1+2kλH0-kλ1+2kλH0z(1+2kλ)Q+1+3kλ1+2kλH(1+2kλ)P-(-1+2λ)kλ1+2kλH02PT2R00-(-1+2λ)kλ1+2kλH0-kλ1+2kλH00000,
and the system is written as
ξ˙=-A˜λ(ξ-ξλ)+g˜λ(ξ),
where g˜λ is the higher-order term, which vanish at the ξ=ξλ together with its first derivative. Here,
kλ:=(1-λ)λ,P:=Ex(2)v*ϕ(w*·x)ϕ'(w*·x)x,Q:=Ex(2)(v*)2ϕ'(w*·x)2xxT,R:=Ex(2)ϕ(w*·x)2,:=(x,f(1)(x;θ*))y,2:=2(x,f(1)(x;θ*))y2,
and H is the matrix defined by equation 2.1. H and Q are matrices, P is a column vector, and R is a scalar. For λ=0 or 1, the negative of the coefficient matrix is of the form
w,vuzA˜0=A˜1=w,v{u{z{Q+HP002PT2R0000000000.
(3.6)
We show that all the eigenvalues of (w,v)-block of the coefficient matrix -A˜0 are strictly negative. Recall that the coefficient matrix at a critical point ξ* of the dynamical system, equation 1.5, is given by
-ξθξθT2L(2)ξξ(ξ*).
Applying lemma 2 to X=(ξ/θ)(ξ/θ)T and Y=(2L(2)/ξξ)(ξ0), all the eigenvalues of the coefficient matrix -A˜0 are nonpositive. One can see that the Hessian matrix is positive semidefinite and degenerates by n+2 dimension at θ=θ0,θ1 in appendix A. Since a coordinate transformation preserves the rank of the coefficient matrix of linearization, A˜0 degenerates by n+2 dimension, which is equal to the size of (u,z)-block. This implies that the (w,v)-block is of full rank, and thus all the eigenvalues of (w,v)-block are strictly negative. It is proved similarly for ξ=ξ1.

Due to proposition 2, there are center manifolds parametrized by (u,z) around θ=θ0,θ1 respectively.

3.3  Reduced Dynamical System

By virtue of proposition 3 and theorem 2, we can assume that the dynamics of the gradient descent is on the center manifold near the points θ=θ0,θ1. Thus, we can reduce the dynamical system into that of (u,z). Recalling the coefficient matrix, equation 3.6, we can see that u˙ and z˙ have no first-order terms. In more detail, calculating the Taylor expansion of (u˙,z˙) up to the second order around ξ=ξ1, we obtain
u˙=1v*{-(P·(w-w*))(u+(w-w*))-(v-v*)(RI+12H)(u+(w-w*))+12(z-v*)H(u+(w-w*))}+O(||ξ-ξ1||3),
(3.7)
z˙=1v*-(w-w*)TQ(u+(w-w*))-(v-v*)(P·(u+(w-w*)))-12(w-w*)TH(w-w*)+12uTHu+O(||ξ-ξ1||3),
(3.8)
where I denotes the (n+1)×(n+1) identity matrix. Now we consider the reduced dynamical system on the center manifold. Here, the center manifold (w,v)=h(u,z) satisfies that
h(u,z)=w(u,z)v(u,z)=w*v*+O(u,z-v*2),
by definition. This gives an approximation of the dynamics on the center manifold near ξ=ξ1 as
u˙=12v*(z-v*)Hu+O(u,z-v*3),z˙=12v*uTHu+O(u,z-v*3).
(3.9)
Neglecting the higher-order terms, we can integrate this equation to obtain
u2=(z-v*)2+C,
(3.10)
where C is an integral constant.
Around the point ξ=ξ0, we obtain the similar dynamics,
u˙=-12v*(z+v*)Hu+O(u,z+v*3),z˙=-12v*uTHu+O(u,z+v*3),
and the relation,
u2=(z+v*)2+C.

We remark that theorem 2 is valid even when there exists a true parameter in the singular region R(w*,v*); however, in this case, such a simple form of the reduced dynamical system as equation 3.9 is not obtained. As mentioned above, this case implies that H becomes the zero matrix. Then, the second-order terms of the reduced dynamical system, equation 3.9, vanish, and the third-order terms become dominant. Thus, we have to take into account the cross terms between (w-w*,v-v*) and (u,z-v*). It needs to calculate the center manifold (w,v)=h(u,z) up to the second order, which makes the analysis complicated.

Finally, we remark on a difference between our analysis and previous work. Wei et al. (2008) have studied a reduced dynamical system in the vicinity of the whole part of a Milnor-like attractor. On the other hand, a center manifold is defined locally, and center manifolds around each of two points cannot be connected at a midpoint in general. Thus, one cannot discuss a center manifold defined around the entire region of a Milnor-like attractor.

3.4  More General Models

Our results can be extended to a more general model including multilayer perceptrons whose output is one-dimensional. In this section, we consider a parameterized family of functions that can be written as
f(x;θ):=g(x,τ)+v1φ(x;w1,τ)+v2φ(x;w2,τ),θ=(w1,w2,v1,v2,τ),
(3.11)
where we assume that g(x;τ) and φ(x;w,τ) are twice differentiable with respect to τ and (w,τ), respectively.
A multilayer perceptron α˜(L)(x;θ) with L layers defined recursively as
α(0)(x;θ):=(1,x1,,xn0)T,α˜()(x;θ):=W()α(-1)(x;θ),α()(x;θ):=1,ϕ(α˜1()(x;θ)),,ϕ(α˜n()(x;θ))T,1L,θ=(W(1),,W(L)),
for each x=(x1,,xn0)TRn0, where W() is an n×(n-1+1) matrix, n0,n1,,nLN, and ϕ is a twice-differentiable activation function. Assuming that the dimension nL of the output is equal to 1 and denoting
W(L-1)=w1w2wnL-1T,W(L)=v0v1v2vnL-1,
a multilayer perceptron is represented as model 3.11 by letting
φ(x;w,τ):=ϕ(w·α(L-2)(x;τ)),g(x;τ):=i=3nL-1viϕ(wi·α(L-2)(x;τ)),τ:=(W(1),,W(L-2),w3,,wnL-1,v0,v3,,vnL-1).
Our main result is extended to the model, equation 3.11 as follows. Let θ*=(w*,v*,τ*) be a strict local minimizer of an averaged loss function for the degenerate model:
f(1)(x;w˜,v˜,τ)=g(x;τ)+v˜φ(x;w˜,τ).
Then the coordinate system ξ=(w,v,τ,u,z), given by formula 3.5, admits a center manifold structure around the two points θ=(w*,w*,0,v*,τ*), (w*,w*,v*,0,τ*), and the dynamical system is reduced to that of (u,z). This is confirmed by the argument of the coefficient matrix of the linearization, similar to theorem 2.

In the previous section, we showed that the dynamics of (w,v) are fast and those of (u,z) are slow under the coordinate system, equation 3.5. In this section, we verify this fact by numerical simulations.

4.1  Example 1

As the first example, we set the input dimension to be n=1 and choose the teacher function T:RR defined by
T(x):=2tanh(x)-tanh(4x),
where tanh is the hyperbolic tangent function. The shape of T is shown in Figure 2 by the solid black line. We set the activation function ϕ as tanh. Thus, the target function T can be represented by the (1-2-1)-perceptron with no bias terms:
f(2)(x;θ)=v1ϕ(w1x)+v2ϕ(w2x),
and the true parameter is (w1,w2,v1,v2)=(1,4,2,-1). We also discard the bias terms w10 and w20 of the student perceptron. This makes the matrix H scalar valued, and hence the assumption of proposition 1 holds trivially.
Figure 2:

The target function T(x) and the (1-1-1)-perceptron f(1)(x;θ*), which corresponds to the local minimizer θ*.

Figure 2:

The target function T(x) and the (1-1-1)-perceptron f(1)(x;θ*), which corresponds to the local minimizer θ*.

Close modal
We assume that the data set {xs}s=1S is given and that the probability distribution of x is the empirical distribution on that data set. Then the transition formula, equation 1.4, of the parameter θ is rewritten as
θ(t+1)=θ(t)-ɛ1Ss=1S(xs,f(2)(xs;θ))θθ=θ(t).

In this simulation, we set the size S of the data set to be 1000, and data {xs}s=1S are drawn identical and independenty distributed (i.i.d.) according to N(0,22). Here, N(μ,σ2) denotes the gaussian distribution with mean μ and variance σ2.

For a data set given as above, we obtained a local minimizer θ*=(w*,v*)=(0.459,1.15) of L(1). The shape of the function that corresponds to the local minimizer is shown by the dashed blue line in Figure 2. The value of H is approximately 0.0472. Since H>0, the attractive region is {θλλ(0,1)}, due to proposition 1.

Figure 3 displays time evolutions of each parameter in the first 1500 iterations from 50 different initial points. We chose an initial parameter θ(0)=(w1(0),w2(0),v1(0),v2(0)) by
w1(0)=w*+ζ1,w2(0)=w*+ζ2,v1(0)=v*+12(ζ3+ζ4),v2(0)=12(ζ3-ζ4),
so that v=v*+ζ3 and z=v*+ζ4, where ζ1,ζ2U(-0.2,0.2), and ζ3,ζ4U(-0.2,0.2). Here, U(a,b) denotes the uniform distribution on the interval [a,b]R. We set the learning rate ɛ to be 0.05 and the number of iterations to be 20,000. We can see that the parameters w and v converge to their equilibriums exponentially fast (see Figures 3a and 3b), while u and z evolve slowly (see Figures 3c and 3d).
Figure 3:

Time evolutions of each parameter for the first 1500 iterations. Each trajectory of w or v quickly converges to the equilibrium point w*=0.459,v*=1.15, respectively. Trajectories of u and z evolve very slowly compared with w and v.

Figure 3:

Time evolutions of each parameter for the first 1500 iterations. Each trajectory of w or v quickly converges to the equilibrium point w*=0.459,v*=1.15, respectively. Trajectories of u and z evolve very slowly compared with w and v.

Close modal
Figure 4 shows evolutions on the (z,||u||2)-plane. The red circles in the figure represent initial points. When w=w* and v=v*, the z-axis is a Milnor-like attractor, and the region |z|<v* is the attractive part of it. We can check that parameters near the attractive region are trapped and those near the repulsive region are escaping. The intersection point of the line z=v* and z-axis corresponds to the point θ=θ1, the boundary of the attractive and repulsive parts of the Milnor-like attractor. The analytical trajectories, equation 3.10, are plotted as dashed blue curves. Numerical evolutions of the parameter follow the analytical trajectories considerably well around θ=θ1. We see in the figure that some instances of time evolutions change their direction sharply. This is because the fast dynamics of w and v are the main dynamics at the beginning of the learning, while the slow dynamics of u and z become dominant after w and v converge to the center manifold.
Figure 4:

Trajectories on the (z,||u||2)-plane obtained by learning for 20,000 iterations (solid black curves) and analytical trajectories (dashed blue curves) near θ=θ1=(w*,w*,v*,0). Red circles represent initial points.

Figure 4:

Trajectories on the (z,||u||2)-plane obtained by learning for 20,000 iterations (solid black curves) and analytical trajectories (dashed blue curves) near θ=θ1=(w*,w*,v*,0). Red circles represent initial points.

Close modal

4.2  Example 2

As the second example, we consider a three-layer perceptron whose input dimension is n=2. Let the teacher function T:R2R be given by
T(x1,x2):=0.75Sgm(2.5x1-2.5x2)+Sgm(2.5x1+2.5x2)+0.5,
where Sgm is the logistic sigmoidal function, which is defined by
Sgm(x):=11+e-2x=12(1+tanh(x)).
Figure 5 shows the shape of the teacher function T(x). We use a perceptron with no bias terms also in this simulation and choose Sgm as the activation function. Note that a (2-2-1)-perceptron is unable to represent the target function in this case.
Figure 5:

The target teacher function T(x).

Figure 5:

The target teacher function T(x).

Close modal
Also in this simulation, we assume that the probability distribution of x is the empirical distribution on a fixed data set {xs}s=1S. We set the number S of the data set to be 1000, and draw a data set {xs}s=1S i.i.d. according to N(0,I2), where I2 denotes the 2×2 identity matrix. We chose a realization {xs}s=1S as above and obtained a local minimizer θ*=(w*,v*) of L(1), where w*=(0.399,0.0652) and v*=2.76. Figure 6 shows the shape of the (2-1-1)-perceptron corresponding to the local minimizer. The matrix H is numerically computed as
-0.044-0.026-0.026-0.20.
Since this H is negative definite, the attractive region is {θλλR[0,1]} due to proposition 1.
Figure 6:

The (2-1-1)-perceptron f(1)(x;θ*) that corresponds to the local minimizer θ*.

Figure 6:

The (2-1-1)-perceptron f(1)(x;θ*) that corresponds to the local minimizer θ*.

Close modal
Figures 7a to 7d show time evolutions of each parameter in the first 500 iterations from 50 different initial points. We chose initial parameters of the (2-2-1)-perceptron as
w1(0)=w*+ζ1,w2(0)=w*+ζ2,v1(0)=v*+12(ζ3+ζ4),v2(0)=12(ζ3-ζ4),
where ζ1,ζ2U(-0.2,0.2)2 and ζ3,ζ4U(-0.2,0.2). We set the learning rate ɛ to be 0.05 and the number of iterations to be 20,000. In this simulation, since w and u are two-dimensional, their evolutions are not displayed as time series but as trajectories on each plane. The red circles in Figures 7a and 7c represent initial values of w and u, respectively. Figures 7a and 7b show that the parameters w and v converge to their equilibrium w* and v* very quickly. To display how quick the convergence is, the first 30 iterations are dashed in green in Figure 7a. In contrast, Figures 7c and 7d show that the parameters u and z evolve very slowly.
Figure 7:

Time evolutions of each parameter for the first 500 iterations. Each trajectory of w or v quickly converges to the equilibrium point w*=(0.399,0.0652),v*=2.76 respectively. However, trajectories of u and z evolve very slowly compared with w and v. In panel a, the first 30 iterations are dashed in green to display the speed of convergence, and the red point at the center represents w=w*. Red circles in panels a and c represent initial points.

Figure 7:

Time evolutions of each parameter for the first 500 iterations. Each trajectory of w or v quickly converges to the equilibrium point w*=(0.399,0.0652),v*=2.76 respectively. However, trajectories of u and z evolve very slowly compared with w and v. In panel a, the first 30 iterations are dashed in green to display the speed of convergence, and the red point at the center represents w=w*. Red circles in panels a and c represent initial points.

Close modal
Figure 8 plots time evolutions of the parameter θ on the (z,||u||2)-plane, which means the plane whose axes indicate the values of z and u2. We can check that parameters near the attractive part {|z|>v*,||u||2=0} of the Milnor-like attractor are trapped and that those near the repulsive part {|z|<v*,||u||2=0} are escaping. The numerical evolutions follow the analytical flows (dashed blue curves) well also in this case.
Figure 8:

Trajectories on the (z,||u||2)-plane obtained by learning for 20,000 iterations (solid black curves) and analytical trajectories near θ=θ1 (dashed blue curves). Red circles represent the initial points.

Figure 8:

Trajectories on the (z,||u||2)-plane obtained by learning for 20,000 iterations (solid black curves) and analytical trajectories near θ=θ1 (dashed blue curves). Red circles represent the initial points.

Close modal
In this section, we discuss stochastic effects in the learning process. Thus far, we have analyzed the dynamical system, equation 1.5, driven by the averaged gradient. In practice, the averaged gradient is estimated by the arithmetic mean of the instantaneous gradient (x,f(d)(x;θ))/θ over a large number of input data. However, taking the arithmetic mean for each update of the parameter demands high computational cost. In order to reduce the cost, the expectation is often replaced by a single realization of the instantaneous gradient. Such a method is called online learning, a typical stochastic gradient descent method. Mathematically, it is given by
θ(t+1)=θ(t)-ɛθ(xt,f(d)(xt;θ))θ=θ(t),
(5.1)
where {xt}t are i.i.d. realizations of the input data x. Unlike the deterministic dynamical system, equation 1.5, the system 5.1 is a randomized dynamical system.
In numerical simulations, we found that sample paths of the online learning seem quite different from trajectories obtained in the averaged gradient descent method. We set the distribution over which the loss function L(1) is averaged to be N(0,22) and obtained a local minimizer of L(1) as θ*=(w*,v*)(0.472,1.13). We carried out numerical simulations of the online learning in the same setting as example 1 in section 4. Figure 9a shows numerical trajectories of the averaged gradient descent on the (z,u)-plane around θ=θ1. In order to approximate the averaged gradient descent sufficiently, we used the empirical distribution on a data set of 10,000 data drawn i.i.d. according to N(0,22). Figure 9b shows sample paths of the online learning for a common input data sequence {xt}t. In contrast to the averaged gradient descent, in the online learning, some sample paths move from region {|z|>v*} to {|z|<v*}. Such sample paths are observed even when we use another realization of the input data sequence, and its dynamics seems qualitatively different from the averaged one.
Figure 9:

Trajectories on the (z,u)-plane obtained by the averaged gradient descent method, equation 1.5, and the online learning, equation 5.1, for 20,000 iterations. Red circles represent initial values.

Figure 9:

Trajectories on the (z,u)-plane obtained by the averaged gradient descent method, equation 1.5, and the online learning, equation 5.1, for 20,000 iterations. Red circles represent initial values.

Close modal
In order to investigate this phenomenon, we observe the evolution of the parameters, again in the coordinate system, equation 3.5. Figures 10a to 10d show time evolutions of each parameter in the first 1500 iterations of the online learning. The parameters (w,v) evolve very fast compared with (u,z) also in this case. However, in this case, (w,v) does not converge to its equilibrium point (w*,v*)(0.472,1.13), but fluctuates stochastically around (w*,v*).
Figure 10:

Time evolutions of each parameter for the first 1500 iterations of the online learning. Each trajectory of w or v fluctuates intensively around the equilibrium point w*0.472,v*1.134, respectively. Trajectories of u and z evolve very slowly compared with w and v in this case.

Figure 10:

Time evolutions of each parameter for the first 1500 iterations of the online learning. Each trajectory of w or v fluctuates intensively around the equilibrium point w*0.472,v*1.134, respectively. Trajectories of u and z evolve very slowly compared with w and v in this case.

Close modal
Based on these observations, we suppose that w and v run over a sufficiently wide range of their values to be integrated, while u and z move in a small range. Then we assume that the dynamics of (u,z) is integrated with respect to (w,v) according to some stationary distribution. We further assume that (w,v) are distributed around (w*,v*) with finite variance. By integrating the Taylor expansions, equations 3.7 and 3.8, with (w,v), we obtain the following dynamical system near θ=θ1:
u˙=12(z-v*)Hu+C1,z˙=12uTHu+C2.
(5.2)
Here, C1 and C2 are constants resulting from the variance and covariance of (w,v). Figure 11 shows the analytical trajectories of the dynamical system, equation 5.2, where C1=1.71×10-4 and C2=-3.06×10-4 are determined heuristically. One can find that the deterministic dynamical system, equation 5.2, gives similar trajectories to sample paths of the online learning presented in Figure 9b.
Figure 11:

Analytical trajectories on the (z,u)-plane given by the dynamical system 5.2 with C1=1.71×10-4 and C2=-3.06×10-4.

Figure 11:

Analytical trajectories on the (z,u)-plane given by the dynamical system 5.2 with C1=1.71×10-4 and C2=-3.06×10-4.

Close modal

We deduce that a fluctuation of the parameter around a center manifold causes constants C1 and C2 working as drift terms and that it makes the dynamics of the online learning qualitatively different from those of the averaged gradient descent. This example suggests that stochastic effects can influence a macroscopic flow of the learning process via a center manifold structure.

In this article, we first gave a quick review of a mechanism that causes plateau phenomena in a three-layer perceptron—in particular, how degeneration of hidden units gives rise to a Milnor-like attractor consisting of both attractive and repulsive parts. We next investigated the dynamics of learning around special points on a Milnor-like attractor and proved the existence of the center manifold. We also succeeded in integrating the reduced dynamical system to obtain an analytical form of a trajectory. We performed several numerical simulations to demonstrate the accuracy of our results. As an application of the center manifold structure, we gave an explanation for a characteristic behavior of the online learning.

Unfortunately, the two examples presented in section 4 were the only ones that we could find in which the assumptions of proposition 1 are fulfilled. This might suggest that the appearance of a Milnor-like attractor would be a rather rare situation in a perceptron that has bias terms. In fact, just by replacing the activation function Sgm with tanh in example 2, the matrix H becomes indefinite and the assumption of proposition 1 is violated. Finding more suggestive examples that shed light on the complex behavior of the dynamics of learning is an important subject for future study.

In section 5, we investigated stochastic effects of the online learning from an intermediate viewpoint between fully stochastic and averaged dynamics. We made use of the center manifold of the averaged dynamics and discussed an integration with quickly fluctuating parameters. There have been many reports of qualitative differences between stochastic and deterministic methods; however, there are few general theories for analyzing such dissimilarities. We expect that the intermediate viewpoint in this article can be a clue to clarify stochastic effects in learning.

This appendix gives proofs of proposition 1 and Theorem 1. The proof of proposition 1 is based on the analysis of the Hessian matrix of L(2). However, the Hessian at the point θλ becomes singular, since L(2)(θλ) is constant along λR. Thus, we need to take into account higher-order derivatives of L(2), which is overlooked in the Fukumizu and Amari (2000). The prototype of theorem 1 was given by Amari et al. (2018); however, they proved it only for a special case. Here, we give a rigorous proof with an additional mild assumption.

Proof of Proposition 1. We introduce a new coordinate system ξ=(w,v,u,z) by
w=v1w1+v2w2v1+v2v=v1+v2u=w1-w2z=v1-v2,
(A.1)
where v1+v20. Under this coordinate system, the point θλ is denoted as ξλ=(w*,v*,0,(2λ-1)v*). Note that each point ξ=ξλ is a critical point of L(2)(ξ) by lemma 1. The inverse transformation is given as
w1=w+v-z2vuw2=w-v+z2vuv1=v+z2v2=v-z2
We observe the Hessian matrix Hess(ξλ) of L(2)(ξ) at ξλ for each λR. For all ξ such that u=0 (or, equivalently, w1=w2), using the inverse transformation formula above,
L(2)u(ξ)=Ex(x,f(2)(x;ξ))yv2-z24v(ϕ'(w1·x)-ϕ'(w2·x))x=0,L(2)z(ξ)=-Ex(x,f(2)(x;ξ))yv+z4vϕ'(w1·x)-v-z4vϕ'(w2·x)(u·x)+Ex(x,f(2)(x;ξ))y12(ϕ(w1·x)-ϕ(w2·x))=0.
Here, we left w1 and w2 for notational simplicity. We then have
2L(2)γz(ξλ)=0,2L(2)γu(ξλ)=0,
where γ=w,v,z. Hence, the matrix Hess(ξλ) has the form
w,vuzw,v{u{z{**00*0000*00000.
The (w,v)-block is equal to the Hessian matrix of L(1) at θ* and is positive definite. In fact, for any ξ such that u=0, noting that f(2)(x;ξ)f(1)(x;w,v),
L(2)w(ξ)=Ex(x,f(1)(x;w,v))yvϕ'(w·x)x=L(1)w(1)(w,v),L(2)v(ξ)=Ex(x,f(1)(x;w,v))yϕ(w·x)=L(1)v(1)(w,v).
The (w,v)-block of Hess(ξλ) is given by differentiating the equations above with (w,v) and thus equal to the Hessian matrix (2L(1)/θ(1)θ(1))(θ*), which is positive definite since θ(1)=θ* is a strict local minimizer of L(1)(θ(1)).
On the other hand, we have
2L(2)uu(ξλ)=Ex(x,f(2)(x;ξλ))yλ(1-λ)v*ϕ''(w*·x)xxT=λ(1-λ)H,
where H is the matrix defined in equation 2.1. Thus, when H is positive definite, the Hessian matrix Hess(ξλ) is positive semidefinite if and only if λ[0,1]. This proves that ξλ is a saddle point of L(2)(ξ) for λR[0,1]. Similarly, when H is negative definite, ξλ is a saddle point of L(2)(ξ) for λ(0,1). When H is indefinite, so is Hess(ξλ) for every λR{0,1}.
We show that the point ξ=ξλ is a local minimizer of L(2)(ξ) for λ(0,1) when the matrix H is positive definite. For λ(0,1), since the matrix Hess(ξλ) is positive semidefinite, the Taylor series of (L(2)(ξ)-L(2)(ξλ)) up to the second order is nonnegative in a neighborhood of ξ=ξλ. However, higher-order terms may influence the sign of the Taylor series, since the coefficients of the terms of z2 and γz are zero for γ=w,v,u. Since the terms of γkz with k2 are dominated by the term γ2 near ξ=ξλ, we check if the coefficients of the terms of z and γz are equal to zero for all 1. For all ξ such that u=0, f(2)(x;ξ)=vϕ(w·x) is a constant as a function of z, and so is L(2)(ξ). Hence, we obtain
L(2)z(ξλ)=0,+1L(2)γz(ξλ)=0,1,
for γ=w,v. Therefore, it suffices to check if
+1L(2)uz(ξλ)=0,1.
Since we have already seen that (L(2)/u)(ξ)=0 for all ξ such that u=0, this equality is confirmed. Finally, we have shown that the coefficients of higher-order terms are all zero, and that they do not influence the sign of the Taylor series. This shows that (L(2)(ξ)-L(2)(ξλ)) is nonnegative near ξ=ξλ, and hence the proof is complete. In the case that the matrix H is negative, it is proved similarly.
Proof of Theorem 1. Since v1 and v2 are no longer scalars, we cannot use the coordinate system, given by equation A.1. Therefore, in order to analyze the Hessian matrix, we introduce another coordinate system ξ=(w,v,u,z) as
w=w1+w22v=v1+v2u=w1-w22z=v1-v2.
In this coordinate system, the point θλ is denoted as ξλ=(w*,v*,0,(2λ-1)v*).

Fix λR arbitrarily. We show that the Hessian matrix Hess(ξλ) of L(2)(ξ) at ξ=ξλ has both positive and negative eigenvalues. It suffices to show that the (w,v)-part of the Hessian is positive definite and that the (u,z)-part is not positive semidefinite. They imply that the full Hessian matrix Hess(ξλ) is neither negative semidefinite nor positive semidefinite, and hence that Hess(ξλ) is indefinite.

One can check that the (w,v)-part of Hess(ξλ) is equal to the Hessian matrix of L(1) at θ* by direct calculation. Since θ* is a strict local minimizer of L(1), this is positive definite.

On the other hand, we have
2L(2)zu(ξλ)=Ex(x,f(2)(x;ξλ))yϕ'(w*·x)xT,
(A.2)
2L(2)zz(ξλ)=O,
(A.3)
and thus the (u,z)-part B of Hess(ξλ) is written as
B=2L(2)uu(ξλ)Ex(x,f(2)(x;ξλ))yϕ'(w*·x)xTTEx(x,f(2)(x;ξλ))yϕ'(w*·x)xTO,
where we treat /y as a column vector. Here, since f(2)(x;ξλ)f(1)(x;θ*), the nondiagonal block, equation A.2, is equal to the matrix, equation 2.2, and thus is nonzero by assumption. Hence, the (u,z)-part above is not positive semidifinite. In fact, choosing vectors aRm and bRn+1 such that
aTEx(x,f(2)(x;ξλ))yϕ'(w*·x)xTb<0,
the vector cɛ:=(ɛbT,aT)T satisfies cɛTBcɛ<0 for sufficiently small ɛ>0. Such vectors (a,b) always exist—for instance, by letting ai=-ρ, bj=1, and other entries be 0, where (i,j) is an index such that (i,j)-entry ρ of the matrix A.2 is nonzero.

I express my gratitude to Akio Fujiwara for his helpful guidance, discussions, and advice. I thank as well Yuzuru Sato for many discussions and helpful comments. And I thank the anonymous referees for their insightful suggestions to improve this article.

Amari
,
S.-I.
,
Ozeki
,
T.
,
Karakida
,
R.
,
Yoshida
,
Y.
, &
Okada
,
M.
(
2018
).
Dynamics of learning in MLP: Natural gradient and singularity revisited
.
Neural Computation
,
30
(
1
),
1
33
.
Carr
,
J.
(
1981
).
Applications of centre manifold theory.
New York
:
Springer
.
Cousseau
,
F.
,
Ozeki
,
T.
, &
Amari
,
S.-I.
(
2008
).
Dynamics of learning in multilayer perceptrons near singularities
.
IEEE Transactions on Neural Networks
,
19
(
8
),
1313
1328
.
Cybenko
,
G.
(
1989
).
Approximation by superpositions of a sigmoidal function
.
Mathematics of Control, Signals, and Systems
,
2
(
4
),
303
314
.
Fukumizu
,
K.
, &
Amari
,
S.-I.
(
2000
).
Local minima and plateaus in hierarchical structures of multilayer perceptrons
.
Neural Networks
,
13
(
3
),
317
327
.
Funahashi
,
K.
(
1989
).
On the approximate realization of continuous mappings by neural networks
.
Neural Networks
,
2
(
3
),
183
192
.
Sonoda
,
S.
, &
Murata
,
N.
(
2017
).
Neural network with unbounded activation functions is universal approximator
.
Applied and Computational Harmonic Analysis
,
43
(
2
),
233
268
.
Wei
,
H.
,
Zhang
,
J.
,
Cousseau
,
F.
,
Ozeki
,
T.
, &
Amari
,
S.-I.
(
2008
).
Dynamics of learning near singularities in layered networks
.
Neural Computation
,
20
(
3
),
813
843
.