The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.

1  Introduction

Deep neural networks (DNNs) have outperformed many standard machine-learning methods in practical applications (LeCun, Bengio, & Hinton, 2015). Despite their practical success, many theoretical aspects of DNNs remain to be uncovered, and there are still many heuristics used in deep learning. We need a solid theoretical foundation for elucidating how and under what conditions DNNs and their learning algorithms work well.

The Fisher information matrix (FIM) is a fundamental metric tensor that appears in statistics and machine learning. An empirical FIM is equivalent to the Hessian of the loss function around a certain global minimum, and it affects the performance of optimization in machine learning. In information geometry, the FIM defines the Riemannian metric tensor of the parameter manifold of a statistical model (Amari, 2016). The natural gradient method is a first-order gradient method in the Riemannian space where the FIM works as its Riemannian metric (Amari, 1998; Park, Amari, & Fukumizu, 2000; Pascanu & Bengio, 2014; Ollivier, 2015; Martens & Grosse, 2015). The FIM also acts as a regularizer to prevent catastrophic forgetting (Kirkpatrick et al., 2017); a DNN trained on one data set can learn another data set without forgetting information if the parameter change is regularized with a diagonal FIM.

However, our understanding of the FIM for neural networks has so far been limited to empirical studies and theoretical analyses of simple networks. Numerical experiments empirically confirmed that the eigenvalue spectra of the FIM and those of the Hessian are highly distorted; that is, most eigenvalues are close to zero, while others take on large values (LeCun, Bottou, Orr, & Müller, 1998; Sagun, Evci, Guney, Dauphin, & Bottou, 2017; Papyan, 2019; Ghorbani, Krishnan, & Xiao, 2019). Focusing on shallow neural networks, Pennington and Worah (2018) theoretically analyzed the FIM's eigenvalue spectra by using random matrix theory, and Fukumizu (1996) derived a condition under which the FIM becomes singular. Liang, Poggio, Rakhlin, and Stokes (2019) have connected FIMs to the generalization ability of DNNs by using model complexity, but their results are restricted to linear networks. Thus, theoretical evaluations of deeply nonlinear cases seem to be difficult mainly because of iterated nonlinear transformations. To go one step further, it would be helpful if a framework that is widely applicable to various DNNs could be constructed.

Investigating DNNs with random weights has given promising results. When such DNNs are sufficiently wide, we can formulate their behavior by using simpler analytical equations through coarse-graining of the model parameters, as discussed in mean field theory (Amari, 1974; Kadmon & Sompolinsky, 2016; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein, 2017; Yang & Schoenholz, 2017; Xiao, Bahri, Sohl-Dickstein, Schoenholz, & Pennington, 2018; Yang, 2019) and random matrix theory (Pennington, Schoenholz, & Ganguli, 2018; Pennington & Bahri, 2017; Pennington & Worah, 2017). For example, Schoenholz et al. (2017) proposed a mean field theory for backpropagation in fully connected DNNs. This theory characterizes the amplitudes of gradients by using specific quantities, that is, order parameters in statistical physics, and enables us to quantitatively predict parameter regions that can avoid vanishing or explosive gradients. This theory is applicable to a wide class of DNNs with various nonlinear activation functions and depths. Such DNNs with random weights are substantially connected to gaussian process and kernel methods (Daniely, Frostig, & Singer, 2016; Lee et al., 2018; Matthews, Rowland, Hron, Turner, & Ghahramani, 2018; Jacot, Gabriel, & Hongler, 2018). Furthermore, the theory of the neural tangent kernel (NTK) explains that even trained parameters are close enough to the random initialization in sufficiently wide DNNs, and the performance of trained DNNs is determined by the NTK on the initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019).

Karakida, Akaho, and Amari (2019b) focused on the FIM corresponding to the mean square error (MSE) loss and proposed a framework to express certain eigenvalue statistics by using order parameters. They revealed that when fully connected networks with random initialization are sufficiently wide, the FIM's eigenvalue spectrum asymptotically becomes pathologically distorted. As the network width increases, a small number of the eigenvalues asymptotically take on huge values and become outliers, while the others are much smaller. The distorted shape of the eigenvalue spectrum is consistent with empirical reports (LeCun et al., 1998; Sagun et al., 2017; Papyan, 2019; Ghorbani et al., 2019). While LeCun, Kanter, and Solla (1991) implied that such pathologically large eigenvalues might appear in multilayered networks and affect the training dynamics, its theoretical elucidation has been limited to a data covariance matrix in a linear regression model. The results of Karakida et al. (2019b) can be regarded as a theoretical verification of this large eigenvalue suggested by LeCun et al. (1991). The obtained eigenvalue statistics have given insight into the convergence of gradient dynamics (LeCun et al., 1998; Karakida et al., 2019b), mechanism of batch normalization to decrease the sharpness of the loss function (Karakida, Akaho, & Amari, 2019a), and the generalization measure of DNNs based on the minimum description length (Sun & Nielsen, 2019).

In this letter, we extend the framework of the previous work (Karakida et al., 2019b) and reveal that various types of FIMs and variants show pathological spectra. Our main contribution is the following:

  • FIM for classification tasks with softmax output: While the previous works (Karakida et al., 2019a, 2019b) analyzed the FIM for regression based on the MSE loss, we typically use the cross-entropy loss with softmax output in classification tasks. We analyze this FIM for classification tasks and reveal that its spectrum is pathologically distorted as well. While the FIM for regression tasks has unique and degenerated outliers in the infinite-width limit, the softmax output can make these outliers disperse and remove the degeneracy. Our theory shows that there are number-of-classes outlier eigenvalues, which is consistent with experimental reports (Sagun et al., 2017; Papyan, 2019). Experimental results demonstrate that the eigenvalue density has a tail of outliers spreading form the bulk.

Furthermore, we also give a unified perspective on the variants:

  • Diagonal blocks of FIM: We give a detailed analysis of the diagonal block parts of the FIM for regression tasks. Natural gradient algorithms often use a block diagonal approximation of the FIM (Amari, Karakida, & Oizumi, 2019). We show that the diagonal blocks also suffer from pathological spectra.

  • Connection to NTK: The NTK and FIM inherently share the same nonzero eigenvalues. Paying attention to a specific rescaling of the parameters assumed in studies of NTK, we clarify that NTK's eigenvalue statistics become independent of the width scale. Instead, the gap between the average and maximum eigenvalues increases with the sample size. This suggests that as the sample size increases, the training dynamics converge nonuniformly and that calculations with the NTK become ill conditioned. We also demonstrate a simple normalization method to make eigenvalue statistics that are independent of the large width and sample size.

  • Metric tensors for input and feature spaces: We consider metric tensors for input and feature spaces spanned by neurons in input and hidden layers. These metric tensors potentially enable us to evaluate the robustness of DNNs against perturbations in the input and feedforward propagated signals. We show that the spectrum is pathologically distorted, similar to FIMs, in the sense that the outlier of the spectrum is much farther from most of the eigenvalues. The softmax output makes the outliers disperse as well.

In summary, this study sheds light on the asymptotical eigenvalue statistics common to various wide networks.

2  Preliminaries

2.1  Model

We investigated the fully connected feedforward neural network shown in Figure 1. The network consists of one input layer, L-1 hidden layers (l=1,,L-1), and one output layer. It includes shallow nets (L=2) and arbitrary deep nets (L3). The network width is denoted by Ml. The preactivations uil and activations of units hil in the lth layer are defined recursively by
uil=j=1Ml-1Wijlhjl-1+bil,hil=ϕ(uil),
(2.1)
which we will explain. The input signals are hi0=xi, which propagate layer by layer by equation 2.1. We define the weight matrices as WijlRMl×Ml-1 and the bias terms as bilRMl. Regarding the network width, we set
Ml=αlM(lL-1),ML=C,
(2.2)
and consider the limiting case of a sufficiently large M with constant coefficients αl>0. The number of output units is taken to be a constant C, as is usually done in practice. We denote the linear output of the last layer by
fi=uiL.
(2.3)
We also investigate DNNs with softmax outputs in section 3.3. The C-dimensional softmax function is given by
gi:=exp(fi)k=1Cexp(fk),
(2.4)
for i=1,,C.
Figure 1:

Notation of deep neural networks (DNNs). The mathematical definitions are given in section 2.1.

Figure 1:

Notation of deep neural networks (DNNs). The mathematical definitions are given in section 2.1.

FIM computations require the chain rule of backpropagated signals δklRMl. The backpropagated signals are defined by δk,il:=fk/uil and naturally appear in the derivatives of fk with respect to the parameters:
fkWijl=δk,ilhjl-1,fkbil=δk,il,δk,il=ϕ'(uil)jδk,jl+1Wjil+1.
(2.5)
To avoid complicating the notation, we will omit the index k of the output unit, δil=δk,il. To evaluate the above feedforward and backward signals, we assume the following conditions.

2.1.1  Random Weights and Biases

Suppose that the parameter set is an ensemble generated by
Wijli.i.d.N(0,σw2/Ml-1),bili.i.d.N(0,σb2),
(2.6)
and thus is fixed, where N(0,σ2) denotes a gaussian distribution with zero mean and variance σ2. Treating the case in which different layers have different variances is straightforward. Note that the variances of the weights are scaled in the order of 1/M. In practice, the learning of DNNs usually starts from random initialization with this scaling (Glorot & Bengio, 2010; He, Zhang, Ren, & Sun, 2015).

2.1.2  Input Samples

We assume that there are N input samples x(n)RM0 (n=1,,N) generated identically and independently from the input distribution. We generate the samples by using a standard normal distribution,
xj(n)i.i.d.N(0,1).
(2.7)

2.1.3  Activation Functions

Suppose the following two conditions: (1) the activation (x) has a polynomially bounded weak derivative and (2) the network is noncentered, which means a DNN with bias terms (σb0) or activation functions satisfying a nonzero gaussian mean. The definition of the nonzero gaussian mean is Dzϕ(z)0. The notation Du=duexp(-u2/2)/2π means integration over the standard gaussian density.

Condition 1 is used to obtain recurrence relations of backward-order parameters (Yang, 2019). Condition 2 plays an essential role in our evaluation of the FIM (Karakida et al., 2019a, 2019b). The two conditions are valid in various realistic settings, because conventional networks include bias terms, and widely used activation functions, such as the sigmoid function and (leaky-) ReLUs, have bounded weak derivatives and nonzero gaussian means. Different layers may have different activation functions.

2.2  Overview of Metric Tensors

We will analyze two types of metric tensors (metric matrices) that determine the responses of network outputs: the response to a local change in parameters and the response to a local change in the input and hidden neurons. They are summarized in Figure 2. One can systematically understand these tensors from the perspective of perturbations of variables.
Figure 2:

Matrix representations of metric tensors. (a) Metric for parameter space, also known as the empirical Fisher information matrix (FIM). In particular, Q=I corresponds to the FIM for MSE loss with linear output. (b) Dual of FIM. Under a specific parameter transformation, this is equivalent to the neural tangent kernel. (c) Metric for input and feature spaces. Note that the figures omit the scalar factors of the metrics.

Figure 2:

Matrix representations of metric tensors. (a) Metric for parameter space, also known as the empirical Fisher information matrix (FIM). In particular, Q=I corresponds to the FIM for MSE loss with linear output. (b) Dual of FIM. Under a specific parameter transformation, this is equivalent to the neural tangent kernel. (c) Metric for input and feature spaces. Note that the figures omit the scalar factors of the metrics.

We denote the set of network parameters as θRP. P is the number of parameters. We measure the response of the network output to an infinitesimal change dθ by
Ef(x;θ+dθ)-f(x;θ)2dθFdθ,
(2.8)
where we took the first-order Taylor expansion of the deterministic function f and defined
F:=k=1CEθfk(x)θfk(x).
(2.9)
E[·] denotes the expectation over an input distribution, · denotes the Euclidean norm, and θ is the derivative with respect to θ. The matrix F acts as a metric tensor for the parameter space. F's eigenvalues determine the robustness of the network output against the perturbation. While we introduced F by the response of the function f, we can also introduce this F as a metric tensor of information geometry, that is, the Fisher information matrix (FIM). As we will explain in section 3.1, we can introduce a probabilistic model corresponding to the MSE loss and F naturally appears from the Kullback-Leibler divergence of that model.
When N input samples x(n)(n=1,,N) are available, we can replace the expectation E[·] of the FIM with the empirical mean:
F=k=1C1Nn=1Nθfk(n)θfk(n),
(2.10)
where we have abbreviated the network outputs as fk(n)=fk(x(n);θ) to avoid complicating the notation. This is an empirical FIM in the sense that the average is computed over empirical input. We can express it in the matrix form shown in Figure 2a. The Jacobian θf is a P×CN matrix whose each column corresponds to θfk(n) (k=1,,C, n=1,,N). We investigate this type of empirical metric tensor for arbitrary N. One can set N as a constant value or make it increase depending on M. The empirical FIM (see equation 2.10) converges to the expected FIM as N. In addition, the FIM can be partitioned into L2 layer-wise block matrices. We denote the (l,l')th block as Fll' (l,l'=1,,L). We take a closer look at the eigenvalue statistics of diagonal blocks in section 3.2.

In this letter, we also investigate another FIM denoted by Fcross, which corresponds to classification tasks with cross-entropy loss. As shown in Figure 2a one can represent Fcross as a modification of F. A specific coefficient matrix Q is inserted between the Jacobian θf and its transpose. Q is defined by equation 3.7 and composed of nothing but softmax functions. Its mathematical definition is given in section 3.3. One more interesting quantity is a left-to-right reversed product of θf described in Figure 2b. This matrix is known as the neural tangent kernel (NTK). The FIM and NTK share the same nonzero eigenvalues by definition, although we need to be careful in the change of parameterization used in the studies of NTK. The details are shown in section 4.

In analogy with the FIM, one can introduce a metric tensor that measures the response to a change in the neural activities. Make a vector of all the activations in the input and hidden layers: h:={h0,h1,,hL-1}RMh with Mh=i=0L-1Mi. Next, define an infinitesimal perturbation of h, dhRMh, that is independent of x. Then the response can be written as
E[f(h+dh;θ)-f(h;θ)2]dhAdh,
(2.11)
A:=Ek=1Chfkhfk.
(2.12)
We refer to A as the metric tensor for the input and feature spaces because each hl acts as the input to the next layer and corresponds to the features realized in the network. We can also deal with a layer-wise diagonal block of A. Let us denote the (l,l')th block by All' (l,l'=0,,L-1). In particular, the first diagonal block A00 indicates the robustness of the network output against perturbation of the input:
E[f(x+dx;θ)-f(x;θ)2]dxA00dx.
(2.13)
Robustness against input noise has been investigated by similar (but different) quantities, such as sensitivity (Novak, Bahri, Abolafia, Pennington, & Sohl-Dickstein, 2018), and robustness against adversarial examples (Goodfellow, Shlens, & Szegedy, 2014).

2.3  Order Parameters for Wide Neural Networks

We use the following four types of order parameter, (q^1l,q^2l,q˜1l,q˜2l), which were used in various studies on wide DNNs (Amari, 1974; Poole et al., 2016; Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018; Lee et al., 2018). First, we define the following variables for feedforward signal propagation;
q^1l:=1Mli=1Mlhil(n)2,q^2l:=1Mli=1Mlhil(n)hil(m),
(2.14)
where hil(n) is the output of the lth layer generated by the nth input sample x(n) (n=1,,N). The variable q^1l describes the total activity in the lth layer, and the variable q^2l describes the overlap between the activities for different input samples x(n) and x(m). These variables have been utilized to describe the depth to which signals can propagate from the perspective of order-to-chaos phase transitions (Poole et al., 2016). In the large M limit, these variables can be recursively computed by integration over gaussian distributions (Poole et al., 2016; Amari, 1974):
q^1l+1=Duϕ(q1l+1u)2,q^2l+1=Iϕ[q1l+1,q2l+1],q1l+1:=σw2q^1l+σb2,q2l+1:=σw2q^2l+σb2,
(2.15)
for l=0,,L-1. Because the input samples generated by equation 2.7 yield q^10=1 and q^20=0 for all n and m, q^2l in each layer takes the same value for all nm; so does q^1l for all n. A two-dimensional gaussian integral is given by
Iϕ[a,b]:=DyDxϕ(ax)ϕ(a(cx+1-c2y))
(2.16)
with c=b/a. One can represent this integral in a bit simpler form, Iϕ[a,b]=Dy(Dxϕ(a-bx+by))2.
Next, let us define the following variables for backpropagated signals:
q˜1l:=i=1Mlδil(n)2,q˜2l:=i=1Mlδil(n)δil(m).
(2.17)
Above, we omitted k, the index of the output fk, because the symmetry in the layer makes the above variables independent of k in the large M limit. Note that each δil is of O(1/M), and their sums are of O(1) in terms of the order notation O(·). The variable q˜1l is the magnitude of the backward signals, and q˜2l is their overlap. Previous studies found that these order parameters in the large M limit are easily computed using the following recurrence relations (Schoenholz et al., 2017; Yang, 2019),
q˜1l=σw2q˜1l+1Duϕ'(q1lu)2,q˜2l=σw2q˜2l+1Iϕ'[q1l,q2l],
(2.18)
for l=0,,L-1. A linear network output, equation 2.3, leads to the following initialization of the recurrences: q˜1L=q˜2L=1. The previous studies showed excellent agreement between these backward-order parameters and experimental results (Schoenholz et al., 2017; Yang & Schoenholz, 2017; Xiao et al., 2018). Although those studies required the so-called gradient independence assumption to derive these recurrences, Yang (2019) recently proved that such an assumption is unnecessary when condition 1 of the activation function is satisfied.

These order parameters depend only on the type of activation function, depth, and the variance parameters σw2 and σb2. The recurrence relations for the order parameters require L iterations of one- and two-dimensional numerical integrals. Moreover, we can obtain explicit forms of the recurrence relations for some of the activation functions (Karakida et al., 2019b).

3  Eigenvalue statistics of FIMs

This section shows the asymptotic eigenvalue statistics of the FIMs. When we have an P×P metric tensor whose eigenvalues are λi (i=1,,P), we compute the following quantities:
mλ:=1Pi=1Pλi,sλ:=1Pi=1Pλi2,λmax:=maxiλi.
The obtained results are universal for any sample size N, which may depend on M.

3.1  FIM for Regression Tasks

This section overviews the results obtained in the previous studies (Karakida et al., 2019a, 2019b). The metric tensor F is equivalent to the Fisher information matrix (FIM) (Amari, 1998; Pascanu & Bengio, 2014; Ollivier, 2015; Park et al., 2000; Martens & Grosse, 2015), originally defined by
F:=Eθlogp(x,y;θ)θlogp(x,y;θ).
(3.1)
The statistical model is given by p(x,y;θ)=p(y|x;θ)q(x), where p(y|x;θ) is the conditional probability distribution of the DNN of output y given input x, and q(x) is an input distribution. The expectation E[·] is taken over the input-output pairs (x,y) of the joint distribution p(x,y;θ). This FIM appears in the Kullback-Leibler divergence between a statistical model and an infinitesimal change to it: KL[p(x,y;θ):p(x,y;θ+dθ)]dθFdθ. The parameter space θ forms a Riemannian manifold, and the FIM acts as its Riemannian metric tensor (Amari, 2016).
Basically, there are two types of FIM for supervised learning, depending on the definition of the statistical model. One type corresponds to the mean squared error (MSE) loss for regression tasks; the other corresponds to the cross-entropy loss for classification tasks. The latter is discussed in section 3.3. Let us consider the following statistical model for the regression:
p(y|x;θ)=12πexp-12y-f(x;θ)2.
(3.2)
Substituting p(y|x;θ) into the original definition of FIM, equation 3.1, and taking the integral over y, one can easily confirm that it is equivalent to the metric tensor, equation 2.9, introduced by the perturbation. Note that the loss function Loss(θ) is given by the log likelihood of this model, Loss(θ)=-E[lnp(y|x;θ)], where we take the expectation over empirical input-output samples. This becomes the MSE loss.

The previous studies (Karakida et al., 2019a, 2019b) uncovered the following eigenvalue statistics of the FIM, equation 2.10.

Theorem 1 (Karakida et al., 2019a, 2019b).
When M is sufficiently large, the eigenvalue statistics of F can be asymptotically evaluated as
mλκ1CM,sλαN-1Nκ22+κ12NC,λmaxαN-1Nκ2+κ1NM,
where α:=l=1L-1αlαl-1, and positive constants κ1 and κ2 are obtained using order parameters,
κ1:=l=1Lαl-1αq˜1lq^1l-1,κ2:=l=1Lαl-1αq˜2lq^2l-1.

The average of the eigenvalue spectrum asymptotically decreases on the order of 1/M, while the variance takes a value of O(1) and the largest eigenvalue takes a huge value of O(M). It implies that most of the eigenvalues are asymptotically close to zero, while the number of large eigenvalues is limited. Thus, when the network is sufficiently wide, one can see that the shape of the spectrum asymptotically becomes pathologically distorted. This suggests that the parameter space of the DNNs is locally almost flat in most directions but highly distorted in a few specific directions.

In particular, regarding the large eigenvalues, we have:

Theorem 2 (theorem 4 Karakida et al., 2019a).
When M is sufficiently large, the eigenspace corresponding to λmax is spanned by C eigenvectors,
E[θfk](k=1,,C).
When N=ρ-1M with a constant ρ>0, under the gradient independence assumption, the second largest eigenvalue λmax' is bounded by
ρα(κ1-κ2)+c1λmax'(Cα2ρ(κ1-κ2)2+c2)M,
(3.3)
for nonnegative constants c1 and c2.
Since λmax' is of order M at most, the largest eigenvalues λmax act as outliers. It is noteworthy that batch normalization in the last layer can eliminate them. Such normalization includes mean subtraction, f¯k:=fk-E[fk]. Karakida et al. (2019a) analyzed the corresponding FIM:
F¯:=kE[θf¯kθf¯k]=k(E[θfkθfk]-E[θfk]E[θfk]).
(3.4)
The subtraction, equation 3.4, means eliminating the C largest eigenvalues from F. Numerical experiments confirmed that the largest eigenvalue of F¯ is of order 1. Note that when NM, the sample size is sufficiently large but the network satisfies PN and remains overparameterized.
Figure 3a (left) shows a typical spectrum of the FIM. We computed the eigenvalues by using random gaussian weights, biases, and inputs. We used deep tanh networks with L=3, M=200, C=10, αl=1, and (σw2,σb2)=(3,0.64). The sample size was N=100. The histograms were made from eigenvalues over 100 different networks with different random seeds. The histogram had two populations. The red dashed histogram was made by eliminating the largest C eigenvalues. It coincides with the smaller population. Thus, one can see that the larger population corresponds to the C largest eigenvalues. The larger population in experiments can be distributed around λmax because M is large but finite.
Figure 3:

Eigenvalue spectra of FIMs in experiments show that top C eigenvalues act as outliers. (a) Left: Case of F. Right: Case of diagonal block F22. The vertical axis is the cumulative number of eigenvalues over 100 different networks. The black histograms show the original spectra, while the red dashed ones show the spectra without the C largest eigenvalues. The blue lines represent the theoretical values of the largest eigenvalues. (b) Left: Case of Fcross. Middle: Case of Fcross22. Right: Case of Q described in section 3.4. In contrast to the linear output, softmax output makes outliers disperse.

Figure 3:

Eigenvalue spectra of FIMs in experiments show that top C eigenvalues act as outliers. (a) Left: Case of F. Right: Case of diagonal block F22. The vertical axis is the cumulative number of eigenvalues over 100 different networks. The black histograms show the original spectra, while the red dashed ones show the spectra without the C largest eigenvalues. The blue lines represent the theoretical values of the largest eigenvalues. (b) Left: Case of Fcross. Middle: Case of Fcross22. Right: Case of Q described in section 3.4. In contrast to the linear output, softmax output makes outliers disperse.

Remark 1: Loss Landscape and Gradient Methods.
The empirical FIM, equation 2.10, is equivalent to the Hessian of the loss, θ2Loss(θ), around the global minimum with zero training loss. Karakida et al. (2019a) referred to the steep shape of the local loss landscape caused by λmax as pathological sharpness. The sharpness of the loss landscape is connected to an appropriate learning rate of gradient methods for convergence. The previous work empirically confirmed that a learning rate η satisfying
η<2/λmax
(3.5)
is necessary for the steepest gradient method to converge (Karakida et al., 2019b). In fact, this 2/λmax acts as a boundary of neural tangent kernel regime (Lee et al., 2019; Lewkowycz, Bahri, Dyer, Sohl-Dickstein, & Gur-Ari, 2020). Because λmax increases depending on the width and depth, we need to carefully choose an appropriately scaled learning rate to train the DNNs.

3.2  Diagonal Blocks of FIM

One can easily obtain insight into diagonal blocks, that is, Fll, in the same way as theorem 1. Let us denote the set of parameters in the lth layer by θl. We have Fll=kE[θlfkθlfk]. When M is sufficiently large, the eigenvalue statistics of Fll are asymptotically evaluated as
mλlq˜1lq^1l-1αlCM,sλlαl-1αlN-1N(q˜2lq^2l-1)2+(q˜1lq^1l-1)2NC,λmaxlαl-1N-1Nq˜2lq^2l-1+q˜1lq^1l-1NM,
for l=1,,L. The eigenspace corresponding to the largest eigenvalues is spanned by C eigenvectors, E[θlfk](k=1,,C). The derivation is shown in appendix A. The order of eigenvalue statistics is the same as the full-sized FIM. Figure 3 (middle) empirically confirms that F22 has a similar pathological spectrum to that of F. Its experimental setting was the same as in the case of F.

It is helpful to investigate the relation between F and its diagonal blocks when one considers the diagonal block approximation of F. For example, use of a diagonal block approximation can decrease the computational cost of natural gradient algorithms (Martens & Grosse, 2015; Amari, Karakida, & Oizumi, 2019; Karakida & Osawa, 2020). When a matrix is composed only of diagonal blocks, its eigenvalues are given by those of each diagonal block. F approximated in this fashion has the same mean of the eigenvalues as the original F and the largest eigenvalue maxlλmaxl, which is of O(M). Thus, the diagonal block approximation also suffers from a pathological spectrum. Eigenvalues that are close to zero can make the inversion of the FIM in natural gradient unstable, whereas using a damping term seems to be an effective way of dealing with this instability (Martens & Grosse, 2015).

3.3  FIM for Multilabel Classification Tasks

The cross-entropy loss is typically used in multilabel classification tasks. It comes from the log likelihood of the following statistical model:
p(y|x;θ)=k=1Cgk(x)yk,
(3.6)
where g is the softmax output, equation 2.4, and y is a C-dimensional one-hot vector. The cross-entropy loss is given by -E[kykloggk]. Substituting the statistical model into the definition of the FIM, equation 3.1, and taking the summation over y, we find that the empirical FIM for the cross-entropy loss is given by
Fcross=1NnNk,k'Cθfk(n)Qn(k,k')θfk'(n),
(3.7)
Qn(k,k'):={gk(n)δkk'-gk(n)gk'(n)}.
(3.8)
The contribution of softmax output appears only in Qn. Fcross is linked to F through the matrix representation shown in Figure 2a. One can view F as a matrix representation with Q=I, that is, the identity matrix. In contrast, Fcross corresponds to a block-diagonal Q whose nth block is given by the C×C matrix Qn. In a similar way to equation 2.8, we can see Fcross as the metric tensor for the parameter space. Using the softmax output g, we have
4Eg(x;θ+dθ)-g(x;θ)2dθFcrossdθ,
(3.9)
where the square root is taken entry-wise.

We obtain the following result of Fcross's eigenvalue statistics:

Theorem 3.
When M is sufficiently large, the eigenvalue statistics of Fcross are asymptotically evaluated as
mλβ1Cκ1M,sλαβ2κ22+β3κ12N,β4αN-1Nκ2+κ1NMλmaxαsλM,
where the constant coefficients are given by
β1:=1-nNkCgk(n)2,β2:=nmN2kCgk(m)gk(n)-2kCgk(m)2gk(n)+kCgk(m)gk(n)2,β3:=nNkC(1-2gk(n))gk(n)2+kCgk(n)22,β4:=max1kCnNgk(n)(1-gk(n)).

The derivation is shown in section B.1. We find that the eigenvalue spectrum shows the same width dependence as the FIM for regression tasks. Although the evaluation of λmax in theorem 4 is based on inequalities, one can see that λmax linearly increases as the width M or the depth L increases. The softmax functions appear in the coefficients βk. It should be noted that the values of βk generally depend on the index k of each softmax output. This is because the values of the softmax functions depend on the specific configuration of WL and bL.

If a relatively loose bound is acceptable, we can use the following simpler evaluation. Let us denote the eigenvalue statistics of F shown in theorem 2 by (mλlin,sλlin,λmaxlin). They correspond to the contribution of linear output f before putting it into the softmax function. Taking into account the contribution of the softmax function, we have
mλmλlin,sλsλlin,λmaxλmaxlin.
(3.10)
Note that Fcross's eigenvalues satisfy λi(Fcross)λmax(Q)λi(F) (i=1,,P; λ1λ2λP). The inequality 3.10 comes from λmax(Q)1.
Figure 4 shows that our theory predicts experimental results rather well for artificial data. We computed the eigenvalues of Fcross with random gaussian weights, biases, and inputs. We set L=3, M=1000, C=10, αl=1 and (σw2,σb2) = (3,0.64) in the tanh case, (2,0.1) in the ReLU case, and (1,0.1) in the linear case. The sample size was set to N=100. The predictions of theorem 4 coincided with the experimental results for sufficiently large widths.
Figure 4:

Fcross's eigenvalue statistics in experiments are well described by the theory: means (left), second moments (center), and maximum (right). Black points and error bars show means and standard deviations of the experimental results over 100 different networks with different random seeds. The blue lines represent the theoretical results obtained in the large M limit. For λmax, the dashed lines show the theoretical upper bound, while the solid ones show the lower bound.

Figure 4:

Fcross's eigenvalue statistics in experiments are well described by the theory: means (left), second moments (center), and maximum (right). Black points and error bars show means and standard deviations of the experimental results over 100 different networks with different random seeds. The blue lines represent the theoretical results obtained in the large M limit. For λmax, the dashed lines show the theoretical upper bound, while the solid ones show the lower bound.

3.4  Fcross's Large Eigenvalues

In this section, we take a closer look at Fcross's large eigenvalues. Figure 3b (left) shows a typical spectrum of Fcross. We set M=1000, and other settings were the same as in the case of F. We found that compared to F, Fcross had the largest C(=10) eigenvalues, which were widely spread from the bulk of the spectrum. Naively speaking, this is because the coefficient matrix Q has distributed eigenvalues, as is shown in Figure 3b (right). Compared to the FIM for regression, which corresponds to Q=I, the distributed Q's eigenvalues can make Fcross's eigenvalues disperse. The diagonal block Fcross also has the same characteristics of the spectrum as the original Fcross in Figure 3b (middle). It is noteworthy that exhaustive experiments on the cross-entropy loss have recently confirmed that there are C dominant large eigenvalues (so-called outliers) even after the training (Papyan, 2019).

Consistent with the empirical results, we found that there are C large eigenvalues:

Theorem 4.

Fcross has the first C largest eigenvalues of O(M).

The theorem is proved in section B.2. These C large eigenvalues are reminiscent of the C largest eigenvalues of F shown in theorem 1 and can act as outliers. Note that because λi(Fcross)λi(F), we have λC+1λmax'. λC+1 denotes the (C+1)-th largest eigenvalue of Fcross. Let us suppose that NM and the assumptions of theorem 2 hold. In this case, λmax' is upper-bounded by equation 3.3, and then λC+1 is of O(M) at most. This means that the first C largest eigenvalues of Fcross can become outliers. It would be interesting to extend the above results and theoretically quantify more precise values of these outliers. One promising direction will be to analyze a hierarchical structure of Fcross empirically investigated by Papyan (2019). It is also noteworthy that our outliers disappear under the mean subtraction of f in the last layer mentioned in equation 3.4. This is because we have λi(Fcross)λi(F¯) under the mean subtraction.

Although this work mainly focuses on DNNs with random weights, it also gives some insight into the training of sufficiently wide neural networks. It is known that the whole training dynamics of gradient descent can be explained by NTK in sufficiently wide neural networks. See section 4 for more details. In the case of the cross-entropy loss, its functional gradient is given by QΘ(y-g) (Lee et al., 2019). Θ is the NTK at random initialization described in equation 4.1. Q depends on the softmax function g at each time step. One can numerically solve the training dynamics of g and obtain the theoretical value of the training loss (Lee et al., 2019). In Figure 5 (left), we confirmed that the theoretical line coincided well with the experimental results of gradient descent training. We set random initialization by (σw2,σb2)=(2,0). We used artificial data with gaussian inputs (N=100) and generated their labels by a teacher network whose architecture was the same as the trained network. In addition, Figure 5 (right) shows the largest eigenvalues during the training. As is expected from NTK theory, λmaxlin remained unchanged during the training. In contrast, the Fcross's largest eigenvalue dynamically changed because Q depends on the softmax output g, which changes with the scale of O(1). We calculated theoretical bounds of λmax (blue lines) by substituting g at each time step into theorem 4. Note that the bounds obtained in theorem 4 are available even in the NTK regime because the coefficients βk admit any value of g and are not limited to the random initialization. These theoretical bounds explained well the experimental results of λmax during the training.
Figure 5:

Training with the cross-entropy loss in NTK regime. Left: Training dynamics of the cross-entropy loss. Right: Training dynamics of the largest eigenvalue. The largest eigenvalue of F (gray crosses) keeps unchanged, while that of Fcross (black points) dynamically changes and approaches zero. We trained a three-layered deep neural network (L=3, M=2000, C=2, αl=1) by gradient descent on artificial data. We used five different networks with different random seeds and showed its average and deviation. The colored lines represent the theoretical results obtained in the large M limit.

Figure 5:

Training with the cross-entropy loss in NTK regime. Left: Training dynamics of the cross-entropy loss. Right: Training dynamics of the largest eigenvalue. The largest eigenvalue of F (gray crosses) keeps unchanged, while that of Fcross (black points) dynamically changes and approaches zero. We trained a three-layered deep neural network (L=3, M=2000, C=2, αl=1) by gradient descent on artificial data. We used five different networks with different random seeds and showed its average and deviation. The colored lines represent the theoretical results obtained in the large M limit.

Because the global minimum is given by g=y, all entries of Q and λmax approach zero after a large enough number of steps. In the MSE case, we can explain the critical learning rate for convergence as is remarked in equation 3.5. In contrast, it is challenging to estimate such a critical learning rate in the cross-entropy case since Q and Fcross dynamically change.

4  Connection to Neural Tangent Kernel

4.1  Scale-Dependent Eigenvalue Statistics

The empirical FIM, equation 2.10, is essentially connected to a recently proposed Gram matrix, the neural tangent kernel (NTK). Jacot et al. (2018) defined the NTK by
Θ:=θfθf.
(4.1)
Note that the Jacobian θf is a P×CN matrix with each column corresponding to θfk(n) (k=1,,C, n=1,,N). Under certain conditions with sufficiently large M, the NTK at random initialization governs the whole training process in the function space by
dfdt=ηNΘ(y-f),
(4.2)
where the notation t corresponds to the time step of the parameter update and η represents the learning rate. Specifically, NTK's eigenvalues determine the speed of convergence of the training dynamics. Moreover, one can predict the network output on the test samples by using the gaussian process with the NTK (Jacot et al., 2018; Lee et al., 2019).

The NTK and empirical FIM share essentially the same nonzero eigenvalues. It is easy to see that one can represent the empirical FIM, equation 2.10, by F=θfθf/N. This means that the NTK, equation 4.1, is the left-to-right reversal of F up to the constant factor 1/N. Karakida et al. (2019b) introduced F*=θfθf/N, which is essentially the same as the NTK, and referred to F* as the dual of F. They used F* to derive theorem 1.

It should be noted that the studies of NTK typically suppose a special parameterization different from the usual setting. They consider DNNs with a parameter set θ={ωijl,βil}, which determines weights and biases by
Wijl=σwMl-1ωijl,bil=σbβil,ωijl,βili.i.d.N(0,1).
(4.3)
This NTK parameterization changes the scaling of Jacobian. For instance, we have ωijlf=σwMl-1Wijlf. This makes the eigenvalue statistics slightly change from those of theorem 1. When M is sufficiently large, the eigenvalue statistics of Θ under the NTK parameterization are asymptotically evaluated as
mλακ1'C,sλα2(N-1)κ2'2+κ1'2C,λmaxα(N-1)κ2'+κ1'.
The positive constants κ1 and κ2 are obtained using order parameters,
κ1':=1αl=1L(σw2q˜1lq^1l-1+σb2q˜1l),κ2':=1αl=1L(σw2q˜2lq^2l-1+σb2q˜2l).
The derivation is given in appendix C. The NTK parameterization makes the eigenvalue statistics independent of the width scale. This is because the NTK parameterization maintains the scale of the weights but changes the scale of the gradients with respect to the weights. It also makes (κ1,κ2) shift to (κ1',κ2'). This shift occurs because the NTK parameterization makes the order of the weight gradients ωf comparable to that of the bias gradients βf. The second terms in κ1' and κ2' correspond to a nonnegligible contribution from βf.

While mλ is independent of the sample size N, λmax depends on it. This means that the NTK dynamics converge nonuniformly. Most of the eigenvalues are relatively small, and the NTK dynamics converge more slowly in the corresponding eigenspace. In addition, a prediction made with the NTK requires the inverse of the NTK to be computed (Jacot et al., 2018; Lee et al., 2019). When the sample size is large, the condition number of the NTK, λmax/λmin, is also large and the computation with the inverse NTK is expected to be numerically inaccurate.

4.2  Scale-Independent NTK

A natural question is under what condition NTK's eigenvalue statistics become independent of both the width and the sample size. As indicated in equation 3.4, the mean subtraction in the last layer with NM is a simple way to make the FIM's largest eigenvalue independent of the width. Similarly, one can expect that the mean subtraction makes the NTK's largest eigenvalue of O(N) disappear and the eigenvalue spectrum take a range of O(1) independent of the large width and sample size.

Figure 6 empirically confirms this speculation. We set L=3, C=2, and αl=1 and used, the ReLU activation function, the gaussian inputs, and weights with (σw2,σb2)=(2,0). As shown in Figure 6 (left), NTK's eigenvalue spectrum becomes pathologically distorted as the sample size increases. To make an easier comparison of the spectra, the eigenvalues in this figure are normalized by 1/N. As the sample size increases, most of the eigenvalues concentrate close to zero, while the largest eigenvalues become outliers. In contrast, Figure 6 (right) shows that the mean subtraction keeps NTK's whole spectrum in the range of O(1) under the condition NM. The spectrum empirically converged to a fixed distribution in the large M limit.
Figure 6:

Spectra of the NTK (Θ) becomes independent of the scale by a normalization method (mean subtraction). Left: Spectra with M=1000 and various N. The eigenvalues are normalized by 1/N for comparison. Right: Spectra under mean subtraction in the last layer and the condition N=M. The vertical axes represent the probability density obtained from the cumulative number of eigenvalues over 400 different networks.

Figure 6:

Spectra of the NTK (Θ) becomes independent of the scale by a normalization method (mean subtraction). Left: Spectra with M=1000 and various N. The eigenvalues are normalized by 1/N for comparison. Right: Spectra under mean subtraction in the last layer and the condition N=M. The vertical axes represent the probability density obtained from the cumulative number of eigenvalues over 400 different networks.

5  Metric Tensor for Input and Feature Spaces

The above framework for evaluating FIMs is also applicable to metric tensors for input and feature spaces, which are expressed in the matrix form in Figure 2c. Let us denote Ak:=Ehfkhfk. It is easy to see the eigenvalue statistics of A from those of Ak. We can prove the following theorem:

Theorem 5.
When M is sufficiently large, the eigenvalue statistics of Ak are asymptotically evaluated as
mλκ˜1M,sλα˜N-1Nκ˜22+κ˜12N,λmaxα˜N-1Nκ˜2+κ˜1N,
where α˜:=l=0L-1αl, and positive constants κ˜1 and κ2˜ are obtained from the order parameters,
κ˜1:=σw2α˜l=1Lq˜1l,κ˜2:=σw2α˜l=1Lq˜2l.
The eigenvector of Ak corresponding to λmax is E[hfk].

The theorem is derived in appendix D. Since A is the summation of Ak over C output units, mλ and sλ of A are C times as large as those of Ak. The mean of the eigenvalues asymptotically decreases on the order of O(1/M). Note that when MhN, Ak has trivial Mh-N zero eigenvalues. Even if we neglect these trivial zero eigenvalues, the mean becomes α˜κ˜1/N and decreases on the order of O(1/N). In contrast, the largest eigenvalue is of O(1) for any M and N. Thus, the spectrum of Ak is pathologically distorted in the sense that the mean is far from the edge beyond the order difference. The local geometry of h is strongly distorted in the direction of E[hfk]. Similarly, it is easy to derive the eigenvalue statistics of diagonal blocks Akll. The details are shown in the appendix.

Figure 7a (left) shows typical spectra of A and Figure 7a (right) those of A00. We used deep tanh networks with M=500 and N=1000. The other experimental settings are the same as those in Figure 3a. The pathological spectra appear as the theory predicts. Similarly, we show spectra of softmax output in Figure 7b. The softmax output made the outliers widely spread in the same manner as in Fcross.
Figure 7:

Spectra of metric tensor for input and feature spaces show that top C eigenvalues act as outliers. Left: The spectra of A. Right: The spectra of A00. The vertical axis shows the cumulative number of eigenvalues over 100 different networks. The black histograms show the original spectra, while the red dashed ones show the spectra without the C largest eigenvalues. The blue lines represent the theoretical values of the largest eigenvalues.

Figure 7:

Spectra of metric tensor for input and feature spaces show that top C eigenvalues act as outliers. Left: The spectra of A. Right: The spectra of A00. The vertical axis shows the cumulative number of eigenvalues over 100 different networks. The black histograms show the original spectra, while the red dashed ones show the spectra without the C largest eigenvalues. The blue lines represent the theoretical values of the largest eigenvalues.

Let us remark on some related work in the literature of deep learning. First, Pennington et al. (2018) investigated similar but different matrices. Briefly, they used random matrix theory and obtained the eigenvalue spectrum of ulf with N=1, Ml=C=M. They found that the isometry of the spectrum is helpful to solve the vanishing gradient problem. Second, DNNs are known to be vulnerable to a specific noise perturbation, that is, the adversarial example (Goodfellow et al., 2014). One can speculate that the eigenvector corresponding to λmax may be related to adversarial attacks, although such a conclusion will require careful consideration.

6  Conclusion and Discussion

We evaluated the asymptotic eigenvalue statistics of the FIM and its variants in sufficiently wide DNNs. We found that they have pathological spectra, that is, number-of-classes eigenvalues act as outliers in the conventional setting of random initialization and activation functions. In particular, we empirically demonstrated that softmax output disperses the outliers and makes a tail of the eigenvalue spectrum spread from the bulk. Since the FIM shares the same nonzero eigenvalues as NTK, the convergence property of the training dynamics can depend on the outliers. This suggests that we need to be careful about the eigenvalue statistics and their influence on the learning when we use large-scale deep networks in naive settings. These outliers can disappear under specific normalization of the last layer.

This work focused on fully connected neural networks, and it will be interesting to explore the spectra of other architectures such as ResNets and CNN. It will also be fundamental to explore the eigenvalue statistics that this study cannot capture. While our study captured some of the basic eigenvalue statistics, it remains to derive the whole spectrum analytically. In particular, after the normalization excludes large outliers, the bulk of the spectrum becomes dominant. In such cases, random matrix theory seems to be a prerequisite for further progress. It enables us to analyze the FIM's eigenvalue spectrum of a shallow centered network in the large width limit under the fixed ratio between the width and sample size (Pennington & Worah, 2018). In deep networks, the Stieltjes transformation of the NTK's eigenvalue spectrum is obtained in an iterative formulation (Fan & Wang, 2020). Extending these analyses will lead to our better understanding of the spectrum. Furthermore, we assumed a finite number of network output units. In order to deal with multilabel classifications with high dimensionality, it would be helpful to investigate eigenvalue statistics in the wide limit of both hidden and network output layers. Finally, although we focused on the finite depth and regarded order parameters as constants, they can exponentially explode on extremely deep networks in the chaotic regime (Schoenholz et al., 2017; Yang, Pennington, Rao, Sohl-Dickstein, & Schoenholz, 2019). The NTK in such a regime has been investigated in Jacot et al. (2019).

It would also be interesting to explore further connections between the eigenvalue statistics and learning. Recent studies have yielded insights into the connection between the generalization performance of DNNs and the eigenvalue statistics of certain Gram matrices including FIM and NTK (Suzuki, 2018; Sun & Nielsen, 2019; Yang & Salman, 2019). We expect that the theoretical foundation of the metric tensors given in this letter will lead to a more sophisticated understanding and development of deep learning in the future.

Appendix A: Eigenvalue Statistics of F

A.1  Overviewing the Derivation of Theorem 1

The FIM is composed of feedforward and backpropagated signals. For example, its weight part is given by (Fkll')(ij)(i'j'):=EWijlfkWi'j'l'fk. One can represent it in matrix form,
Fkll'=E[δkl(δkl')hl-1(hl'-1)],
(A.1)
where represents the Kronecker product. The variables h and δ are functions of x, and the expectation is taken over x. This expression is not easy to treat in the analysis, and we introduce a dual expression of the FIM, which is essentially the same as NTK, as follows.
First, we briefly overview the derivation of theorem 1 shown in Karakida et al. (2019a, 2019b). The essential point is that a Gram matrix has the same nonzero eigenvalues as its dual. One can represent the empirical FIM, equation 2.10, as
F=RR,R:=1N[θf1θf2θfC].
(A.2)
Its columns are the gradients on each input: θfk(n)(n=1,,N). Let us refer to a CN×CN matrix F*:=RR as the dual of FIM. Matrices F and F* have the same nonzero eigenvalues by definition. This F* can be partitioned into N×N block matrices. The (k,k')th block is given by
F*(k,k')=θfkθfk'/N,
(A.3)
for k,k'=1,,C. In the large M limit, the previous study (Karakida et al., 2019b) showed that F* asymptotically satisfies
F*(k,k')=αMNKδkk'+1No(M),
(A.4)
where δk,k' is the Kronecker delta. As is summarized in lemma A.1 in Karakida et al. (2019a), the second term of equation A.4 is negligible in the large M limit. In particular, it is reduced to O(M)/N under certain condition. The matrix K has entries given by
Knm=κ1(n=m),κ2(nm).
(A.5)
Using this K, the previous studies derived the basic eigenvalues statistics (Karakida et al., 2019b) and eigenvectors corresponding to λmax (Karakida et al., 2019a). The matrix K has the largest eigenvalue ((N-1)κ2+κ1) and its eigenvectors νkRCN (k=1,,C) whose entries are given by
(νk)i:=1N((k-1)N+1ikN),0(otherwise).
(A.6)
Note that κ1 is positive by definition and κ2 is positive under condition 2 of activation functions. The other eigenvalues of K are given by κ1-κ2.

We can obtain mλ from Trace(F*(k,k))C/P, sλ from F*(k,k)F2C/P where ·F is the Frobenius norm, and λmax from νkF*(k,k)νk. The eigenvector of F corresponding to λmax is asymptotically given by E[fk]=Rνk. When N is of O(1), it is obvious that K's eigenvalues determine F*'s eigenvalues in the large M limit. Even if N increases depending on M, our eigenvalue statistics hold in the large M and N limits. That is, we have asymptotically mλκ1C/M, sλακ22C, and λmaxακ2M. As one can see here, the condition of κ2>0 is crucial for our eigenvalue statistics. Noncentered networks guarantee q^2l>0 and q˜2l>0, which leads to κ2>0. In centered networks, κ2 can become zero, and we need to carefully evaluate the second term of equation A.4.

A.2  Diagonal Blocks

We can immediately derive eigenvalue statistics of diagonal blocks in the same way as theorem 1. We can represent the diagonal blocks as Fll:=RlRl with
Rl:=1N[θlf1θlf2θlfC]
(A.7)
and the dual of this Gram matrix as
Fll*:=RlRl,
(A.8)
where the parameter set θl means all parameters in the lth layer. The CN×CN matrix Fll* can be partitioned into N×N block matrices whose (k,k')th block is given by
Fll*(k,k')=θlfkθlfk'/N,
(A.9)
for k,k'=1,,C. As one can see from the additivity of l=1LFll*(k,k')=F*(k,k'), the following evaluation is part of equation A.4:
Fll*(k,k')=αl-1MNKlδkk'+1No(M),
(A.10)
where
Knml:=q˜1lq^1l-1(n=m),q˜2lq^2l-1(nm).
(A.11)
Thus, we have mλTrace(αl-1MNKl)C/Pl and sλαl-1MNKlF2C/Pl, where the dimension of θl is given by Pl=αlαl-1M2. We set αL=C/M in the last layer. The matrices K and Kl have the same eigenvectors corresponding to the largest eigenvalues, νk. The largest eigenvalue is given by λmaxαl-1MNνkKlνk. The eigenvectors of Fll corresponding to λmax are Rlνk=E[θlfk].

Appendix B: Eigenvalue Statistics of Fcross

B.1  Derivation of Theorem 3

Fcross is expressed by
Fcross:=RQR,
(B.1)
where Q is a CN×CN matrix. One can rearrange the columns and rows of Q and partition it into N×N block matrices Q(k,k') whose entries are given by
Q(k,k')nm={gk(n)δkk'-gk(n)gk'(n)}δnm,
(B.2)
for k,k'=1,,C. Each block is a diagonal matrix. Note that the nonzero eigenvalues of RQR are equivalent to those of QRR. Since we have F*=RR, we should investigate the eigenvalues of the following matrix:
Fcross*:=QF*.
(B.3)
The mean of the eigenvalues is given by
mλ=Trace(Fcross*)/P=i,kTrace(Q(k,i)F*(i,k))/PkTrace(Q(k,k)F*(k,k))/PC(1-β1)κ1/M.
(B.4)
The third line holds asymptotically, since the order of F(k,k) in equation A.4 is higher than that of F*(k,k') (kk'). The fourth line comes from kgk(n)=1.
The second moment is evaluated as
sλ=Trace(Fcross*2)/P=kTrace(a,b,cQ(k,a)F*(a,b)Q(b,c)F*(c,k))/Pk,k'Trace(Q(k,k')F*(k',k')Q(k',k)F*(k,k))/P.
(B.5)
Substituting K into F*(k,k) gives
sλαN2k,k'nQn(k,k')(κ22m{n}Qm(k,k')+κ12Qn(k,k'))=αβ2κ22+β3κ12N,
(B.6)
where m{n} means a summation over m excluding the nth sample.
Finally, we derive the largest eigenvalue. Let us denote the eigenvectors of F as
vk:=E[fk]E[fk].
(B.7)
It is easy to confirm that we have asymptotically E[fk]2λmax(F) (Karakida et al., 2019a), where the largest eigenvalue of F is denoted as λmax(F)=α(N-1Nκ2+κ1N)M. By definition, Fcross's largest eigenvalue satisfies λmaxxFcrossx for any unit vector x. By taking x=vk, we obtain
λmaxλmax(F)-1·(Rνk)Fcross(Rνk)=λmax(F)-1·(F*νk)Q(F*νk).
(B.8)
Because we have asymptotically F*νk=λmax(F)νk, the lower bound is given by
λmax(F)λmax(F)·(νkQνk)=λmax(F)·1Nngk(n)(1-gk(n)).
(B.9)
Taking the index k that maximizes the right-hand side, we obtain the lower bound of λmax. The upper bound of λmax immediately comes from a simple inequality for nonnegative variables, λmaxiλi2=Psλ. Thus, we obtain theorem 4.

Note that we immediately have λi(Fcross*)λmax(Q)λi(F*) from equation B.3. This means that Fcross's eigenvalues satisfy λi(Fcross)λmax(Q)λi(F) (i=1,,P). In addition, we have λmax(Q)1 because λmax(Q)maxnλmax(Qn) and λmax(Qn)maxkgk(n)1. Therefore, λi(Fcross)λi(F) holds and we obtain inequalities 3.10.

B.2  Derivation of Theorem 5

Define ui to be the eigenvector of Fcross corresponding to the eigenvalue λi (λ1λiλP). Moreover, let us denote the linear subspace spanned by {u1,,uk} as Uk and the orthogonal complement of Uk as Uk. When k=0, we have U0=RP. The dimension of Uk is P-k, and we denote it as dim(Uk)=P-k. Thus, we have
λr=maxx=1;xUr-1xFcrossx,
(B.10)
for r=1,,P. Define Vk to be a linear subspace spanned by k eigenvectors of F corresponding to λmax(F), that is, {vi1,,vik}. The indices {i1,,ik} are chosen from {1,,C} without duplication.
It is trivial to show from the dimensionality of the linear space that the intersection Sr:={Ur-1VC} is a linear subspace satisfying C-r+1dim(Sr)C when 1rC. Let us take a unit vector x in Sr as x=s=1r*asvis, where we have defined r*:=dim(Sr) and the coefficients as satisfy s=1r*as2=1. In the large M limit, we asymptotically have
λrmaxx=1;xSrxFcrossx=max(a1,,ar*);as2=1s,s'asas'(νisQνis')·λmax(F)νi1Qνi1·λmax(F)=1Nngi1(n)(1-gi1(n))·λmax(F),
(B.11)
where λmax(F) is of O(M) from theorem 1. This holds for all of r=1,,C, and we can say that there exist C large eigenvalues of O(M).

Appendix C: Derivation of NTK's Eigenvalue Statistics

The NTK is defined as Θ=NF* under the NTK parameterization. In the same way as equation A.4, the (k,k')th block of the NTK is asymptotically given by
Θ(k,k')=αK'δkk'+o(1),
(C.1)
for k,k'=1,,C. In contrast to equation A.4, the NTK parameterization makes F* multiplied by 1/M. The negligible term of o(1) is reduced to O(1/M) under the condition summarized in Karakida et al. (2019a). The entries of K' are given by
Knm':=κ1'(n=m),κ2'(nm),
(C.2)
where κ1' is composed of two parts. The first part is ωfk(n)2=σw2l,i,jδil(n)2hjl-1(n)2/Ml-1σw2l=1Lq˜1lq^1l-1. The second term is βfk(n)2=σb2l,iδil(n)2σb2l=1Lq˜1l. Although the number of weights is much larger than the number of biases, the NTK parameterization makes a contribution of ωfk comparable to that of βfk. This is in contrast to the evaluation of F* in theorem 1, where the contribution of bfk is negligible (Karakida et al., 2019b). We can evaluate κ2' in the same way.

In the same way as with the FIM, the trace of K' leads to mλ, the Frobenius norm of K' leads to sλ, and K' has the largest eigenvalue ((N-1)κ2'+κ1') for arbitrary N. The eigenspace of Θ corresponding to λmax is also the same as F*. It is spanned by eigenvectors νk (k=1,,C).

Appendix D: Eigenvalue Statistics of A

D.1  Derivation of Theorem 6

The metric tensor Ak can be represented by Ak=hfkhfk/N, where hfk is an Mh×N matrix and its columns are the gradients on each input: hfk(n)(n=1,,N). Let us introduce the N×N dual matrix of Ak: Ak*:=hfkhfk/N. It has the same nonzero eigenvalues as Ak by definition. Its nmth entry is given by
(Ak*)nm=hfk(n)hfk(m)/N=l=0L-1i,j,j'Wjil+1Wj'il+1δjl+1(n)δj'l+1(m)/Nσw2l=1Ljδjl(n)δjl(m)/N,
(D.1)
in the large M limit. Accordingly, we have
Ak*=α˜NA¯*+1No(1),(A¯*)nm:=κ˜1(n=m),κ˜2(nm).
(D.2)
κ˜1 is positive by definition, and κ2˜ is positive under the condition of the activation functions 2.

The eigenvalue statistics are easily derived from the leading term A¯*. We can derive the mean of the eigenvalues as mλTrace(α˜NA¯*)/Mh and the second moment as sλα˜NA¯*F2/Mh, where Mh=α˜M. We can determine the largest eigenvalue because we explicitly obtain the eigenvalues of A¯*; λ1=(N-1)κ˜2+κ˜1 with an eigenvector ν˜:=(1,,1) and λi=κ˜1-κ˜2 with eigenvectors e1-ei (i=2,,N). The vector ei denotes a unit vector whose entries are 1 for the ith entry and 0 otherwise. The largest eigenvalue is given by λ1.

The eigenvector of Ak corresponding to λmax is constructed from ν˜. Let us denote by v an eigenvector of Ak satisfying Akv=λmaxv. Multiplying both sides by hfk, we get
Ak*(hfkv)=λmax·(hfkv).
(D.3)
This means that hfkv is the eigenvector of Ak* and equals ν˜. Multiplying both sides of ν˜=hfkv by 1Nhfk, we get
E[hfk]=Akv,
(D.4)
which equals λmaxv by definition of v. As a result, we obtain v=E[hfk] up to a scale factor.
We can also evaluate eigenvalue statistics of A=kCAk in analogy with F. Note that we can represent A as A=R˜R˜ with
R˜:=1N[hf1hf2hfC].
(D.5)
Its columns are the gradients on each input: hfk(n)(n=1,,N). We can introduce a dual matrix R˜R˜ and obtain the following eigenvalue statistics of A:
mλκ˜1CM,sλα˜N-1Nκ˜22+κ˜12NC,λmaxα˜N-1Nκ˜2+κ˜1N.
(D.6)

D.2  Diagonal Blocks

In the same way as in the FIM, we can also evaluate the eigenvalue statistics of the diagonal blocks Akll (l=0,,L-1). The metric tensor Akll can be represented by Akll=hlfkhlfk/N. Consider its dual: Akll*=hlfkhlfk/N. Note that we partitioned Ak into L2 block matrices whose (l,l')th block is expressed by an Ml×Ml' matrix:
Akll':=(Wl+1)E[δkl+1(δkl'+1)]Wl'+1.
(D.7)
In the large M limit, we have asymptotically
Akll*=1NA¯ll*+1No(1),(A¯ll*)nm:=σw2q˜1l+1(n=m),σw2q˜2l+1(nm).
(D.8)
The eigenvalue statistics of Akll are asymptotically evaluated as
mλσw2q˜1l+1Ml,sλσw4N-1N(q˜2l+1)2+(q˜1l+1)2N,λmaxσw2N-1Nq˜2l+1+q˜1l+1N.
The eigenvector of Akll corresponding to λmax is E[hlfk]. We can also derive the eigenvalue statistics of the summation All=kCAkll; the mean and second moment are multiplied by C.

Acknowledgments

R.K. acknowledges the funding support from JST ACT-X (grant. JPMJAX190A) and Grant-in-Aid for Young Scientists (grant 19K20366).

References

Amari
,
S.
(
1974
).
A method of statistical neurodynamics
.
Kybernetik
,
14
(
4
),
20
215
.
Amari
,
S.
(
1998
).
Natural gradient works efficiently in learning
.
Neural Computation
,
10
(
2
),
251
276
.
Amari
,
S.
(
2016
).
Information geometry and its applications
.
Berlin
:
Springer
.
Amari
,
S.
,
Karakida
,
R.
, &
Oizumi
,
M.
(
2019
).
Fisher information and natural gradient learning of random deep networks.
In
Proceedings of International Conference on Artificial Intelligence and Statistics
(pp.
694
702
).
Arora
,
S.
,
Du
,
S. S.
,
Hu
,
W.
,
Li
,
Z.
,
Salakhutdinov
,
R.
, &
Wang
,
R.
(
2019
). On exact computation with an infinitely wide neural net. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
.
Red Hook, NY
:
Curran
.
Daniely
,
A.
,
Frostig
,
R.
, &
Singer
,
Y.
(
2016
). Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
2253
2261
).
Red Hook, NY
:
Curran
.
Fan
,
Z.
, &
Wang
,
Z.
(
2020
). Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
, 33 (pp.
7710
7721
).
Red Hook, NY
:
Curran
.
Fukumizu
,
K.
(
1996
).
A regularity condition of the information matrix of a multilayer perceptron network
.
Neural Networks
,
9
(
5
),
871
879
.
Ghorbani
,
B.
,
Krishnan
,
S.
, &
Xiao
,
Y.
(
2019
).
An investigation into neural net optimization via Hessian eigenvalue density.
In
Proceedings of the International Conference on Machine Learning
(pp.
2232
2241
).
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
). Understanding the difficulty of training deep feedforward neural networks. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
249
256
).
Goodfellow
,
I. J.
,
Shlens
,
J.
, &
Szegedy
,
C.
(
2014
).
Explaining and harnessing adversarial examples.
arXiv:1412.6572.
He
,
K.
,
Zhang
,
X.
,
Ren
,
S.
, &
Sun
,
J.
(
2015
).
Delving deep into rectifiers: Surpassing human-level performance on imagenet classification
. In
Proceedings of the IEEE International Conference on Computer Vision.
Piscataway, NJ
:
IEEE
.
Jacot
,
A.
,
Gabriel
,
F.
, &
Hongler
,
C.
(
2018
). Neural tangent kernel: Convergence and generalization in neural networks. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
8580
8589
).
Red Hook, NY
:
Curran
.
Jacot
,
A.
,
Gabriel
,
F.
, &
Hongler
,
C.
(
2019
).
Freeze and chaos for DNNs: An NTK view of batch normalization, checkerboard and boundary effects
. arXiv:1907.05715.
Kadmon
,
J.
, &
Sompolinsky
,
H.
(
2016
). Optimal architectures in a solvable model of deep networks. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
4781
4789
).
Red Hook, NY
:
Curran
.
Karakida
,
R.
,
Akaho
,
S.
, &
Amari
,
S.
(
2019a
). The normalization method for alleviating pathological sharpness in wide neural networks. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
.
Red Hook, NY
:
Curran
.
Karakida
,
R.
,
Akaho
,
S.
, &
Amari
,
S.
(
2019b
).
Universal statistics of Fisher information in deep neural networks: Mean field approach.
In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
1032
1041
).
Karakida
,
R.
, &
Osawa
,
K.
(
2020
). Understanding approximate Fisher information for fast convergence of natural gradient descent in wide neural networks. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
, 33 (pp.
10891
10901
).
Red Hook, NY
:
Curran
.
Kirkpatrick
,
J.
,
Pascanu
,
R.
,
Rabinowitz
,
N.
,
Veness
,
J.
,
Desjardins
,
G.
,
Rusu
,
A. A.
,
Hadsell
,
R.
(
2017
).
Over-coming catastrophic forgetting in neural networks
. In
Proceedings of the National Academy of Sciences
,
114
(
13
),
3521
3526
.
LeCun
,
Y.
,
Bengio
,
Y.
, &
Hinton
,
G.
(
2015
).
Deep learning
.
Nature
,
521
(
7553
),
436
444
.
LeCun
,
Y.
,
Bottou
,
L.
,
Orr
,
G. B.
, &
Müller
,
K.-R.
(
1998
). Efficient backprop. In
G.
Montavon
,
G.
Orr
, &
K.-R.
Muller
(Eds.),
Neural networks: Tricks of the trade
(pp.
9
50
).
Berlin
:
Springer
.
LeCun
,
Y.
,
Kanter
,
I.
, &
Solla
,
S. A.
(
1991
).
Eigenvalues of covariance matrices: Application to neural-network learning
.
Physical Review Letters
,
66
(
18
), 2396.
Lee
,
J.
,
Bahri
,
Y.
,
Novak
,
R.
,
Schoenholz
,
S. S.
,
Pennington
,
J.
, &
Sohl Dickstein
,
J.
(
2018
).
Deep neural networks as gaussian processes
.
ICLR'2018.
arXiv:1711.00165.
Lee
,
J.
,
Xiao
,
L.
,
Schoenholz
,
S. S.
,
Bahri
,
Y.
,
Sohl-Dickstein
,
J.
, &
Pennington
,
J.
(
2019
). Wide neural networks of any depth evolve as linear models under gradient descent. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d'Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
.
Red Hook, NY
:
Curran
.
Lewkowycz
,
A.
,
Bahri
,
Y.
,
Dyer
,
E.
,
Sohl-Dickstein
,
J.
, &
Gur-Ari
,
G.
(
2020
).
The large learning rate phase of deep learning: The catapult mechanism
. arXiv:2003.02218.
Liang
,
T.
,
Poggio
,
T.
,
Rakhlin
,
A.
, &
Stokes
,
J.
(
2019
).
Fisher-Rao metric, geometry, and complexity of neural networks
. In
Proceedings of International Conference on Artificial Intelligence and Statistics
(pp.
888
896
).
Martens
,
J.
, &
Grosse
,
R.
(
2015
).
Optimizing neural networks with Kronecker- factored approximate curvature
. In
Proceedings of International Conference on Machine Learning
(pp.
2408
2417
).
New York
:
ACM
.
Matthews
,
A. G. d. G.
,
Rowland
,
M.
,
Hron
,
J.
,
Turner
,
R. E.
, &
Ghahramani
,
Z.
(
2018
).
Gaussian process behaviour in wide deep neural networks
. In
Proceeding of the International Conference on Learning Representations.
arXiv:1804.11271.
Novak
,
R.
,
Bahri
,
Y.
,
Abolafia
,
D. A.
,
Pennington
,
J.
, &
Sohl-Dickstein
,
J.
(
2018
).
Sensitivity and generalization in neural networks: An empirical study
. In
Proceedings of the International Conference on Learning Representations.
arXiv:1802.08760.
Ollivier
,
Y.
(
2015
).
Riemannian metrics for neural networks I: Feedforward networks
.
Information and Inference: A Journal of the IMA
,
4
(
2
),
108
153
.
Papyan
,
V.
(
2019
).
Measurements of three-level hierarchical structure in the outliers in the spectrum of deepnet Hessians
. In
Proceedings of the International Conference on Machine Learning
.
Park
,
H.
,
Amari
,
S.
, &
Fukumizu
,
K.
(
2000
).
Adaptive natural gradient learning algorithms for various stochastic models
.
Neural Networks
,
13
(
7
),
755
764
.
Pascanu
,
R.
, &
Bengio
,
Y.
(
2014
).
Revisiting natural gradient for deep networks
.
Proceedings of the International Conference on Learning Representations.
arXiv:1301.3584.
Pennington
,
J.
, &
Bahri
,
Y.
(
2017
).
Geometry of neural network loss surfaces via random matrix theory
. In
Proceedings of International Conference on Machine Learning
(pp.
2798
2806
).
Pennington
,
J.
,
Schoenholz
,
S. S.
, &
Ganguli
,
S.
(
2018
).
The emergence of spectral universality in deep networks
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
1924
1932
).
Pennington
,
J.
, &
Worah
,
P.
(
2017
). Nonlinear random matrix theory for deep learning. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
2634
2643
).
Red Hook, NY
:
Curran
.
Pennington
,
J.
, &
Worah
,
P.
(
2018
). The spectrum of the Fisher information matrix of a single-hidden-layer neural network. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
5410
5419
).
Red Hook, NY
:
Curran
.
Poole
,
B.
,
Lahiri
,
S.
,
Raghu
,
M.
,
Sohl-Dickstein
,
J.
, &
Ganguli
,
S.
(
2016
). Exponential expressivity in deep neural networks through transient chaos. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
3360
3368
).
Red Hook, NY
:
Curran
.
Sagun
,
L.
,
Evci
,
U.
,
Guney
,
V. U.
,
Dauphin
,
Y.
, &
Bottou
,
L.
(
2017
).
Empirical analysis of the Hessian of over-parameterized neural networks.
arXiv:1706.04454.
Schoenholz
,
S. S.
,
Gilmer
,
J.
,
Ganguli
,
S.
, &
Sohl-Dickstein
,
J.
(
2017
).
Deep information propagation
. In
Proceedings of the International Conference on Learning Representations.
arXiv:1611.01232.
Sun
,
K.
&
Nielsen
,
F.
(
2019
).
Lightlike neuromanifolds, Occam's razor and deep learning.
arXiv:1905.11027.
Suzuki
,
T.
(
2018
).
Fast generalization error bound of deep learning from a kernel perspective
. In
Proceedings of the International Conference on Artificial Intelligence and Statistics
(pp.
1397
1406
).
Xiao
,
L.
,
Bahri
,
Y.
,
Sohl-Dickstein
,
J.
,
Schoenholz
,
S. S.
, &
Pennington
,
J.
(
2018
).
Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks
. In
Proceedings of International Conference on Machine Learning
(pp.
5393
5402
).
Yang
,
G.
(
2019
).
Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation.
arXiv:1902.04760.
Yang
,
G.
,
Pennington
,
J.
,
Rao
,
V.
, Sohl-
Dickstein
,
J.
, &
Schoenholz
,
S. S.
(
2019
).
A mean field theory of batch normalization.
In
Proceedings of the International Conference on Learning Representations.
arXiv:1902.08129.
Yang
,
G.
, &
Salman
,
H.
(
2019
).
A fine-grained spectral perspective on neural networks.
arXiv:1907.10599.
Yang
,
G.
, &
Schoenholz
,
S. S.
(
2017
). Mean field residual networks: On the edge of chaos. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
2865
2873
).
Red Hook, NY
:
Curran
.