The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.
Deep neural networks (DNNs) have outperformed many standard machine-learning methods in practical applications (LeCun, Bengio, & Hinton, 2015). Despite their practical success, many theoretical aspects of DNNs remain to be uncovered, and there are still many heuristics used in deep learning. We need a solid theoretical foundation for elucidating how and under what conditions DNNs and their learning algorithms work well.
The Fisher information matrix (FIM) is a fundamental metric tensor that appears in statistics and machine learning. An empirical FIM is equivalent to the Hessian of the loss function around a certain global minimum, and it affects the performance of optimization in machine learning. In information geometry, the FIM defines the Riemannian metric tensor of the parameter manifold of a statistical model (Amari, 2016). The natural gradient method is a first-order gradient method in the Riemannian space where the FIM works as its Riemannian metric (Amari, 1998; Park, Amari, & Fukumizu, 2000; Pascanu & Bengio, 2014; Ollivier, 2015; Martens & Grosse, 2015). The FIM also acts as a regularizer to prevent catastrophic forgetting (Kirkpatrick et al., 2017); a DNN trained on one data set can learn another data set without forgetting information if the parameter change is regularized with a diagonal FIM.
However, our understanding of the FIM for neural networks has so far been limited to empirical studies and theoretical analyses of simple networks. Numerical experiments empirically confirmed that the eigenvalue spectra of the FIM and those of the Hessian are highly distorted; that is, most eigenvalues are close to zero, while others take on large values (LeCun, Bottou, Orr, & Müller, 1998; Sagun, Evci, Guney, Dauphin, & Bottou, 2017; Papyan, 2019; Ghorbani, Krishnan, & Xiao, 2019). Focusing on shallow neural networks, Pennington and Worah (2018) theoretically analyzed the FIM's eigenvalue spectra by using random matrix theory, and Fukumizu (1996) derived a condition under which the FIM becomes singular. Liang, Poggio, Rakhlin, and Stokes (2019) have connected FIMs to the generalization ability of DNNs by using model complexity, but their results are restricted to linear networks. Thus, theoretical evaluations of deeply nonlinear cases seem to be difficult mainly because of iterated nonlinear transformations. To go one step further, it would be helpful if a framework that is widely applicable to various DNNs could be constructed.
Investigating DNNs with random weights has given promising results. When such DNNs are sufficiently wide, we can formulate their behavior by using simpler analytical equations through coarse-graining of the model parameters, as discussed in mean field theory (Amari, 1974; Kadmon & Sompolinsky, 2016; Poole, Lahiri, Raghu, Sohl-Dickstein, & Ganguli, 2016; Schoenholz, Gilmer, Ganguli, & Sohl-Dickstein, 2017; Yang & Schoenholz, 2017; Xiao, Bahri, Sohl-Dickstein, Schoenholz, & Pennington, 2018; Yang, 2019) and random matrix theory (Pennington, Schoenholz, & Ganguli, 2018; Pennington & Bahri, 2017; Pennington & Worah, 2017). For example, Schoenholz et al. (2017) proposed a mean field theory for backpropagation in fully connected DNNs. This theory characterizes the amplitudes of gradients by using specific quantities, that is, order parameters in statistical physics, and enables us to quantitatively predict parameter regions that can avoid vanishing or explosive gradients. This theory is applicable to a wide class of DNNs with various nonlinear activation functions and depths. Such DNNs with random weights are substantially connected to gaussian process and kernel methods (Daniely, Frostig, & Singer, 2016; Lee et al., 2018; Matthews, Rowland, Hron, Turner, & Ghahramani, 2018; Jacot, Gabriel, & Hongler, 2018). Furthermore, the theory of the neural tangent kernel (NTK) explains that even trained parameters are close enough to the random initialization in sufficiently wide DNNs, and the performance of trained DNNs is determined by the NTK on the initialization (Jacot et al., 2018; Lee et al., 2019; Arora et al., 2019).
Karakida, Akaho, and Amari (2019b) focused on the FIM corresponding to the mean square error (MSE) loss and proposed a framework to express certain eigenvalue statistics by using order parameters. They revealed that when fully connected networks with random initialization are sufficiently wide, the FIM's eigenvalue spectrum asymptotically becomes pathologically distorted. As the network width increases, a small number of the eigenvalues asymptotically take on huge values and become outliers, while the others are much smaller. The distorted shape of the eigenvalue spectrum is consistent with empirical reports (LeCun et al., 1998; Sagun et al., 2017; Papyan, 2019; Ghorbani et al., 2019). While LeCun, Kanter, and Solla (1991) implied that such pathologically large eigenvalues might appear in multilayered networks and affect the training dynamics, its theoretical elucidation has been limited to a data covariance matrix in a linear regression model. The results of Karakida et al. (2019b) can be regarded as a theoretical verification of this large eigenvalue suggested by LeCun et al. (1991). The obtained eigenvalue statistics have given insight into the convergence of gradient dynamics (LeCun et al., 1998; Karakida et al., 2019b), mechanism of batch normalization to decrease the sharpness of the loss function (Karakida, Akaho, & Amari, 2019a), and the generalization measure of DNNs based on the minimum description length (Sun & Nielsen, 2019).
In this letter, we extend the framework of the previous work (Karakida et al., 2019b) and reveal that various types of FIMs and variants show pathological spectra. Our main contribution is the following:
FIM for classification tasks with softmax output: While the previous works (Karakida et al., 2019a, 2019b) analyzed the FIM for regression based on the MSE loss, we typically use the cross-entropy loss with softmax output in classification tasks. We analyze this FIM for classification tasks and reveal that its spectrum is pathologically distorted as well. While the FIM for regression tasks has unique and degenerated outliers in the infinite-width limit, the softmax output can make these outliers disperse and remove the degeneracy. Our theory shows that there are number-of-classes outlier eigenvalues, which is consistent with experimental reports (Sagun et al., 2017; Papyan, 2019). Experimental results demonstrate that the eigenvalue density has a tail of outliers spreading form the bulk.
Furthermore, we also give a unified perspective on the variants:
Diagonal blocks of FIM: We give a detailed analysis of the diagonal block parts of the FIM for regression tasks. Natural gradient algorithms often use a block diagonal approximation of the FIM (Amari, Karakida, & Oizumi, 2019). We show that the diagonal blocks also suffer from pathological spectra.
Connection to NTK: The NTK and FIM inherently share the same nonzero eigenvalues. Paying attention to a specific rescaling of the parameters assumed in studies of NTK, we clarify that NTK's eigenvalue statistics become independent of the width scale. Instead, the gap between the average and maximum eigenvalues increases with the sample size. This suggests that as the sample size increases, the training dynamics converge nonuniformly and that calculations with the NTK become ill conditioned. We also demonstrate a simple normalization method to make eigenvalue statistics that are independent of the large width and sample size.
Metric tensors for input and feature spaces: We consider metric tensors for input and feature spaces spanned by neurons in input and hidden layers. These metric tensors potentially enable us to evaluate the robustness of DNNs against perturbations in the input and feedforward propagated signals. We show that the spectrum is pathologically distorted, similar to FIMs, in the sense that the outlier of the spectrum is much farther from most of the eigenvalues. The softmax output makes the outliers disperse as well.
In summary, this study sheds light on the asymptotical eigenvalue statistics common to various wide networks.
2.1.1 Random Weights and Biases
2.1.2 Input Samples
2.1.3 Activation Functions
Suppose the following two conditions: (1) the activation has a polynomially bounded weak derivative and (2) the network is noncentered, which means a DNN with bias terms () or activation functions satisfying a nonzero gaussian mean. The definition of the nonzero gaussian mean is . The notation means integration over the standard gaussian density.
Condition 1 is used to obtain recurrence relations of backward-order parameters (Yang, 2019). Condition 2 plays an essential role in our evaluation of the FIM (Karakida et al., 2019a, 2019b). The two conditions are valid in various realistic settings, because conventional networks include bias terms, and widely used activation functions, such as the sigmoid function and (leaky-) ReLUs, have bounded weak derivatives and nonzero gaussian means. Different layers may have different activation functions.
2.2 Overview of Metric Tensors
In this letter, we also investigate another FIM denoted by , which corresponds to classification tasks with cross-entropy loss. As shown in Figure 2a one can represent as a modification of . A specific coefficient matrix is inserted between the Jacobian and its transpose. is defined by equation 3.7 and composed of nothing but softmax functions. Its mathematical definition is given in section 3.3. One more interesting quantity is a left-to-right reversed product of described in Figure 2b. This matrix is known as the neural tangent kernel (NTK). The FIM and NTK share the same nonzero eigenvalues by definition, although we need to be careful in the change of parameterization used in the studies of NTK. The details are shown in section 4.
2.3 Order Parameters for Wide Neural Networks
These order parameters depend only on the type of activation function, depth, and the variance parameters and . The recurrence relations for the order parameters require iterations of one- and two-dimensional numerical integrals. Moreover, we can obtain explicit forms of the recurrence relations for some of the activation functions (Karakida et al., 2019b).
3 Eigenvalue statistics of FIMs
3.1 FIM for Regression Tasks
The average of the eigenvalue spectrum asymptotically decreases on the order of , while the variance takes a value of and the largest eigenvalue takes a huge value of . It implies that most of the eigenvalues are asymptotically close to zero, while the number of large eigenvalues is limited. Thus, when the network is sufficiently wide, one can see that the shape of the spectrum asymptotically becomes pathologically distorted. This suggests that the parameter space of the DNNs is locally almost flat in most directions but highly distorted in a few specific directions.
In particular, regarding the large eigenvalues, we have:
3.2 Diagonal Blocks of FIM
It is helpful to investigate the relation between and its diagonal blocks when one considers the diagonal block approximation of . For example, use of a diagonal block approximation can decrease the computational cost of natural gradient algorithms (Martens & Grosse, 2015; Amari, Karakida, & Oizumi, 2019; Karakida & Osawa, 2020). When a matrix is composed only of diagonal blocks, its eigenvalues are given by those of each diagonal block. approximated in this fashion has the same mean of the eigenvalues as the original and the largest eigenvalue , which is of . Thus, the diagonal block approximation also suffers from a pathological spectrum. Eigenvalues that are close to zero can make the inversion of the FIM in natural gradient unstable, whereas using a damping term seems to be an effective way of dealing with this instability (Martens & Grosse, 2015).
3.3 FIM for Multilabel Classification Tasks
We obtain the following result of 's eigenvalue statistics:
The derivation is shown in section B.1. We find that the eigenvalue spectrum shows the same width dependence as the FIM for regression tasks. Although the evaluation of in theorem 4 is based on inequalities, one can see that linearly increases as the width or the depth increases. The softmax functions appear in the coefficients . It should be noted that the values of generally depend on the index of each softmax output. This is because the values of the softmax functions depend on the specific configuration of and .
3.4 's Large Eigenvalues
In this section, we take a closer look at 's large eigenvalues. Figure 3b (left) shows a typical spectrum of . We set , and other settings were the same as in the case of . We found that compared to , had the largest eigenvalues, which were widely spread from the bulk of the spectrum. Naively speaking, this is because the coefficient matrix has distributed eigenvalues, as is shown in Figure 3b (right). Compared to the FIM for regression, which corresponds to , the distributed 's eigenvalues can make 's eigenvalues disperse. The diagonal block also has the same characteristics of the spectrum as the original in Figure 3b (middle). It is noteworthy that exhaustive experiments on the cross-entropy loss have recently confirmed that there are dominant large eigenvalues (so-called outliers) even after the training (Papyan, 2019).
Consistent with the empirical results, we found that there are large eigenvalues:
has the first largest eigenvalues of .
The theorem is proved in section B.2. These large eigenvalues are reminiscent of the largest eigenvalues of shown in theorem 1 and can act as outliers. Note that because , we have . denotes the -th largest eigenvalue of . Let us suppose that and the assumptions of theorem 2 hold. In this case, is upper-bounded by equation 3.3, and then is of at most. This means that the first largest eigenvalues of can become outliers. It would be interesting to extend the above results and theoretically quantify more precise values of these outliers. One promising direction will be to analyze a hierarchical structure of empirically investigated by Papyan (2019). It is also noteworthy that our outliers disappear under the mean subtraction of in the last layer mentioned in equation 3.4. This is because we have under the mean subtraction.
Because the global minimum is given by , all entries of and approach zero after a large enough number of steps. In the MSE case, we can explain the critical learning rate for convergence as is remarked in equation 3.5. In contrast, it is challenging to estimate such a critical learning rate in the cross-entropy case since and dynamically change.
4 Connection to Neural Tangent Kernel
4.1 Scale-Dependent Eigenvalue Statistics
The NTK and empirical FIM share essentially the same nonzero eigenvalues. It is easy to see that one can represent the empirical FIM, equation 2.10, by . This means that the NTK, equation 4.1, is the left-to-right reversal of up to the constant factor . Karakida et al. (2019b) introduced , which is essentially the same as the NTK, and referred to as the dual of . They used to derive theorem 1.
While is independent of the sample size , depends on it. This means that the NTK dynamics converge nonuniformly. Most of the eigenvalues are relatively small, and the NTK dynamics converge more slowly in the corresponding eigenspace. In addition, a prediction made with the NTK requires the inverse of the NTK to be computed (Jacot et al., 2018; Lee et al., 2019). When the sample size is large, the condition number of the NTK, , is also large and the computation with the inverse NTK is expected to be numerically inaccurate.
4.2 Scale-Independent NTK
A natural question is under what condition NTK's eigenvalue statistics become independent of both the width and the sample size. As indicated in equation 3.4, the mean subtraction in the last layer with is a simple way to make the FIM's largest eigenvalue independent of the width. Similarly, one can expect that the mean subtraction makes the NTK's largest eigenvalue of disappear and the eigenvalue spectrum take a range of independent of the large width and sample size.
5 Metric Tensor for Input and Feature Spaces
The above framework for evaluating FIMs is also applicable to metric tensors for input and feature spaces, which are expressed in the matrix form in Figure 2c. Let us denote . It is easy to see the eigenvalue statistics of from those of . We can prove the following theorem:
The theorem is derived in appendix D. Since is the summation of over output units, and of are times as large as those of . The mean of the eigenvalues asymptotically decreases on the order of . Note that when , has trivial zero eigenvalues. Even if we neglect these trivial zero eigenvalues, the mean becomes and decreases on the order of . In contrast, the largest eigenvalue is of for any and . Thus, the spectrum of is pathologically distorted in the sense that the mean is far from the edge beyond the order difference. The local geometry of is strongly distorted in the direction of Similarly, it is easy to derive the eigenvalue statistics of diagonal blocks . The details are shown in the appendix.
Let us remark on some related work in the literature of deep learning. First, Pennington et al. (2018) investigated similar but different matrices. Briefly, they used random matrix theory and obtained the eigenvalue spectrum of with , . They found that the isometry of the spectrum is helpful to solve the vanishing gradient problem. Second, DNNs are known to be vulnerable to a specific noise perturbation, that is, the adversarial example (Goodfellow et al., 2014). One can speculate that the eigenvector corresponding to may be related to adversarial attacks, although such a conclusion will require careful consideration.
6 Conclusion and Discussion
We evaluated the asymptotic eigenvalue statistics of the FIM and its variants in sufficiently wide DNNs. We found that they have pathological spectra, that is, number-of-classes eigenvalues act as outliers in the conventional setting of random initialization and activation functions. In particular, we empirically demonstrated that softmax output disperses the outliers and makes a tail of the eigenvalue spectrum spread from the bulk. Since the FIM shares the same nonzero eigenvalues as NTK, the convergence property of the training dynamics can depend on the outliers. This suggests that we need to be careful about the eigenvalue statistics and their influence on the learning when we use large-scale deep networks in naive settings. These outliers can disappear under specific normalization of the last layer.
This work focused on fully connected neural networks, and it will be interesting to explore the spectra of other architectures such as ResNets and CNN. It will also be fundamental to explore the eigenvalue statistics that this study cannot capture. While our study captured some of the basic eigenvalue statistics, it remains to derive the whole spectrum analytically. In particular, after the normalization excludes large outliers, the bulk of the spectrum becomes dominant. In such cases, random matrix theory seems to be a prerequisite for further progress. It enables us to analyze the FIM's eigenvalue spectrum of a shallow centered network in the large width limit under the fixed ratio between the width and sample size (Pennington & Worah, 2018). In deep networks, the Stieltjes transformation of the NTK's eigenvalue spectrum is obtained in an iterative formulation (Fan & Wang, 2020). Extending these analyses will lead to our better understanding of the spectrum. Furthermore, we assumed a finite number of network output units. In order to deal with multilabel classifications with high dimensionality, it would be helpful to investigate eigenvalue statistics in the wide limit of both hidden and network output layers. Finally, although we focused on the finite depth and regarded order parameters as constants, they can exponentially explode on extremely deep networks in the chaotic regime (Schoenholz et al., 2017; Yang, Pennington, Rao, Sohl-Dickstein, & Schoenholz, 2019). The NTK in such a regime has been investigated in Jacot et al. (2019).
It would also be interesting to explore further connections between the eigenvalue statistics and learning. Recent studies have yielded insights into the connection between the generalization performance of DNNs and the eigenvalue statistics of certain Gram matrices including FIM and NTK (Suzuki, 2018; Sun & Nielsen, 2019; Yang & Salman, 2019). We expect that the theoretical foundation of the metric tensors given in this letter will lead to a more sophisticated understanding and development of deep learning in the future.
Appendix A: Eigenvalue Statistics of
A.1 Overviewing the Derivation of Theorem 1
We can obtain from , from where is the Frobenius norm, and from . The eigenvector of corresponding to is asymptotically given by . When is of , it is obvious that 's eigenvalues determine 's eigenvalues in the large limit. Even if increases depending on , our eigenvalue statistics hold in the large and limits. That is, we have asymptotically , , and . As one can see here, the condition of is crucial for our eigenvalue statistics. Noncentered networks guarantee and , which leads to . In centered networks, can become zero, and we need to carefully evaluate the second term of equation A.4.
A.2 Diagonal Blocks
Appendix B: Eigenvalue Statistics of
B.1 Derivation of Theorem 3
B.2 Derivation of Theorem 5
Appendix C: Derivation of NTK's Eigenvalue Statistics
In the same way as with the FIM, the trace of leads to , the Frobenius norm of leads to , and has the largest eigenvalue for arbitrary . The eigenspace of corresponding to is also the same as . It is spanned by eigenvectors ().
Appendix D: Eigenvalue Statistics of
D.1 Derivation of Theorem 6
The eigenvalue statistics are easily derived from the leading term . We can derive the mean of the eigenvalues as and the second moment as , where . We can determine the largest eigenvalue because we explicitly obtain the eigenvalues of ; with an eigenvector and with eigenvectors (). The vector denotes a unit vector whose entries are 1 for the th entry and 0 otherwise. The largest eigenvalue is given by .
D.2 Diagonal Blocks
R.K. acknowledges the funding support from JST ACT-X (grant. JPMJAX190A) and Grant-in-Aid for Young Scientists (grant 19K20366).