Skip Nav Destination
Close Modal
Update search
NARROW
Format
Journal
TocHeadingTitle
Date
Availability
1-20 of 46
Shun-ichi Amari
Close
Follow your search
Access your saved searches in your account
Would you like to receive an alert when new items match your search?
1
Sort by
Journal Articles
Publisher: Journals Gateway
Neural Computation (2021) 33 (8): 2274–2307.
Published: 26 July 2021
Abstract
View article
PDF
The Fisher information matrix (FIM) plays an essential role in statistics and machine learning as a Riemannian metric tensor or a component of the Hessian matrix of loss functions. Focusing on the FIM and its variants in deep neural networks (DNNs), we reveal their characteristic scale dependence on the network width, depth, and sample size when the network has random weights and is sufficiently wide. This study covers two widely used FIMs for regression with linear output and for classification with softmax output. Both FIMs asymptotically show pathological eigenvalue spectra in the sense that a small number of eigenvalues become large outliers depending on the width or sample size, while the others are much smaller. It implies that the local shape of the parameter space or loss landscape is very sharp in a few specific directions while almost flat in the other directions. In particular, the softmax output disperses the outliers and makes a tail of the eigenvalue density spread from the bulk. We also show that pathological spectra appear in other variants of FIMs: one is the neural tangent kernel; another is a metric for the input signal and feature space that arises from feedforward signal propagation. Thus, we provide a unified perspective on the FIM and its variants that will lead to more quantitative understanding of learning in large-scale DNNs.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2020) 32 (8): 1431–1447.
Published: 01 August 2020
FIGURES
Abstract
View article
PDF
It is known that any target function is realized in a sufficiently small neighborhood of any randomly connected deep network, provided the width (the number of neurons in a layer) is sufficiently large. There are sophisticated analytical theories and discussions concerning this striking fact, but rigorous theories are very complicated. We give an elementary geometrical proof by using a simple model for the purpose of elucidating its structure. We show that high-dimensional geometry plays a magical role. When we project a high-dimensional sphere of radius 1 to a low-dimensional subspace, the uniform distribution over the sphere shrinks to a gaussian distribution with negligibly small variances and covariances.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2019) 31 (5): 827–848.
Published: 01 May 2019
FIGURES
| View All (6)
Abstract
View article
PDF
We propose a new divergence on the manifold of probability distributions, building on the entropic regularization of optimal transportation problems. As Cuturi ( 2013 ) showed, regularizing the optimal transport problem with an entropic term is known to bring several computational benefits. However, because of that regularization, the resulting approximation of the optimal transport cost does not define a proper distance or divergence between probability distributions. We recently tried to introduce a family of divergences connecting the Wasserstein distance and the Kullback-Leibler divergence from an information geometry point of view (see Amari, Karakida, & Oizumi, 2018 ). However, that proposal was not able to retain key intuitive aspects of the Wasserstein geometry, such as translation invariance, which plays a key role when used in the more general problem of computing optimal transport barycenters. The divergence we propose in this work is able to retain such properties and admits an intuitive interpretation.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2018) 30 (1): 1–33.
Published: 01 January 2018
FIGURES
| View All (15)
Abstract
View article
PDF
The dynamics of supervised learning play a main role in deep learning, which takes place in the parameter space of a multilayer perceptron (MLP). We review the history of supervised stochastic gradient learning, focusing on its singular structure and natural gradient. The parameter space includes singular regions in which parameters are not identifiable. One of our results is a full exploration of the dynamical behaviors of stochastic gradient learning in an elementary singular network. The bad news is its pathological nature, in which part of the singular region becomes an attractor and another part a repulser at the same time, forming a Milnor attractor. A learning trajectory is attracted by the attractor region, staying in it for a long time, before it escapes the singular region through the repulser region. This is typical of plateau phenomena in learning. We demonstrate the strange topology of a singular region by introducing blow-down coordinates, which are useful for analyzing the natural gradient dynamics. We confirm that the natural gradient dynamics are free of critical slowdown. The second main result is the good news: the interactions of elementary singular networks eliminate the attractor part and the Milnor-type attractors disappear. This explains why large-scale networks do not suffer from serious critical slowdowns due to singularities. We finally show that the unit-wise natural gradient is effective for learning in spite of its low computational cost.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2012) 24 (12): 3191–3212.
Published: 01 December 2012
FIGURES
| View All (16)
Abstract
View article
PDF
We study the Bayesian process to estimate the features of the environment. We focus on two aspects of the Bayesian process: how estimation error depends on the prior distribution of features and how the prior distribution can be learned from experience. The accuracy of the perception is underestimated when each feature of the environment is considered independently because many different features of the environment are usually highly correlated and the estimation error greatly depends on the correlations. The self-consistent learning process renews the prior distribution of correlated features jointly with the estimation of the environment. Here, maximum a posteriori probability (MAP) estimation decreases the effective dimensions of the feature vector. There are critical noise levels in self-consistent learning with MAP estimation, that cause hysteresis behaviors in learning. The self-consistent learning process with stochastic Bayesian estimation (SBE) makes the presumed distribution of environmental features converge to the true distribution for any level of channel noise. However, SBE is less accurate than MAP estimation. We also discuss another stochastic method of estimation, SBE2, which has a smaller estimation error than SBE without hysteresis.
Journal Articles
Identification of Directed Influence: Granger Causality, Kullback-Leibler Divergence, and Complexity
Publisher: Journals Gateway
Neural Computation (2012) 24 (7): 1722–1739.
Published: 01 July 2012
FIGURES
| View All (4)
Abstract
View article
PDF
Detecting and characterizing causal interdependencies and couplings between different activated brain areas from functional neuroimage time series measurements of their activity constitutes a significant step toward understanding the process of brain functions. In this letter, we make the simple point that all current statistics used to make inferences about directed influences in functional neuroimage time series are variants of the same underlying quantity. This includes directed transfer entropy, transinformation, Kullback-Leibler formulations, conditional mutual information, and Granger causality. Crucially, in the case of autoregressive modeling, the underlying quantity is the likelihood ratio that compares models with and without directed influences from the past when modeling the influence of one time series on another. This framework is also used to derive the relation between these measures of directed influence and the complexity or the order of directed influence. These results provide a framework for unifying the Kullback-Leibler divergence, Granger causality, and the complexity of directed influence.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2011) 23 (5): 1248–1260.
Published: 01 May 2011
Abstract
View article
PDF
A neural field is a continuous version of a neural network model accounting for dynamical pattern forming from populational firing activities in neural tissues. These patterns include standing bumps, moving bumps, traveling waves, target waves, breathers, and spiral waves, many of them observed in various brain areas. They can be categorized into two types: a wave-like activity spreading over the field and a particle-like localized activity. We show through numerical experiments that localized traveling excitation patterns (traveling bumps), which behave like particles, exist in a two-dimensional neural field with excitation and inhibition mechanisms. The traveling bumps do not require any geometric restriction (boundary) to prevent them from propagating away, a fact that might shed light on how neurons in the brain are functionally organized. Collisions of traveling bumps exhibit rich phenomena; they might reveal the manner of information processing in the cortex and be useful in various applications. The trajectories of traveling bumps can be controlled by external inputs.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2011) 23 (2): 477–516.
Published: 01 February 2011
FIGURES
| View All (42)
Abstract
View article
PDF
We present a computational model that highlights the role of basal ganglia (BG) in generating simple reaching movements. The model is cast within the reinforcement learning (RL) framework with correspondence between RL components and neuroanatomy as follows: dopamine signal of substantia nigra pars compacta as the temporal difference error, striatum as the substrate for the critic, and the motor cortex as the actor. A key feature of this neurobiological interpretation is our hypothesis that the indirect pathway is the explorer. Chaotic activity, originating from the indirect pathway part of the model, drives the wandering, exploratory movements of the arm. Thus, the direct pathway subserves exploitation, while the indirect pathway subserves exploration. The motor cortex becomes more and more independent of the corrective influence of BG as training progresses. Reaching trajectories show diminishing variability with training. Reaching movements associated with Parkinson's disease (PD) are simulated by reducing dopamine and degrading the complexity of indirect pathway dynamics by switching it from chaotic to periodic behavior. Under the simulated PD conditions, the arm exhibits PD motor symptoms like tremor, bradykinesia and undershooting. The model echoes the notion that PD is a dynamical disease.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2010) 22 (7): 1718–1736.
Published: 01 July 2010
FIGURES
Abstract
View article
PDF
Analysis of correlated spike trains is a hot topic of research in computational neuroscience. A general model of probability distributions for spikes includes too many parameters to be of use in analyzing real data. Instead, we need a simple but powerful generative model for correlated spikes. We developed a class of conditional mixture models that includes a number of existing models and analyzed its capabilities and limitations. We apply the model to dynamical aspects of neuron pools. When Hebbian cell assemblies coexist in a pool of neurons, the condition is specified by these assemblies such that the probability distribution of spikes is a mixture of those of the component assemblies. The probabilities of activation of the Hebbian assemblies change dynamically. We used this model as a basis for a competitive model governing the states of assemblies.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2009) 21 (8): 2309–2335.
Published: 01 August 2009
FIGURES
Abstract
View article
PDF
Information geometry has been suggested to provide a powerful tool for analyzing multineuronal spike trains. Among several advantages of this approach, a significant property is the close link between information-geometric measures and neural network architectures. Previous modeling studies established that the first- and second-order information-geometric measures corresponded to the number of external inputs and the connection strengths of the network, respectively. This relationship was, however, limited to a symmetrically connected network, and the number of neurons used in the parameter estimation of the log-linear model needed to be known. Recently, simulation studies of biophysical model neurons have suggested that information geometry can estimate the relative change of connection strengths and external inputs even with asymmetric connections. Inspired by these studies, we analytically investigated the link between the information-geometric measures and the neural network structure with asymmetrically connected networks of N neurons. We focused on the information-geometric measures of orders one and two, which can be derived from the two-neuron log-linear model, because unlike higher-order measures, they can be easily estimated experimentally. Considering the equilibrium state of a network of binary model neurons that obey stochastic dynamics, we analytically showed that the corrected first- and second-order information-geometric measures provided robust and consistent approximation of the external inputs and connection strengths, respectively. These results suggest that information-geometric measures provide useful insights into the neural network architecture and that they will contribute to the study of system-level neuroscience.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2009) 21 (4): 960–972.
Published: 01 April 2009
FIGURES
Abstract
View article
PDF
There are a number of measures of correlation for spikes of two neurons and for spikes at two successive time bins in one neuron: covariance, correlation coefficient, mutual information, and information-geometric measure in the log-linear model. It is desirable to have a measure that is not affected by change in the firing rates of neurons. We explain the superiority of the information-geometric measure from the point of view of geometry, by which the correlation and firing rates are separated orthogonally, that is, without correlation. We then analyze characteristics of other measures and show analytically how they are related to firing rates.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2008) 20 (6): 1411–1426.
Published: 01 June 2008
Abstract
View article
PDF
We study the discrimination capability of spike time sequences using the Chernoff distance as a metric. We assume that spike sequences are generated by renewal processes and study how the Chernoff distance depends on the shape of interspike interval (ISI) distribution. First, we consider a lower bound to the Chernoff distance because it has a simple closed form. Then we consider specific models of ISI distributions such as the gamma, inverse gaussian (IG), exponential with refractory period (ER), and that of the leaky integrate-and-fire (LIF) neuron. We found that the discrimination capability of spike times strongly depends on high-order moments of ISI and that it is higher when the spike time sequence has a larger skewness and a smaller kurtosis. High variability in terms of coefficient of variation (CV) does not necessarily mean that the spike times have less discrimination capability. Spike sequences generated by the gamma distribution have the minimum discrimination capability for a given mean and variance of ISI. We used series expansions to calculate the mean and variance of ISIs for LIF neurons as a function of the mean input level and the input noise variance. Spike sequences from an LIF neuron are more capable of discrimination than those of IG and gamma distributions when the stationary voltage level is close to the neuron's threshold value of the neuron.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2008) 20 (4): 994–1025.
Published: 01 April 2008
Abstract
View article
PDF
Continuous attractor is a promising model for describing the encoding of continuous stimuli in neural systems. In a continuous attractor, the stationary states of the neural system form a continuous parameter space, on which the system is neutrally stable. This property enables the neutral system to track time-varying stimuli smoothly, but it also degrades the accuracy of information retrieval, since these stationary states are easily disturbed by external noise. In this work, based on a simple model, we systematically investigate the dynamics and the computational properties of continuous attractors. In order to analyze the dynamics of a large-size network, which is otherwise extremely complicated, we develop a strategy to reduce its dimensionality by utilizing the fact that a continuous attractor can eliminate the noise components perpendicular to the attractor space very quickly. We therefore project the network dynamics onto the tangent of the attractor space and simplify it successfully as a one-dimensional Ornstein-Uhlenbeck process. Based on this simplified model, we investigate (1) the decoding error of a continuous attractor under the driving of external noisy inputs, (2) the tracking speed of a continuous attractor when external stimulus experiences abrupt changes, (3) the neural correlation structure associated with the specific dynamics of a continuous attractor, and (4) the consequence of asymmetric neural correlation on statistical population decoding. The potential implications of these results on our understanding of neural information processing are also discussed.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2008) 20 (3): 813–843.
Published: 01 March 2008
Abstract
View article
PDF
We explicitly analyze the trajectories of learning near singularities in hierarchical networks, such as multilayer perceptrons and radial basis function networks, which include permutation symmetry of hidden nodes, and show their general properties. Such symmetry induces singularities in their parameter space, where the Fisher information matrix degenerates and odd learning behaviors, especially the existence of plateaus in gradient descent learning, arise due to the geometric structure of singularity. We plot dynamic vector fields to demonstrate the universal trajectories of learning near singularities. The singularity induces two types of plateaus, the on-singularity plateau and the near-singularity plateau, depending on the stability of the singularity and the initial parameters of learning. The results presented in this letter are universally applicable to a wide class of hierarchical models. Detailed stability analysis of the dynamics of learning in radial basis function networks and multilayer perceptrons will be presented in separate work.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2007) 19 (10): 2780–2796.
Published: 01 October 2007
Abstract
View article
PDF
When there are a number of stochastic models in the form of probability distributions, one needs to integrate them. Mixtures of distributions are frequently used, but exponential mixtures also provide a good means of integration. This letter proposes a one-parameter family of integration, called α -integration, which includes all of these well-known integrations. These are generalizations of various averages of numbers such as arithmetic, geometric, and harmonic averages. There are psychophysical experiments that suggest that α -integrations are used in the brain. The α -divergence between two distributions is defined, which is a natural generalization of Kullback-Leibler divergence and Hellinger distance, and it is proved that α -integration is optimal in the sense of minimizing α -divergence. The theory is applied to generalize the mixture of experts and the product of experts to the α -mixture of experts. The α -predictive distribution is also stated in the Bayesian framework.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (10): 2359–2386.
Published: 01 October 2006
Abstract
View article
PDF
We considered a gammadistribution of interspike intervals as a statistical model for neuronal spike generation. A gamma distribution is a natural extension of the Poisson process taking the effect of a refractory period into account. The model is specified by two parameters: a time-dependent firing rate and a shape parameter that characterizes spiking irregularities of individual neurons. Because the environment changes over time, observed data are generated from a model with a time-dependent firing rate, which is an unknown function. A statistical model with an unknown function is called a semiparametric model and is generally very difficult to solve. We used a novel method of estimating functions in information geometry to estimate the shape parameter without estimating the unknown function. We obtained an optimal estimating function analytically for the shape parameter independent of the functional form of the firing rate. This estimation is efficient without Fisher information loss and better than maximum likelihood estimation. We suggest a measure of spiking irregularity based on the estimating function, which may be useful for characterizing individual neurons in changing environments.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (6): 1259–1267.
Published: 01 June 2006
Abstract
View article
PDF
The decoding scheme of a stimulus can be different from the stochastic encoding scheme in the neural population coding. The stochastic fluctuations are not independent in general, but an independent version could be used for the ease of decoding. How much information is lost by using this unfaithful model for decoding? There are discussions concerning loss of information (Nirenberg & Latham, 2003; Schneidman, Bialek, & Berry, 2003). We elucidate the Nirenberg-Latham loss from the point of view of information geometry.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (5): 1007–1065.
Published: 01 May 2006
Abstract
View article
PDF
The parameter spaces of hierarchical systems such as multilayer perceptrons include singularities due to the symmetry and degeneration of hidden units. A parameter space forms a geometrical manifold, called the neuromanifold in the case of neural networks. Such a model is identified with a statistical model, and a Riemannian metric is given by the Fisher information matrix. However, the matrix degenerates at singularities. Such a singular structure is ubiquitous not only in multilayer perceptrons but also in the gaussian mixture probability densities, ARMA time-series model, and many other cases. The standard statistical paradigm of the Cramér-Rao theorem does not hold, and the singularity gives rise to strange behaviors in parameter estimation, hypothesis testing, Bayesian inference, model selection, and in particular, the dynamics of learning from examples. Prevailing theories so far have not paid much attention to the problem caused by singularity, relying only on ordinary statistical theories developed for regular (nonsingular) models. Only recently have researchers remarked on the effects of singularity, and theories are now being developed. This article gives an overview of the phenomena caused by the singularities of statistical manifolds related to multilayer perceptrons and gaussian mixtures. We demonstrate our recent results on these problems. Simple toy models are also used to show explicit solutions. We explain that the maximum likelihood estimator is no longer subject to the gaussian distribution even asymptotically, because the Fisher information matrix degenerates, that the model selection criteria such as AIC, BIC, and MDL fail to hold in these models, that a smooth Bayesian prior becomes singular in such models, and that the trajectories of dynamics of learning are strongly affected by the singularity, causing plateaus or slow manifolds in the parameter space. The natural gradient method is shown to perform well because it takes the singular geometrical structure into account. The generalization error and the training error are studied in some examples.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2006) 18 (3): 545–568.
Published: 01 March 2006
Abstract
View article
PDF
In examining spike trains, different models are used to describe their structure. The different models often seem quite similar, but because they are cast in different formalisms, it is often difficult to compare their predictions. Here we use the information-geometric measure, an orthogonal coordinate representation of point processes, to express different models of stochastic point processes in a common coordinate system. Within such a framework, it becomes straightforward to visualize higher-order correlations of different models and thereby assess the differences between models. We apply the information-geometric measure to compare two similar but not identical models of neuronal spike trains: the inhomogeneous Markov and the mixture of Poisson models. It is shown that they differ in the secondand higher-order interaction terms. In the mixture of Poisson model, the second- and higher-order interactions are of comparable magnitude within each order, whereas in the inhomogeneous Markov model, they have alternating signs over different orders. This provides guidance about what measurements would effectively separate the two models. As newer models are proposed, they also can be compared to these models using information geometry.
Journal Articles
Publisher: Journals Gateway
Neural Computation (2005) 17 (10): 2215–2239.
Published: 01 October 2005
Abstract
View article
PDF
Two issues concerning the application of continuous attractors in neural systems are investigated: the computational robustness of continuous attractors with respect to input noises and the implementation of Bayesian online decoding. In a perfect mathematical model for continuous attractors, decoding results for stimuli are highly sensitive to input noises, and this sensitivity is the inevitable consequence of the system's neutral stability. To overcome this shortcoming, we modify the conventional network model by including extra dynamical interactions between neurons. These interactions vary according to the biologically plausible Hebbian learning rule and have the computational role of memorizing and propagating stimulus information accumulated with time. As a result, the new network model responds to the history of external inputs over a period of time, and hence becomes insensitive to short-term fluctuations. Also, since dynamical interactions provide a mechanism to convey the prior knowledge of stimulus, that is, the information of the stimulus presented previously, the network effectively implements online Bayesian inference. This study also reveals some interesting behavior in neural population coding, such as the trade-off between decoding stability and the speed of tracking time-varying stimuli, and the relationship between neural tuning width and the tracking speed.
1