A simple expression for a lower bound of Fisher information is derived for a network of recurrently connected spiking neurons that have been driven to a noise-perturbed steady state. We call this lower bound linear Fisher information, as it corresponds to the Fisher information that can be recovered by a locally optimal linear estimator. Unlike recent similar calculations, the approach used here includes the effects of nonlinear gain functions and correlated input noise and yields a surprisingly simple and intuitive expression that offers substantial insight into the sources of information degradation across successive layers of a neural network. Here, this expression is used to (1) compute the optimal (i.e., information-maximizing) firing rate of a neuron, (2) demonstrate why sharpening tuning curves by either thresholding or the action of recurrent connectivity is generally a bad idea, (3) show how a single cortical expansion is sufficient to instantiate a redundant population code that can propagate across multiple cortical layers with minimal information loss, and (4) show that optimal recurrent connectivity strongly depends on the covariance structure of the inputs to the network.
The brain encodes many variables, such as the color of objects and the direction of arm movements, through the concerted activity of populations of noisy spiking neurons, a type of code known as population codes. Understanding these population codes is a key step toward developing neural theories of computation, learning, and information transmission. A natural measure for characterizing the information content of a code when dealing with continuous variables is Fisher information (Abbott & Dayan, 1999). Fisher information is inversely proportional to the smallest change in the encoded stimulus that can be discriminated from the neuronal responses. This measure can be used to explore how to optimize population codes, that is, how to wire neural circuits to maximize Fisher information.
In many population codes, the tuning curve of the neurons, that is, the average response as a function of a real-valued stimulus (denoted s), follows gaussian functions of s. Several studies have investigated how to optimize the parameters of these tuning curves, such as the height and width, when s is a scalar variable (Seung & Sompolinsky, 1993) as well as when s is a multidimensional vector (Zhang & Sejnowski, 1999). These studies have argued that for scalar s, the brain should use high-amplitude, narrow tuning curves to optimize information transmission and that learning should seek to reduce the width of the tuning curve as a way to improve behavioral performance (Somers, Nelson, & Sur, 1995; Spitzer, Desimone, & Moran, 1988; Murray & Wojciulik, 2004; Schoups, Vogels, Qian, & Orban, 2001; Teich & Qian, 2003).
This conclusion, however, was derived under the assumption that neurons generate independent Poisson spike counts. This is a problem because neurons in vivo are correlated (Zohary, Shadlen, & Newsome, 1994), and correlations can have a significant impact on Fisher information (Abbott & Dayan, 1999; Yoon & Sompolinsky, 1998; Sompolinsky, Yoon, Kang, & Shamir, 2001; Wilke & Eurich, 2002; Wu, Nakahara, & Amari, 2001). These researchers investigated the effects of correlations by considering a variety of physiologically inspired parameterizations of covariance matrices, but they did not consider how a network of spiking neurons might generate these covariance structures. To address this issue, we need an expression for Fisher information in a recurrently connected network of spiking neurons. For scalar variables, such an expression has been recently derived by Toyoizumi, Aihara, and Amari (2006), but only for a single layer of noisy neurons driven by a noiseless or deterministic function of the stimuli of interest. This is a serious limitation for two reasons. First, their approach cannot be applied to a situation in which there is a layered architecture and the quantity of interest is the information content of the final layer. This is because although the first layer of noisy neurons might be driven by a signal that is a deterministic function of the stimulus, the subsequent layers are necessarily driven by a noise-corrupted version of that signal. Second, stimulus-dependent, noiseless inputs convey an infinite amount of Fisher information (assuming invertible transformations), while Fisher information is necessarily finite in the nervous system. For instance, given the image of a contour, it is not possible to know its orientation with infinite precision if only because of noise in the physical world and the noise in the photoreceptors. Indeed, one could argue that the quantity of interest here is information loss between two layers of cortex. When the input layer is a deterministic function of the stimulus, as it is in the case worked out by Toyoizumi et al. (2006), information loss is infinite. Thus, to model the effects of finite input information and information loss between layers of cortex, we require an expression for Fisher information in a network of noisy neurons that is driven by a noisy input layer.
Here, we expand on previous work that estimated the second-order statistics of spiking neural networks and derive a simple and intuitive expression for a lower bound on Fisher information in a network of spiking neurons (more specifically, linear, nonlinear, Poisson neurons; see below) that fire in response to noisy input spike trains with finite information content. This lower bound corresponds to what we call linear Fisher information, which is the fraction of Fisher information about a stimulus s that can be recovered by a locally optimal linear estimator (i.e., the linear operation on neural activity that can best discriminate between s and , where is small). In practice, linear Fisher information has been found to provide a tight bound on total Fisher information, in simulations (Seriès, Latham, & Pouget, 2004) and in vivo (Averbeck, Latham, & Pouget, 2006). Consequently, this expression provides a relevant and valuable tool for investigating fundamental questions regarding the computational properties of rate-based population codes.
2. Linear, Nonlinear, Poisson Neurons
The spike response model (SRM) or linear-nonlinear-Poisson model (LNP) of neural activity has become a popular model of spiking neural activity (Gerstner & Kistler, 2002), due in part to its computational simplicity, the ease with which it can be unambiguously fit to neural data (Paninski, 2004), and its ability to approximate more complicated integrate-and-fire neurons (Plesser & Gerstner, 2000). Here we consider an output layer of such LNP neurons with lateral connections, receiving spike trains from an input layer of spiking neurons. Each neuron in the input layer generates a spike train , according to a stationary stochastic process with stimulus-dependent mean, and covariance . Here indicates an average conditioned on the value of the stimulus of interest, s.
Spikes are then generated from an inhomogeneous Poisson process with the rate given by . Here, the gain function, g(u), is a monotonically increasing, nonnegative function; is the time of the last spike of neuron i, so that models the refractoriness of the neuron (in this work, is either a constant or given by ). We use the notation to denote the spike train for output neuron i and y(t) to refer to the vector of spike trains from all output neurons.
3. Linear Fisher Information
We define linear Fisher information as , where and are the stimulus-dependent mean and covariance matrix of y(t), and the notation () is meant to indicate a dot or inner product. This corresponds to the part of Fisher information that can be inferred from the variance of the locally optimal linear estimator of the stimulus (Seriès et al., 2004) under the condition that the Cramér-Rao bound is attainable (Wu, Amari, & Nakahara, 2002).
In general, Fisher information contains other terms in addition to the linear term. For instance, when p(y|s) is a multivariate gaussian distribution, there is a second term, the so-called trace term, that reflects the information content that results from a stimulus-dependent covariance matrix under the gaussian assumption. In theory, this term can contain a large fraction of the information, particularly when the covariance matrix depends on the stimulus (Shamir & Sompolinsky, 2004). Nonetheless, we chose to focus on the linear term because it provides a tight bound on total Fisher information in both simulations (Seriès et al., 2004) and in vivo (Averbeck et al., 2006). Moreover, the trace term is applicable only when the stimulus-conditioned population response is, in fact, gaussian distributed. In general, this assumption of gaussianity may not hold and should be tested. Such a test requires knowledge of the third, fourth, and possibly higher moments. Unfortunately, a theory of correlations of neural networks is agnostic as to moments higher than the second. Thus, such a theory can be used only to estimate linear Fisher information.
When the refractory term is present, the procedure for estimating is essentially the same. Indeed, because we seek a linear transfer function, it is helpful to consider the estimation in the Fourier domain, henceforth indicated by a . In this domain, the estimation of is nothing more than the estimation of the frequency response function of a neuron driven by some noise-perturbed membrane potential proxy. This procedure can be performed, numerically if necessary, for any nonlinearity in the gain function or any refractory term (Chacron et al., 2005; Gerstner & Kistler, 2002). The procedure is straightforward: one can simply drive a neuron with some mean input perturbed by white noise with variance to obtain the statistics of y0(t). The frequency response function can then be obtained by adding to the drive a small frequency-dependent component with frequency .
If the feedforward connectivity matrix, M, is invertible and we set the term to zero, equation 4.12 reduces to the linear Fisher information in the input layer . Therefore, these two terms, M and , control the amount of information lost between the input and output layers. The second term can be interpreted as noise with variance, , that is added to the feedforward afferences or inputs of the network: (Mx).
At first sight, it would appear that the recurrent connectivity, W, has no impact on information loss since it does not appear in equation 4.12. However, recurrent connectivity does affect information loss implicitly by modulating the shape of the steady-state tuning curve. This in turn affects noise added to the feedforward afferences by modulating the matrices and , both of which store quantities evaluated at the steady-state mean activity.
Another important point to emphasize is that equation 4.12 is asymptotically valid for any pattern of feedforward and recurrent connectivity that scales in O(1/N), that is, when the variance of the membrane potential proxy is small. Moreover, this expression can be used regardless of the function that is being computed between the input and output layers. For instance, with a proper choice of connectivity (i.e., a proper choice of M and W), it is easy to build a network in which the input layer contains neurons with gaussian tuning curves to s and in which the output layer contains optimal tuning curves for some other variable of interest, z=h(s), where h(s) is a nonlinear function of s. The Fisher information about z in the output layer can still be obtained from an equation of the form of equation 4.12 but with the prime now indicating z derivatives.
As a result, this expression can be used to explore how tuning curve parameters like width influence information content regardless of the dimensionality of s. This issue has been studied in the past but only for independent noise (Zhang & Sejnowski, 1999; Brown & Backer, 2006).
We then computed the percentage of information preserved in the output layer (obtained from the ratio of linear Fisher information in the output layer to the same quantity in the input layer). This was done by computing the variance of the unbiased locally optimal linear estimator applied to output spikes (Seriès et al., 2004). This empirically observed quantity was then compared to the prediction obtained from equation 4.12. Figure 1 confirms that this expression does indeed provide a very tight bound on linear Fisher information over a wide range of network parameters and activation functions. Curiously, though neglected in the derivation, this expression seems to hold even when refractory effects are included provided and are simply modulated by the expected value of , that is, and . This may be due to the stationary statistics of the input process x(t), but further investigation is needed to explain this coincidence. We also tested whether there is a significant fraction of information beyond the linear term by estimating information with a nonlinear decoder, namely, a support vector machine with radial basis function kernels (SVM-RBF). We found that these discrete classification algorithms extract less than 3.4% more information than the local optimal linear estimator does. The comparison between these discrete classification algorithms and Fisher information was performed by mapping the percentage of correct classification onto the equivalent value of d-prime, which is related to the square root of Fisher information (Dayan & Abbott, 2001).
Next, we describe a few applications of the expression for Fisher information derived above.
4.1. Optimal Firing Rate and Recurrent Sharpening.
We saw that the information loss in equation 4.12 is controlled by two terms. The second term, , is the ratio of the mean firing rate of a given output neuron to the square of the sensitivity of the neuron as described by its linear transfer function. In the absence of a refractory term, this takes the form . For an exponential gain function, , in which case . From the perspective of a single neuron, this implies that a higher firing rate always preserves more information, with information loss becoming exponentially large as output firing rates go to zero. The opposite result holds for a rectified linear gain function. Indeed, in this case, is equal to a constant and . Therefore, somewhat counterintuitively, with a linear activation function, the more a given neuron fires, the less information it transmits.
However, if one uses a linear gain function with LNP neurons, the effective noise-perturbed gain function, , is exponential for a weak drive and linear for a strong drive (Gerstner & Kistler, 2002; see Figure 2a). In this case, there is a firing rate that minimizes the effective noise added () because the effective noise added to the input reaches a minimum value. A neuron that fires at the rate corresponding to this minimum can be said to be firing at its optimal firing rate. For the particular gain function that was used to create Figure 2b, this optimal firing rate is around 10 Hz. More generally, we can conclude that optimal (information-maximizing) firing rates occur where the gain function has positive curvature and satisfies .
This single neuron result can also be used at the network level to show why severe sharpening of input tuning curves may be a particularly bad idea. Consider, for example, the simple case depicted in Figure 2. Here the feedforward afferences, Mx(t), are assumed to be independent and Poisson with broad tuning curves (the dashed-dot lines in Figures 2c and 2d). Since the afferences are independent, the contribution of each afference to Fisher information in the inputs is given by the ratio of the square of the derivative (with respect to the stimulus) of the mean drive to the mean drive of the afference. These contributions are represented by the (dash-dot in Figures 2e and 2f). Note that as usual, the most informative inputs are those that correspond to the largest slope of the input tuning curves. Now consider two networks: one that sharpens the tuning curves in the output layer (dashed lines in Figure 2c) and one that does not (lines in Figure 2d). The solid line in Figures 2e and 2f shows the effective noise added to each input afference by the Poisson step in the output layer of both networks (corresponding to the term ). This effective noise determines the fraction of input information (dashed-dot curve in Figures 2e and 2f) that will be conveyed in the output spike trains (dashed curve in Figures 2e and 2f). The effective noise is minimal for neurons, with an output firing rate close to the optimal value of 10 Hz as shown in Figure 2b.
To optimize information transmission for the entire population of neurons, effective noise added should be the smallest for neurons receiving the most informative inputs. In the no-sharpening network, this is indeed the case. The effective noise is small (solid curve in Figure 2f) when the input information is high (dashed-dot curve in Figure 2f). For the sharpening network, this is no longer the case. A large amount of noise is added to neurons that receive highly informative inputs, resulting in large information loss. This is due to the fact that as firing rates go toward zero, the effective noise scales like one over the firing rate. Indeed, by computing the ratio of the information in the input population to that in the output population, we found that the sharpening network transmits only 27% of the information it receives compared to 49% for the nonsharpening network.
Note also that for a given gain function g(u), this result holds regardless of the specific mechanism by which the sharpening occurs and regardless of the specific spatiotemporal covariance structure induced in the output layer. Note, however, that we are not saying that sharpening is always inefficient. As we will see next, a small amount of sharpening can in fact be helpful; it is severe sharpening that generally destroys information.
4.2. Cortical Expansion, Redundant Codes, and Balanced Excitation and Inhibition.
Adding more neurons to a given layer is a well-known way to decrease information loss, but this expression allows us to quantify precisely the impact of the number of neurons on information loss. For example, suppose that each layer is divided into subpopulations of neurons with identical tuning curves and gain functions and that each subpopulation has K neurons in the input layer and N neurons in the output layer. In this case, averaging over the identically tuned neurons results in an effective noise-added term that scales like . This indicates that increasing the number of neurons in the output layer, N, while keeping the number of input neurons, K, fixed has the effect of decreasing information loss by an amount proportional to . Equation 4.12 also indicates that even when K=N, near-perfect information preservation can be achieved as long as the code in the input layer is redundant (i.e., the information is small compared to the number of neurons) and each of the output neurons fires sufficiently close to its optimal firing rate. This is because neurons firing at their optimal rate effectively place a bound on the added-noise term, . When linear Fisher information is small compared to the number of stimulus-tuned neurons in a large network, the eigenvalues of the associated covariance matrix must be large. As a result, the eigenvalues of must also be large. When they are sufficiently large compared to the eigenvalues of the matrix that describes the effective noise added, , then this term can be neglected and the linear Fisher information in the output layer will be very close to the linear Fisher information in the input afferences.
This is very convenient as it implies that a single layer of cortical expansion (i.e., a large increase in the number of neurons in the primary sensory cortical areas) is sufficient to instantiate a redundant code, which can then be propagated with small information loss across multiple layers, each of which has the same number of noisy neurons. To see why, consider a single cortical expansion for which the first layer consists of M independent Poisson neurons tuned to s. As a result, the information in the network scales like M. Suppose these neurons project to an output layer that has many more neurons: . As previously indicated, reasonable constraints on the activity of the output neurons is sufficient to ensure that linear Fisher information is nearly perfectly preserved. But we have actually accomplished more than simple information preservation. We have also instantiated a redundant code in the output layer. This is because the amount of information contained in the network is small (order M) compared to the information capacity of the network, which is order N. This means that the eigenvalues of the covariance matrix of the output layer, , must be large. Thus, if these output neurons are then used to drive another layer of N, similarly tuned neurons, the covariance of the input afferences to this third layer will satisfy the large eigenvalue condition necessary to ensure that information, once again, is nearly perfectly preserved.
This result indicates that greater information preservation can be accomplished by simply increasing the magnitude of the feedforward connection strengths, M (and thus increasing the magnitude of the eigenvalues of ), while manipulating recurrent connections to keep the neurons driven by the most informative inputs near the optimal rates. Since connections between cortical areas are excitatory, local recurrent inhibition would be needed to accomplish this. This, then, provides a simple explanation for the information benefit of recurrent networks that balance large excitatory inputs with local recurrent inhibition, a widely observed property of cortical circuits (Marino et al., 2005).
4.3. Optimal Connectivity and Correlations.
Finally, the framework described here may be used to demonstrate that optimal connectivity and tuning curve shape depend strongly on the correlations in the input layer. This is illustrated in Figure 3, which shows optimal recurrent connectivity and tuning curve shape in an orientation hypercolumn for two cases: one in which the input population consists of independent neurons and one in which the input neurons are locally positively correlated. In both cases, optimal connectivity was computed by gradient ascent applied to the recurrent weight matrix to maximize Fisher information in response to noisy images of oriented gratings across multiple contrast and image noise levels. Feedforward connectivity was held fixed, as were the parameters of the gain function (once again, partial derivatives with respect to were small and so were ignored). Here, recurrent connections act to ensure that the most informative inputs drive neurons that fire close to the optimal rate, that is, the rate that the effective noise added (), is minimized for this particular choice of nonlinear gain function. In the independent case, this means that output neurons that lie near the inflection point of the tuning curve fire near the optimal rate of 10 Hz, while the relatively uninformative inputs at the peak of the tuning curve fire at a much higher rate. In the positively correlated case, neurons near the peak of the tuning curve are now relatively more informative than in the independent case, while neurons near the tail are now relatively less informative. This can be observed by noting the change in the spectra of the covariance matrix of the input afferences. As a result, optimization now penalizes local excitation and has the effect of decreasing the amplitude (and, to a lesser extent, the sharpness) of the tuning curves so that the neurons close to the peak fire near the optimal rate, while the now less informative neurons in the tail have been driven below the optimal rate.
We have derived a simple expression for linear Fisher information in a network of LNP neurons with arbitrary connectivity. This expression can be used to explore the efficiency of information transmission in networks of spiking neurons computing nonlinear functions, thereby representing an important step toward elucidating the neural basis of processes such as attention and perceptual learning, which allow the nervous system to access more information regarding behaviorally relevant sensory stimuli.
This analysis is limited to linear Fisher information: the fraction of Fisher information that is recoverable by a locally optimal linear estimator. Whether this is a severe limitation remains to be seen. We have found that empirically, it is exceedingly difficult to find any information beyond the linear term in networks of spiking neurons. Moreover, the amount of data required to estimate the nonlinear contributions to Fisher information is typically prohibitively large, because one needs to estimate the third-order and higher-order statistics of spike trains. Similar issues arise with spike timing codes, which convey information through the presence (or absence) of coincident or time-delayed coincident spikes. Such a code is present when the sufficient statistic, T(y), is influenced by these coincident spikes, as is the case when T(y) spans the space of quadratic functions of y. Since estimation of Fisher Information requires an estimate of the covariance of the sufficient statistic, the analysis of such a code would require estimates of the third and fourth moments of y.
Finally, we have also assumed that the stimulus is constant over time and that the network has reached a noise-perturbed steady state. While this is sufficient to model a wide variety of behavioral experiments, there is no question that the extension of this work to time-varying stimuli would be of use. However, it is not yet clear that linearizing the Poisson spiking nonlinearity around a nontrivial dynamic state will yield an approximation comparable to that observed in the stationary case.