## Abstract

While Shannon's mutual information has widespread applications in many disciplines, for practical applications it is often difficult to calculate its value accurately for high-dimensional variables because of the curse of dimensionality. This article focuses on effective approximation methods for evaluating mutual information in the context of neural population coding. For large but finite neural populations, we derive several information-theoretic asymptotic bounds and approximation formulas that remain valid in high-dimensional spaces. We prove that optimizing the population density distribution based on these approximation formulas is a convex optimization problem that allows efficient numerical solutions. Numerical simulation results confirmed that our asymptotic formulas were highly accurate for approximating mutual information for large neural populations. In special cases, the approximation formulas are exactly equal to the true mutual information. We also discuss techniques of variable transformation and dimensionality reduction to facilitate computation of the approximations.

## 1 Introduction

Shannon's mutual information (MI) provides a quantitative characterization of the association between two random variables by measuring how much knowing one of the variables reduces uncertainty about the other (Shannon, 1948). Information theory has become a useful tool for neuroscience research (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999; Pouget, Dayan, & Zemel, 2000; Laughlin & Sejnowski, 2003; Brown, Kass, & Mitra, 2004; Quiroga & Panzeri, 2009), with applications to various problems such as sensory coding problems in the visual systems (Eckhorn & Pöpel, 1975; Optican & Richmond, 1987; Atick & Redlich, 1990; McClurkin, Gawne, Optican, & Richmond, 1991; Atick, Li, & Redlich, 1992; Becker & Hinton, 1992; Van Hateren, 1992; Gawne & Richmond, 1993; Tovee, Rolls, Treves, & Bellis, 1993; Bell & Sejnowski, 1997; Lewis & Zhaoping, 2006) and the auditory systems (Chechik et al., 2006; Gourévitch and Eggermont, 2007; Chase & Young, 2005).

One major problem encountered in practical applications of information theory is that the exact value of mutual information is often hard to compute in high-dimensional spaces. For example, suppose we want to calculate the mutual information between a random stimulus variable that requires many parameters to specify and the elicited noisy responses of a large population of neurons. In order to accurately evaluate the mutual information between the stimuli and the responses, one has to average over all possible stimulus patterns and over all possible response patterns of the whole population. This averaging quickly leads to a combinatorial explosion as either the stimulus dimension or the population size increases. This problem occurs not only when one computes MI numerically for a given theoretical model but also when one estimates MI empirically from experimental data.

Even when the input and output dimensions are not that high, an MI estimate from experimental data tends to have a positive bias due to limited sample size (Miller, 1955; Treves & Panzeri, 1995). For example, a perfectly flat joint probability distribution implies zero MI, but an empirical joint distribution with fluctuations due to finite data size appears to suggest a positive MI. The error may get much worse as the input and output dimensions increase because a reliable estimate of MI may require exponentially more data points to fill the space of the joint distribution. Various asymptotic expansion methods have been proposed to reduce the bias in an MI estimate (Miller, 1955; Carlton, 1969; Treves & Panzeri, 1995; Victor, 2000; Paninski, 2003). Other estimators of MI have also been studied, such as those based on *k*-nearest neighbor (Kraskov, Stögbauer, & Grassberger, 2004) and minimal spanning trees (Khan et al., 2007). However, it is not easy for these methods to handle the general situation with high-dimensional inputs and high-dimensional outputs.

For numerical computation of MI for a given theoretical model, one useful approach is Monte Carlo sampling, a convergent method that may potentially reach arbitrary accuracy (Yarrow, Challis, & Series, 2012). However, its stochastic and inefficient computational scheme makes it unsuitable for many applications. For instance, to optimize the distribution of a neural population for a given set of stimuli, one may want to slightly alter the population parameters and see how the perturbation affects the MI, but a tiny change of MI can be easily drowned out by the inherent noise in the Monte Carlo method.

An alternative approach is to use information-theoretic bounds and approximations to simplify calculations. For example, the Cramér-Rao lower bound (Rao, 1945) tells us that the inverse of Fisher information (FI) is a lower bound to the mean square decoding error of any unbiased decoder. Fisher information is useful for many applications partly because it is often much easier to calculate than MI (see e.g., Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999; Abbott & Dayan, 1999; Bethge, Rotermund, & Pawelzik, 2002; Harper & McAlpine, 2004; Toyoizumi, Aihara, & Amari, 2006).

A link between MI and FI has been studied by several researchers (Clarke & Barron, 1990; Rissanen, 1996; Brunel & Nadal, 1998; Sompolinsky, Yoon, Kang, & Shamir, 2001). Clarke and Barron (1990) first derived an asymptotic formula between the relative entropy and FI for parameter estimation from independent and identically distributed (i.i.d.) observations with suitable smoothness conditions. Rissanen (1996) generalized it in the framework of stochastic complexity for model selection. Brunel and Nadal (1998) presented an asymptotic relationship between the MI and FI in the limit of a large number of neurons. The method was extended to discrete inputs by Kang and Sompolinsky (2001). More general discussions about this also appeared in other papers (e.g., Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). However, for finite population size, the asymptotic formula may lead to large errors, especially for high-dimensional inputs, as detailed in sections 2.2 and 4.1.

In this article, our main goal is to improve FI approximations to MI for finite neural populations especially for high-dimensional inputs. Another goal is to discuss how to use these approximations to optimize neural population coding. We will present several information-theoretic bounds and approximation formulas and discuss the conditions under which they are established in section 2, with detailed proofs given in the appendix. We also discuss how our approximation formulas are related to other statistical estimators and information-theoretic bounds, such as Cramér-Rao bound and van Trees' Bayesian Cramér-Rao bound (see section 3). In order to better apply the approximation formulas in high-dimensional input space, we propose some useful techniques in section 4, including variable transformation and dimensionality reduction, which may greatly reduce the computational complexity for practical applications. Finally, in section 5, we discuss how to use the approximation formulas for optimizing information transfer for neural population coding.

## 2 Bounds and Approximations for Mutual Information in Neural Population Coding

### 2.1 Mutual Information and Notations

### 2.2 Information-Theoretic Asymptotic Bounds and Approximations

#### 2.2.1 Regularity Conditions

First, we consider the following regularity conditions for and :

The regularity conditions C1 and C2 are needed to prove theorems in later sections. They are expressed in mathematical forms that are convenient for our proofs, although their meanings may seem opaque at first glance. In the following, we will examine these conditions more closely. We will use specific examples to make interpretations of these conditions more transparent.

In this article, we assume that the probability distributions and are piecewise twice continuously differentiable. This is because we need to use Fisher information to approximate mutual information, and Fisher information requires derivatives that make sense only for continuous variables. Therefore, the methods developed in this article apply only to continuous input variables or stimulus variables. For discrete input variables, we need alternative methods for approximating MI, which we will address in a separate publication.

Conditions 2.25a and 2.25b state that the first and the second derivatives of have finite values for any given . These two conditions are easily satisfied by commonly encountered probability distributions because they only require finite derivatives within , the set of allowable inputs, and derivatives do not need to be finitely bounded.

Conditions 2.25c to 2.26a constrain how the first and the second derivatives of scale with , the number of neurons. These conditions are easily met when is conditionally independent or when the noises of different neurons are independent, that is, .

We emphasize that it is possible to satisfy these conditions even when is not independent or when the noises are correlated, as we show later. Here we first examine these conditions closely, assuming independence. For simplicity, our demonstration that follows is based on a one-dimensional input variable (). The conclusions are readily generalizable to higher-dimensional inputs () because is fixed and does not affect the scaling with .

Condition 2.25e is easily satisfied under the assumption of independence. It is easy to show that this condition holds when is bounded.

In summary, conditions 2.25c to 2.26a are easy to meet when is independent. It is sufficient to satisfy these conditions when the averages of the first and second derivatives of , as well as the averages of their powers, are bounded by finite numbers for all the neurons.

^{2}. This situation is best illustrated by the familiar example of a population of neurons with correlated noises that obey a multivariate gaussian distribution: where is an invertible covariance matrix, and describes the mean responses with being the parameter vector. Using the following transformation, we obtain the independent distribution: In the special case when the correlation coefficient between any pair of neurons is a constant , , the noise covariance can be written as where is a constant, is the identity matrix, and . The desired transformation in equations 2.32 and 2.33 is given explicitly by where The new response variables defined in equations 2.32 and 2.33 now read: Now we have the derivatives: where and are finite as long as and are finite. Conditions C1 and C2 are satisfied when the derivatives and their powers are finitely bounded as shown before.

The example above shows explicitly that it is possible to meet conditions C1 and C2 even when the noises of different neurons are correlated. More generally, if a nonlinear transformation exists that maps correlated random variables into independent variables, then by similar argument, conditions C1 and C2 are satisfied when the derivatives of the log likelihood functions and their powers in the new variables are finitely bounded. Even when the desired transformation does not exist or is unknown, it does not necessarily imply that conditions C1 and C2 must be violated.

While the exact mathematical conditions for the existence of the desired transformation are unclear, let us consider a specific example. If a joint probability density function can be morphed smoothly and reversibly into a flat or constant density in a cube (hypercube), which is a special case of an independent distribution, then this morphing is the desired transformation. Here we may replace the flat distribution by any known independent distribution and the argument above should still work. So the desired transformation may exist under rather general conditions.

For correlated random variables, one may use algorithms such as independent component analysis to find an invertible linear mapping that makes the new random variables as independent as possible (Bell & Sejnowski, 1997) or use neural networks to find related nonlinear mappings (Huang & Zhang, 2017). These methods do not directly apply to the problem of testing conditions C1 and C2 because they work for a given network size and further development is needed to address the scaling behavior in the large network limit .

Finally, we note that the value of the MI of the transformed independent variables is the same as the MI of the original correlated variables because of the invariance of MI under invertible transformation of marginal variables. A related discussion is in theorem ^{10}, which involves a transformation of the input variables rather than a transformation of the output variables as needed here.

#### 2.2.2 Asymptotic Bounds and Approximations for Mutual Information

In the following we state several conclusions about the MI; their proofs are given in the appendix.

In general, we need only to assume that and are piecewise twice continuously differentiable for . In this case, lemmas ^{5} and ^{6} and theorem ^{7} can still be established. For more general cases, such as discrete or continuous inputs, we have also derived a general approximation formula for MI from which we can easily derive formula for (this will be discussed in separate paper).

### 2.3 Approximations of Mutual Information in Neural Populations with Finite Size

In the preceding section, we provided several bounds, including both lower and upper bounds, and asymptotic relationships for the true MI in the large (network size) limit. Now, we discuss effective approximations to the true MI in the case of finite . Here we consider only the case of continuous inputs (we will discuss the case of discrete inputs in another paper).

^{7}, then we can prove equations 2.58 and 2.59 in a manner similar to the proof of that theorem. Considering a special case where , (e.g., ) and , then we can no longer use the asymptotic formulas in theorem

^{7}. However, if we substitute for by choosing an appropriate such that is positive-definite and , then we can use equation 2.58 or 2.59 as the asymptotic formula.

We find that is often a good approximation of MI even for relatively small . However, we cannot guarantee that is always positive-semidefinite in equation 2.14, and as a consequence, it may happen that is very small for small , is not positive-definite, and is not a real number. In this case, is not a good approximation to but is still a good approximation. Generally, if is always positive-semidefinite, then or is a better approximation than , especially when is close to a normal distribution.

In the following, we give an example of 1D inputs. High-dimensional inputs are discussed in section 4.1.

#### 2.3.1 A Numerical Comparison for 1D Stimuli

*n*th neuron, , takes the form of circular normal or von Mises distribution where , , , with , , , and , and the centers , , , of the neurons are uniformly distributed on interval , that is, , with and . Suppose the distribution of 1D continuous input () has the form where is a constant set to and is the normalization constant. Figure 1A shows graphs of the input distribution and the tuning curves with different centers , 0, .

## 3 Statistical Estimators and Neural Population Decoding

Here we can not directly obtain as in Brunel and Nadal (1998) when and . The simulation results in Figure 1 also show that is not a lower bound of .

By the Cramér-Rao lower bound, we know that the inverse of FI matrix reflects the accuracy of decoding (see equation 3.2). provides some knowledge about the prior distribution ; for example, is the covariance matrix of input when is a normal distribution. is small for a flat prior (poor prior) and large for a sharp prior (good prior). Hence, if the prior is flat or poor and the knowledge about model is rich, then the MI is governed by the knowledge of model, which results in a small (see equation 2.64) and . Otherwise, the prior knowledge has a great influence on MI , which results in a large and .

## 4 Variable Transformation and Dimensionality Reduction in Neural Population Coding

For low-dimensional input and large , both are are good approximations of MI , but for high-dimensional input , a large value of may lead to a large error of , in which case (or ) is a better approximation. It is difficult to directly apply the approximation formula when we do not have an explicit expression of or . For many applications, we do not need to know the exact value of and care only about the value of (see section 5). From equations 2.12, 2.22, and 2.78, we know that if is close to a normal distribution, we can easily approximate and to obtain and . When is not a normal distribution, we can employ a technique of variable transformation to make it closer to a normal distribution, as discussed below.

### 4.1 Variable Transformation

Figure 2A shows how the values of and vary with the input dimension and the number of neurons (with , 4, 6, , 30 and , , , ). The relative error is shown in Figure 2B. The absolute value of the relative error tends to decrease with but may grow quite large as increases. In Figure 2B, the largest absolute value of relative error is greater than , which occurs when and . Even the smallest is still greater than , which occurs when and . In this example, is a bad approximation of MI , whereas and are strictly equal to the true MI across all parameters.