Abstract

While Shannon's mutual information has widespread applications in many disciplines, for practical applications it is often difficult to calculate its value accurately for high-dimensional variables because of the curse of dimensionality. This article focuses on effective approximation methods for evaluating mutual information in the context of neural population coding. For large but finite neural populations, we derive several information-theoretic asymptotic bounds and approximation formulas that remain valid in high-dimensional spaces. We prove that optimizing the population density distribution based on these approximation formulas is a convex optimization problem that allows efficient numerical solutions. Numerical simulation results confirmed that our asymptotic formulas were highly accurate for approximating mutual information for large neural populations. In special cases, the approximation formulas are exactly equal to the true mutual information. We also discuss techniques of variable transformation and dimensionality reduction to facilitate computation of the approximations.

1  Introduction

Shannon's mutual information (MI) provides a quantitative characterization of the association between two random variables by measuring how much knowing one of the variables reduces uncertainty about the other (Shannon, 1948). Information theory has become a useful tool for neuroscience research (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997; Borst & Theunissen, 1999; Pouget, Dayan, & Zemel, 2000; Laughlin & Sejnowski, 2003; Brown, Kass, & Mitra, 2004; Quiroga & Panzeri, 2009), with applications to various problems such as sensory coding problems in the visual systems (Eckhorn & Pöpel, 1975; Optican & Richmond, 1987; Atick & Redlich, 1990; McClurkin, Gawne, Optican, & Richmond, 1991; Atick, Li, & Redlich, 1992; Becker & Hinton, 1992; Van Hateren, 1992; Gawne & Richmond, 1993; Tovee, Rolls, Treves, & Bellis, 1993; Bell & Sejnowski, 1997; Lewis & Zhaoping, 2006) and the auditory systems (Chechik et al., 2006; Gourévitch and Eggermont, 2007; Chase & Young, 2005).

One major problem encountered in practical applications of information theory is that the exact value of mutual information is often hard to compute in high-dimensional spaces. For example, suppose we want to calculate the mutual information between a random stimulus variable that requires many parameters to specify and the elicited noisy responses of a large population of neurons. In order to accurately evaluate the mutual information between the stimuli and the responses, one has to average over all possible stimulus patterns and over all possible response patterns of the whole population. This averaging quickly leads to a combinatorial explosion as either the stimulus dimension or the population size increases. This problem occurs not only when one computes MI numerically for a given theoretical model but also when one estimates MI empirically from experimental data.

Even when the input and output dimensions are not that high, an MI estimate from experimental data tends to have a positive bias due to limited sample size (Miller, 1955; Treves & Panzeri, 1995). For example, a perfectly flat joint probability distribution implies zero MI, but an empirical joint distribution with fluctuations due to finite data size appears to suggest a positive MI. The error may get much worse as the input and output dimensions increase because a reliable estimate of MI may require exponentially more data points to fill the space of the joint distribution. Various asymptotic expansion methods have been proposed to reduce the bias in an MI estimate (Miller, 1955; Carlton, 1969; Treves & Panzeri, 1995; Victor, 2000; Paninski, 2003). Other estimators of MI have also been studied, such as those based on k-nearest neighbor (Kraskov, Stögbauer, & Grassberger, 2004) and minimal spanning trees (Khan et al., 2007). However, it is not easy for these methods to handle the general situation with high-dimensional inputs and high-dimensional outputs.

For numerical computation of MI for a given theoretical model, one useful approach is Monte Carlo sampling, a convergent method that may potentially reach arbitrary accuracy (Yarrow, Challis, & Series, 2012). However, its stochastic and inefficient computational scheme makes it unsuitable for many applications. For instance, to optimize the distribution of a neural population for a given set of stimuli, one may want to slightly alter the population parameters and see how the perturbation affects the MI, but a tiny change of MI can be easily drowned out by the inherent noise in the Monte Carlo method.

An alternative approach is to use information-theoretic bounds and approximations to simplify calculations. For example, the Cramér-Rao lower bound (Rao, 1945) tells us that the inverse of Fisher information (FI) is a lower bound to the mean square decoding error of any unbiased decoder. Fisher information is useful for many applications partly because it is often much easier to calculate than MI (see e.g., Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Zhang & Sejnowski, 1999; Abbott & Dayan, 1999; Bethge, Rotermund, & Pawelzik, 2002; Harper & McAlpine, 2004; Toyoizumi, Aihara, & Amari, 2006).

A link between MI and FI has been studied by several researchers (Clarke & Barron, 1990; Rissanen, 1996; Brunel & Nadal, 1998; Sompolinsky, Yoon, Kang, & Shamir, 2001). Clarke and Barron (1990) first derived an asymptotic formula between the relative entropy and FI for parameter estimation from independent and identically distributed (i.i.d.) observations with suitable smoothness conditions. Rissanen (1996) generalized it in the framework of stochastic complexity for model selection. Brunel and Nadal (1998) presented an asymptotic relationship between the MI and FI in the limit of a large number of neurons. The method was extended to discrete inputs by Kang and Sompolinsky (2001). More general discussions about this also appeared in other papers (e.g., Ganguli & Simoncelli, 2014; Wei & Stocker, 2015). However, for finite population size, the asymptotic formula may lead to large errors, especially for high-dimensional inputs, as detailed in sections 2.2 and 4.1.

In this article, our main goal is to improve FI approximations to MI for finite neural populations especially for high-dimensional inputs. Another goal is to discuss how to use these approximations to optimize neural population coding. We will present several information-theoretic bounds and approximation formulas and discuss the conditions under which they are established in section 2, with detailed proofs given in the appendix. We also discuss how our approximation formulas are related to other statistical estimators and information-theoretic bounds, such as Cramér-Rao bound and van Trees' Bayesian Cramér-Rao bound (see section 3). In order to better apply the approximation formulas in high-dimensional input space, we propose some useful techniques in section 4, including variable transformation and dimensionality reduction, which may greatly reduce the computational complexity for practical applications. Finally, in section 5, we discuss how to use the approximation formulas for optimizing information transfer for neural population coding.

2  Bounds and Approximations for Mutual Information in Neural Population Coding

2.1  Mutual Information and Notations

Suppose the input is a -dimensional vector, , and the outputs of neurons are denoted by a vector, . In this article, we denote random variables by uppercase letters (e.g., random variables and ) in contrast to their vector values and . The MI (denoted as below) between and is defined by Cover and Thomas (2006):
formula
2.1
where , and the integration symbol is for the continuous variables and can be replaced by the summation symbol for discrete variables. The probability density function (p.d.f.) of , , satisfies
formula
2.2
The MI in equation 2.1 may also be expressed equivalently as
formula
2.3
where is the entropy of random variable ,
formula
2.4
and denotes expectation:
formula
2.5
formula
2.6
formula
2.7
Next, we introduce the following notations,
formula
2.8
formula
2.9
formula
2.10
and
formula
2.11
formula
2.12
where denotes the matrix determinant, and
formula
2.13
formula
2.14
formula
2.15
Here is the FI matrix, which is symmetric and positive-semidefinite, and and denote the first and second derivative for , respectively; that is, and . If is twice differentiable for , then
formula
2.16
We denote the Kullback-Leibler (KL) divergence as
formula
2.17
and define
formula
2.18
as the neighborhoods of and its complementary set as
formula
2.19
where is a positive number.

2.2  Information-Theoretic Asymptotic Bounds and Approximations

In a large limit, Brunel and Nadal (1998) proposed an asymptotic relationship between MI and FI and gave a proof in the case of one-dimensional input. Another proof is given by Sompolinsky et al. (2001), although there appears to be an error in their proof when a replica trick is used (see equation B1 in their paper; their equation B5 does not follow directly from the replica trick). For large but finite , is usually a good approximation as long as the inputs are low dimensional. For the high-dimensional inputs, the approximation may no longer be valid. For example, suppose is a normal distribution with mean and covariance matrix and is a normal distribution with mean and covariance matrix ,
formula
2.20
where is a deterministic matrix and is the identity matrix. The MI is given by (see Verdu, 1986; Guo, Shamai, & Verdu, 2005, for details)
formula
2.21
If , then . Notice that here, . When and , then by equation 2.21 and a matrix determinant lemma, we have
formula
2.22
and by equation 2.11,
formula
2.23
which is obviously incorrect as an approximation to . For high-dimensional inputs, the determinant may become close to zero in practical applications. When the FI matrix becomes degenerate, the regularity condition ensuring the Cramér-Rao paradigm of statistics is violated (Amari & Nakahara, 2005), in which case using as a proxy for incurs large errors.
In the following, we will show that is a better approximation of for high-dimensional inputs. For instance, for the above example, we can verify that
formula
2.24
which is exactly equal to the MI given in equation 2.21.

2.2.1  Regularity Conditions

First, we consider the following regularity conditions for and :

  • C1: and are twice continuously differentiable for almost every , where is a convex set; is positive definite, and , where denotes the Frobenius norm of a matrix. The following conditions hold:
    formula
    2.25a
    formula
    2.25b
    formula
    2.25c
    formula
    2.25d
    and there exists an for such that
    formula
    2.25e
    where indicates the big-O notation.
  • C2: The following condition is satisfied,
    formula
    2.26a
    for , and there exists such that
    formula
    2.26b
    for all , and with , where denotes the probability of given .

The regularity conditions C1 and C2 are needed to prove theorems in later sections. They are expressed in mathematical forms that are convenient for our proofs, although their meanings may seem opaque at first glance. In the following, we will examine these conditions more closely. We will use specific examples to make interpretations of these conditions more transparent.

Remark 1.

In this article, we assume that the probability distributions and are piecewise twice continuously differentiable. This is because we need to use Fisher information to approximate mutual information, and Fisher information requires derivatives that make sense only for continuous variables. Therefore, the methods developed in this article apply only to continuous input variables or stimulus variables. For discrete input variables, we need alternative methods for approximating MI, which we will address in a separate publication.

Conditions 2.25a and 2.25b state that the first and the second derivatives of have finite values for any given . These two conditions are easily satisfied by commonly encountered probability distributions because they only require finite derivatives within , the set of allowable inputs, and derivatives do not need to be finitely bounded.

Remark 2.

Conditions 2.25c to 2.26a constrain how the first and the second derivatives of scale with , the number of neurons. These conditions are easily met when is conditionally independent or when the noises of different neurons are independent, that is, .

We emphasize that it is possible to satisfy these conditions even when is not independent or when the noises are correlated, as we show later. Here we first examine these conditions closely, assuming independence. For simplicity, our demonstration that follows is based on a one-dimensional input variable (). The conclusions are readily generalizable to higher-dimensional inputs () because is fixed and does not affect the scaling with .

Assuming independence, we have with , and the left-hand side of equation 2.25c becomes
formula
2.27
where the final result contains only two terms with even numbers of duplicated indices, while all other terms in the expansion vanish because any unmatched or lone index (from ) should yield a vanishing average:
formula
2.28
Thus, condition 2.25c is satisfied as long as and are bounded by some finite numbers, say, and , respectively, because now equation 2.27 should scale as . For instance, a gaussian distribution always meets this requirement because the averages of the second and fourth powers are proportional to the second and fourth moments, which are both finite. Note that the argument above works even if is not finitely bounded but scales as .
Similarly, under the assumption of independence, the left-hand side of equation 2.25d becomes
formula
2.29
where, in the second step, the only remaining terms are the squares, while all other terms in the expansion with have vanished because . Thus, condition 2.25d is satisfied as long as and are bounded so that equation 2.29 scales as .

Condition 2.25e is easily satisfied under the assumption of independence. It is easy to show that this condition holds when is bounded.

Condition 2.26a can be examined with similar arguments used for equations 2.27 and 2.29. Assuming independence, we rewrite the left-hand side of equation 2.26a as
formula
2.30
where is an even number. Any term in the expansion with an unmatched index should vanish, as in the cases of equations 2.27 and 2.29. When and are bounded, the leading term with respect to scaling with is the product of squares, as shown at the end of equation 2.30, because all the other nonvanishing terms increase more slowly with . Thus equation 2.30 should scale as , which trivially satisfies condition 2.26a.

In summary, conditions 2.25c to 2.26a are easy to meet when is independent. It is sufficient to satisfy these conditions when the averages of the first and second derivatives of , as well as the averages of their powers, are bounded by finite numbers for all the neurons.

Remark 3.
For neurons with correlated noises, if there exists an invertible transformation that maps to such that becomes conditionally independent, then conditions C1 and C2 are easily met in the space of the new variables by the discussion in remark 2. This situation is best illustrated by the familiar example of a population of neurons with correlated noises that obey a multivariate gaussian distribution:
formula
2.31
where is an invertible covariance matrix, and describes the mean responses with being the parameter vector. Using the following transformation,
formula
2.32
formula
2.33
we obtain the independent distribution:
formula
2.34
In the special case when the correlation coefficient between any pair of neurons is a constant , , the noise covariance can be written as
formula
2.35
where is a constant, is the identity matrix, and . The desired transformation in equations 2.32 and 2.33 is given explicitly by
formula
2.36
where
formula
2.37
The new response variables defined in equations 2.32 and 2.33 now read:
formula
2.38
formula
2.39
Now we have the derivatives:
formula
2.40
formula
2.41
where and are finite as long as and are finite. Conditions C1 and C2 are satisfied when the derivatives and their powers are finitely bounded as shown before.

The example above shows explicitly that it is possible to meet conditions C1 and C2 even when the noises of different neurons are correlated. More generally, if a nonlinear transformation exists that maps correlated random variables into independent variables, then by similar argument, conditions C1 and C2 are satisfied when the derivatives of the log likelihood functions and their powers in the new variables are finitely bounded. Even when the desired transformation does not exist or is unknown, it does not necessarily imply that conditions C1 and C2 must be violated.

While the exact mathematical conditions for the existence of the desired transformation are unclear, let us consider a specific example. If a joint probability density function can be morphed smoothly and reversibly into a flat or constant density in a cube (hypercube), which is a special case of an independent distribution, then this morphing is the desired transformation. Here we may replace the flat distribution by any known independent distribution and the argument above should still work. So the desired transformation may exist under rather general conditions.

For correlated random variables, one may use algorithms such as independent component analysis to find an invertible linear mapping that makes the new random variables as independent as possible (Bell & Sejnowski, 1997) or use neural networks to find related nonlinear mappings (Huang & Zhang, 2017). These methods do not directly apply to the problem of testing conditions C1 and C2 because they work for a given network size and further development is needed to address the scaling behavior in the large network limit .

Finally, we note that the value of the MI of the transformed independent variables is the same as the MI of the original correlated variables because of the invariance of MI under invertible transformation of marginal variables. A related discussion is in theorem 10, which involves a transformation of the input variables rather than a transformation of the output variables as needed here.

Remark 4.
Condition 2.26b is satisfied if a positive number and a positive integer exist such that
formula
2.42
for all , where
formula
2.43
and means that the matrix is negative definite. A proof is as follows.
First note that in equation 2.43, if or , then . Following Markov's inequality, condition C2 and equation A.19 in the appendix, for the complementary set of , , we have
formula
2.44
where
formula
2.45
Define the set
formula
2.46
Then it follows from Markov's inequality and equation 2.42 that
formula
2.47
Hence, we get
formula
which yields condition 2.26b.
Condition 2.42 is satisfied if there exists a positive number such that
formula
2.48
for all and . This is because
formula
2.49
Here notice that (see equation A.23).
Inequality 2.48 holds if is conditionally independent, namely, , with
formula
2.50
for all and . Consider the inequality where the equality holds when . If there is only one extreme point at for , then generally it is easy to find a set that satisfies equation 2.50, so that equation 2.26b holds.

2.2.2  Asymptotic Bounds and Approximations for Mutual Information

Let
formula
2.51
and it follows from conditions C1 and C2 that
formula
2.52
Moreover, if is conditionally independent, then by an argument similar to the discussion in remark 2, we can verify that the condition is easily met.

In the following we state several conclusions about the MI; their proofs are given in the appendix.

Lemma 1.
If condition C1 holds, then the MI has an asymptotic upper bound for integer ,
formula
2.53
Moreover, if equations 2.25c and 2.25d are replaced by
formula
2.54a
formula
2.54b
for some , where indicates the Little-O notation, then the MI has the following asymptotic upper bound for integer :
formula
2.55
Lemma 2.
If conditions C1 and C2 hold, , then the MI has an asymptotic lower bound for integer ,
formula
2.56
Moreover, if condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for , then the MI has the following asymptotic lower bound for integer :
formula
2.57
Theorem 1.
If conditions C1 and C2 hold, , then the MI has the following asymptotic equality for integer :
formula
2.58
For more relaxed conditions, suppose condition C1 holds but equations 2.25c and 2.25d are replaced by 2.54a and 2.54b, and inequality 2.26b in C2 also holds for , then the MI has an asymptotic equality for integer :
formula
2.59
Theorem 2.
Suppose and are symmetric and positive-definite. Let
formula
2.60
formula
2.61
Then
formula
2.62
where indicating matrix trace; moreover, if is positive-semidefinite, then
formula
2.63
But if
formula
2.64
for some , then
formula
2.65
Remark 5.

In general, we need only to assume that and are piecewise twice continuously differentiable for . In this case, lemmas 5 and 6 and theorem 7 can still be established. For more general cases, such as discrete or continuous inputs, we have also derived a general approximation formula for MI from which we can easily derive formula for (this will be discussed in separate paper).

2.3  Approximations of Mutual Information in Neural Populations with Finite Size

In the preceding section, we provided several bounds, including both lower and upper bounds, and asymptotic relationships for the true MI in the large (network size) limit. Now, we discuss effective approximations to the true MI in the case of finite . Here we consider only the case of continuous inputs (we will discuss the case of discrete inputs in another paper).

Theorem 7 tells us that under suitable conditions, we can use to approximate for a large but finite (e.g., ), that is,
formula
2.66
Moreover, by theorem 8, we know that if with positive-semidefinite or holds (see equations 2.60 and 2.64), then by equations 2.63, 2.65, and 2.66, we have
formula
2.67
Define
formula
2.68
formula
2.69
where is positive-definite and is a symmetric matrix depending on and . Suppose . If we replace by in theorem 7, then we can prove equations 2.58 and 2.59 in a manner similar to the proof of that theorem. Considering a special case where , (e.g., ) and , then we can no longer use the asymptotic formulas in theorem 7. However, if we substitute for by choosing an appropriate such that is positive-definite and , then we can use equation 2.58 or 2.59 as the asymptotic formula.
If we assume and are positive-definite and
formula
2.70
then similar to the proof of theorem 8, we have
formula
2.71
and
formula
For large , we usually have .
It is more convenient to redefine the following quantities:
formula
2.72
formula
2.73
formula
2.74
and
formula
2.75
Notice that if is twice differentiable for and
formula
2.76
then
formula
2.77
For example, if is a normal distribution, , then
formula
2.78
Similar to the proof of theorem 8, we can prove that
formula
2.79
where
formula
2.80

We find that is often a good approximation of MI even for relatively small . However, we cannot guarantee that is always positive-semidefinite in equation 2.14, and as a consequence, it may happen that is very small for small , is not positive-definite, and is not a real number. In this case, is not a good approximation to but is still a good approximation. Generally, if is always positive-semidefinite, then or is a better approximation than , especially when is close to a normal distribution.

In the following, we give an example of 1D inputs. High-dimensional inputs are discussed in section 4.1.

2.3.1  A Numerical Comparison for 1D Stimuli

Considering the Poisson neuron model (see equation 5.7 in section 5.1 for details), the tuning curve of the nth neuron, , takes the form of circular normal or von Mises distribution
formula
2.81
where , , , with , , , and , and the centers , of the neurons are uniformly distributed on interval , that is, , with and . Suppose the distribution of 1D continuous input () has the form
formula
2.82
where is a constant set to and is the normalization constant. Figure 1A shows graphs of the input distribution and the tuning curves with different centers , 0, .
Figure 1:

A comparison of approximations , , , and for one-dimensional input stimuli. All of them were almost equally good, even for small population size . (A) The stimulus distribution and tuning curves with different centers , 0, . (B) The values of , , , and all increase with neuron number . (C) The relative errors , , and for the results in panel B. (D) The absolute values of the relative errors , , and , with error bars showing standard deviations of repeated trials.

Figure 1:

A comparison of approximations , , , and for one-dimensional input stimuli. All of them were almost equally good, even for small population size . (A) The stimulus distribution and tuning curves with different centers , 0, . (B) The values of , , , and all increase with neuron number . (C) The relative errors , , and for the results in panel B. (D) The absolute values of the relative errors , , and , with error bars showing standard deviations of repeated trials.

To evaluate the precision of the approximation formulas, we use Monte Carlo (MC) simulation to approximate MI . For MC simulation, we first sample an input by the distribution , then generate the neural response by the conditional distribution , where , 2, . The value of MI by MC simulation is calculated by
formula
2.83
where is given by
formula
2.84
and for .
To evaluate the accuracy of MC simulation, we compute the standard deviation,
formula
2.85
where
formula
2.86
formula
2.87
and is the th entry of the matrix with samples taken randomly from the integer set , 2, , by a uniform distribution. Here we set , and .
For different , we compare with , , and , which are illustrated in Figures 1B to 1D. Here we define the relative error of approximation, for example, for , as
formula
2.88
and the relative standard deviation
formula
2.89
Figure 1B shows how the values of , , , and change with neuron number , and Figures 1C and 1D show their relative errors and the absolute values of the relative errors with respect to . From Figures 1B to 1D, we can see that the values of , , and are all very close to one another and the absolute values of their relative errors are all very small. The absolute values are less than when and less than when . However, for the high-dimensional inputs, there will be a big difference between , , and in many cases (see section 4.1 for more details).

3  Statistical Estimators and Neural Population Decoding

Given the neural response elicited by the input , we may infer or estimate the input from the response. This procedure is sometimes referred to as decoding from the response. We need to choose an efficient estimator or a function that maps the response to an estimate of the true stimulus . The maximum likelihood (ML) estimator defined by
formula
3.1
is known to be efficient in large limit. According to the Cramér-Rao lower bound (Rao, 1945), we have the following relationship between the covariance matrix of any unbiased estimator and the FI matrix ,
formula
3.2
where is an unbiased estimation of from the response , and means that matrix is positive-semidefinite. Thus,
formula
3.3
The MI between and is given by
formula
3.4
where is the entropy of random variable and is its conditional entropy of random variable given . Since the maximum entropy probability distribution is gaussian, satisfies
formula
3.5
Therefore, from equations 3.4 and 3.5, we get
formula
3.6
The data processing inequality (Cover & Thomas, 2006) states that postprocessing cannot increase information, so that we have
formula
3.7

Here we can not directly obtain as in Brunel and Nadal (1998) when and . The simulation results in Figure 1 also show that is not a lower bound of .

For biased estimators, the van Trees' Bayesian Cramér-Rao bound (Van Trees & Bell, 2007) provides a lower bound:
formula
3.8
It follows from equations 2.75, 3.6, and 3.8 that
formula
3.9
formula
3.10
formula
3.11
We may also regard decoding as Bayesian inference. By Bayes' rule,
formula
3.12
According to the Bayesian decision theory, if we know the response , from the prior and the likelihood , we can infer an estimation of the true stimulus , —for example,
formula
3.13
which is also called maximum a posteriori (MAP) estimation.
Consider a loss function for estimation,
formula
3.14
which is minimized when reaches its maximum. Now the conditional risk is
formula
3.15
and the overall risk is
formula
3.16
Then it follows from equations 2.3 and 3.16 that
formula
3.17
Comparing equations 2.12, 2.66, and 3.17, we find
formula
3.18
Hence, maximizing MI (or ) means minimizing the overall risk for a determinate . Therefore, we can get the optimal Bayesian inference via optimizing MI (or ).

By the Cramér-Rao lower bound, we know that the inverse of FI matrix reflects the accuracy of decoding (see equation 3.2). provides some knowledge about the prior distribution ; for example, is the covariance matrix of input when is a normal distribution. is small for a flat prior (poor prior) and large for a sharp prior (good prior). Hence, if the prior is flat or poor and the knowledge about model is rich, then the MI is governed by the knowledge of model, which results in a small (see equation 2.64) and . Otherwise, the prior knowledge has a great influence on MI , which results in a large and .

4  Variable Transformation and Dimensionality Reduction in Neural Population Coding

For low-dimensional input and large , both are are good approximations of MI , but for high-dimensional input , a large value of may lead to a large error of , in which case (or ) is a better approximation. It is difficult to directly apply the approximation formula when we do not have an explicit expression of or . For many applications, we do not need to know the exact value of and care only about the value of (see section 5). From equations 2.12, 2.22, and 2.78, we know that if is close to a normal distribution, we can easily approximate and to obtain and . When is not a normal distribution, we can employ a technique of variable transformation to make it closer to a normal distribution, as discussed below.

4.1  Variable Transformation

Suppose is an invertible and differentiable mapping:
formula
4.1
, and . Let denote the p.d.f. of random variable and
formula
4.2
Then we have the following conclusions, the proofs of which are given in the appendix.
Theorem 3.
The MI is equivariant under the invertible transformations. More specifically, for the above invertible transformation , the MI in equation 2.1 is equal to
formula
4.3
Furthermore, suppose and fulfill the conditions C1, C2 and . Then we have
formula
4.4
formula
4.5
where is the entropy of random variable and satisfies
formula
4.6
and denotes the Jacobian matrix of ,
formula
4.7
Corollary 1.
Suppose is a normal distribution,
formula
4.8
where , for , 2, , , is a deterministic matrix, is a deterministic invertible matrix, and is an invertible and differentiable function. If has also a normal distribution, , then
formula
4.9
where
formula
4.10
formula
4.11
formula
4.12
Remark 6.

From corollary 11 and equation 2.78, we know that the approximation accuracy for is improved when we employ an invertible transformation on the input random variable to make the new random variable closer to a normal distribution (see section 4.3).

Consider the eigendecompositions of and as given by
formula
4.13
formula
4.14
where and are orthogonal matrices; and are eigenvalue matrices; and and . Then by equations 2.11 and 4.9, we have
formula
4.15
formula
4.16
and
formula
4.17
Now consider two special cases. If , then by equation 4.17, we get
formula
4.18
If , then
formula
4.19
Here , . The FI matrices and become degenerate when and .
From equations 4.18 and 4.19, we see that if either or becomes degenerate, then . This may happen for high-dimensional stimuli. For a specific example, consider a random matrix defined as follows. Here we first generate elements , (, 2, , ; , 2, , ) from a normal distribution . Then each column of matrix is normalized by . We randomly sample (set to ) image patches with size from Olshausen's nature image data set (Olshausen & Field, 1996) as the inputs. Each input image patch was centered by subtracting its mean: . Then let for , 2, , . Define matrix , and compute eigendecomposition
formula
4.20
where is a orthogonal matrix and is a eigenvalue matrix with . Define
formula
4.21
Then
formula
4.22
The distribution of random variable can be approximated by a normal distribution (see section 4.3 for more details). When , we have
formula
4.23
formula
4.24
formula
4.25
The error of approximation is given by
formula
4.26
and the relative error for is
formula
4.27

Figure 2A shows how the values of and vary with the input dimension and the number of neurons (with , 4, 6, , 30 and , , , ). The relative error is shown in Figure 2B. The absolute value of the relative error tends to decrease with but may grow quite large as increases. In Figure 2B, the largest absolute value of relative error is greater than , which occurs when and . Even the smallest is still greater than , which occurs when and . In this example, is a bad approximation of MI , whereas and are strictly equal to the true MI across all parameters.

Figure 2:

A comparison of approximations and for different input dimensions. Here is always equal to the true MI with , whereas always has nonzero errors. (A) The values and vary with input dimension with , 4, 6, , 30, and the number of neurons with , , , . (B) The relative error changes with input dimension for different .

Figure 2:

A comparison of approximations and for different input dimensions. Here is always equal to the true MI with , whereas always has nonzero errors. (A) The values and vary with input dimension with , 4, 6, , 30, and the number of neurons with , , , . (B) The relative error changes with input dimension for different .

4.2  Dimensionality Reduction for Asymptotic Approximations

Suppose is partitioned into two sets of components, with
formula
4.28
formula
4.29
where , , , , , and . Then by Fubini's theorem, the MI in equation 2.1 can be written as