Throughout the nervous system, information is commonly coded in activity distributed over populations of neurons. In idealized situations where a single, continuous stimulus is encoded in a homogeneous population code, the value of the encoded stimulus can be read out without bias. However, in many situations, multiple stimuli are simultaneously present; for example, multiple motion patterns might overlap. Here we find that when multiple stimuli that overlap in their neural representation are simultaneously encoded in the population, biases in the read-out emerge. Although the bias disappears in the absence of noise, the bias is remarkably persistent at low noise levels. The bias can be reduced by competitive encoding schemes or by employing complex decoders. To study the origin of the bias, we develop a novel general framework based on gaussian processes that allows an accurate calculation of the estimate distributions of maximum likelihood decoders, and reveals that the distribution of estimates is bimodal for overlapping stimuli. The results have implications for neural coding and behavioral experiments on, for instance, overlapping motion patterns.
In many brain areas, information is distributed across neurons using population codes in which many neurons respond collectively to a single stimulus. By pooling across neurons, population codes allow accurate estimation of a stimulus from the population response even when neural noise is present. Given its ubiquity, understanding population coding is believed to be crucial to understand coding of information in the brain. Numerous studies have quantified, among other issues, the role of tuning curves (Zhang & Sejnowski, 1999), noise correlations (Sompolinsky, Yoon, Kang, & Shamir, 2002; Moreno-Bote et al., 2014), heterogeneity (Shamir & Sompolinsky, 2006; Ecker, Berens, Tolias, & Bethge, 2011; Shamir, 2014), and stimulus multiplicity (Orhan & Ma, 2015) on coding accuracy.
However, coding accuracy as measured by the variance in the estimates is not the only performance metric. When the same stimulus is repeatedly estimated from a population response and these estimates are averaged over many trials, a systematic difference between the mean estimated value and its true value might remain; this is called bias.
In many idealized cases, biases are absent from population coding estimation schemes. First, in the limit of low noise, estimators such as the maximum likelihood decoder can be shown to be unbiased under quite general conditions (Kay, 1993). Second, the coding problem might have an intrinsic symmetry that abolishes bias, that is, over- and underestimation of the stimulus are equally likely—for example, the estimation of the orientation of a visual grating from a homogeneous population using a homogeneous decoder. Either condition by itself is sufficient to warrant unbiased estimation. For instance, while for one-dimensional direction estimates the maximum likelihood decoder is suboptimal at high noise, it remains unbiased (Xie, 2002).
Yet in perception, biases are common. To explain these, theoretical studies typically rely on mechanisms that modulate the neural response to break the homogeneity of the population without adjusting the decoder, such as occurs with adaptation (Stocker & Simoncelli, 2006; Seriès, Stocker, & Simoncelli, 2009; Cortes et al., 2012; Wei & Stocker, 2015) or with contextual changes in the neural tuning (Schwartz, Hsu, & Dayan, 2007; Keemink & van Rossum, 2016).
In contrast to those studies, we show that biases can occur even in homogeneous population codes. We consider the case where multiple variables are simultaneously coded in a population, such as occurs in visual cortical area MT when two overlapping transparent random dot motion patterns are presented. We find that in these situations, biases in estimation emerge, even though the decoder has full knowledge of the encoding process. Furthermore, when multiple overlapping stimuli are presented, the number of perceived stimuli can be fewer than the number presented, resembling psychophysical findings (Treue, Hol, & Rauber, 2000; Edwards & Greenwood, 2005).
To explain these findings we develop a mathematical framework based on gaussian processes—an extension of multivariate gaussian distributions—which is generally applicable to maximum likelihood decoders for systems with gaussian noise. We use this framework to calculate and understand the implications of the bias for neural computation and perceptual biases.
2.1 Decoding of the Neural Response
We first consider the case when the opening angle is large, so that the two peaks in the tuning curve are far apart (, where is the tuning width). In this case, the stimulus estimates are centered around the true stimulus value, approximately according to a two-dimensional gaussian distribution (see Figure 1B). The true stimulus value (cross) and the mean estimate (marked by the ) coincide.
However, when the motion directions are instead almost the same so that the peaks in the population response partly overlap, the distribution radically changes shape (see Figure 1C). Now the estimates fall essentially in two categories: either the estimates are strongly positively correlated and cluster on the diagonal where (in this case, the most likely explanation for the neural response is that the two motion directions were the same); on other trials the estimates are negatively correlated, and the angular difference in the motion direction is overestimated. The mean of neither component of the distribution coincides individually with the true stimulus vector, nor does the mean of the full distribution; the estimate is biased.
To more easily understand these results, we transform the coordinates and describe the system in the sum and difference of the angles. The sum of the angles, , follows a gaussian distribution and is unbiased, as dictated by the rotational invariance of the setup. More interesting, however, is the opening angle . Estimator bias is defined as the difference between mean estimate and true stimulus value, , where the angular brackets denote the average over trials of the estimates . The estimator bias is shown as a function of true stimulus value in Figure 2A. When the opening angle is small, the bias is repulsive (the apparent angle is larger than the true value). As the opening angle increases, the bias changes sign and becomes attractive, before reducing to zero for even larger angles (see Figure 2A).
One can wonder whether the repulsive bias is simply caused by imposing . But this would not explain the biphasic nature of the bias or the bimodal decoding distribution. Furthermore, if the ordering of and were randomly assigned, the estimate distribution would become trimodal, with some estimates lying on the diagonal and others clustering in clouds on the antidiagonal on either side of the origin. On average, one would find , that is, , irrespective of the true angle, so it would be a strongly biased estimator.
In summary, in this relatively simple coding problem, biphasic biases emerge. Next, we attempt to understand why this occurs.
2.2 Emergence of Bias
Of particular interest is the limiting case of . While somewhat contrived as the presented motion directions are identical in that case, exact results can be obtained in this limit that approximately hold for any small . In this limit, is lowest at , as expected (see Figure 2B, thick black curve). Because of symmetry in the combined tuning curves, equation 4.1, not only all odd derivatives but also the second derivative of is zero. Thus, the dependence is quartic, (i.e., very flat), and for small opening angles, this error term hardly changes as the estimate is altered.
As the noise term combines the signals from overlapping tuning curves, it is smooth. It is also symmetric in ; however, its second derivative is nonzero. Depending on the noise, it is in leading order either an upward or a downward curved parabola centered around the origin. For small , this parabola will dominate over . Therefore, if the parabola is U-shaped and thus with a minimum at , the total MSE also has a global minimum there (see Figure 2B, purple curves). If, on the other hand, the noise term has a maximum at , the global minimum will be repulsed away from the true solution (see Figure 2B, green curves). As a result the distribution shows a sharp peak at 0 and a smeared peak farther away (see Figure 2C, left). Furthermore, when the encoded angle , exactly half of the estimates will be at (i.e., the fall on the diagonal in Figure 1C) and the other half will not. As increases, the probability of finding estimates will decrease, and the second peak will gain more mass (see Figure 2C, middle and right). The net effect is that this will first decrease the repulsive bias, then turn into an attractive bias, and finally the bias disappears.
The gaussian process approach can be used to calculate the probability of estimates in a numerically exact way without relying on simulations. Briefly, for a given true stimulus , we run over all candidate stimulus estimates and find the probability that it minimizes (see section 4). This gives an accurate and noise-free estimation of the decoding distribution, and thus of the decoding bias. This method was used to create Figures 2A and 2C and compares well to explicit stochastic simulations over many trials (see the dots in Figure 2A).
In an elegant but little-known paper Amari and Burnashev (2003) calculated the bias analytically in the case of gaussian white noise (see section 4). While our approach does not yield the analytical form of the distribution, it has as an advantage that it allows for more general encoding and noise models, as we examine below.
Interestingly, as the noise is reduced, the distribution of estimates remains bimodal. While in the limit of zero noise the bias disappears (as the theory of maximum likelihood estimation states), the transition in the limit of small angles is not due to a collapse of into a single gaussian distribution; rather, it is due to the two peaks in the distribution of estimates moving closer and closer together.
Intuitively, the bias emerges due to the interaction two effects: (1) for small opening angles, the two stimuli are interpreted as being just a single stimulus leading to attractive bias, and (2) when the stimulus is correctly perceived as being two directions, the angle estimate is broad and tends to overestimate, leading to a repulsive bias.
2.3 Effect of Noise Correlation and Heterogeneity
Next we examine how the bias depends on the neural noise, tuning curve heterogeneity, and encoding model. All of these effects can be included in the gaussian process approach without additional computational cost or complexity.
First, we consider the effect of correlations in the neural noise. In studying the coding of single stimuli, it was found that correlations in the neurons' noise (so-called noise correlations) limit the ability to average across neurons. This effect was particularly important when the spatial scale of the correlations, , matched the tuning curve width (Sompolinsky et al., 2002; Wu, Amari, & Nakahara, 2002). A similar effect is found here (see Figure 3A). For uncorrelated noise (small ), the bias can be reduced by using larger neural populations as in equation 2.4, but for correlated noise it cannot. The bias is maximal for intermediate correlation length. The bias diminishes (but remains finite) for large correlation lengths, in which case all neurons covary across trials.
Next, we consider heterogeneity among the tuning curves, which, again, from univariate coding studies is known to improve population coding accuracy (Shamir & Sompolinsky, 2006; Ecker et al., 2011). Similarly, heterogeneity resolves some of the degeneracy at that underlies the bias. To examine this, we drew the widths of the individual tuning curves from a log-normal distribution. Indeed, increasing the heterogeneity among the tuning curves decreased the bias (see Figure 3B). The bias reduction could be simply the result of the inclusion of a few neurons with very narrow tuning curves in the population that allow a precise estimate. To check against this explanation, we calculated the average bias from a set of homogeneous populations as , with the distribution of widths and from equation 2.4. This has only a weak effect on the bias (see Figure 3B, dashed line); instead it is the heterogeneity that underlies the bias reduction.
2.4 Bias Reduction Strategies
2.4.1 Bias Reduction by the Encoder
2.4.2 Bias Reduction by the Decoder
We wondered whether the bias is unavoidable or is perhaps particular to the ML decoder. First, we use a Bayesian decoder, which calculates the full distribution of possible stimulus estimates given the response and the noise model, . For a flat prior for , this distribution is proportional to . Whereas the maximum likelihood decoder takes the maximum of this distribution, using a square loss function the Bayesian estimate equals the expected value of this distribution, (Kay, 1993; Salinas & Abbott, 1994). This estimator minimizes the mean squared error in the estimate. We find that with a Bayesian decoder, the bias is slightly more pronounced (see Figure 4A).
Can one design a bias-free decoder? If there is a smooth, monotonic, potentially noisy relation between an estimate and the true stimulus, one can hope to compensate the bias. While here the bimodal distribution of estimates makes this more challenging, we wondered if nevertheless one can compensate for the ML decoder bias. Is there a nonlinear mapping replacing each estimate of the ML decoder with a transformed estimate that reduces the bias across the range of possible encoded angles?
After discretizing both and , the bias after correction () becomes , where . The goal now is to find the vector so that for all . This is a linear algebra problem of the classic form , where (here, ) has to be found for known (here, ) and (here, ). Because the problem is ill conditioned, we use singular value decomposition to find . Furthermore, as the entries of diverge, we apply Tikhonov regularization. As a result of the regularization, the bias will not be exactly zero, but the norm of is limited (see section 4).
With this procedure, the bias is substantially reduced compared to the uncorrected estimator, (see Figure 4B; green versus black curve). The nonlinearity found to achieve this fluctuates smoothly around zero (inset). The intuition is that the oscillations make the nonlinearity highly sensitive to small changes in while maintaining the correct mean. Reassuringly, for larger estimates, where bias was small anyway, no nonlinearity is needed and (inset, dashed line). However, the bias reduction comes at a cost: because of the oscillations in , the estimates vary wildly from trial to trial. For the case illustrated, the variance in the estimate is some 300 times larger than for the ML decoder for a fourfold bias reduction. As the regularization is relaxed, the bias can be made smaller, but the amplitude (and frequency) of increases, and hence the variance grows.
2.5 Bias and Estimator Efficiency
The quality of population code readout is not quantified by the bias alone, but also by the number of trial-to-trial variations in the estimates—the variance in the distributions in Figures 1B and 1C. The variance of the estimates from the ML decoder follows directly from the distribution of estimates that our approach yields. The variance in the estimator is plotted in Figure 5A.
For large opening angles , the estimate distributions are gaussian with a width proportional to the neural noise (see Figure 2C, right), and the bias plays a minor role. As expected from the Fisher information, the variance of the estimator is proportional to the square of the neural noise (see Figure 5B, red curve). However, for small opening angles, the bias has a profound effect on the estimator, causing a linear dependence on the noise (see Figure 5B, black curve). This can be understood as follows: The estimator variance at can be approximately found by describing the estimate distribution as a peak at zero and a gaussian (see Figure 2C). The variance is similar to the bias squared, and thus its parameter dependence follows from squaring equation 2.4. Therefore, in contrast to the behavior at large angles, the variance in the estimates is only linear in the neural noise.
While the Fisher information is proportional to the neural noise squared (see equation 4.3), the Cramèr-Rao bound at small opening angles is only linear in the neural noise (see Figure 5B, dashed curves). The reason is that for small angles, the Fisher information goes to zero (see equation 4.3), but the bias' derivative at the origin equals . Hence, at small angles, both numerator and denominator in the Cramèr-Rao bound (see equation 2.6) go to zero. The bound therefore does not diverge, and a linear dependence on the neural noise remains. For the parameters used, the ML always achieves an efficiency of or more.
As an aside, here another advantage of the gaussian process approach shows. With simulations, the bias and, in particular, its derivative, are hard to calculate accurately, even using a large number of realizations (see Figure 2A, dots). However, the numerically exact method to calculate the bias allows for a precise calculation of the bias and its derivative.
Traditionally, theoretical studies of the accuracy limits of population codes have focused on estimator variance. Whenever biases have been studied theoretically, they have typically been explained from inhomogeneities in the neural encoding. Here we find that when multiple stimuli are encoded simultaneously in a relatively simple coding problem, substantial biases arise with standard decoders. That biases occur should not be surprising. Apart from cases where symmetry rules out biases, the absence of biases can be proven in the limit of low noise, but in general, an ML decoder will not be unbiased or efficient (Kay, 1993; Seriès et al., 2009; Pilarski & Pokora, 2015). Yet the rich structure of the bias in these simple models, including its biphasic character and its relative persistence at low noise, is surprising. The reason for the bias is the bimodal distribution of decoding estimates. The bias will disappear in the limit of zero noise, but it diminishes only slowly as noise is reduced (proportional to the square root of the standard deviation of the neural noise; see equation 2.4). The persistence of the bias at low noise stands in contrast to other studies of bias and efficiency where below a critical noise level, the decoders abruptly become optimal (Kay, 1993; Xie, 2002). Simulations show that the shape of the bias curve and the distribution of estimates are very similar when, instead of gaussian noise, Poisson or multiplicative gaussian noise is considered (not shown), but the gaussian process approach cannot be used in the Poisson case.
This particular coding problem has been studied twice before. Orhan and Ma (2015) calculated the Fisher information but did not consider decoder biases. Amari and Burnashev (2003) showed that for uncorrelated noise, a singularity in the Fisher information leads to a bound on for any decoder. The gaussian process approach numerically matches their analytical results but also allowed us to study correlated noise, nonlinear encoding, and heterogeneity.
While the ML decoder and the Bayesian decoder have a strong bias, it can be arbitrarily reduced by a nonlinear mapping. However, this comes at the cost of a much increased variance (in line with Amari & Burnashev, 2003) and the compensating decoder is probably too complicated and fragile for biological implementation; and, in addition, it would need to be made dependent on noise level. The neural implementation of decoding mechanisms is currently not clear, although it has been argued that it is straightforward to implement ML decoders neurally (Deneve, Latham, & Pouget, 1999; Jazayeri & Movshon, 2006). One could wonder for which coding problems ML decoders are biased. While we do not currently have a general answer to this, one can employ the gaussian process approach to explore potential cases (under a gaussian noise assumption).
The question of which decoder the brain implements and whether bias is present is ultimately an empirical one and can be tested in psychophysical experiments. For instance, two overlapping random dot motion patterns with different directions can be presented and subjects are asked to guess the angle between the two directions. In such experiments, repulsive biases have commonly been observed (Marshak & Sekuler, 1979), but attractive effects have also been observed (Braddick, Wishart, & Curran, 2002). Several effects have been hypothesized to underlie these biases, including adaptation (Rauber & Treue, 1999), cortical interactions (Carandini & Ringach, 1997), and repulsion from the cardinal directions (Rauber & Treue, 1998). The bias described here is not at odds with those explanations but presents a novel contribution to the total bias. It should be most prominent at small angles and when presentation times are short so that the signal-to-noise ratio is small.
The estimated decoding distribution reflects an ambiguity between the presence of one or two stimuli. Apart from predicting a bias, the theory predicts a bimodal distribution of direction difference estimates, and for small angles, about half the time, the two motions should be perceived as one. In experiments, the number of stimuli that can simultaneously be perceived using overlapping motions is limited (Edwards & Greenwood, 2005), and when three or five overlapping motions are presented, they can sometimes be perceived as two (so-called metamers; Treue et al., 2000), an effect that previously has been explained using the probabilistic population code framework (Zemel, Dayan, & Pouget, 1998; Zemel & Dayan, 1999). The results here suggest that differences in the numerosity between presented and perceived stimuli already emerge with maximum likelihood decoders. Quantitative verification of the predictions of our study regarding bias and numerosity should be possible, but attention, participants' expectations, and natural priors for perceiving a single motion direction instead of two directions should be taken into account.
More generally, these results might also be relevant for other brain areas, such as higher visual areas. Here, our findings pose limits on the number of objects that can be represented simultaneously in a neural population. The competitive coding studied here, or, alternatively, complex temporal dynamics during simultaneous stimulus presentation (Gawne, 2008; Li et al., 2016), might help to alleviate such limitations (Amari & Nakahara, 2005).
4.1 Neural Population Response
We use a population of neurons. The tuning of neuron to a single stimulus is given by . Here, is the response amplitude (arbitrarily set to 1), and is the width of the tuning curve (set to ). The preferred directions of the neurons are equally spaced between 0 and . As is common, we assume that the angles involved are relatively small, so that we do not have to worry about their circularity, which would add complication through the need for circular statistics but does not change the results qualitatively.
4.2 Algorithm to Calculate of Maximum Likelihood Estimate
Given a noisy response , we run over all candidate stimulus estimates and find the probability that it minimizes . Because is a smooth gaussian process and nearby 's are correlated, we finely discretize the possible estimates and define a set of candidate estimates .
To calculate the probability that a certain estimate yields the lowest MSE, it is compared to the MSE that all other estimates yield. We define the -dimensional set of MSE differences . We write this in shorthand as the vector , where .
The distribution of differences is an -dimensional multivariate normal distribution,
While the orthant integral, equation 4.2, is not analytically tractable, efficient algorithms exist that calculate it to a high precision for values of up to in the hundreds. We used the quasi–Monte Carlo integration function mvnun from Scipy (Genz, 1992, 1998) with and (using a larger had negligible effects). (The code is available at https://github.com/swkeemink/DeDist.)
We note that the algorithm also extends to higher-dimensional stimuli, but is in practice limited by the dimensionality of the integral (which is equal to the number of bins used for the stimulus space discretization). However, algorithms for even higher dimensions exist (Azzimonti & Ginsbourger, 2016).
4.3 Scaling of the Bias
The approximate location of the repulsed minimum is given by , or and thus . The bias in the other half of the trials is zero (see the purple traces in Figure 2B); hence, the average bias is . This yields the relation in the main text, equation 2.4. The dependency of this equation on its parameters was confirmed numerically.
4.4 Calculation of the Cramèr-Rao Bound
4.4.1 Fisher Information in Competitive Coding Model
For the competitive coding, the Fisher information is identical for both sum and difference angles, and again depends only on , , where is the identity matrix. This is a monotonically increasing function in . When there are two separate peaks in the population response (), the information is twice that when , where there is a single peak in the tuning.
4.5 Debiasing Nonlinearity
To solve equation 4.4 for , we use the standard SVD decomposition , where and are orthogonal matrices and is a diagonal matrix. Now . To regularize this solution we replace the elements of the diagonal matrix , , with .
As a reminder to the reader, denotes the true stimulus value, the possible candidate estimates, and the estimate (the most likely ).
This work was supported by the Engineering and Physical Sciences Research Council (grant EP/F500386/1) and the Biotechnology and Biological Sciences Research Council (grant BB/F529254/1) to the University of Edinburgh School of Informatics Doctoral Training Centre in Neuroinformatics and Computational Neuroscience S.K. was supported by the EuroSpin Erasmus Mundus program and the EPSRC NeuroInformatics DTC. The work has made use of resources provided by the Edinburgh Compute and Data Facility (ECDF; www.ecdf.ed.ac.uk), which has support from the eDIKT initiative (www.edikt.org.uk). We thank Ad Aertsen, Udo Ernst, Richard van Wezel, Lawrence York, Peggy Series, and Chris Williams for discussions.