Neural decoding methods provide a powerful tool for quantifying the information content of neural population codes and the limits imposed by correlations in neural activity. However, standard decoding methods are prone to overfitting and scale poorly to high-dimensional settings. Here, we introduce a novel decoding method to overcome these limitations. Our approach, the gaussian process multiclass decoder (GPMD), is well suited to decoding a continuous low-dimensional variable from high-dimensional population activity and provides a platform for assessing the importance of correlations in neural population codes. The GPMD is a multinomial logistic regression model with a gaussian process prior over the decoding weights. The prior includes hyperparameters that govern the smoothness of each neuron’s decoding weights, allowing automatic pruning of uninformative neurons during inference. We provide a variational inference method for fitting the GPMD to data, which scales to hundreds or thousands of neurons and performs well even in data sets with more neurons than trials. We apply the GPMD to recordings from primary visual cortex in three species: monkey, ferret, and mouse. Our decoder achieves state-of-the-art accuracy on all three data sets and substantially outperforms independent Bayesian decoding, showing that knowledge of the correlation structure is essential for optimal decoding in all three species.

Since Zohary, Shadlen, and Newsome’s landmark demonstration of correlated activity in a population of MT neurons (Zohary et al., 1994), computational neuroscience has been seeking to elucidate the role that noise correlations play in the population code (Adibi et al., 2013; Bartolo et al., 2020; Cafaro & Rieke, 2010; Ecker et al., 2011, 2014; Kanitscheider et al., 2015; Kohn et al., 2016; Moreno-Bote et al., 2014; Nirenberg & Latham, 2003; Nogueira et al., 2020; Panzeri et al., 2022; Schneidman et al., 2003; Sokoloski et al., 2021). Noise correlations refer to statistical dependencies in the trial-to-trial fluctuations in population activity elicited in response to a fixed stimulus, and they may either increase or decrease information relative to a population with conditionally independent neurons (Averbeck et al., 2006; da Silveira & Rieke, 2020). A common strategy for evaluating the role that noise correlations play in a population code is to compare the accuracy of two decoders trained to predict stimuli from population activity: a “correlation-blind” decoder, which has no access to noise correlations present in population activity, and a “correlation-aware” decoder, which does (Berens et al., 2012; Bialek et al., 1991; Graf et al., 2011; Nirenberg & Latham, 2003; Pillow et al., 2008; Stringer et al., 2021; Yates et al., 2020). If the correlation-aware decoder performs better, then we may conclude that downstream regions must take noise correlations into account to optimally read out information from the upstream population.

This strategy is an effective way to investigate the scientific question, but existing work is plagued by a number of statistical issues, which we aim to address in this article. First, as neural data sets have increased in dimensionality, regularization has become a prerequisite for good decoding performance, making it difficult to compare correlation-blind decoders—which are often fit without regularization—and correlation-aware decoders, which are almost always regularized. Second, conventional correlation-aware decoders struggle to scale statistically and computationally to modern data sets containing tens or hundreds of thousands of neurons.

To address these shortcomings, we develop a suite of three new decoders with a common regularization strategy based on gaussian processes (GPs). First, we introduce two correlation-blind decoders that apply Bayesian decoding to an independent encoding model: the GP Poisson independent decoder (GPPID), which assumes independent Poisson encoding noise; and the GP gaussian independent decoder (GPGID), which assumes independent gaussian encoding noise. Both of these decoders place a gaussian process prior over the neural tuning curves. (For each neuron, its tuning curve is its mean response as a function of the stimulus variable.) The GPPID model can be used when the neural responses are nonnegative integers (e.g., spike counts), whereas the GPGID model can be used when the neural responses are real numbers (e.g., calcium imaging). We emphasize that both of these decoders are insensitive to correlations because they assume neural responses are independent given the stimulus.

We then introduce a novel correlation-aware decoding model, the gaussian process multiclass decoder (GPMD), which is a multinomial logistic regression model that uses a GP prior to regularize the decoding weights for each neuron. This decoder, which learns a direct linear mapping from high-dimensional neural activity patterns to the log probability of the stimulus, is the only one of the three that takes noise correlations into account. However, the three decoders have a similar number of parameters—equal to the number of neurons times the number of stimulus categories—and rely on a common regularization method, making it straightforward to compare them.

We compared our decoders to a variety of previously proposed decoding methods: first, multinomial logistic regression regularized using an elastic-net penalty (GLMNET; see Zou & Hastie, 2005); second, the empirical linear decoder (ELD), a decoder trained using support vector machines (Graf et al., 2011); and third, the “super-neuron” decoder (SND), a recently proposed decoder trained using least squares regression and a bank of nonlinear target functions (Stringer et al., 2021). All three of these decoders are correlation-aware linear classifiers. For completeness, we also compared our decoders to unregularized, correlation-blind Poisson and gaussian independent decoders (PID/GID).

We benchmarked all of these decoders on three real-world data sets from primary visual cortex (V1), recorded from monkey (Graf et al., 2011), ferret (this article), and mouse (Stringer et al., 2021). We found that our regularized correlation-blind decoders (GPPID and GPGID) could match and even exceed the performance of some of the correlation-aware decoders. However, none of these decoders performed as well as our proposed correlation-aware decoder, the GPMD, which achieved state-of-the-art accuracy on all data sets, with no preprocessing or manual elimination of noisy neurons. The GPMD’s performance advantage derived directly from its expressive regularization strategy, which successfully adapted to every data set’s unique structure while remaining as scalable as other regularized, correlation-aware decoders.

These results indicate that knowledge of the correlation structure is crucial for optimal readout of stimulus information from V1 populations in all three species. For ease of use, our decoders conform to the scikit-learn interface and are released as a Python package at https://github.com/cdgreenidge/gdec.

In this article, we consider the problem of a decoding a low-dimensional stimulus variable (i.e., the orientation of a sinusoidal grating) from a high-dimensional neural activity pattern (i.e., a vector of spike counts). We assume the stimulus belongs to one of K discrete bins or classes, formally making this a classification problem. However, the regression problem can be approximated by making K large, so that the grid of stimulus values becomes arbitrarily fine.

Figure 1 illustrates the problem setup for the V1 data sets we examined. The visual stimulus for each individual trial is a drifting sinusoidal grating with an orientation θk selected from a set of discrete orientations {θ1,...,θK} that evenly divide the interval [0,2π]. The stimulus variable to be decoded is thus a categorical variable y{1,...,K}.

Figure 1:

(A) Decoding task diagram. The animal is presented a grating drifting at an angle θ[0,2π]. Responses are recorded from primary visual cortex and summed over time into a feature vector, x. The stimulus is binned to an integer class label, y. We use linear decoders to predict y from x. (B) Tuning curves from three randomly selected neurons from each species. The two calcium fluorescence data sets (ferret and mouse) are noisier and have many more neurons, increasing the importance of regularization.

Figure 1:

(A) Decoding task diagram. The animal is presented a grating drifting at an angle θ[0,2π]. Responses are recorded from primary visual cortex and summed over time into a feature vector, x. The stimulus is binned to an integer class label, y. We use linear decoders to predict y from x. (B) Tuning curves from three randomly selected neurons from each species. The two calcium fluorescence data sets (ferret and mouse) are noisier and have many more neurons, increasing the importance of regularization.

Close modal

We consider the neural population response to be a vector xRD, where D indicates the number of neurons in the data set. We obtained this response vector by summing each neuron’s spikes (monkey) or two-photon calcium fluorescence (ferret and mouse) over some time window after stimulus presentation. Figure 1B shows orientation tuning curves from three example neurons from each data set. The monkey data sets (left) contained between D=68 and D=147 neurons, with K=72 discrete stimulus orientations (spaced every 5 degrees), and 50 trials per orientation for T=3600 trials (Graf et al., 2011). The ferret data set (middle) contained D=784 neurons, with K=180 discrete stimuli (spaced every 2 degrees) and 11 trials per orientation for a total of T=1991 trials, with the 0/360 orientation sampled twice. Finally, the mouse data sets (right) contained between D=11,311 and D=20,616 neurons. The stimuli for this experiment were sampled uniformly in [0,2π], and we subsequently discretized them into K=180 bins (Stringer et al., 2021). Each bin contained between 12 and 42 trials for a total of between T=4282 and T=4469 trials, depending on the data set.

In each case, we collected the population response vectors x and the discretized stimuli y into a classification data set D={(xt,yt)}t=1T. Full details on these data sets and their preprocessing procedures can be found in appendix C in the supplement.

2.1  Linear Classification

The decoders we consider are all linear classifiers, meaning that they are defined by a set of linear decoding weights and an intercept term. Their common form is
(2.1)
where wkRD is a set of decoding weights, and bk is an intercept term for stimulus class k. Note that an explicit intercept term is not strictly necessary, since it can be included in the weights if a one-valued entry is appended to x. To obtain the stimulus estimate y^, we compute the dot product between the neural response vector x and the weights for each class, and select the class in {1,...,K} for which this dot product is maximal.

The full set of parameters for a decoding model is thus the set of decoding weights for each class, which can be written as a D×K matrix W=(w1,...,wK) and, optionally, a k-dimensional intercept vector b=(b1,...,bk). In the following, we let Wdk denote the decoding weight for neuron d for stimulus class k. The only difference between the decoding methods we consider is the procedure for training these weights from data.

The probabilistic formulation of these classifiers is that the log probability of the stimulus class is an affine (i.e., linear plus constant) function of the neural response,
(2.2)
or, equivalently,
(2.3)
where Z=j=1Kexp(wjx+bj) is the normalizing constant ensuring that the probabilities over the K classes sum to 1.

Decoding with linear classifiers is optimal for so-called exponential-family probabilistic population codes with linear sufficient statistics (Beck et al., 2007; Ma et al., 2006). Although we could certainly expand our study to consider decoding with nonlinear classifiers, previous analyses of two of the data sets we used showed no benefit from adding nonlinear classification (Graf et al., 2011; Stringer et al., 2021).

2.2  Noise and Signal Correlations

It is helpful to distinguish between two types of correlations present in neural population responses: noise correlations and signal correlations (Averbeck et al., 2006; Panzeri et al., 2022). Noise correlations correspond to statistical dependencies in P(xy), the distribution of population responses x given a stimulus y. If neurons are conditionally independent (as assumed by the Poisson and independent gaussian encoding models), then noise correlations are zero. Noise correlations are typically characterized by Cov [xy], the covariance of the population responses to a fixed stimulus. This covariance clearly depends on the stimulus, meaning that noise correlations may differ for different stimuli.

Signal correlations are correlations between the tuning curves of different neurons. Two neurons with similar tuning curves will have high signal correlations, while neurons with dissimilar tuning curves may have zero or even negative signal correlations (i.e., if their tuning curves are anticorrelated). Signal correlations are commonly characterized by the covariance matrix Cov [x¯], which is the covariance of the tuning curves x¯(y)=E[xy] over the stimulus distribution P(y).

The effects of noise correlations on decoding performance depend critically on the alignment between noise and stimulus correlations (Averbeck & Lee, 2006; da Silveira & Rieke, 2020; Panzeri et al., 2022). However, it is important to note that so-called correlation-blind decoders are blind only to noise correlations. Thus, comparing correlation-blind and correlation-aware decoders is a method for determining whether a downstream population needs to know the structure of the noise correlations in order to decode information optimally. Figure S1 shows a characterization of both noise and signal correlations for the data sets considered in this article.

Here we describe previously proposed neural decoding methods, to which we will compare the gaussian process-based decoders, which we introduce in section 4.

3.1  Correlation-Blind Decoders

First, we introduce two independent or correlation-blind decoders, the first assuming Poisson noise and the second assuming gaussian noise. Both decoders use Bayes’ rule to obtain a posterior distribution over the stimulus under an independent encoding model, an approach commonly known as “naive Bayes.” The encoding models underlying these decoders assume that neural responses are conditionally independent given the stimulus, preventing them by construction from using noise correlations for decoding.

3.1.1  The Poisson Independent Decoder

The Poisson independent decoder (PID) relies on an independent Poisson encoding model of neural responses, which assumes that each neuron’s spike count is drawn from an independent Poisson distribution, the mean of which is determined by the stimulus (Abbott, 1994; Földiák, 1993). The encoding model describes xd, the response of neuron d, as
(3.1)
where λdk is the mean response of neuron d to stimulus k. Under the conditional independence assumption, the joint distribution of the population response is the product of the single-neuron encoding distributions,
(3.2)
where D is the total number of neurons.
Using Bayes’s theorem, it is possible to derive the probability over stimuli given an observed response vector x:
(3.3)
This quantity can be used for decoding. If there are equal numbers of trials per class in the training data set, the prior probabilities P(y=k) are equal for all k, and cancel, leaving the prediction rule,
(3.4)
To fit the model, one must estimate the parameters {λdk} for every stimulus and neuron. The vector of parameters for a single neuron λd=(λd1,...,λdK)T is known as the tuning curve. It contains the neuron’s expected response as a function of the stimulus. The maximum likelihood estimate for λdk is given by the mean spike count for each neuron-stimulus combination,
(3.5)
where Ak={(xt,yt)Dyt=k} is the set of all elements of the data set D={(xn,yn)}t=1T associated with a particular stimulus y=k, and |Ak| is the number of elements in Ak.
Assuming that the prior class probabilities P(y=k) are equal, the log of the class-conditional probability (see equation 3.4), also known as the log posterior over classes, can be written as
(3.6)
where c is a constant we ignore because it does not depend on the class. This shows that the PID decoder is a linear classifier (see equation 2.1), with weights W^ and intercepts b^ given by
(3.7)
(3.8)
See appendix A for a detailed derivation.

3.1.2  The Gaussian Independent Decoder

The Poisson independent decoder described above can be applied only to nonnegative integer data, such as spike counts. For real-valued data such as calcium fluorescence, intracellularly recorded membrane potential, local field potential, or fMRI BOLD signals, it is common to use a gaussian encoding model. This model describes xd, the response of neuron d, as
(3.9)
where μdk is the mean response of neuron d to stimulus k, and σd2 is the noise variance for neuron d. Unlike a typical gaussian naive Bayes decoder, we restrict the noise variance to be constant across stimulus classes, though, as usual, it can vary across neurons. With this restriction, the decoder becomes a linear classifier, like the other decoders we consider. If the noise variance were allowed to vary across stimulus classes, the decoder would be a quadratic classifier (see appendix A.)
To fit the model, we compute maximum likelihood estimates of the encoding distribution parameters for each neuron, which are given by the class-conditional empirical means μ^dk, and the empirical variances σd2, for each dth neuron:
(3.10)
(3.11)
As before, D={(xt,yt)}t=1T is the data set, and Ak={xDyt=k} is the set of all neural response vectors for a particular stimulus y=k.
Decoding stimuli under this encoding model follows from Bayes’s rule in the same manner as in the Poisson independent decoder (see equation 3.4), but using the gaussian encoding distribution instead of the Poisson. After some algebra, we can see that the gaussian independent decoder (GID) is a a linear classifier (see equation 2.1) with weights W and intercepts b given by
(3.12)
(3.13)
See appendix A for a detailed derivation.

3.2  Correlation-Aware Decoders

Here we review three previously proposed decoders that take into account the structure of neural correlations when determining a classification boundary and are therefore “correlation aware.” Unlike the two naive Bayes decoders described above, which resulted from applying Bayes’s rule to an encoding model, these directly model the posterior probability over stimuli given a vector of neural activity. All three decoders are multiclass linear classifiers, but they are trained with different loss functions and regularization methods.

3.2.1  Multinomial Logistic Regression with Elastic-Net Penalty (GLMNET)

The multinomial logistic regression model is a generalization of binary logistic regression to the multiple-class setting. It assumes that the log probability of the stimulus given the response is an affine function—that is, a linear transform plus a constant—of the neural response vector (Bishop, 2006). The conditional probability of the stimulus belonging to class k given the neural response vector x can be written as
(3.14)
where wk is a vector of the decoding weights for class k, bk is the constant offset for class k, and Z is the normalizing constant.
The model parameters consist of the weights W=(w1,...,wK) and offsets b=(b1,...,bK) and can be fit by maximum likelihood. The log-likelihood function given the data set D={(xt,yt)}t=1T can be written as
(3.15)
where yt is the one-hot vector representation of the stimulus class yt{1,...,K} on trial t—that is, a vector of all zeros except for a one in the entry corresponding to the stimulus class—and 1 is a length-D vector of ones.

The maximum-likelihood estimator (MLE) tends to perform poorly in settings with limited amounts of data and may not exist for small data sets. In fact, the MLE is not defined when the number of trials T is smaller than the number of identifiable parameters in the weight matrix W, that is, when T<D(K-1). Even in settings where the MLE does exist, it may overfit, yielding poor generalization performance.

A popular solution to this problem is to regularize the MLE with the elastic-net penalty, which combines 1 (“lasso”) and 2 (“ridge”) penalties to induce parameter sparsity and shrinkage (Friedman et al., 2010). The elastic-net estimator is obtained by maximizing the log-likelihood minus the regularization penalty:
(3.16)
Here, γ1 is a hyperparameter determining the strength of the regularization, and γ2 is a hyperparameter that controls the balance between the 1 penalty, which encourages W to be sparse, and the 2 penalty, which encourages W to have a small magnitude. For our decoding tasks, we found that including the 2 penalty always diminished cross-validated performance, so we fixed γ2=0. We then set γ1 using a five-step logarithmic grid search from 10-4 to 10, evaluated with three-fold cross-validation.

3.2.2  The Empirical Linear Decoder (ELD)

The empirical linear decoder (ELD), introduced by Graf et al. (2011), is similar to multinomial logistic regression in that it models the log probability of the stimulus class as an affine function of the neural response vector (see equation 3.14). However, instead of using standard likelihood-based methods to fit the model, the authors constructed an inference method based on support vector machines (SVMs).

Their key observation was that the log-likelihood ratio for adjacent stimulus classes is an affine function of the response vector, with weights given by the difference of the two classes’ decoding weight vectors. For example, for stimulus classes 1 and 2 we have
(3.17)
Here, we have defined v2 to be the difference vector (w2-w1) and c2 to be the difference scalar b2-b1.

We see that discriminating class 2 from class 1 under the multinomial logistic regression model is equivalent to solving a linear binary classification task with weights v2 and offset c2. Graf et al. (2011) proposed estimating v2 and c2 using an SVM trained on the data from classes 1 and 2. They then used the same approach to estimate the weights for all subsequent pairs of adjacent classes. That is, they estimated the difference weights vk using an SVM trained on data from classes k-1 and k, for k=2,...,K.

To recover the weights of the multinomial logistic regression model from the SVM weights, Graf et al. (2011) used the recursions
(3.18)
and
(3.19)
for k=2,...,K. Here the weights for class 1 can be set to zero without loss of generality. The constant αk, which scales the contribution of the SVM weights vk and ck, is necessary because SVMs recover vk and ck only up to a multiplicative constant. We were unable to determine how the authors set these scaling constants (see Figure S6), so we fit them by maximizing the log-likelihood of the data under the multinomial logistic regression model (see equation 3.15).

3.2.3  The Super-Neuron Decoder (SND)

The super-neuron decoder (SND), introduced by Stringer et al. (2021), is a third approach for training a linear classifier on multiclass data. It optimizes a set of decoding weights using penalized least-squares regression and a set of nonlinear super-neuron response functions. The super-neuron response functions encode the tuning curves of a population of narrowly selective downstream ``super-neurons,'' containing one super-neuron for each stimulus orientation. Each super-neuron responds maximally to a single orientation, making the population response on each trial a narrow bump of activity centered on the correct stimulus.

Formally, the SND seeks a matrix of weights W that maps the response vector x to a target vector hRK on each trial. The target vector h contains the responses of the super neuron population. The super-neurons have tuning curves parameterized in the same manner as the von Mises probability density function, which is appropriate since the stimulus variable is periodic.

The ith super-neuron has a preferred orientation of θi=2πK(i-1), so its tuning curve is given by
(3.20)
The target vector for the kth stimulus class is therefore
(3.21)
where θk=2πK(k-1) is the stimulus angle associated with stimulus class k{1,...,K}.
Stringer et al. (2021) trained the model weights W by linearly regressing the observed neural responses onto the target vectors. To penalize large weight values, they included an 2 (“ridge”) regularization:
(3.22)
The term hyt-Wxt22 is the squared error between the correct target vector hyt and the output of the linear decoding weights Wxt on trial t. The term γW22 is the squared 2 penalty on the decoding weights with regularization strength γ, which the authors fixed at γ=1.0. Intuitively, this training procedure seeks weights W that make the linearly transformed population response Wx match the super-neuron population response h as closely as possible in a least-squares sense.
The decoding rule chooses the class label corresponding to the maximum of the linearly weighted responses. (This is the same decoding rule as in the other decoders we have considered):
(3.23)
Here (Wx)k is the kth element of the transformed response vector Wx. In other words, the predicted stimulus value is the preferred orientation of the maximally responding super-neuron.

In this section, we first introduce two correlation-blind decoders regularized with gaussian process (GP) priors: the GP-regularized Poisson independent decoder (GPPID) and the GP-regularized gaussian independent decoder (GPGID). After that, we introduce a correlation-aware decoder, the gaussian process multiclass decoder (GPMD), which adds a GP prior to multinomial logistic regression for the same purpose.

Gaussian processes are often used for nonlinear classification problems, a method setting known as gaussian process classification (Bishop, 2006; Liu et al., 2021). In neuroscience settings, GPs have also been used to construct priors over smoothly varying firing rates or latent factors (Cunningham et al., 2007; Duncker & Sahani, 2018; Keeley et al., 2020; Yu et al., 2009). Here, we use GPs to provide a construct a prior over the weights in a general linear model (in the case of GPGID) or generalized linear model (GPPID and GPMD). Previous work has used GPs to provide a smoothing prior over weights of a linear or generalized linear model (Macke et al., 2011; Park & Pillow, 2011; Park et al., 2014; Rad & Paninski, 2010; Sahani & Linden, 2002). However, to our knowledge, ours is the first to use GP priors to impose a smoothness assumption over the different classes in a multinomial logistic regression model.

In the following sections, we describe the relationship between the linear weights in our model and neural tuning curves, which makes GPs a natural tool for regularization in our problem.

4.1  The GP-Regularized Poisson Independent Decoder (GPPID)

When doing inference in the PID decoder (see section 3.1.1), it is necessary to estimate λdk, the expected spike count of neuron d in response to the kth stimulus, across all neurons and stimuli. The expected spike counts for a single neuron form a tuning curve across stimuli, λd=(λd1,...,λdK)T. The maximum likelihood estimator for each entry in the tuning curve is simply the empirical mean of the spike counts for the dth neuron under the kth stimulus. However, the empirical mean estimates are noisy, especially when the number of trials for each stimulus is small, which can limit the PID decoder’s performance.

In principle, we could compensate for the noise by recording more trials for each stimulus, but this is expensive, particularly if the stimulus grid has a fine resolution. Instead, we propose to reduce error in the tuning curve estimates by exploiting our prior knowledge that tuning curves tend to be smooth with respect to orientation. We incorporate this knowledge into the model by placing an independent gaussian process prior over the log tuning curve of each neuron (Park et al., 2014; Rad & Paninski, 2010).

The resulting GP-regularized PID model is given by
(4.1)
(4.2)
where xd is the spike response of neuron d, and λd[k] is the kth element of the tuning curve λd. Here, the log tuning curve has a gaussian process prior with zero mean and a covariance function κ(·,·) with hyperparameters θd.

4.1.1  Periodic Gaussian Covariance Function

Because the stimulus orientations live on a circular domain, we selected a periodic covariance function for the GP prior. In particular, we used covariance based on the gaussian covariance function, also known as the radial basis function (RBF) or squared-exponential covariance function. To make this covariance periodic, this covariance is wrapped around the circle as follows (Schölkopf & Smola, 2002):
(4.3)
where indices j,k{1,...K} denote stimulus classes, and the hyperparameters θd={ρd,d} are the marginal variance ρd and length scale d of the log tuning curve. The sum over integer multiples of 2π accomplishes wrapping, making the function periodic. When the length scale d is much smaller than 2π, this function can often be evaluated accurately using a single term from the infinite sum. However, our implementation relies on a sparse Fourier-domain implementation, meaning that we do not need to evaluate it in the orientation domain at all. (See appendix B for details.)

The neuron-specific hyperparameters θd permit tuning curves to differ in amplitude and smoothness, so different neurons can be regularized differently. For neurons with nonexistent tuning or exceptionally noisy responses, the inference procedure will set the amplitude ρd to zero or the length scale d to infinity, making the tuning curve flat (see Figure 5). Such neurons are effectively pruned from the data set, since flat tuning curves make no contribution to decoding. This effect, known as automatic relevance determination (MacKay, 1992; Neal, 1996), eliminates the need to manually filter out noisy or untuned neurons, which can be critical when using other decoding methods (Graf et al., 2011).

4.1.2  Fitting the GPPID to Spike Count Data

To fit the GPPID model to spike count data, we employ a two-step procedure known as empirical Bayes (Bishop, 2006). For each neuron, we first compute a point estimate of the neuron-specific tuning curve hyperparameters θd by maximizing the model evidence, also known as marginal likelihood. Then, we find the maximum a posteriori (MAP) estimate of the tuning curve using the estimated hyperparameters θ^d. The model evidence for neuron d is the marginal probability of X*d=(x1d,...,xTd), the dth column of the spike count matrix X, given the hyperparameters θd and stimuli Y:
(4.4)
(4.5)

where wd=logλd is the log-tuning curve for neuron d, λd[yt] denotes the element of λd=exp(wd) corresponding to yt, the stimulus class for the tth time bin, and κθd denotes the K×K covariance matrix arising from evaluating the periodic gaussian covariance function (see equation 4.3) at all pairs of points in (1,...,K).

Because the integral in (equation 4.5) is intractable, we use the standard approach of approximating it using Laplace’s method (Bishop, 2006). In the following, let h(wd)=t=1Tlog Poiss (xtd;λd[yt])+logN(logwd;0,κθd) denote the sum of log-likelihood and log-prior and H be the Hessian (second-derivative matrix) of h(wd), evaluated at its maximizer, the MAP estimate w^d. The approximate evidence is then given by
(4.6)

We compute point estimates of the model hyperparameters θ^d by optimizing this approximation with respect to θd using the Nelder-Mead algorithm (Nelder & Mead, 1965).

Once we have estimated the maximum-evidence estimate of the hyperparameters θ^d for neuron d, we compute the empirical Bayes estimate of the neuron’s log-tuning curve by computing the MAP estimate of wd under the model given by equations 4.1 and 4.2,
(4.7)
where Poiss(Xtd;λd[yt]) is the probability of spike count Xtd given the firing rate λd[yt]=exp(wd[yt]) under a Poisson distribution (see equation 3.1).

To accelerate the optimization procedure, we use a Fourier-domain representation of the covariance matrix κθd based on the Karhunen-Loéve expansion (Paciorek, 2007; Royle & Wikle, 2005; Wikle, 2002; see appendix B for details). Since the procedure described above can be performed independently for each neuron, fitting the GPPID model is fully parallelizable across neurons.

4.2  The GP-Regularized Gaussian Independent Decoder (GPGID)

In the same way that we added a gaussian process prior to the Poisson independent decoder to obtain the GPPID decoder, we can add a GP prior to the gaussian independent decoder to obtain a GP-regularized gaussian independent decoder (GPPID). However, three salient differences set the gaussian model apart from the Poisson case. First, the parameters of interest in the gaussian model are the tuning curves, not the log-tuning curves that appeared in the Poisson case above. Second, there is an additional hyperparameter to learn in the gaussian case, corresponding to each neuron’s observation noise σd2. Third, and most important, the gaussian model evidence can be computed in closed form, obviating the need for approximations.

The GPGID model for the dth neuron is defined by a GP prior over the tuning curve μd=(μ1d,...,μKd)T and a gaussian observation model,
(4.8)
(4.9)
where xd is the response of neuron d, μd[k] is the kth element of the tuning curve μd, and κθd is the periodic gaussian covariance function defined in equation 4.3. The model contains three hyperparameters for each neuron, θd=(ρd,d,σd2), consisting of the marginal variance ρd and length scale d of the GP prior, and the observation noise variance σd2, which appears in the likelihood. As in the GID model (see section 3.1.2), we ensure that the naive Bayes classifier is linear by assuming that the observation noise variance is constant over classes (see appendix A).
To fit the model, we again use empirical Bayes. The first step is thus to compute a maximum-evidence estimate of the hyperparameters θd for each neuron,
(4.10)
(4.11)
(4.12)
where μd[yt] represents the tuning curve value for yt, the stimulus class on trial t, κθd is the K×K covariance matrix induced by the GP prior, and Σθd=κ(Y,Y)+σd2I denotes a T×T covariance matrix whose tth element along the main diagonal is κθd(yt,yt)+σd2 and off-diagonal element (s,t) is equal to κθd(ys,yt). Because this objective function can be expressed analytically, we perform the maximization using a trust-region Newton method.
Once we have optimized equation 4.12 to find θ^d, we compute the MAP estimate of the tuning curve given these hyperparameters, which has an analytic solution given by
(4.13)
(4.14)
where Σ˜θd is a K×T matrix with the k,tth element given by κθd(k,yt), for k(1,...K) (Rasmussen & Williams, 2006). However, for scalability we use an equivalent formula leveraging the same spectral weight representation as in the GPPID. (See appendix B for details.)

4.3  The Gaussian Process Multiclass Decoder (GPMD)

In this section, we introduce the gaussian process multiclass decoder (GPMD), which is multinomial logistic regression with a GP prior placed over the weights for each neuron (see Figure 2). As in section 3.2.1, the multinomial logistic regression model can be written as
(4.15)
where wk is a vector containing the decoding weights for class k, bk is the offset for class k, and Z is the normalizing constant.
We regularize the weight matrix by placing an independent zero-mean gaussian process prior on each of its rows:
(4.16)
Here Wd* is the dth row of W, which contains the decoding weights associated with neuron d across all stimuli, and κθd is the periodic gaussian covariance function defined in equation 4.3.

In the GPPID and GPGID, we used a GP prior to formalize our prior knowledge that the neuron’s log-tuning curves, and thus its tuning curves, tend to be smooth. In the GPMD, the model weights have no direct interpretation in terms of tuning curves. However, we can still motivate the use of a GP prior by observing that decoding weights also ought to vary smoothly with orientation, since tuning varies smoothly as a function of orientation.

Figure 2:

The GPMD model. A smoothing gaussian process prior is applied to each row of a logistic regression weight matrix W to shrink the weights and ensure that they vary smoothly across stimuli, mimicking the smoothness of neural tuning curves. The weight matrix is then linearly combined with the neuron’s response vector x to produce unnormalized scores for each decoding class. A softmax function is used to transform the unnormalized scores into probabilities.

Figure 2:

The GPMD model. A smoothing gaussian process prior is applied to each row of a logistic regression weight matrix W to shrink the weights and ensure that they vary smoothly across stimuli, mimicking the smoothness of neural tuning curves. The weight matrix is then linearly combined with the neuron’s response vector x to produce unnormalized scores for each decoding class. A softmax function is used to transform the unnormalized scores into probabilities.

Close modal

Like previous decoders, the GPMD has neuron-specific hyperparameters, which allow different neurons to have decoding weights with different amplitudes and different amounts of smoothness. This flexibility has two benefits: first, it allows each neuron’s weights to adapt to the neuron’s response properties, and second, it automatically discards untuned or noisy neurons as described in section 4.1, eliminating the need for manual data set preprocessing.

To fit the GPMD, we use variational inference to simultaneously learn both a posterior estimate for the weights {W,b} and point estimates for the prior hyperparameters. Specifically, given an approximate posterior family q—which we choose to be mean-field gaussian—indexed by parameters φ and prior hyperparameters θ, we maximize the evidence lower bound (ELBO; see Blei et al., 2017; Hoffman et al., 2013) jointly with respect to φ and θ:
(4.17)
To calculate the likelihood term, we draw M=3 samples from the variational posterior {(W(m),b(m))}m=1Mq and use these to compute a Monte Carlo approximation of the expectation:
(4.18)
where Z is the normalizing constant defined in equation 4.15. In principle, we could also calculate this approximation using a subset of the data, or “minibatch” (Hoffman et al., 2013), but our data sets are small enough that this is not necessary, and doing so increases the approximation variance.
The KL-divergence term expands to a sum of KL divergences for each row of the weight matrix, since each row is independent of the others. Assuming q1,...,qD are the mean-field variational distributions for each row of the weight matrix and p1,...,pD are the associated GP priors, the KL-divergence term reduces to
(4.19)
Each of the summands is easy to calculate analytically, since qd and pd are both multivariate normal distributions. The approximate variational posterior qd is a diagonal normal distribution, and the GP prior for each column pd is the multivariate normal distribution N(0,κθd), where, as in the GPPID model, κθd is the K×K prior covariance matrix induced by the GP prior.

To make predictions, we approximate MAP inference by using the mode of the posterior approximation q as a point estimate of the weights W and b in the multinomial logistic regression model (see equation 4.15). Since we are interested only in the effects of the prior, and not in using the posterior to assess prediction uncertainty, this approach suffices. In practice, we found that setting the offset b to 0 made no significant difference in performance (see Figure S7), likely because all our data sets were reasonably high-dimensional.

4.3.1  Scaling GPMD Inference

When maximizing the ELBO, there are two scaling concerns: the number of examples (T) and the number of neurons (D). It is well known that the ELBO can be scaled to huge numbers of examples by estimating the likelihood term using a minibatch approximation (Hoffman et al., 2013). However, even when using a minibatch approximation, the KL-divergence term must be calculated at each gradient step, and from it costs DK3 to evaluate (see equation 4.19). For large values of D, which are expected in high-dimensional neural data sets, the KL-divergence term evaluation makes stochastic gradient descent far too slow.

We solve this problem by representing W using a basis ΨCK×M, that is, W=ΨU, where URM×D. Then we place an independent normal prior on each entry of U. This allows the KL-divergence term to be evaluated with DM complexity, since it becomes a KL divergence between two independent normal distributions. The only difficulty is choosing Ψ such that W turns out to have the desired gaussian process distribution when U is an independent normal.

It can be shown that the appropriate choice of Ψ is the unitary Fourier basis (see appendix B). With this basis, the entry Uid, the element in the dth column of U corresponding to Fourier frequency i, must satisfy two conditions. The first condition is conjugacy, Uid=U-id*, which ensures that W is real. The second is on the distribution, which must be a zero-mean complex normal with variance
(4.20)
This spectral formulation assumes the stimuli lie on [0,2π], but it can be trivially extended to any domain.

5.1  Evaluation and Performance

We benchmarked each decoder by calculating its mean absolute test error on the monkey (Graf et al., 2011), ferret (this article), and mouse (Stringer et al., 2021) data sets (see Figure 3). We chose to estimate the error using five-fold cross-validation repeated 10 times, since 10-fold cross-validation produced data sets that had too few examples per class to train some models. Error bars were computed using twice the standard error of the mean of the cross-validation scores (±2σ/n). We examined five monkey data sets, one ferret data set, and three mouse data sets. Figure 3A reports the average scores for each animal; separate scores are reported in Figure S4. Note that the GID decoder has two variants: the standard formulation, which has a quadratic decision boundary, and the formulation described in section 3.1.2, which has a linear decision boundary.

The rank ordering of the models remained largely consistent across data sets. In general, the correlation-blind decoders (the PID and GID) performed worse than the correlation-aware decoders, which is consistent with previous decoding studies (Graf et al., 2011; Stringer et al., 2021). Their regularized variants (the GPPID and GPGID) performed better but still did not match the performance of the best correlation-aware decoders. The GPMD set matched state-of-the-art performance on all data sets. An important advantage of its Bayesian approach is that hyperparameters are learned automatically, which allowed the GPMD to adapt to the conditions present in different data set. By contrast, models that set hyperparameters manually exhibited occasional poor performance—for example, the SND performance on the monkey data sets.

Figure 3:

(A) Test error on monkey (Graf et al., 2011), ferret, and mouse (Stringer et al., 2021) data sets, estimated using five-fold cross-validation repeated 10 times. Since there are five monkey and three mouse data sets, we trained the models on each data set separately and averaged the scores. We did not train the PID and GPPID decoders on the ferret and mouse data sets, which contain real-valued calcium data, because they assume integer features. Error bars are ±2σ, where σ is the standard error of the mean. (B) The distribution of errors for each decoder, using the third monkey data set. The GPMD had more mass concentrated around 0 error (correct classification) and a narrower error distribution than the other decoders. One noteworthy feature of all decoders is a preponderance of errors at 180, indicating an estimate with the correct orientation but incorrect drift direction. The GPMD nevertheless exhibited this error less frequently than the other decoders. (C) Empirical CDF of errors on the third monkey data set. The GPMD classified higher fractions of the data set at lower errors than the other decoders, demonstrating its superior performance.

Figure 3:

(A) Test error on monkey (Graf et al., 2011), ferret, and mouse (Stringer et al., 2021) data sets, estimated using five-fold cross-validation repeated 10 times. Since there are five monkey and three mouse data sets, we trained the models on each data set separately and averaged the scores. We did not train the PID and GPPID decoders on the ferret and mouse data sets, which contain real-valued calcium data, because they assume integer features. Error bars are ±2σ, where σ is the standard error of the mean. (B) The distribution of errors for each decoder, using the third monkey data set. The GPMD had more mass concentrated around 0 error (correct classification) and a narrower error distribution than the other decoders. One noteworthy feature of all decoders is a preponderance of errors at 180, indicating an estimate with the correct orientation but incorrect drift direction. The GPMD nevertheless exhibited this error less frequently than the other decoders. (C) Empirical CDF of errors on the third monkey data set. The GPMD classified higher fractions of the data set at lower errors than the other decoders, demonstrating its superior performance.

Close modal

These results could have been obscured by the choice of error metric. For example, repeating the same benchmark using “proportion correct” instead of mean absolute error improved the performance of the ELD substantially (see Figure S3), qualitatively replicating the results of Graf et al. (2011). To ensure that our results were not artifacts of the error metric, we used the empirical error cumulative distribution function to characterize each decoder’s errors in more detail (see Figure 3B). Good decoders should classify higher fractions of the data set at lower errors, producing curves that are shifted upward from those of poorer decoders. We found that the GPMD outperformed or matched all the other decoders on all the data sets (see Figure S5).

Our results show that both regularization and exploiting correlations improved decoding performance substantially. The regularized correlation-blind decoders, the GPPID and GPGID, outperformed their unregularized analogues, the GID and PID. The GLMNET decoder, which is correlation aware, outperformed the correlation-blind GPPID and GPGID. Finally, the SND and GPMD, both regularized and correlation aware, outperformed all other decoders.

The relative impact of these strategies depended on data set dimensionality. All of the decoders performed better on the higher-dimensional mouse data set than the mid- and low-dimensional ferret and mouse data sets. This is because as dimensionality increased, the class distributions became farther apart, making the problem easier to solve (see Figure S2). Increased neural selectivity was correlated with decoding performance (see Figure S11). Within each data set, decoders performed differently depending on how they handled regularization and correlations. For small data sets, such as the monkey data set with approximately 150 neurons, both regularization and exploiting correlations had a substantial effect. For example, adding regularization to the GID (using the GPGID) decreased its mean absolute error by 16.3 degrees, and exploiting correlations (using the GPMD) decreased error by another 9.4 degrees. For high-dimensional data sets where it was easy to overfit, such as the mouse data set, which was recorded from approximately 20,000 neurons, regularization became the most important strategy. In fact, on the mouse data set, the regularized correlation-blind GPGID did just as well as some of the correlation-aware decoders.

5.2  Scaling with Data Set Size and Dimensionality

To characterize the GPMD’s performance and training times with respect to data set size and dimensionality, we performed ablation studies on both the number of training examples and the number of neural features (see Figure 4).

The GPMD performed well at all training set sizes (see Figure 4A), implying that its prior assumptions encouraging smooth and small decoding weights were well calibrated—that is, strong enough to permit good performance with few training examples but flexible enough to allow continued learning with many training examples. We believe that the nonparametric nature of the gaussian process prior is the key component that allows our model to accommodate the complex structure present in larger neural data sets. Models with stronger assumptions, such as the GPGID, which assumes independence, or the SND, which has many hard-coded parameters, had difficulty learning from increasing numbers of training examples.

Figure 4:

(A) Cross-validated model performance with varying amounts of training data. We chose random subsets of the third monkey data set. The GPMD continues to benefit from increasing training data and does exhibit asymptotic behavior like most of the other models. (B) Training times for the ablation study in panel A. Like all the other models, the GPMD shows essentially constant training times. This is because the monkey data set is small enough that training cost is dominated by constant factors. (C) Cross-validated model performance with varying amounts of neurons (features). We chose random subsets of the first mouse data set. The GPMD’s careful regularization avoids undesirable double-descent characteristics while achieving state-of-the-art performance for all feature sizes. (D) Training times for the ablation study in panel C. The GPMD is nearly one order of magnitude faster than logistic regression, a less sophisticated model, and trains in a few minutes on the largest data sets.

Figure 4:

(A) Cross-validated model performance with varying amounts of training data. We chose random subsets of the third monkey data set. The GPMD continues to benefit from increasing training data and does exhibit asymptotic behavior like most of the other models. (B) Training times for the ablation study in panel A. Like all the other models, the GPMD shows essentially constant training times. This is because the monkey data set is small enough that training cost is dominated by constant factors. (C) Cross-validated model performance with varying amounts of neurons (features). We chose random subsets of the first mouse data set. The GPMD’s careful regularization avoids undesirable double-descent characteristics while achieving state-of-the-art performance for all feature sizes. (D) Training times for the ablation study in panel C. The GPMD is nearly one order of magnitude faster than logistic regression, a less sophisticated model, and trains in a few minutes on the largest data sets.

Close modal

The GPMD also performed well with any number of neural features (see Figure 4C). Linear decoders with no or poor regularization, such as the GID and ELD, did not exhibit this property; in fact, their performance became worse as the number of neural features increased from the “classical” to the “interpolating” regime, producing a phenomenon known as the “double descent” error curve (Belkin et al., 2019). Properly regularized models such as the GPGID and GPMD did not display this phenomenon and gave accurate performance estimates for all numbers of neural features.

Thanks to the GPMD’s approximate inference, GPU acceleration, and spectral weight representation, it trained quickly, producing fast cross-validated error estimates that exhibited favorable scaling with respect to both observations and neurons (see Figures 4B and 4D). For the largest data set with 20,000 neurons, it took 131 +/- 0.82 seconds to train (roughly 20 minutes of wall-clock time) for a 10-fold cross-validation estimate. By comparison, a performance-tuned GLMNET model took 618 +/- 6.40 seconds to train (roughly 1 hour and 45 minutes of wall-clock time) for the same estimate. Given the training time trends shown in the training-set size ablation (see Figure 4 BB) and neural feature ablation (see Figure 4D) studies, we expect the GPMD to handle even larger data sets without difficulty.

Scaling to large data sets was further enhanced by the GPMD’s automatic data set preprocessing. Decoding studies, such as Graf et al. (2011), often select only strongly tuned neurons for decoding, since noisy neurons make it easier for models to overfit. Manual selection rules have two disadvantages: first, they may ignore neurons that look noisy but actually carry information, and second, they can require prohibitive amounts of time if human input is needed (e.g., for choosing initialization points for nonlinear curve fitting).

The GPMD’s Bayesian formulation automatically discarded noise neurons by setting their prior amplitudes to zero (see Figure 5), a phenomenon known as automatic relevance determination (MacKay, 1992; Neal, 1996). Examples of tuning curves (estimated using the sample average conditional on the stimulus) from automatically discarded and automatically retained neurons are shown in Figure 5. Some of the automatically retained neurons displayed the bimodal “gaussian bump” structure commonly sought by manual selection rules. Others displayed more complicated tuning patterns that would likely be ignored by a manual selection rule.

Our implementation of the empirical linear decoder (ELD; see Graf et al., 2011) replicated the original paper’s results only qualitatively, not quantitatively. Our implementation of the ELD did outperform the Poisson independent decoder (PID) when using the “proportion correct” error criterion, as in the original paper (see Figure S3). However, it did not achieve the performance reported in the original paper. Because our implementation of the PID, a very simple decoder, also did not match the performance of the PID in Graf et al. (2011), we believe the discrepancy was caused by data preprocessing. We were not able to replicate the data preprocessing steps described in Graf et al. (2011) precisely since the original code has been lost.

Figure 5:

(A) GPMD hyperparameter distributions on the third monkey data set. The distribution of amplitudes has a peak near zero, and the distribution of length scales has a second mode near 45, indicating an automatic relevance determination (ARD) effect. (B) Tuning curves for the ARD-eliminated neurons, estimated using the sample average of each neuron’s spike counts conditioned on the stimulus. These neurons have amplitudes near zero and, often, long length scales. (C) Tuning curves for the neurons that were not eliminated. These tuning curves were also estimated using the sample average. These neurons have positive amplitudes and much shorter length scales. Some have simple “gaussian bump” tuning curves, whereas others have more complex tuning characteristics.

Figure 5:

(A) GPMD hyperparameter distributions on the third monkey data set. The distribution of amplitudes has a peak near zero, and the distribution of length scales has a second mode near 45, indicating an automatic relevance determination (ARD) effect. (B) Tuning curves for the ARD-eliminated neurons, estimated using the sample average of each neuron’s spike counts conditioned on the stimulus. These neurons have amplitudes near zero and, often, long length scales. (C) Tuning curves for the neurons that were not eliminated. These tuning curves were also estimated using the sample average. These neurons have positive amplitudes and much shorter length scales. Some have simple “gaussian bump” tuning curves, whereas others have more complex tuning characteristics.

Close modal

5.3  Visualizing the Impact of Correlations on Decoding

As described in section 4, each row of the weight matrix W for each decoder forms a vector of decoding weights over orientation for a single neuron. Example decoding weights from a correlation-blind (GPPID for monkey, GPGID for mouse and ferret) and correlation-aware (GPMD) decoder pair are shown in Figure 6A. In general, the GPMD produces smoother weights, and unlike the GPPID/GPGID, it occasionally places large weights on stimuli with low expected response (e.g., ferret neuron 82 and mouse neuron 87). To verify that the GPMD’s decoding weights were usually smoother, we compared the inferred prior lengthscales for each decoder pair (see Figure 6B) and found that the GPMD learned longer length scales (sign test, p<0.001 for each animal.) We discarded decoding weights with 2 norm less than 0.001, since nearly zero weights indicated a neuron discarded by the learning process, which would have uninformative length scales.

Figure 6:

(A) Inferred decoding weights from naive Bayes (GPPID/GPGID) and GPMD model decoders for five example neurons from monkey, ferret, and mouse data sets. (B) Comparison of inferred prior length scales for each decoder. The GPMD learns longer length scales (sign test, p<0.001 for each animal). (C) Comparison of decoding weight norms for each decoder. The GPMD learns decoding weights with smaller norms (sign test, p<0.001 for each animal). (D) Comparison of test-set predictive distribution entropies for each decoder. The GPMD produces distributions with larger entropies (sign test, p<0.001 for each animal), indicating that it reports more uncertainty.

Figure 6:

(A) Inferred decoding weights from naive Bayes (GPPID/GPGID) and GPMD model decoders for five example neurons from monkey, ferret, and mouse data sets. (B) Comparison of inferred prior length scales for each decoder. The GPMD learns longer length scales (sign test, p<0.001 for each animal). (C) Comparison of decoding weight norms for each decoder. The GPMD learns decoding weights with smaller norms (sign test, p<0.001 for each animal). (D) Comparison of test-set predictive distribution entropies for each decoder. The GPMD produces distributions with larger entropies (sign test, p<0.001 for each animal), indicating that it reports more uncertainty.

Close modal

In nearly all cases, the GPPID/GPGID decoders learned decoding weights with larger 2 norm than the GPMD (see Figure 6C, sign test p<0.001 for each animal). Since the norm of the tuning curve is directly related to the predicted probability for each class (see equation 2.1), this indicates that the GPPID/GPGID are more confident in their predictions than the GPMD. To directly measure this, we chose a test set for each data set and compared the predictive distribution entropy for both the GPPID/GPGID and GPMD decoders (see Figure 6D). We found that the GPMD produced predictive distributions with higher entropy (sign test, p<0.001 for all animals), indicating that it produces more uncertain predictive distributions. This confirms the patterns seen in Figure 6C.

To further characterize the effects of correlations on decoding performance, we visualized population responses projected onto 2D subspaces defined by the decoding weights of correlation-blind (GPPID) and correlation-aware (GPMD) decoding models (see Figure 7A). To reduce the 147-dimensional response vectors obtained from a population of 147 monkey V1 neurons, we first selected a model to visualize and two stimulus classes i and j. Then we formed a two-dimensional basis by orthogonalizing the decoding weight vectors Wi* and Wj*, producing a basis matrix of orthogonal unit vectors, S. We plotted the data for classes i and j and for the model separatrix by projecting onto this basis. The model separatrix is given by the solution to the equation xu+c=0, where u=S(Wj*-Wi*), and the constant offset c is bj-bi. We estimated the approximate separatrix from the model that did not produce the basis by projecting its normal vector into the subspace defined by B and setting the constant offset to be -( mean (Xiu)+ mean (Xju)/2, where Xi and Xj are the data for classes i and j.

Figure 7:

(A) To investigate the impact of correlations on decoding performance, we projected the 25 and 30 data classes from the third macaque data set to two dimensions using an orthogonalized bases derived from the 25 and 30 decoding vectors from the correlation-blind GPPID and correlation-aware GPMD. The data displayed significant correlations in the GPPID basis, making the GPPID separatrix a poor decision boundary. The GPPID separatrix performed well only when correlations were removed from the data set using a Poisson surrogate model. In the GPMD basis, the difference between the correlated and Poisson surrogate data sets was much less pronounced, indicating that the GPMD’s projection decorrelated the data somewhat. (B) Class correlation ellipses plotted using the GPPID and GPMD bases derived from zero- and five-degree decoding vectors. The GPMD basis produced far superior class separation.

Figure 7:

(A) To investigate the impact of correlations on decoding performance, we projected the 25 and 30 data classes from the third macaque data set to two dimensions using an orthogonalized bases derived from the 25 and 30 decoding vectors from the correlation-blind GPPID and correlation-aware GPMD. The data displayed significant correlations in the GPPID basis, making the GPPID separatrix a poor decision boundary. The GPPID separatrix performed well only when correlations were removed from the data set using a Poisson surrogate model. In the GPMD basis, the difference between the correlated and Poisson surrogate data sets was much less pronounced, indicating that the GPMD’s projection decorrelated the data somewhat. (B) Class correlation ellipses plotted using the GPPID and GPMD bases derived from zero- and five-degree decoding vectors. The GPMD basis produced far superior class separation.

Close modal

We carried out this procedure for both the GPPID and GPMD models, since each two-dimensional basis could exactly represent only the separatrix from its source weight matrix. Figure 7A shows our visualization for the 25/30 class pair, and we included four additional class pairs in Figure S9 (all class pairs exhibited a similar pattern).

We first sought to determine whether the data deviated significantly from the independent Poisson model assumed by the GPPID. To do this, we generated a simulated data set from a conditionally independent Poisson model with means given by the tuning curves of the GPPID model (referred to in the figure as the “Poisson surrogate” data set). Compared to the real data, the Poisson surrogate data exhibit much smaller variance and tilt relative to the basis vectors, showing that the correlations in the real data can significantly affect decoding. However, in the GPMD basis, the differences between the real and surrogate data sets are much less pronounced, implying that the GPMD’s weight matrix identifies a linear projection that decorrelates and reduces the variance of the projected data. We found the same results using a decorrelated data set created by shuffling neural responses across trials (see Figure S8).

Next, we plotted the class separatrices along with the data. The separatrix given by the GPPID successfully separated the Poisson surrogate data but failed to separate the real data set because of correlation-induced distortions. However, as expected, the GPMD separatrix successfully took the data’s correlations into account.

To visualize how the entire set of 72 classes related to each other, we plotted each class’s correlation ellipse on the basis given by the zero- and five-degree basis vectors from each model (see Figure 7B). The selection of the basis vectors was arbitrary (see Figure S10 for other choices). The GPPID’s basis did a poor job of separating the classes, but the GPMD’s basis separated them fairly well. In the GPMD’s basis, the ellipses from classes 180 degrees apart appear in nearly identical locations, confirming that that the GPMD identified grating angles more precisely than grating drift direction, a phenomenon previously observed in our performance benchmarks (see Figure 3C).

Linear decoders provide a powerful tool for characterizing the information content of neural population responses. While all linear decoders share a common mathematical form, differences in fitting methods—namely, in regularization and the incorporation of noise correlations—can lead to large differences in performance.

Previous studies, such as Graf et al. (2011) and Stringer et al. (2021), have used correlation-blind and correlation-aware linear decoders to investigate the significance of correlations on decoding performance. In all cases, they found that correlation-aware decoders outperform correlation-blind decoders. However, the performance differences could have been due to the lack of regularization in the correlation-blind decoders, as maximum likelihood weight estimates in the sample-poor data regimes typically studied are often subject to overfitting.

In this article, we have presented a suite of new decoders that use gaussian processes for regularization. Two of these methods correspond to naive Bayes decoders under Poisson and gaussian noise models, respectively, which do not incorporate correlations. The third method, the GPMD, corresponds to a decoding model with a GP prior over its weights that does take account of noise correlations. We found that the GPMD matched or outperformed all other decoders on data sets from monkey, ferret, and mouse visual cortex. Furthermore, we showed that the GPMD scales to the very largest data sets using a combination of approximate Bayesian inference, spectral methods, and GPU acceleration.

Our results show that the performance gap between correlation-blind and correlation-aware decoders is fundamental, and not just a result of regularization. The correlation-aware decoders consistently outperformed even the regularized correlation-blind decoders, though regularization did narrow the gap significantly (as shown in Figure 3). This confirmed the results of previous studies, and we conclude that exploiting neural correlations can significantly improve decoding performance (though this result may have some dependence on whether the animal is anesthetized (see Yates et al., 2020). Visualizations of the decoding separatrices produced by each decoder indicate that the real data sets differ significantly from the assumptions made by correlation-blind decoders. The correlation-aware decoders discovered low-dimensional subspaces that decorrelated the responses, making the transformed data match independence assumptions more closely.

One major caveat to our results is that they cannot tell us whether the brain itself uses decoders that are correlation-blind or correlation-aware. Animal behavior may rely on a different populations of neurons than the ones recorded in our experimental data sets. Animals might achieve similar performance to the GPMD by reading out a similar neural population optimally or a larger neural population suboptimally (Chen et al., 2006; Jacobs et al., 2009; Morais et al., 2022). As in previous studies (Stringer et al., 2021), we found that in the largest recordings from mice, the decoding performance (1.3) substantially exceeded current estimates of discrimination ability of the animal (9; Lyamzin et al., 2021). This suggests that the animal’s behavioral performance may be primarily limited by the decoding strategies implemented by downstream regions, and not by the information present in V1 itself (Beck et al., 2012).

The prior assumption of smoothness governs the applicability of the GPMD and other GP-regularized decoders we developed to other brain regions and tasks. If the neural population of interest has smoothly varying tuning curves and the task can be solved using a linear decision boundary, the GPMD will perform well. Although our decoding task used angular stimuli and a periodic distance measure, the GPMD could be trivially extended to stimuli that vary smoothly on an interval of the real line by using the ordinary Euclidean distance. Beyond smoothness, we have made no assumptions about the population of neurons to be decoded, and the GPMD naturally “prunes” uninformative neurons (e.g., neurons untuned to the stimulus) by shrinking their decoding weights to zero.

In the future, similar decoders could be used to empirically characterize how the peak and high-slope regions of tuning curves contribute to optimal decoding strategies in real neural populations (for theoretical studies, see Butts & Goldman, 2006, and Yarrow et al., 2012). We also believe that the performance of the decoders could be further improved. In particular, the performance of the correlation-aware decoder could likely be improved by modeling additional structure in the neural data set—for example, the differences in response characteristics between excitatory and inhibitory neurons, or temporal structure (which would likely improve the classification of direction). The correlation-blind decoder could also be improved by using a generative model that can handle non-Poisson characteristics of neural data such as overdispersion (Charles et al., 2018; Gao et al., 2015; Goris et al., 2014; Gur et al., 1997; Stevenson, 2016). We are optimistic that future work will use and extend the decoding methods we have introduced to obtain new insights into the structure of neural codes.

Under appropriate assumptions, both the GID and the PID decoders have linear decision boundaries. To derive both decision boundaries at the same time, we consider the more general case of a naive Bayes decoder as described in section 3.1.1, but with an exponential family likelihood (Bishop, 2006). That is, the likelihood of the dth element of the feature vector xd can be written as
(A.1)

where η is the natural parameter and the sufficient statistic u:RR is a function of xd.

With this likelihood in mind, we can begin solving for the decision boundary. Our class prediction y^ for a given example x is
(A.2)
(A.3)
If we assume that the prior probabilities are constant, that is, P(y=k)=P(y=k') for every k,k'{1,...,K}, then this simplifies to
(A.4)
Introducing a log under the argmax, dropping terms that don’t depend on k, and substituting P(x)=d=1DP(xd), we simplify further:
(A.5)
(A.6)
Writing out the exponential family form, we have
(A.7)
(A.8)
(A.9)
This will simplify to the form of a linear decoder as long as the sufficient statistic of the exponential distribution u(xd) of the form u(xd)=αxd where α is a scalar. In that case, the entries of the weight matrix are given by the natural parameters,
(A.10)
and the entries of the intercept vector b are given by
(A.11)
Using these definitions, we can write
(A.12)
which is the form of a linear decoder.
In the case of the Poisson independent decoder, the sufficient statistic is the identity function, the natural parameter of the Poisson distribution is given by ηkd=logλkd, and g(ηkd)=e-λkd. Thus, for the PID,
(A.13)
(A.14)
The case of the gaussian independent decoder is slightly more complicated, since the gaussian sufficient statistic takes the proper form only if the variance σ2 can be incorporated into h(xd), a term we dropped from the argmax. For this dropping to be valid, we must constrain σkd=σk'd for all k,k'{1,...,K}. If this is true, then the sufficient statistic is u(xd)=xd/σd, the natural parameter is given by ηkd=μkd/σd, and g(ηkd)=exp-μkd22σkd2. Thus, for the GID,
(A.15)
(A.16)

B.1  With Gaussian Noise

In this section, we demonstrate how to solve a 1D GP regression problem in the spectral domain, following the general approach of Paciorek (2007), Royle and Wikle (2005), and Wikle (2002). Consider a regression data set {xt,yt}t=1T with yR and, without loss of generality, x[-π,π]. Note that we do not require the x values to lie on a grid. We can concatenate the training examples and labels into vectors as follows: x=(x1,...,xt) and y=(y1,...,yt).

We assume that the values of y are noisy observations of a zero-mean gaussian process z where zi=z(xi), that is, yi=zi+ε, where εN(0,σ2). Our probability model can be written as
(B.1)
(B.2)
Here κθ is the kernel matrix between x and x. If κθ(·) is the stationary kernel function with hyperparameters θ, then (κθ)ij=κθ(xi-xj). To infer the GP hyperparameters using type 2 maximum likelihood, we wish to maximize the log evidence given by
(B.3)
(B.4)
However, each evaluation of this expression has cost n3, which is intractable for large n. Our goal is to decorrelate z so that the prior covariance becomes diagonal, dropping the cost to n.
To achieve this, we will represent the GP z(x) using a Fourier series:
(B.5)
where Ω=2π/N and the Fk are the Fourier series coefficients. Our goal is to find the distribution of Fk such that z is a real zero-mean gaussian with Cov [z(x),z(x')]=κθ(x,x').

To ensure that z(x) is real, we require Fk=F-k*. This requirement can be verified by expanding equation B.5 in terms of sines and cosines. To ensure that z(x) is gaussian, we require that the Fk are independent gaussian random variables, which implies that they are jointly gaussian. Since z(x) is a linear combination of the Fks, this implies it is also gaussian. To ensure that z(x) is zero mean, we require that each Fk is zero mean. Because expectation is a linear operator, this ensures that z(x) is also zero mean.

The trickiest task is finding the variance of the Fks that induces the proper GP distribution on z. We can construct an equation to solve for it as follows: given a lag τ=s-t between two x values, we have the covariance
(B.6)
(B.7)
(B.8)
Because the Fourier coefficients are independent, we have E[FkFk'*]=0 for all kk'.
(B.9)
This is just a Fourier series. Thus, we can use the Fourier coefficient formula to invert the equation and solve for E[FkFk*]:
(B.10)
At this point, we are done. Note that the resulting kernel is positive definite, since equation B.9 evaluates to a linear combination of cosine kernels, which are positive definite (Murphy, 2012). Many implementations use the Fourier transform of κ rather than the Fourier coefficient expression given above. The equivalence can be derived by extending the bounds of integration to [-,]. This is a reasonable approximation as long as κ(τ) is close to zero outside [-π,π]—which is true for the RBF and Matern kernels as long as the lengthscale is short. Extending the bounds of integration, we have
(B.11)
For notational simplicity, let w be the vector of frequency-domain coefficients and s be the vector of associated covariances. Then the log evidence we wish to maximize is
(B.12)
(B.13)
(B.14)
where A=ΨΨσ2+ diag (s)-1 and b=1σ2Ψy. If Ψ is unitary, which is true if x lies on a grid, then A will be diagonal, which simplifies the gradient and Hessian calculations somewhat.

B.2  With Poisson Noise

Consider the same regression problem as in section B.1, but with Poisson observation noise. We wish to fit the hyperparameters by maximizing the log evidence:
(B.15)
Since p(yz,θ)=t=1 Poiss (yt;θt) the integral is not analytically tractable.
Define h(z)=logp(yz,θ)+p(zθ) and its argmax as z*. To use the Laplace approximation (Azevedo-Filho & Shachter, 1994) to approximate the integral, we will start with solving for the evidence using Bayes’ rule:
(B.16)
(B.17)
Now, define H to be the Hessian matrix of h(z) and approximate the posterior p(zy,θ) as a gaussian using the Laplace approximation:
(B.18)
(B.19)
Because logp(yθ) does not depend on z, we will evaluate the right-hand side at z=z* to simplify the normal distribution:
(B.20)
(B.21)
(B.22)
Let N be the dimension of z:
(B.23)
(B.24)
Using the identity log|A-1|=log1/A=-logA, we have
(B.25)
Since evaluating this quantity requires finding z* via an optimization procedure, it is difficult to maximize it using a derivative-based optimization algorithm, an issue pointed out by Rasmussen and Williams (2006). We use a derivative-free technique, the Nelder-Mead algorithm (Nelder & Mead, 1965).

For each of the five monkey data sets provided by Graf et al. (2011), we chose the feature ({x}) exactly as in Graf et al. did. The stimuli grating angles were selected from a 5-degree grid, so to get yt, we simply mapped the angles {0,5,10,...,360} to the integers {1,2,...,72}.

Unlike Graf et al., we did not drop noisy neurons from the data set, since we found it made little to no difference in decoding accuracy (see Figure S6).

For the three mouse data sets, we chose the feature vectors {xt}t=1T exactly as in Stringer et al. (2021). For the class values {yt}t=1T, we binned the stimulus angles using 2-degree bins and used the bin index as the class label.

For the ferret data set, all procedures were performed according to NIH guidelines and approved by the Institutional Animal Care and Use Committee at Max Planck Florida Institute for Neuroscience. Surgical procedures and acute preparations were performed as described in Scholl et al. (2017). To perform calcium imaging of cellular populations, AAV1.Syn.GCaMP6s (UPenn) was injected at multiple depths (total volume 500 nL). Visual stimuli were generated using Psychopy (Peirce, 2007). The monitor was placed 25 cm from the animal, centered in the receptive field locations for the cells of interest. Square-wave drifting gratings (0.10 cycles per degree spatial frequency, 4 Hz temporal frequency) were presented at 2-degree increments across the full range of directions (1 second duration, 1 second ISI, 11 trials). Two-photon imaging was performed on a Bergamo II microscope (Thorlabs) running Scanimage (Pologruto et al., 2003) (Vidrio Technologies) with 940 nm dispersion-compensated excitation provided by an Insight DS+ (Spectraphysics). Power after the objective was 40 mW. Images were collected at 30 Hz using bidirectional scanning with 512 × 512 pixel resolution. The full field of view was 1 × 1 mm. Raw images were corrected for in-plane motion via a nonrigid motion correction algorithm (Pnevmatikakis & Giovannucci, 2017). Regions of interest were drawn in ImageJ. Mean pixel values for ROIs were computed over the imaging time series and imported into Matlab (Hiner et al., 2017). ΔF/Fo was computed by computing Fo with time-averaged median or percentile filter. ΔF/Fo traces were synchronized to stimulus triggers sent from Psychopy and collected by Spike2. Response amplitudes for each stimulus on each trial were calculated as the sum of the Fourier mean and modulation (F0+F1). These values for each neuron were used to generate the feature ({x}) vectors. Class values (y) were the stimulus angles presented (at 2-degree increments) using the bin index as the class label.

Figure S1:

Correlations present in the data. (A) Noise correlations calculated by taking the correlation coefficient of the neural responses, minus the mean response conditional on the stimulus. The mean responses were estimated using the GPPID or GPGID models, as appropriate, since these were less noisy than the tuning curves estimated using the sample mean. We restricted the ferret and mouse data sets to the first 512 neurons so that the patterns would be large enough to be visible. (B) Signal correlations, calculated by computing the correlation coefficients of the neural tuning curves. In this case, we estimated the neural tuning curves using the GPPID/GPGID models. As before, we restricted the ferret and mouse data sets to the first 512 neurons.

Figure S1:

Correlations present in the data. (A) Noise correlations calculated by taking the correlation coefficient of the neural responses, minus the mean response conditional on the stimulus. The mean responses were estimated using the GPPID or GPGID models, as appropriate, since these were less noisy than the tuning curves estimated using the sample mean. We restricted the ferret and mouse data sets to the first 512 neurons so that the patterns would be large enough to be visible. (B) Signal correlations, calculated by computing the correlation coefficients of the neural tuning curves. In this case, we estimated the neural tuning curves using the GPPID/GPGID models. As before, we restricted the ferret and mouse data sets to the first 512 neurons.

Close modal
Figure S2:

A visualization of the distance between class distributions in each data set. (A) To prevent numerical instability in our calculations, we first reduced the data set dimension using PCA. The number of components chosen is marked in red. (B) Distance between class distributions, estimated using the analytic Hellinger distance bewteen gaussian distributions fit empirically to each class distribution. The Hellinger distance is a symmetric distance measure that takes the value zero when the distributions are maximally similar and one when the distributions are maximally dissimilar. Nearby class distributions are similar to each other and to distributions 180 away. As dimensionality increases, the class distributions become more and more separated.

Figure S2:

A visualization of the distance between class distributions in each data set. (A) To prevent numerical instability in our calculations, we first reduced the data set dimension using PCA. The number of components chosen is marked in red. (B) Distance between class distributions, estimated using the analytic Hellinger distance bewteen gaussian distributions fit empirically to each class distribution. The Hellinger distance is a symmetric distance measure that takes the value zero when the distributions are maximally similar and one when the distributions are maximally dissimilar. Nearby class distributions are similar to each other and to distributions 180 away. As dimensionality increases, the class distributions become more and more separated.

Close modal
Figure S3:

The same benchmark as in Figure 3 but calculated using proportion correct (i.e., proportion with 0 error) instead of mean absolute error. This is the same criterion as in Graf et al. (2011). Using it, we qualitatively replicate the results of Graf et al.

Figure S3:

The same benchmark as in Figure 3 but calculated using proportion correct (i.e., proportion with 0 error) instead of mean absolute error. This is the same criterion as in Graf et al. (2011). Using it, we qualitatively replicate the results of Graf et al.

Close modal
Figure S4:

The same benchmark as in Figure 3 but without averaging across animals.

Figure S4:

The same benchmark as in Figure 3 but without averaging across animals.

Close modal
Figure S5:

The same benchmark as in Figure 3B but without averaging across animals.

Figure S5:

The same benchmark as in Figure 3B but without averaging across animals.

Close modal
Figure S6:

A comparison of model performance on the monkey data sets from Graf et al. (2011), both with and without noisy neurons included. Noisy neurons were dropped using the procedure detailed in the supplementary information of Graf et al. We thought that this filtering step would make the models perform better, matching the performance reported by Graf et al., but it did not. We suspect that the issue lies in the initialization of our nonlinear curve fitting code, but we were not able to compare our implementation with the original, since the original code has been lost.

Figure S6:

A comparison of model performance on the monkey data sets from Graf et al. (2011), both with and without noisy neurons included. Noisy neurons were dropped using the procedure detailed in the supplementary information of Graf et al. We thought that this filtering step would make the models perform better, matching the performance reported by Graf et al., but it did not. We suspect that the issue lies in the initialization of our nonlinear curve fitting code, but we were not able to compare our implementation with the original, since the original code has been lost.

Close modal
Figure S7:

A comparison of model performance using linear (y=wx) and affine (y=wx+b) model formulations. One would expect the affine models to fit the data better, but the difference in performance is negligible. This is likely due to the relatively high dimensionality of the data sets, since separating hyperplanes are quite easy to find in high dimensions whether or not they are restricted to pass through the origin. We expect a larger performance difference on data sets with smaller feature dimensions (<10).

Figure S7:

A comparison of model performance using linear (y=wx) and affine (y=wx+b) model formulations. One would expect the affine models to fit the data better, but the difference in performance is negligible. This is likely due to the relatively high dimensionality of the data sets, since separating hyperplanes are quite easy to find in high dimensions whether or not they are restricted to pass through the origin. We expect a larger performance difference on data sets with smaller feature dimensions (<10).

Close modal
Figure S8:

The same visualization as in Figure 7A, but with a surrogate correlation-free data set created by shuffling each neuron’s responses conditional on the class label. The conclusions remain the same: the data display significant correlations in the GPPID basis, making the GPPID separatrix a poor decision boundary. The GPPID separatrix performs well only when correlations are removed from the data set (in this case, using shuffling). In the GPMD basis, the difference between the correlated and shuffled surrogate datasets is much less pronounced, indicating that the GPMD’s projection decorrelates the data somewhat.

Figure S8:

The same visualization as in Figure 7A, but with a surrogate correlation-free data set created by shuffling each neuron’s responses conditional on the class label. The conclusions remain the same: the data display significant correlations in the GPPID basis, making the GPPID separatrix a poor decision boundary. The GPPID separatrix performs well only when correlations are removed from the data set (in this case, using shuffling). In the GPMD basis, the difference between the correlated and shuffled surrogate datasets is much less pronounced, indicating that the GPMD’s projection decorrelates the data somewhat.

Close modal
Figure S9:

The same visualization as in Figure 7A, for an additional four neighboring stimuli pairs. The GPMD separatrices are colored red, and the GPPID separatrices are colored black. In each case, the general patterns hold: the GPPID fails to separate the correlated data but performs well on the uncorrelated surrogate data set, the GPMD performs reasonably well in all cases, and the GPMD basis appears to decorrelate the data somewhat.

Figure S9:

The same visualization as in Figure 7A, for an additional four neighboring stimuli pairs. The GPMD separatrices are colored red, and the GPPID separatrices are colored black. In each case, the general patterns hold: the GPPID fails to separate the correlated data but performs well on the uncorrelated surrogate data set, the GPMD performs reasonably well in all cases, and the GPMD basis appears to decorrelate the data somewhat.

Close modal
Figure S10:

The same visualization as in Figure 7B for an additional four neighboring stimuli pairs. In each case, the general pattern holds: the GPMD basis produces far superior class separation. In general, the GPMD maps the classes to lie on an ellipse with a 180 period.

Figure S10:

The same visualization as in Figure 7B for an additional four neighboring stimuli pairs. In each case, the general pattern holds: the GPMD basis produces far superior class separation. In general, the GPMD maps the classes to lie on an ellipse with a 180 period.

Close modal
Figure S11:

A visualization of how neural selectivity influences decoder performance. We quantified neural selectivity using the total Fisher information, estimated by summing the Fisher information across neurons, using the third monkey data set. Assuming a Poisson response distribution, the Fisher information for each neuron as a function of the stimulus is the derivative of its tuning curve squared divided by the mean, f'(θ)/f(θ) (Seung & Sompolinsky, 1993). The inverse of the total Fisher information bounds the minimum achievable mean squared error (Abbott & Dayan, 1999). We wished to compare the inverse total Fisher information of the neural population to the mean squared error (MSE) achieved by the GPMD decoder. Panels A and B show two views of this information: panel A compares MSE and inverse Fisher information for each stimulus orientation, and panel B is a scatter plot comparison that demonstrates that MSE and inverse Fisher information are positively correlated (linear regression line and 95% confidence interval). The inverse total Fisher information does not lower-bound the MSE, since our Fisher information estimate assumes uncorrelated noise. Nevertheless, lower inverse Fisher information is associated with lower MSE, showing that increased neural selectivity contributes to lower decoding error.

Figure S11:

A visualization of how neural selectivity influences decoder performance. We quantified neural selectivity using the total Fisher information, estimated by summing the Fisher information across neurons, using the third monkey data set. Assuming a Poisson response distribution, the Fisher information for each neuron as a function of the stimulus is the derivative of its tuning curve squared divided by the mean, f'(θ)/f(θ) (Seung & Sompolinsky, 1993). The inverse of the total Fisher information bounds the minimum achievable mean squared error (Abbott & Dayan, 1999). We wished to compare the inverse total Fisher information of the neural population to the mean squared error (MSE) achieved by the GPMD decoder. Panels A and B show two views of this information: panel A compares MSE and inverse Fisher information for each stimulus orientation, and panel B is a scatter plot comparison that demonstrates that MSE and inverse Fisher information are positively correlated (linear regression line and 95% confidence interval). The inverse total Fisher information does not lower-bound the MSE, since our Fisher information estimate assumes uncorrelated noise. Nevertheless, lower inverse Fisher information is associated with lower MSE, showing that increased neural selectivity contributes to lower decoding error.

Close modal

This work was supported by grants from the Simons Collaboration on the Global Brain (SCGB AWD543027), the NIH BRAIN initiative (NS104899 and R01EB026946), and a U19 NIH-NINDS BRAIN Initiative Award (5U19NS104648). J.Y. is supported by the NIH (K99EY032179). B.S. is supported by the NIH (K99EY031137) and thanks the Max Planck Society and Max Planck Florida Institute for their generous support. We thank A. B. A. Graf, Al Kohn, M. Jazayeri, and J. A. Movshon for providing the primate data sets; C. Stringer, M. Michaelos, and M. Pachitariu for providing the publicly available mouse data sets; and Kenneth Latimer for helpful comments on the manuscript.

Abbott
,
L. F.
(
1994
).
Decoding neuronal firing and modelling neural networks
.
Quarterly Reviews of Biophysics
,
27
(
3
),
291
331
.
Abbott
,
L. F.
, &
Dayan
,
P.
(
1999
).
The effect of correlated variability on the accuracy of a population code
.
Neural Computation
,
11
(
1
),
91
101
.
Adibi
,
M.
,
McDonald
,
J. S.
,
Clifford
,
C. W. G.
, &
Arabzadeh
,
E.
(
2013
).
Adaptation improves neural coding efficiency despite increasing correlations in variability
.
Journal of Neuroscience
,
33
(
5
),
2108
2120
. http://www.jneurosci.org/content/33/5/2108.abstract.
Averbeck
,
B. B.
,
Latham
,
P. E.
, &
Pouget
,
A.
(
2006
).
Neural correlations, population coding and computation
.
Nature Reviews Neuroscience
,
7
(
5
),
358
366
.
Averbeck
,
B. B.
, &
Lee
,
D.
(
2006
).
Effects of noise correlations on information encoding and decoding
.
Journal of Neurophysiology
,
95
(
6
),
3633
3644
.
Azevedo-Filho
,
A.
, &
Shachter
,
R. D.
(
1994
,
January
).
Laplace’s method approximations for probabilistic inference in belief networks with continuous variables
. In
R. L.
de Mantaras
&
D.
Poole
(Eds.),
Uncertainty Proceedings 1994
(pp.
28
36
).
Morgan Kaufmann
.
Bartolo
,
R.
,
Saunders
,
R. C.
,
Mitz
,
A. R.
, &
Averbeck
,
B. B.
(
2020
).
Information-limiting correlations in large neural populations
.
Journal of Neuroscience
,
40
(
8
),
1668
1678
.
Beck
,
J.
,
Ma
,
W.
,
Latham
,
P.
, &
Pouget
,
A.
(
2007
).
Probabilistic population codes and the exponential family of distributions
.
Progress in Brain Research
,
165
,
509
519
.
Beck
,
J.
,
Ma
,
W. J.
,
Pitkow
,
X.
,
Latham
,
P. E.
, &
Pouget
,
A.
(
2012
).
Not noisy, just wrong: The role of suboptimal inference in behavioral variability
.
Neuron
,
74
(
1
),
30
39
.
Belkin
,
M.
,
Hsu
,
D.
,
Ma
,
S.
, &
Mandal
,
S.
(
2019
).
Reconciling modern machine-learning practice and the classical bias-variance trade-off
.
Proceedings of the National Academy of Sciences of the United States of America
,
116
(
32
),
15849
15854
.
Berens
,
P.
,
Ecker
,
A. S.
,
Cotton
,
R. J.
,
Ma
,
W. J.
,
Bethge
,
M.
, &
Tolias
,
A. S.
(
2012
).
A fast and simple population code for orientation in primate V1
.
Journal of Neuroscience
,
32
(
31
),
10618
10626
.
Bialek
,
W.
,
Rieke
,
F.
,
Steveninck
,
R. R.
,
R.
, &
Warland
,
D
. (
1991
).
Reading a neural code
.
Science
,
252
,
1854
1857
.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning
.
Springer
.
Blei
,
D. M.
,
Kucukelbir
,
A.
, &
McAuliffe
,
J. D.
(
2017
).
Variational inference: A review for statisticians
.
Journal of the American Statistical Association
,
112
(
518
),
859
877
.
Butts
,
D. A.
, &
Goldman
,
M. S.
(
2006
).
Tuning curves, neuronal variability, and sensory coding
.
PLOS Biology
,
4
(
4
),
[PubMed]
.
Cafaro
,
J.
, &
Rieke
,
F.
(
2010
).
Noise correlations improve response fidelity and stimulus encoding
.
Nature
,
468
(
7326
),
964
967
.
Charles
,
A. S.
,
Park
,
M.
,
Weller
,
J. P.
,
Horwitz
,
G. D.
, &
Pillow
,
J. W.
(
2018
).
Dethroning the Fano factor: A flexible, model-based approach to partitioning neural variability
.
Neural Computation
,
30
(
4
),
1012
1045
.
Chen
,
Y.
,
Geisler
,
W. S.
, &
Seidemann
,
E.
(
2006
).
Optimal decoding of correlated neural population responses in the primate visual cortex
.
Nature Neuroscience
,
9
(
11
),
1412
1420
.
Cunningham
,
J. P.
,
Yu
,
B. M.
,
Shenoy
,
K. V.
, &
Sahani
,
M.
(
2007
).
Inferring neural firing rates from spike trains using gaussian processes
. In
J.
Platt
,
D.
Koller
,
Y.
Singer
, &
S.
Roweis
(Eds.),
Advances in neural information processing systems
,
20
(pp.
329
336
).
Curran
.
da Silveira
,
R. A.
, &
Rieke
,
F.
(
2020
).
The geometry of information coding in correlated neural populations
.
Annual Review of Neuroscience
,
44
.
Duncker
,
L.
, &
Sahani
,
M.
(
2018
).
Temporal alignment and latent gaussian process factor inference in population spike trains
. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
10445
10455
).
Curran
.
Ecker
,
A. S.
,
Berens
,
P.
,
Cotton
,
R. J.
,
Subramaniyan
,
M.
,
Denfield
,
G. H.
,
Cadwell
,
C. R.
, . . .
Tolias
,
A. S.
(
2014
).
State dependence of noise correlations in macaque primary visual cortex
.
Neuron
,
82
(
1
),
235
248
.
Ecker
,
A. S.
,
Berens
,
P.
,
Tolias
,
A. S.
, &
Bethge
,
M.
(
2011
).
The effect of noise correlations in populations of diversely tuned neurons
.
Journal of Neuroscience
,
31
(
40
),
14272
14283
.
Földiák
,
P.
(
1993
).
The “ideal homunculus”: Statistical inference from neural population responses
. In
F. H.
Eeckman
&
J. M.
Bower
(Eds.),
Computation and neural systems
(pp.
55
60
).
Springer
.
Friedman
,
J.
,
Hastie
,
T.
, &
Tibshirani
,
R.
(
2010
).
Regularization paths for generalized linear models via coordinate descent
.
Journal of Statistical Software
,
33
(
1
),
1
22
.
Gao
,
Y.
,
Busing
,
L.
,
Shenoy
,
K. V.
, &
Cunningham
,
J. P.
(
2015
).
High-dimensional neural spike train analysis with generalized count linear dynamical systems
. In
C.
Cortes
,
N.
Lawrence
,
D.
Lee
,
M.
Sugiyama
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
28
(pp.
2044
2052
).
Curran
.
Goris
,
R. L. T.
,
Movshon
,
J. A.
, &
Simoncelli
,
E. P.
(
2014
).
Partitioning neuronal variability
.
Nature Neuroscience
,
17
(
6
),
858
865
.
Graf
,
A. B. A.
,
Kohn
,
A.
,
Jazayeri
,
M.
, &
Movshon
,
J. A.
(
2011
).
Decoding the activity of neuronal populations in macaque primary visual cortex
.
Nature Neuroscience
,
14
(
2
),
239
245
.
Gur
,
M.
,
Beylin
,
A.
, &
Snodderly
,
D. M.
(
1997
).
Response variability of neurons in primary visual cortex (V1) of alert monkeys
.
Journal of Neuroscience
,
17
(
8
),
2914
2920
.
Hiner
,
M. C.
,
Rueden
,
C. T.
, &
Eliceiri
,
K. W.
(
2017
).
ImageJMATLAB: A bidirectional framework for scientific image analysis interoperability
.
Bioinformatics
,
33
(
4
),
629
630
.
Hoffman
,
M. D.
,
Blei
,
D. M.
,
Wang
,
C.
, &
Paisley
,
J.
(
2013
).
Stochastic variational inference
.
Journal of Machine Learning Research
,
14
,
1303
1347
.
Jacobs
,
A. L.
,
Fridman
,
G.
,
Douglas
,
R. M.
,
Alam
,
N. M.
,
Latham
,
P. E.
,
Prusky
,
G. T.
, &
Nirenberg
,
S.
(
2009
).
Ruling out and ruling in neural codes
.
Proceedings of the National Academy of Sciences of the USA
,
106
(
14
),
5936
5941
.
Kanitscheider
,
I.
,
Coen-Cagli
,
R.
, &
Pouget
,
A.
(
2015
).
Origin of information-limiting noise correlations
.
Proceedings of the National Academy of Sciences
,
112
(
50
),
E6973
E6982
.
Keeley
,
S. L.
,
Aoi
,
M.
,
Yu
,
Y.
,
Smith
,
S. L.
, &
Pillow
,
J. W.
(
2020
).
Identifying signal and noise structure in neural population activity with gaussian process factor models
. In
H.
,
Larochelle
,
M.
,
Ranzato
,
R.
,
Hadsell
,
M.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
(pp.
13795
13805
).
Curran
.
Kohn
,
A.
,
Coen-Cagli
,
R.
,
Kanitscheider
,
I.
, &
Pouget
,
A.
(
2016
).
Correlations and neuronal population information
.
Annual Review of Neuroscience
,
39
(
1
),
237
256
.
Liu
,
H.
,
Ong
,
Y.-S
.,
Yu
,
Z.
,
Cai
,
J.
, &
Shen
,
X.
(
2021
).
Scalable gaussian process classification with additive noise for nongaussian likelihoods
.
IEEE Transactions on Cybernetics
,
52
(
7
),
5842
5854
.
Lyamzin
,
D. R.
,
Aoki
,
R.
,
Abdolrahmani
,
M.
, &
Benucci
,
A.
(
2021
).
Probabilistic discrimination of relative stimulus features in mice
.
Proceedings of the National Academy of Sciences of the USA
,
118
(
30
).
Ma
,
W. J.
,
Beck
,
J. M.
,
Latham
,
P. E.
, &
Pouget
,
A.
(
2006
).
Bayesian inference with probabilistic population codes
.
Nature Neuroscience
,
9
,
1432
1438
.
MacKay
,
D. J. C.
(
1992
).
Bayesian interpolation
.
Neural Computation
,
4
(
3
),
415
447
.
Macke
,
J. H.
,
Gerwinn
,
S.
,
White
,
L. E.
,
Kaschube
,
M.
, &
Bethge
,
M.
(
2011
).
Gaussian process methods for estimating cortical maps
.
NeuroImage
,
56
(
2
),
570
581
.
Morais
,
M. J.
,
Michelson
,
C. D.
,
Chen
,
Y.
,
Pillow
,
J. W.
, &
Seidemann
,
E.
(
2022
).
Majority of choice-related variability in perceptual decisions is present in early sensory cortex
.
Moreno-Bote
,
R.
,
Beck
,
J.
,
Kanitscheider
,
I.
,
Pitkow
,
X.
,
Latham
,
P.
, &
Pouget
,
A.
(
2014
).
Information-limiting correlations
.
Nature Neuroscience
,
17
(
10
),
1410
1417
.
Murphy
,
K. P.
(
2012
).
Machine learning: A probabilistic perspective
.
MIT Press
.
Neal
,
R. M.
(
1996
).
Bayesian learning for neural networks
.
Springer
. https://doi-org.ezproxy.princeton.edu/10.1007/978-1-4612-0745-0
Nelder
,
J. A.
, &
Mead
,
R.
(
1965
).
A simplex method for function minimization
.
Computer Journal
,
7
(
4
),
308
313
.
Nirenberg
,
S.
, &
Latham
,
P. E.
(
2003
).
Decoding neuronal spike trains: How important are correlations
?
PNAS
,
100
,
7348
7353
.
Nogueira
,
R.
,
Peltier
,
N. E.
,
Anzai
,
A.
,
DeAngelis
,
G. C.
,
Martíınez-Trujillo
,
J.
, &
Moreno-Bote
,
R.
(
2020
).
The effects of population tuning and trial-by-trial variability on information encoding and behavior
.
Journal of Neuroscience
,
40
(
5
),
1066
1083
.
Paciorek
,
C. J.
(
2007
).
Bayesian smoothing with gaussian processes using Fourier basis functions in the spectralGP package
.
Journal of Statistical Software
,
19
(
2
),
[PubMed]
.
Panzeri
,
S.
,
Moroni
,
M.
,
Safaai
,
H.
, &
Harvey
,
C. D.
(
2022
).
The structures and functions of correlations in neural population codes
.
Nature Reviews Neuroscience
,
23
(
9
).
Park
,
M.
, &
Pillow
,
J. W.
(
2011
).
Receptive field inference with localized priors
.
PLOS Computational Biology
,
7
(
10
),
[PubMed]
.
Park
,
M.
,
Weller
,
J. P.
,
Horwitz
,
G. D.
, &
Pillow
,
J. W.
(
2014
).
Bayesian active learning of neural firing rate maps with transformed gaussian process priors
.
Neural Computation
,
26
(
8
),
1519
1541
. http://wwwmitpressjournals.org/doi/abs/10.1162/NECO_a_00615.
Peirce
,
J. W.
(
2007
).
Psychopy—psychophysics software in Python
.
Journal of Neuroscience Methods
,
162
(
1–2
),
8
13
.
Pillow
,
J. W.
,
Shlens
,
J.
,
Paninski
,
L.
,
Sher
,
A.
,
Litke
,
A. M.
, &
Chichilnisky
,
E. P.
, &
E. J.
Simoncelli
. (
2008
).
Spatio-temporal correlations and visual signaling in a complete neuronal population
.
Nature
,
454
,
995
999
.
Pnevmatikakis
,
E. A.
, &
Giovannucci
,
A.
(
2017
).
NoRMCorre: An online algorithm for piecewise rigid motion correction of calcium imaging data
.
Journal of Neuroscience Methods
,
291
,
83
94
.
Pologruto
,
T. A.
,
Sabatini
,
B. L.
, &
Svoboda
,
K.
(
2003
).
Scanimage: Flexible software for operating laser scanning microscopes
.
Biomedical Engineering Online
,
2
(
1
),
1
9
.
Rad
,
K. R.
, &
Paninski
,
L.
(
2010
).
Efficient, adaptive estimation of two-dimensional firing rate surfaces via gaussian process methods
.
Network: Computation in Neural Systems
,
2
(
3–4
),
142
168
.
Rasmussen
,
C. E.
, &
Williams
,
C. K. I.
(
2006
).
Gaussian processes for machine learning
.
MIT Press
.
Royle
,
J. A.
, &
Wikle
,
C. K.
(
2005
).
Efficient statistical mapping of avian count data
.
Environmental and Ecological Statistics
,
12
(
2
),
225
243
.
Sahani
,
M.
, &
Linden
,
J.
(
2002
).
Evidence optimization techniques for estimating stimulusresponse functions
. In
T.
Dietterich
,
S.
Becker
, &
Z.
Ghahramani
(Eds.),
Advances in neural information processing systems
,
14
.
MIT Press
.
Schneidman
,
E.
,
Bialek
,
W.
, &
Berry
,
M. J.
(
2003
).
Synergy, redundancy, and independence in population codes
.
Journal of Neuroscience
,
23
(
37
),
11539
11553
.
Schölkopf
,
B.
, &
Smola
,
A.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization and beyond
.
MIT Press
.
Scholl
,
B.
,
Wilson
,
D. E.
, &
Fitzpatrick
,
D.
(
2017
).
Local order within global disorder: Synaptic architecture of visual space
.
Neuron
,
96
(
5
),
1127
1138
.
Seung
,
H. S.
, &
Sompolinsky
,
H.
(
1993
).
Simple models for reading neuronal population codes
.
Proceedings of the National Academy of Sciences of the USA
,
90
(
22
),
10749
10753
.
Sokoloski
,
S.
,
Aschner
,
A.
, &
Coen-Cagli
,
R.
(
2021
).
Modelling the neural code in large populations of correlated neurons
.
eLife
,
10
,
[PubMed]
.
Stevenson
,
I. H.
(
2016
).
Flexible models for spike count data with both over- and under-dispersion
.
Journal of Computational Neuroscience
,
41
(
1
),
29
43
.
Stringer
,
C.
,
Michaelos
,
M.
,
Tsyboulski
,
D.
,
Lindo
,
S. E.
, &
Pachitariu
,
M.
(
2021
).
High-precision coding in visual cortex
.
Cell
,
184
(
10
),
2767
2778
.
e15
.
Wikle
,
C. K.
(
2002
).
Spatial modelling of count data: A case study in modelling breeding bird survey data on large spatial domains
. In
A. B.
Lawson
&
D. G. T.
Denison
(Eds.),
Spatial cluster modelling
(pp.
199
209
).
Chapman & Hall/CRC
.
Yarrow
,
S.
,
Challis
,
E.
, &
Seriès
,
P.
(
2012
).
Fisher and Shannon information in finite neural populations
.
Neural Computation
,
24
(
7
),
1740
1780
.
Yates
,
J. L.
,
Katz
,
L. N.
,
Levi
,
A. J.
,
Pillow
,
J. W.
, &
Huk
,
A. C.
(
2020
).
A simple linear readout of MT supports motion direction-discrimination performance
.
Journal of Neurophysiology
,
123
(
2
),
682
694
.
Yu
,
B. M.
,
Cunningham
,
J. P.
,
Santhanam
,
G.
,
Ryu
,
S. I.
,
Shenoy
,
K. V.
, &
Sahani
,
M.
(
2009
).
Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
.
Journal of Neurophysiology
,
102
(
1
),
614
.
Zohary
,
E.
,
Shadlen
,
M. N.
, &
Newsome
,
W. T.
(
1994
).
Correlated neuronal discharge rate and its implications for psychophysical performance
.
Nature
,
370
(
6485
),
140
143
.
Zou
,
H.
, &
Hastie
,
T.
(
2005
).
Regularization and variable selection via the elastic net
.
Journal of the Royal Statistical Society: Series B (Statistical Methodology)
,
67
(
2
),
301
320
.