## Abstract

Neural decoding methods provide a powerful tool for quantifying the information content of neural population codes and the limits imposed by correlations in neural activity. However, standard decoding methods are prone to overfitting and scale poorly to high-dimensional settings. Here, we introduce a novel decoding method to overcome these limitations. Our approach, the gaussian process multiclass decoder (GPMD), is well suited to decoding a continuous low-dimensional variable from high-dimensional population activity and provides a platform for assessing the importance of correlations in neural population codes. The GPMD is a multinomial logistic regression model with a gaussian process prior over the decoding weights. The prior includes hyperparameters that govern the smoothness of each neuron’s decoding weights, allowing automatic pruning of uninformative neurons during inference. We provide a variational inference method for fitting the GPMD to data, which scales to hundreds or thousands of neurons and performs well even in data sets with more neurons than trials. We apply the GPMD to recordings from primary visual cortex in three species: monkey, ferret, and mouse. Our decoder achieves state-of-the-art accuracy on all three data sets and substantially outperforms independent Bayesian decoding, showing that knowledge of the correlation structure is essential for optimal decoding in all three species.

## 1 Introduction

Since Zohary, Shadlen, and Newsome’s landmark demonstration of correlated activity in a population of MT neurons (Zohary et al., 1994), computational neuroscience has been seeking to elucidate the role that noise correlations play in the population code (Adibi et al., 2013; Bartolo et al., 2020; Cafaro & Rieke, 2010; Ecker et al., 2011, 2014; Kanitscheider et al., 2015; Kohn et al., 2016; Moreno-Bote et al., 2014; Nirenberg & Latham, 2003; Nogueira et al., 2020; Panzeri et al., 2022; Schneidman et al., 2003; Sokoloski et al., 2021). Noise correlations refer to statistical dependencies in the trial-to-trial fluctuations in population activity elicited in response to a fixed stimulus, and they may either increase or decrease information relative to a population with conditionally independent neurons (Averbeck et al., 2006; da Silveira & Rieke, 2020). A common strategy for evaluating the role that noise correlations play in a population code is to compare the accuracy of two decoders trained to predict stimuli from population activity: a “correlation-blind” decoder, which has no access to noise correlations present in population activity, and a “correlation-aware” decoder, which does (Berens et al., 2012; Bialek et al., 1991; Graf et al., 2011; Nirenberg & Latham, 2003; Pillow et al., 2008; Stringer et al., 2021; Yates et al., 2020). If the correlation-aware decoder performs better, then we may conclude that downstream regions must take noise correlations into account to optimally read out information from the upstream population.

This strategy is an effective way to investigate the scientific question, but existing work is plagued by a number of statistical issues, which we aim to address in this article. First, as neural data sets have increased in dimensionality, regularization has become a prerequisite for good decoding performance, making it difficult to compare correlation-blind decoders—which are often fit without regularization—and correlation-aware decoders, which are almost always regularized. Second, conventional correlation-aware decoders struggle to scale statistically and computationally to modern data sets containing tens or hundreds of thousands of neurons.

To address these shortcomings, we develop a suite of three new decoders with a common regularization strategy based on gaussian processes (GPs). First, we introduce two correlation-blind decoders that apply Bayesian decoding to an independent encoding model: the GP Poisson independent decoder (GPPID), which assumes independent Poisson encoding noise; and the GP gaussian independent decoder (GPGID), which assumes independent gaussian encoding noise. Both of these decoders place a gaussian process prior over the neural tuning curves. (For each neuron, its tuning curve is its mean response as a function of the stimulus variable.) The GPPID model can be used when the neural responses are nonnegative integers (e.g., spike counts), whereas the GPGID model can be used when the neural responses are real numbers (e.g., calcium imaging). We emphasize that both of these decoders are insensitive to correlations because they assume neural responses are independent given the stimulus.

We then introduce a novel correlation-aware decoding model, the gaussian process multiclass decoder (GPMD), which is a multinomial logistic regression model that uses a GP prior to regularize the decoding weights for each neuron. This decoder, which learns a direct linear mapping from high-dimensional neural activity patterns to the log probability of the stimulus, is the only one of the three that takes noise correlations into account. However, the three decoders have a similar number of parameters—equal to the number of neurons times the number of stimulus categories—and rely on a common regularization method, making it straightforward to compare them.

We compared our decoders to a variety of previously proposed decoding methods: first, multinomial logistic regression regularized using an elastic-net penalty (GLMNET; see Zou & Hastie, 2005); second, the empirical linear decoder (ELD), a decoder trained using support vector machines (Graf et al., 2011); and third, the “super-neuron” decoder (SND), a recently proposed decoder trained using least squares regression and a bank of nonlinear target functions (Stringer et al., 2021). All three of these decoders are correlation-aware linear classifiers. For completeness, we also compared our decoders to unregularized, correlation-blind Poisson and gaussian independent decoders (PID/GID).

We benchmarked all of these decoders on three real-world data sets from primary visual cortex (V1), recorded from monkey (Graf et al., 2011), ferret (this article), and mouse (Stringer et al., 2021). We found that our regularized correlation-blind decoders (GPPID and GPGID) could match and even exceed the performance of some of the correlation-aware decoders. However, none of these decoders performed as well as our proposed correlation-aware decoder, the GPMD, which achieved state-of-the-art accuracy on all data sets, with no preprocessing or manual elimination of noisy neurons. The GPMD’s performance advantage derived directly from its expressive regularization strategy, which successfully adapted to every data set’s unique structure while remaining as scalable as other regularized, correlation-aware decoders.

These results indicate that knowledge of the correlation structure is crucial for optimal readout of stimulus information from V1 populations in all three species. For ease of use, our decoders conform to the scikit-learn interface and are released as a Python package at https://github.com/cdgreenidge/gdec.

## 2 The Neural Decoding Problem

In this article, we consider the problem of a decoding a low-dimensional stimulus variable (i.e., the orientation of a sinusoidal grating) from a high-dimensional neural activity pattern (i.e., a vector of spike counts). We assume the stimulus belongs to one of $K$ discrete bins or classes, formally making this a classification problem. However, the regression problem can be approximated by making $K$ large, so that the grid of stimulus values becomes arbitrarily fine.

Figure 1 illustrates the problem setup for the V1 data sets we examined. The visual stimulus for each individual trial is a drifting sinusoidal grating with an orientation $\theta k$ selected from a set of discrete orientations ${\theta 1,...,\theta K}$ that evenly divide the interval $[0,2\pi ]$. The stimulus variable to be decoded is thus a categorical variable $y\u2208{1,...,K}$.

We consider the neural population response to be a vector $x\u2208RD$, where $D$ indicates the number of neurons in the data set. We obtained this response vector by summing each neuron’s spikes (monkey) or two-photon calcium fluorescence (ferret and mouse) over some time window after stimulus presentation. Figure 1B shows orientation tuning curves from three example neurons from each data set. The monkey data sets (left) contained between $D=68$ and $D=147$ neurons, with $K=72$ discrete stimulus orientations (spaced every 5 degrees), and 50 trials per orientation for $T=3600$ trials (Graf et al., 2011). The ferret data set (middle) contained $D=784$ neurons, with $K=180$ discrete stimuli (spaced every 2 degrees) and 11 trials per orientation for a total of $T=1991$ trials, with the $0\u2218/360\u2218$ orientation sampled twice. Finally, the mouse data sets (right) contained between $D=11,311$ and $D=20,616$ neurons. The stimuli for this experiment were sampled uniformly in $[0,2\pi ]$, and we subsequently discretized them into $K=180$ bins (Stringer et al., 2021). Each bin contained between 12 and 42 trials for a total of between $T=4282$ and $T=4469$ trials, depending on the data set.

In each case, we collected the population response vectors $x$ and the discretized stimuli $y$ into a classification data set $D={(xt,yt)}t=1T$. Full details on these data sets and their preprocessing procedures can be found in appendix C in the supplement.

### 2.1 Linear Classification

The full set of parameters for a decoding model is thus the set of decoding weights for each class, which can be written as a $D\xd7K$ matrix $W=(w1,...,wK)$ and, optionally, a $k$-dimensional intercept vector $b=(b1,...,bk)\u22a4$. In the following, we let $Wdk$ denote the decoding weight for neuron $d$ for stimulus class $k$. The only difference between the decoding methods we consider is the procedure for training these weights from data.

Decoding with linear classifiers is optimal for so-called exponential-family probabilistic population codes with linear sufficient statistics (Beck et al., 2007; Ma et al., 2006). Although we could certainly expand our study to consider decoding with nonlinear classifiers, previous analyses of two of the data sets we used showed no benefit from adding nonlinear classification (Graf et al., 2011; Stringer et al., 2021).

### 2.2 Noise and Signal Correlations

It is helpful to distinguish between two types of correlations present in neural population responses: noise correlations and signal correlations (Averbeck et al., 2006; Panzeri et al., 2022). Noise correlations correspond to statistical dependencies in $P(x\u2223y)$, the distribution of population responses $x$ given a stimulus $y$. If neurons are conditionally independent (as assumed by the Poisson and independent gaussian encoding models), then noise correlations are zero. Noise correlations are typically characterized by $ Cov [x\u2223y]$, the covariance of the population responses to a fixed stimulus. This covariance clearly depends on the stimulus, meaning that noise correlations may differ for different stimuli.

Signal correlations are correlations between the tuning curves of different neurons. Two neurons with similar tuning curves will have high signal correlations, while neurons with dissimilar tuning curves may have zero or even negative signal correlations (i.e., if their tuning curves are anticorrelated). Signal correlations are commonly characterized by the covariance matrix $ Cov [x\xaf]$, which is the covariance of the tuning curves $x\xaf(y)=E[x\u2223y]$ over the stimulus distribution $P(y)$.

The effects of noise correlations on decoding performance depend critically on the alignment between noise and stimulus correlations (Averbeck & Lee, 2006; da Silveira & Rieke, 2020; Panzeri et al., 2022). However, it is important to note that so-called correlation-blind decoders are blind only to noise correlations. Thus, comparing correlation-blind and correlation-aware decoders is a method for determining whether a downstream population needs to know the structure of the noise correlations in order to decode information optimally. Figure S1 shows a characterization of both noise and signal correlations for the data sets considered in this article.

## 3 Review of Existing Decoders

Here we describe previously proposed neural decoding methods, to which we will compare the gaussian process-based decoders, which we introduce in section 4.

### 3.1 Correlation-Blind Decoders

First, we introduce two independent or correlation-blind decoders, the first assuming Poisson noise and the second assuming gaussian noise. Both decoders use Bayes’ rule to obtain a posterior distribution over the stimulus under an independent encoding model, an approach commonly known as “naive Bayes.” The encoding models underlying these decoders assume that neural responses are conditionally independent given the stimulus, preventing them by construction from using noise correlations for decoding.

#### 3.1.1 The Poisson Independent Decoder

#### 3.1.2 The Gaussian Independent Decoder

### 3.2 Correlation-Aware Decoders

Here we review three previously proposed decoders that take into account the structure of neural correlations when determining a classification boundary and are therefore “correlation aware.” Unlike the two naive Bayes decoders described above, which resulted from applying Bayes’s rule to an encoding model, these directly model the posterior probability over stimuli given a vector of neural activity. All three decoders are multiclass linear classifiers, but they are trained with different loss functions and regularization methods.

#### 3.2.1 Multinomial Logistic Regression with Elastic-Net Penalty (GLMNET)

The maximum-likelihood estimator (MLE) tends to perform poorly in settings with limited amounts of data and may not exist for small data sets. In fact, the MLE is not defined when the number of trials $T$ is smaller than the number of identifiable parameters in the weight matrix $W$, that is, when $T<D(K-1)$. Even in settings where the MLE does exist, it may overfit, yielding poor generalization performance.

#### 3.2.2 The Empirical Linear Decoder (ELD)

The empirical linear decoder (ELD), introduced by Graf et al. (2011), is similar to multinomial logistic regression in that it models the log probability of the stimulus class as an affine function of the neural response vector (see equation 3.14). However, instead of using standard likelihood-based methods to fit the model, the authors constructed an inference method based on support vector machines (SVMs).

We see that discriminating class 2 from class 1 under the multinomial logistic regression model is equivalent to solving a linear binary classification task with weights $v2$ and offset $c2$. Graf et al. (2011) proposed estimating $v2$ and $c2$ using an SVM trained on the data from classes 1 and 2. They then used the same approach to estimate the weights for all subsequent pairs of adjacent classes. That is, they estimated the difference weights $vk$ using an SVM trained on data from classes $k-1$ and $k$, for $k=2,...,K$.

#### 3.2.3 The Super-Neuron Decoder (SND)

The super-neuron decoder (SND), introduced by Stringer et al. (2021), is a third approach for training a linear classifier on multiclass data. It optimizes a set of decoding weights using penalized least-squares regression and a set of nonlinear super-neuron response functions. The super-neuron response functions encode the tuning curves of a population of narrowly selective downstream ``super-neurons,'' containing one super-neuron for each stimulus orientation. Each super-neuron responds maximally to a single orientation, making the population response on each trial a narrow bump of activity centered on the correct stimulus.

Formally, the SND seeks a matrix of weights $W$ that maps the response vector $x$ to a target vector $h\u2208RK$ on each trial. The target vector $h$ contains the responses of the super neuron population. The super-neurons have tuning curves parameterized in the same manner as the von Mises probability density function, which is appropriate since the stimulus variable is periodic.

## 4 Proposed Methods: GP-Regularized Decoders

In this section, we first introduce two correlation-blind decoders regularized with gaussian process (GP) priors: the GP-regularized Poisson independent decoder (GPPID) and the GP-regularized gaussian independent decoder (GPGID). After that, we introduce a correlation-aware decoder, the gaussian process multiclass decoder (GPMD), which adds a GP prior to multinomial logistic regression for the same purpose.

Gaussian processes are often used for nonlinear classification problems, a method setting known as gaussian process classification (Bishop, 2006; Liu et al., 2021). In neuroscience settings, GPs have also been used to construct priors over smoothly varying firing rates or latent factors (Cunningham et al., 2007; Duncker & Sahani, 2018; Keeley et al., 2020; Yu et al., 2009). Here, we use GPs to provide a construct a prior over the weights in a general linear model (in the case of GPGID) or generalized linear model (GPPID and GPMD). Previous work has used GPs to provide a smoothing prior over weights of a linear or generalized linear model (Macke et al., 2011; Park & Pillow, 2011; Park et al., 2014; Rad & Paninski, 2010; Sahani & Linden, 2002). However, to our knowledge, ours is the first to use GP priors to impose a smoothness assumption over the different classes in a multinomial logistic regression model.

In the following sections, we describe the relationship between the linear weights in our model and neural tuning curves, which makes GPs a natural tool for regularization in our problem.

### 4.1 The GP-Regularized Poisson Independent Decoder (GPPID)

When doing inference in the PID decoder (see section 3.1.1), it is necessary to estimate $\lambda dk$, the expected spike count of neuron $d$ in response to the $k$th stimulus, across all neurons and stimuli. The expected spike counts for a single neuron form a tuning curve across stimuli, $\lambda d=(\lambda d1,...,\lambda dK)T$. The maximum likelihood estimator for each entry in the tuning curve is simply the empirical mean of the spike counts for the $d$th neuron under the $k$th stimulus. However, the empirical mean estimates are noisy, especially when the number of trials for each stimulus is small, which can limit the PID decoder’s performance.

In principle, we could compensate for the noise by recording more trials for each stimulus, but this is expensive, particularly if the stimulus grid has a fine resolution. Instead, we propose to reduce error in the tuning curve estimates by exploiting our prior knowledge that tuning curves tend to be smooth with respect to orientation. We incorporate this knowledge into the model by placing an independent gaussian process prior over the log tuning curve of each neuron (Park et al., 2014; Rad & Paninski, 2010).

#### 4.1.1 Periodic Gaussian Covariance Function

The neuron-specific hyperparameters $\theta d$ permit tuning curves to differ in amplitude and smoothness, so different neurons can be regularized differently. For neurons with nonexistent tuning or exceptionally noisy responses, the inference procedure will set the amplitude $\rho d$ to zero or the length scale $\u2113d$ to infinity, making the tuning curve flat (see Figure 5). Such neurons are effectively pruned from the data set, since flat tuning curves make no contribution to decoding. This effect, known as automatic relevance determination (MacKay, 1992; Neal, 1996), eliminates the need to manually filter out noisy or untuned neurons, which can be critical when using other decoding methods (Graf et al., 2011).

#### 4.1.2 Fitting the GPPID to Spike Count Data

where $wd=log\lambda d$ is the log-tuning curve for neuron $d$, $\lambda d[yt]$ denotes the element of $\lambda d=exp(wd)$ corresponding to $yt$, the stimulus class for the $t$th time bin, and $\kappa \theta d$ denotes the $K\xd7K$ covariance matrix arising from evaluating the periodic gaussian covariance function (see equation 4.3) at all pairs of points in $(1,...,K)$.

We compute point estimates of the model hyperparameters $\theta ^d$ by optimizing this approximation with respect to $\theta d$ using the Nelder-Mead algorithm (Nelder & Mead, 1965).

To accelerate the optimization procedure, we use a Fourier-domain representation of the covariance matrix $\kappa \theta d$ based on the Karhunen-Loéve expansion (Paciorek, 2007; Royle & Wikle, 2005; Wikle, 2002; see appendix B for details). Since the procedure described above can be performed independently for each neuron, fitting the GPPID model is fully parallelizable across neurons.

### 4.2 The GP-Regularized Gaussian Independent Decoder (GPGID)

In the same way that we added a gaussian process prior to the Poisson independent decoder to obtain the GPPID decoder, we can add a GP prior to the gaussian independent decoder to obtain a GP-regularized gaussian independent decoder (GPPID). However, three salient differences set the gaussian model apart from the Poisson case. First, the parameters of interest in the gaussian model are the tuning curves, not the log-tuning curves that appeared in the Poisson case above. Second, there is an additional hyperparameter to learn in the gaussian case, corresponding to each neuron’s observation noise $\sigma d2$. Third, and most important, the gaussian model evidence can be computed in closed form, obviating the need for approximations.

### 4.3 The Gaussian Process Multiclass Decoder (GPMD)

In the GPPID and GPGID, we used a GP prior to formalize our prior knowledge that the neuron’s log-tuning curves, and thus its tuning curves, tend to be smooth. In the GPMD, the model weights have no direct interpretation in terms of tuning curves. However, we can still motivate the use of a GP prior by observing that decoding weights also ought to vary smoothly with orientation, since tuning varies smoothly as a function of orientation.

Like previous decoders, the GPMD has neuron-specific hyperparameters, which allow different neurons to have decoding weights with different amplitudes and different amounts of smoothness. This flexibility has two benefits: first, it allows each neuron’s weights to adapt to the neuron’s response properties, and second, it automatically discards untuned or noisy neurons as described in section 4.1, eliminating the need for manual data set preprocessing.

To make predictions, we approximate MAP inference by using the mode of the posterior approximation $q$ as a point estimate of the weights $W$ and $b$ in the multinomial logistic regression model (see equation 4.15). Since we are interested only in the effects of the prior, and not in using the posterior to assess prediction uncertainty, this approach suffices. In practice, we found that setting the offset $b$ to 0 made no significant difference in performance (see Figure S7), likely because all our data sets were reasonably high-dimensional.

#### 4.3.1 Scaling GPMD Inference

When maximizing the ELBO, there are two scaling concerns: the number of examples ($T$) and the number of neurons ($D$). It is well known that the ELBO can be scaled to huge numbers of examples by estimating the likelihood term using a minibatch approximation (Hoffman et al., 2013). However, even when using a minibatch approximation, the KL-divergence term must be calculated at each gradient step, and from it costs $\u223cDK3$ to evaluate (see equation 4.19). For large values of $D$, which are expected in high-dimensional neural data sets, the KL-divergence term evaluation makes stochastic gradient descent far too slow.

We solve this problem by representing $W\u22a4$ using a basis $\Psi \u2208CK\xd7M$, that is, $W\u22a4=\Psi U$, where $U\u2208RM\xd7D$. Then we place an independent normal prior on each entry of $U$. This allows the KL-divergence term to be evaluated with $\u223cDM$ complexity, since it becomes a KL divergence between two independent normal distributions. The only difficulty is choosing $\Psi $ such that $W$ turns out to have the desired gaussian process distribution when $U$ is an independent normal.

## 5 Results

### 5.1 Evaluation and Performance

We benchmarked each decoder by calculating its mean absolute test error on the monkey (Graf et al., 2011), ferret (this article), and mouse (Stringer et al., 2021) data sets (see Figure 3). We chose to estimate the error using five-fold cross-validation repeated 10 times, since 10-fold cross-validation produced data sets that had too few examples per class to train some models. Error bars were computed using twice the standard error of the mean of the cross-validation scores ($\xb12\sigma /n$). We examined five monkey data sets, one ferret data set, and three mouse data sets. Figure 3A reports the average scores for each animal; separate scores are reported in Figure S4. Note that the GID decoder has two variants: the standard formulation, which has a quadratic decision boundary, and the formulation described in section 3.1.2, which has a linear decision boundary.

The rank ordering of the models remained largely consistent across data sets. In general, the correlation-blind decoders (the PID and GID) performed worse than the correlation-aware decoders, which is consistent with previous decoding studies (Graf et al., 2011; Stringer et al., 2021). Their regularized variants (the GPPID and GPGID) performed better but still did not match the performance of the best correlation-aware decoders. The GPMD set matched state-of-the-art performance on all data sets. An important advantage of its Bayesian approach is that hyperparameters are learned automatically, which allowed the GPMD to adapt to the conditions present in different data set. By contrast, models that set hyperparameters manually exhibited occasional poor performance—for example, the SND performance on the monkey data sets.

These results could have been obscured by the choice of error metric. For example, repeating the same benchmark using “proportion correct” instead of mean absolute error improved the performance of the ELD substantially (see Figure S3), qualitatively replicating the results of Graf et al. (2011). To ensure that our results were not artifacts of the error metric, we used the empirical error cumulative distribution function to characterize each decoder’s errors in more detail (see Figure 3B). Good decoders should classify higher fractions of the data set at lower errors, producing curves that are shifted upward from those of poorer decoders. We found that the GPMD outperformed or matched all the other decoders on all the data sets (see Figure S5).

Our results show that both regularization and exploiting correlations improved decoding performance substantially. The regularized correlation-blind decoders, the GPPID and GPGID, outperformed their unregularized analogues, the GID and PID. The GLMNET decoder, which is correlation aware, outperformed the correlation-blind GPPID and GPGID. Finally, the SND and GPMD, both regularized and correlation aware, outperformed all other decoders.

The relative impact of these strategies depended on data set dimensionality. All of the decoders performed better on the higher-dimensional mouse data set than the mid- and low-dimensional ferret and mouse data sets. This is because as dimensionality increased, the class distributions became farther apart, making the problem easier to solve (see Figure S2). Increased neural selectivity was correlated with decoding performance (see Figure S11). Within each data set, decoders performed differently depending on how they handled regularization and correlations. For small data sets, such as the monkey data set with approximately 150 neurons, both regularization and exploiting correlations had a substantial effect. For example, adding regularization to the GID (using the GPGID) decreased its mean absolute error by 16.3 degrees, and exploiting correlations (using the GPMD) decreased error by another 9.4 degrees. For high-dimensional data sets where it was easy to overfit, such as the mouse data set, which was recorded from approximately 20,000 neurons, regularization became the most important strategy. In fact, on the mouse data set, the regularized correlation-blind GPGID did just as well as some of the correlation-aware decoders.

### 5.2 Scaling with Data Set Size and Dimensionality

To characterize the GPMD’s performance and training times with respect to data set size and dimensionality, we performed ablation studies on both the number of training examples and the number of neural features (see Figure 4).

The GPMD performed well at all training set sizes (see Figure 4A), implying that its prior assumptions encouraging smooth and small decoding weights were well calibrated—that is, strong enough to permit good performance with few training examples but flexible enough to allow continued learning with many training examples. We believe that the nonparametric nature of the gaussian process prior is the key component that allows our model to accommodate the complex structure present in larger neural data sets. Models with stronger assumptions, such as the GPGID, which assumes independence, or the SND, which has many hard-coded parameters, had difficulty learning from increasing numbers of training examples.

The GPMD also performed well with any number of neural features (see Figure 4C). Linear decoders with no or poor regularization, such as the GID and ELD, did not exhibit this property; in fact, their performance became worse as the number of neural features increased from the “classical” to the “interpolating” regime, producing a phenomenon known as the “double descent” error curve (Belkin et al., 2019). Properly regularized models such as the GPGID and GPMD did not display this phenomenon and gave accurate performance estimates for all numbers of neural features.

Thanks to the GPMD’s approximate inference, GPU acceleration, and spectral weight representation, it trained quickly, producing fast cross-validated error estimates that exhibited favorable scaling with respect to both observations and neurons (see Figures 4B and 4D). For the largest data set with 20,000 neurons, it took 131 $+$/$-$ 0.82 seconds to train (roughly 20 minutes of wall-clock time) for a 10-fold cross-validation estimate. By comparison, a performance-tuned GLMNET model took 618 $+$/$-$ 6.40 seconds to train (roughly 1 hour and 45 minutes of wall-clock time) for the same estimate. Given the training time trends shown in the training-set size ablation (see Figure 4 BB) and neural feature ablation (see Figure 4D) studies, we expect the GPMD to handle even larger data sets without difficulty.

Scaling to large data sets was further enhanced by the GPMD’s automatic data set preprocessing. Decoding studies, such as Graf et al. (2011), often select only strongly tuned neurons for decoding, since noisy neurons make it easier for models to overfit. Manual selection rules have two disadvantages: first, they may ignore neurons that look noisy but actually carry information, and second, they can require prohibitive amounts of time if human input is needed (e.g., for choosing initialization points for nonlinear curve fitting).

The GPMD’s Bayesian formulation automatically discarded noise neurons by setting their prior amplitudes to zero (see Figure 5), a phenomenon known as automatic relevance determination (MacKay, 1992; Neal, 1996). Examples of tuning curves (estimated using the sample average conditional on the stimulus) from automatically discarded and automatically retained neurons are shown in Figure 5. Some of the automatically retained neurons displayed the bimodal “gaussian bump” structure commonly sought by manual selection rules. Others displayed more complicated tuning patterns that would likely be ignored by a manual selection rule.

Our implementation of the empirical linear decoder (ELD; see Graf et al., 2011) replicated the original paper’s results only qualitatively, not quantitatively. Our implementation of the ELD did outperform the Poisson independent decoder (PID) when using the “proportion correct” error criterion, as in the original paper (see Figure S3). However, it did not achieve the performance reported in the original paper. Because our implementation of the PID, a very simple decoder, also did not match the performance of the PID in Graf et al. (2011), we believe the discrepancy was caused by data preprocessing. We were not able to replicate the data preprocessing steps described in Graf et al. (2011) precisely since the original code has been lost.

### 5.3 Visualizing the Impact of Correlations on Decoding

As described in section 4, each row of the weight matrix $W$ for each decoder forms a vector of decoding weights over orientation for a single neuron. Example decoding weights from a correlation-blind (GPPID for monkey, GPGID for mouse and ferret) and correlation-aware (GPMD) decoder pair are shown in Figure 6A. In general, the GPMD produces smoother weights, and unlike the GPPID/GPGID, it occasionally places large weights on stimuli with low expected response (e.g., ferret neuron 82 and mouse neuron 87). To verify that the GPMD’s decoding weights were usually smoother, we compared the inferred prior lengthscales for each decoder pair (see Figure 6B) and found that the GPMD learned longer length scales (sign test, $p<0.001$ for each animal.) We discarded decoding weights with $\u21132$ norm less than 0.001, since nearly zero weights indicated a neuron discarded by the learning process, which would have uninformative length scales.

In nearly all cases, the GPPID/GPGID decoders learned decoding weights with larger $\u21132$ norm than the GPMD (see Figure 6C, sign test $p<0.001$ for each animal). Since the norm of the tuning curve is directly related to the predicted probability for each class (see equation 2.1), this indicates that the GPPID/GPGID are more confident in their predictions than the GPMD. To directly measure this, we chose a test set for each data set and compared the predictive distribution entropy for both the GPPID/GPGID and GPMD decoders (see Figure 6D). We found that the GPMD produced predictive distributions with higher entropy (sign test, $p<0.001$ for all animals), indicating that it produces more uncertain predictive distributions. This confirms the patterns seen in Figure 6C.

To further characterize the effects of correlations on decoding performance, we visualized population responses projected onto 2D subspaces defined by the decoding weights of correlation-blind (GPPID) and correlation-aware (GPMD) decoding models (see Figure 7A). To reduce the 147-dimensional response vectors obtained from a population of 147 monkey V1 neurons, we first selected a model to visualize and two stimulus classes $i$ and $j$. Then we formed a two-dimensional basis by orthogonalizing the decoding weight vectors $Wi*$ and $Wj*$, producing a basis matrix of orthogonal unit vectors, $S$. We plotted the data for classes $i$ and $j$ and for the model separatrix by projecting onto this basis. The model separatrix is given by the solution to the equation $x\u22a4u+c=0$, where $u=S(Wj*-Wi*)$, and the constant offset $c$ is $bj-bi$. We estimated the approximate separatrix from the model that did not produce the basis by projecting its normal vector into the subspace defined by $B$ and setting the constant offset to be $-( mean (Xiu)+ mean (Xju)/2$, where $Xi$ and $Xj$ are the data for classes $i$ and $j$.

We carried out this procedure for both the GPPID and GPMD models, since each two-dimensional basis could exactly represent only the separatrix from its source weight matrix. Figure 7A shows our visualization for the $25\u2218/30\u2218$ class pair, and we included four additional class pairs in Figure S9 (all class pairs exhibited a similar pattern).

We first sought to determine whether the data deviated significantly from the independent Poisson model assumed by the GPPID. To do this, we generated a simulated data set from a conditionally independent Poisson model with means given by the tuning curves of the GPPID model (referred to in the figure as the “Poisson surrogate” data set). Compared to the real data, the Poisson surrogate data exhibit much smaller variance and tilt relative to the basis vectors, showing that the correlations in the real data can significantly affect decoding. However, in the GPMD basis, the differences between the real and surrogate data sets are much less pronounced, implying that the GPMD’s weight matrix identifies a linear projection that decorrelates and reduces the variance of the projected data. We found the same results using a decorrelated data set created by shuffling neural responses across trials (see Figure S8).

Next, we plotted the class separatrices along with the data. The separatrix given by the GPPID successfully separated the Poisson surrogate data but failed to separate the real data set because of correlation-induced distortions. However, as expected, the GPMD separatrix successfully took the data’s correlations into account.

To visualize how the entire set of 72 classes related to each other, we plotted each class’s correlation ellipse on the basis given by the zero- and five-degree basis vectors from each model (see Figure 7B). The selection of the basis vectors was arbitrary (see Figure S10 for other choices). The GPPID’s basis did a poor job of separating the classes, but the GPMD’s basis separated them fairly well. In the GPMD’s basis, the ellipses from classes 180 degrees apart appear in nearly identical locations, confirming that that the GPMD identified grating angles more precisely than grating drift direction, a phenomenon previously observed in our performance benchmarks (see Figure 3C).

## 6 Discussion

Linear decoders provide a powerful tool for characterizing the information content of neural population responses. While all linear decoders share a common mathematical form, differences in fitting methods—namely, in regularization and the incorporation of noise correlations—can lead to large differences in performance.

Previous studies, such as Graf et al. (2011) and Stringer et al. (2021), have used correlation-blind and correlation-aware linear decoders to investigate the significance of correlations on decoding performance. In all cases, they found that correlation-aware decoders outperform correlation-blind decoders. However, the performance differences could have been due to the lack of regularization in the correlation-blind decoders, as maximum likelihood weight estimates in the sample-poor data regimes typically studied are often subject to overfitting.

In this article, we have presented a suite of new decoders that use gaussian processes for regularization. Two of these methods correspond to naive Bayes decoders under Poisson and gaussian noise models, respectively, which do not incorporate correlations. The third method, the GPMD, corresponds to a decoding model with a GP prior over its weights that does take account of noise correlations. We found that the GPMD matched or outperformed all other decoders on data sets from monkey, ferret, and mouse visual cortex. Furthermore, we showed that the GPMD scales to the very largest data sets using a combination of approximate Bayesian inference, spectral methods, and GPU acceleration.

Our results show that the performance gap between correlation-blind and correlation-aware decoders is fundamental, and not just a result of regularization. The correlation-aware decoders consistently outperformed even the regularized correlation-blind decoders, though regularization did narrow the gap significantly (as shown in Figure 3). This confirmed the results of previous studies, and we conclude that exploiting neural correlations can significantly improve decoding performance (though this result may have some dependence on whether the animal is anesthetized (see Yates et al., 2020). Visualizations of the decoding separatrices produced by each decoder indicate that the real data sets differ significantly from the assumptions made by correlation-blind decoders. The correlation-aware decoders discovered low-dimensional subspaces that decorrelated the responses, making the transformed data match independence assumptions more closely.

One major caveat to our results is that they cannot tell us whether the brain itself uses decoders that are correlation-blind or correlation-aware. Animal behavior may rely on a different populations of neurons than the ones recorded in our experimental data sets. Animals might achieve similar performance to the GPMD by reading out a similar neural population optimally or a larger neural population suboptimally (Chen et al., 2006; Jacobs et al., 2009; Morais et al., 2022). As in previous studies (Stringer et al., 2021), we found that in the largest recordings from mice, the decoding performance ($1.3\u2218$) substantially exceeded current estimates of discrimination ability of the animal ($9\u2218$; Lyamzin et al., 2021). This suggests that the animal’s behavioral performance may be primarily limited by the decoding strategies implemented by downstream regions, and not by the information present in V1 itself (Beck et al., 2012).

The prior assumption of smoothness governs the applicability of the GPMD and other GP-regularized decoders we developed to other brain regions and tasks. If the neural population of interest has smoothly varying tuning curves and the task can be solved using a linear decision boundary, the GPMD will perform well. Although our decoding task used angular stimuli and a periodic distance measure, the GPMD could be trivially extended to stimuli that vary smoothly on an interval of the real line by using the ordinary Euclidean distance. Beyond smoothness, we have made no assumptions about the population of neurons to be decoded, and the GPMD naturally “prunes” uninformative neurons (e.g., neurons untuned to the stimulus) by shrinking their decoding weights to zero.

In the future, similar decoders could be used to empirically characterize how the peak and high-slope regions of tuning curves contribute to optimal decoding strategies in real neural populations (for theoretical studies, see Butts & Goldman, 2006, and Yarrow et al., 2012). We also believe that the performance of the decoders could be further improved. In particular, the performance of the correlation-aware decoder could likely be improved by modeling additional structure in the neural data set—for example, the differences in response characteristics between excitatory and inhibitory neurons, or temporal structure (which would likely improve the classification of direction). The correlation-blind decoder could also be improved by using a generative model that can handle non-Poisson characteristics of neural data such as overdispersion (Charles et al., 2018; Gao et al., 2015; Goris et al., 2014; Gur et al., 1997; Stevenson, 2016). We are optimistic that future work will use and extend the decoding methods we have introduced to obtain new insights into the structure of neural codes.

## Appendix A: Linearity of the PID and GID Decoders

where $\eta $ is the natural parameter and the sufficient statistic $u:R\u2192R$ is a function of $xd$.

## Appendix B: Spectral GP Regression

### B.1 With Gaussian Noise

In this section, we demonstrate how to solve a 1D GP regression problem in the spectral domain, following the general approach of Paciorek (2007), Royle and Wikle (2005), and Wikle (2002). Consider a regression data set ${xt,yt}t=1T$ with $y\u2208R$ and, without loss of generality, $x\u2208[-\pi ,\pi ]$. Note that we do not require the $x$ values to lie on a grid. We can concatenate the training examples and labels into vectors as follows: $x=(x1,...,xt)\u22a4$ and $y=(y1,...,yt)\u22a4$.

To ensure that $z(x)$ is real, we require $Fk=F-k*$. This requirement can be verified by expanding equation B.5 in terms of sines and cosines. To ensure that $z(x)$ is gaussian, we require that the $Fk$ are independent gaussian random variables, which implies that they are jointly gaussian. Since $z(x)$ is a linear combination of the $Fk$s, this implies it is also gaussian. To ensure that $z(x)$ is zero mean, we require that each $Fk$ is zero mean. Because expectation is a linear operator, this ensures that $z(x)$ is also zero mean.

### B.2 With Poisson Noise

## Appendix C: Data Set and Preprocessing Details

For each of the five monkey data sets provided by Graf et al. (2011), we chose the feature (${x}$) exactly as in Graf et al. did. The stimuli grating angles were selected from a 5-degree grid, so to get $yt$, we simply mapped the angles ${0,5,10,...,360}$ to the integers ${1,2,...,72}$.

Unlike Graf et al., we did not drop noisy neurons from the data set, since we found it made little to no difference in decoding accuracy (see Figure S6).

For the three mouse data sets, we chose the feature vectors ${xt}t=1T$ exactly as in Stringer et al. (2021). For the class values ${yt}t=1T$, we binned the stimulus angles using 2-degree bins and used the bin index as the class label.

For the ferret data set, all procedures were performed according to NIH guidelines and approved by the Institutional Animal Care and Use Committee at Max Planck Florida Institute for Neuroscience. Surgical procedures and acute preparations were performed as described in Scholl et al. (2017). To perform calcium imaging of cellular populations, AAV1.Syn.GCaMP6s (UPenn) was injected at multiple depths (total volume 500 nL). Visual stimuli were generated using Psychopy (Peirce, 2007). The monitor was placed 25 cm from the animal, centered in the receptive field locations for the cells of interest. Square-wave drifting gratings (0.10 cycles per degree spatial frequency, 4 Hz temporal frequency) were presented at 2-degree increments across the full range of directions (1 second duration, 1 second ISI, 11 trials). Two-photon imaging was performed on a Bergamo II microscope (Thorlabs) running Scanimage (Pologruto et al., 2003) (Vidrio Technologies) with 940 nm dispersion-compensated excitation provided by an Insight DS+ (Spectraphysics). Power after the objective was 40 mW. Images were collected at 30 Hz using bidirectional scanning with 512 $\xd7$ 512 pixel resolution. The full field of view was 1 $\xd7$ 1 mm. Raw images were corrected for in-plane motion via a nonrigid motion correction algorithm (Pnevmatikakis & Giovannucci, 2017). Regions of interest were drawn in ImageJ. Mean pixel values for ROIs were computed over the imaging time series and imported into Matlab (Hiner et al., 2017). $\Delta F/Fo$ was computed by computing $Fo$ with time-averaged median or percentile filter. $\Delta F/Fo$ traces were synchronized to stimulus triggers sent from Psychopy and collected by Spike2. Response amplitudes for each stimulus on each trial were calculated as the sum of the Fourier mean and modulation ($F0+F1$). These values for each neuron were used to generate the feature (${x}$) vectors. Class values ($y$) were the stimulus angles presented (at 2-degree increments) using the bin index as the class label.

## Appendix D: Supplementary Figures

## Acknowledgments

This work was supported by grants from the Simons Collaboration on the Global Brain (SCGB AWD543027), the NIH BRAIN initiative (NS104899 and R01EB026946), and a U19 NIH-NINDS BRAIN Initiative Award (5U19NS104648). J.Y. is supported by the NIH (K99EY032179). B.S. is supported by the NIH (K99EY031137) and thanks the Max Planck Society and Max Planck Florida Institute for their generous support. We thank A. B. A. Graf, Al Kohn, M. Jazayeri, and J. A. Movshon for providing the primate data sets; C. Stringer, M. Michaelos, and M. Pachitariu for providing the publicly available mouse data sets; and Kenneth Latimer for helpful comments on the manuscript.

## References

*Majority of choice-related variability in perceptual decisions is present in early sensory cortex*

*Bayesian learning for neural networks*