## Abstract

An important problem in systems neuroscience is to characterize how a neuron integrates sensory inputs across space and time. The linear receptive field provides a mathematical characterization of this weighting function and is commonly used to quantify neural response properties and classify cell types. However, estimating receptive fields is difficult in settings with limited data and correlated or high-dimensional stimuli. To overcome these difficulties, we propose a hierarchical model designed to flexibly parameterize low-rank receptive fields. The model includes gaussian process priors over spatial and temporal components of the receptive field, encouraging smoothness in space and time. We also propose a new temporal prior, temporal relevance determination, which imposes a variable degree of smoothness as a function of time lag. We derive a scalable algorithm for variational Bayesian inference for both spatial and temporal receptive field components and hyperparameters. The resulting estimator scales to high-dimensional settings in which full-rank maximum likelihood or a posteriori estimates are intractable. We evaluate our approach on neural data from rat retina and primate cortex and show that it substantially outperforms a variety of existing estimators. Our modeling approach will have useful extensions to a variety of other high-dimensional inference problems with smooth or low-rank structure.

## 1 Introduction

A key problem in computational and systems neuroscience is to understand the information carried by neurons in the sensory pathways (Jones & Palmer, 1987; Ringach et al., 1997). A common approach to this problem is to estimate the linear receptive field (RF), which provides a simple characterization of a neuron's stimulus-response properties. The RF consists of a set of linear weights that describe how a neuron integrates a sensory stimulus over space and time (Chichilnisky, 2001; deBoer & Kuyper, 1968; Jones & Palmer, 1987; Lee & Schetzen, 1965; Meyer et al., 2017; Ringach, 2004; Schwartz et al., 2006; Sharpee et al., 2004; Simoncelli et al., 2004; Theunissen et al., 2001). Estimating the RF from imaging or electrophysiological recordings can thus be seen as a straightforward regression problem. However, characterizing RFs in realistic settings poses a number of important challenges.

One major challenge for RF estimation is the high dimensionality of the inference problem. The number of RF coefficients is equal to the product of the number of spatial stimulus elements (e.g., pixels in an image) and number of time lags in the temporal filter. In realistic settings, this can easily surpass thousands of coefficients. Classic RF estimators such as the least-squares regression suffer from high computational cost and low statistical power in high-dimensional settings. The computational cost of the regression estimate scales cubically with the number of RF coefficients, while memory cost scales quadratically. Moreover, the standard regression estimate does not exist unless there are more samples than dimensions and typically requires large amounts of data to achieve a high level of accuracy. High dimensionality is thus a limiting factor in terms of both computational resources and statistical power.

Past efforts to improve statistical power, and thereby reduce data requirements, have relied on various forms of regularization. Regularization reduces the number of degrees of freedom in the RF by biasing the estimator toward solutions that are more likely a priori and can thus provide for better estimates from less data. Common forms of regularization involve smoothness or sparsity assumptions and have been shown to outperform maximum likelihood (ML) estimation in settings of limited data or correlated stimuli (Aoi & Pillow, 2017; Calabrese et al., 2011; David & Gallant, 2005; Gerwinn et al., 2010; Linden et al., 2003; Park & Pillow, 2011; Smyth et al., 2003). However, the computational cost of such estimators is generally no better than that of classical estimators, and may be worse due to the need to optimize hyperparameters governing the strength of regularization. These poor scaling properties make it difficult to apply such estimators to settings involving high-dimensional stimulus ensembles.

In this article, we propose to overcome these difficulties using a model that parameterizes the RF as smooth and low rank. A low-rank receptive field can be described as a sum of a small number of space-time separable filters (Linden et al., 2003; Park & Pillow, 2013; Pillow et al., 2008; Pillow & Simoncelli, 2006; Qiu et al., 2003). This choice of parameterization significantly reduces the number of parameters in the model and is the key to the scalability of our method. To achieve smoothness, we use gaussian process (GP) priors over both the spatial and temporal filters composing the low-rank RF. For the temporal filters, we introduce the temporal relevance determination (TRD) prior, which uses a novel covariance function that allows for increasing smoothness as a function of time lag. To fit the model, we develop a scalable algorithm for variational Bayesian inference for both receptive field components and hyperparameters governing shrinkage and smoothness. The resulting estimator achieves excellent performance in settings with correlated stimuli and scales to high-dimensional settings in which full-rank estimators are intractable.

The article is organized as follows. Sections 2 and 3 review relevant background and the literature: section 2 introduces the linear encoding model and low-rank linear receptive fields; section 3 reviews previously proposed priors for receptive fields. In section 4, we introduce TRD, a new prior for temporal receptive fields, while in section 5 we introduce our variational low-rank (VLR) receptive field estimator. Section 6 shows applications to simulated as well as real neural data sets, and section 7 provides discussion of our results and suggests directions for future work.

## 2 The Linear-Gaussian Encoding Model

### 2.1 The Receptive Field Tensor

This model description neglects the fact that the stimulus and receptive field are often more naturally described as multidimensional tensors. In vision experiments, for example, the stimulus and receptive field are typically third-order tensors, with two dimensions of space and one dimension of time. In this case, the full stimulus movie shown during an entire experiment can be represented by a tensor of size $nx1\xd7nx2\xd7T$, consisting of $nx1\xd7nx2$ pixel images over $T$ total time bins or frames. In this case, a given element $Xi,j,t$ in the stimulus tensor could represent the light intensity in a gray-scale image at time $t$ at spatial location $(i,j)$ in the image.

### 2.2 Low-Rank Receptive Fields

A key feature of neural receptive fields is that they can typically be described by a relatively small number of spatial and temporal components (Adelson & Bergen, 1985; Linden et al., 2003; Park & Pillow, 2013; Pillow et al., 2008; Pillow & Simoncelli, 2006; Qiu et al., 2003). This means that we do not need to use a full set of $nx1\xd7nx2\xd7nt$ independent parameters to represent the coefficients in $K$. Instead, we can accurately describe $K$ with a small number of spatial components, each corresponding to an image with $nx1\xd7nx2$ coefficients, and a corresponding number of temporal components, each with $nt$ coefficients. The number of paired spatial and temporal components needed to represent $K$ is known as the rank of the tensor, which we denote $r$.

Figure 1B illustrates a scenario with $r=2$, in which the tensor $K$ is the sum of two rank 1 components, each defined as a single spatial and temporal weighting function. These rank 1 components are commonly referred to as ”space-time separable” filters. Note that a rank $r$ tensor has only $r(nx1nx2+nt)$ parameters, which generally represents a significant savings over the $(nx1nx2nt)$ parameters needed to parameterize the full-rank tensor. Furthermore, having an explicit description of the temporal and spatial filters composing $K$ increases the interpretability of the RF.

## 3 Existing Receptive Field Priors

### 3.1 Ridge Prior

### 3.2 Automatic Smoothness Determination

### 3.3 Automatic Locality Determination

The automatic locality determination (ALD) prior (Park & Pillow, 2011) goes beyond the smoothness of the ASD prior by encouraging RF coefficients to be localized in space-time and spatiotemporal frequency.

It relies on a covariance function that encourages both the space-time coefficients and the spatiotemporal frequency components of the RF to be zero outside of some localized region, resulting in a form of structured sparsity (Wu et al., 2014). This prior includes the ASD smoothness prior as a special case, namely, when the region of nonzero spatiotemporal frequency components is centered at zero, and there is no locality in space or time. However, the ALD prior also allows for bandpass filters in which the RF is composed primarily of intermediate frequencies.

The full ALD covariance matrix (see equation 3.4), which multiplies together the diagonal matrices $C\theta s$ and $C\theta f$ with a discrete Fourier domain operator in between, has the effect of simultaneously imposing locality in both space-time and frequency. It is identical to ASD under the setting that $C\theta s=I$, $m\u02dc=0$, and $\Psi \u02dc$ is diagonal (Aoi & Pillow, 2017). Empirically, ALD has been shown to outperform both the ridge and ASD priors, as well as other sparsity-inducing priors, in applications to neural data in the early visual system (Park & Pillow, 2011).

## 4 A New Prior for Temporal Receptive Fields

Our first contribution is to propose a new prior for temporal receptive fields. The priors in ASD and ALD both assume a constant degree of smoothness across the receptive field. However, an assumption of stationary smoothness is less appropriate for temporal receptive fields, which are typically sharper at short time lags and smoother at longer time lags. In order to incorporate this variable form of smoothness, we introduce the temporal recency determination (TRD) prior. This prior uses a smoothing covariance function that is stationary in log-scaled time and is therefore nonstationary in linear time.

## 5 A Probabilistic Model for Low-Rank Receptive Fields

Our second contribution is a model and inference method for smooth, low-rank receptive field fields. As noted in section 2.2, a low-rank parameterization for spatiotemporal receptive fields can offer a dramatic reduction in dimensionality without loss of accuracy. In particular, a rank $r$ spatiotemporal filter requires only $r\xb7(nx1\xb7nx2+nt)$ coefficients versus $N=nx1\xb7nx2\xb7nt$ for a full-rank filter.

Under this modeling choice, the full receptive field $K=KtKx\u22a4=kt1kx1\u22a4+\cdots +ktrkxr\u22a4$ represents the product of two gaussian random variables. This effective prior over $K$ is thus non gaussian. This entails that posterior inference is not analytically tractable due to the fact that the prior is not conjugate to the gaussian likelihood of the encoding model (see equation 2.1). We therefore develop a scalable variational inference algorithm for setting the hyperparameters $\theta $ governing the prior covariance matrices and inferring the receptive field coefficients $Kx$ and $Kt$, which we describe in sections 5.1 and 5.2. We introduce additional modeling choices that facilitate scalable estimation through a further reduction in the total number of receptive field coefficients in section 5.1 and describe the resulting algorithm in detail in section 5.2.

### 5.1 Scalability via Sparse Spectral Representations of Covariance Matrices

Performing inference in the model we have already introduced still requires building and inverting large covariance matrices. To reduce the dimensionality of the spatial and temporal receptive field inference further and to achieve scalability to high-dimensional settings, we make use of sparse spectral representations of the prior covariance matrices.

This sparse spectral representation involves expressing the prior covariance of $k$ as $C\theta \u2248B\theta B\theta \u22a4$. When $B\theta $ contains the eigenvectors of $C\theta $, scaled by their square-rooted associated eigenvalues, then this approximation is exact. However, we might be able to incur only a small approximation error by retaining only the leading eigenvectors of $C\theta $ in $B\theta $. For example, for larger length-scale parameters (high degree of smoothness), the covariance matrix associated with the ASD covariance function has eigenvalues that quickly decay toward zero (Rasmussen & Williams, 2006). By truncating its eigenvalues near zero, it is possible to find a low-rank approximation to the covariance matrix $C\theta $ that is exact to machine precision. For all choices of prior covariance matrices we discussed in section 3, it is possible to obtain an analytic expression for $B\theta $ (Aoi & Pillow, 2017).

Given $B\theta $, we can express the prior distribution over the receptive field in terms of a linear transformation of a multivariate standard normal, or “whitened” random variable: $k=B\theta w$, with $w\u223cN(0,I)$. The resulting prior covariance of $k$ is equal to $C\theta =B\theta B\theta \u22a4$ and thus unchanged from before. However, performing inference over $w$ instead of $k$ allows us to circumvent costly inversions of $C\theta $. Furthermore, if $C\theta $ is represented to sufficient accuracy using fewer dimensions in $B\theta $, then $w$ will contain fewer dimensions than $k$, leading to additional improvements in scalability.

Given this representation of the STRF, the rest of the model remains unchanged. This model description is illustrated as a graphical model in Figure 3B.

### 5.2 Variational Inference and Hyperparameter Learning for Low-Rank STRFs (VLR)

Fitting our low-rank model to data involves performing posterior inference over the receptive field components $wx$ and $wt$ and learning the hyperparameters $\theta $ that determine the shape of the basis functions in $B\theta $. To do this, we derive a variational optimization framework, similar to the classic empirical Bayes approach reviewed in appendix A.2. We rely on a variational approach here, since the effective non gaussian prior over $k$ makes obtaining an exact expression for the marginal log-likelihood intractable.

Instead of computing and optimizing the model evidence directly, we can instead perform coordinate ascent and alternatingly maximize $F$ with respect to the distribution $q(wt,wx)$ and with respect to the parameters $\theta $ as part of the expectation maximization (EM) (Dempster et al., 1977) algorithm. The optimization over $q(wt,wx)$ can either be performed exactly (which amounts to setting $q$ equal to the posterior joint distribution over $wt$ and $wx$) or under additional constraints on $q$, such as restricting it to a particular family of variational distribution.

## 6 Results

### 6.1 Data Efficient Estimation under Correlated Stimulus Distributions

We first tested the VLR estimator using synthetic data. We simulated data from a linear-gaussian encoding model with a rank-2 receptive field that was stimulated with correlated stimuli.

### 6.2 Application to Macaque V1 Simple Cells

### 6.3 Application to Rat Retinal Ganglion Cells

Previous work has identified multiple cell types of retinal ganglion cells (RGCs; Baden et al., 2016). Functional cell-type distinctions are usually based on receptive field properties such as ON/OFF regions, temporal filtering properties, or other structures that can be extracted from STRFs (Chichilnisky & Kalmar, 2002; Field et al., 2007; Frechette et al., 2005). Other differences in response properties, such as STRF rank, may prove valuable for further distinguishing functional cell types.

To further illustrate the power of the VLR estimator, we examined a data set consisting of electrophysiological recordings from rat RGCs in response to binary white noise stimuli. We selected a region of $25\xd725$ pixels in space based on the center of the RF (as estimated by the STA) from the $80\xd740$ overall stimulus, and we considered 30 time lags (0.5 seconds) of stimulus history. This yielded a total of $N=18,750$ STRF coefficients. Standard regression estimators like the MLE require large amounts of training data for STRFs of this size. Bayesian estimators like ASD and ALD are computationally intractable in high-dimensional settings due to the need to store and invert matrices of size $N2$. Thus, we compare the VLR estimator with the STA, which can readily be computed even for large STRF sizes such as those considered here.

## 7 Discussion

In this article, we have introduced a novel method for inferring low-rank STRFs from neural recordings and have shown a substantial improvement over previous methods for STRF estimation. Previous approaches like the STA have low computational complexity and are hence applicable to STRFs with large sizes but provide biased estimates under correlated stimulus distributions and classically require large amounts of training data to arrive at accurate estimates. Previous Bayesian approaches have been shown to be more data efficient and can provide more accurate STRF estimates, even under correlated stimulus distributions. However, they require matrix inversions that are computationally costly. Thus, applying Bayesian approaches to large stimulus ensembles remains challenging. Furthermore, all of these approaches parameterize the full STRF and do not explicitly make use of potential low-rank structure in the STRF. These methods are therefore general but require estimation of large numbers of parameters, which prohibits their application to large-scale problems.

The VLR estimator we have introduced in this article addresses limitations of previous approaches to STRF estimation in linear encoding models. We have shown that our method provides accurate estimates in settings when stimuli are correlated, the receptive field dimensionality is high, or training data are limited. We developed a new prior covariance function that captures the nonstationary smoothness that is typical of temporal receptive fields. In combination with previous priors capturing localized smooth structure for spatial receptive fields, our probabilistic model provided a powerful framework for low-rank STRF estimation. While we have focused on these modeling choices in this article, our modeling framework is general and can easily accommodate other choices of prior covariance functions.

The ability to accurately estimate STRFs from limited amounts of data will be important for quantifying changes of receptive field properties over time or during learning. Furthermore, it will open up possibilities for closed-loop experiments, where stimuli are selected actively on each frame to reduce uncertainty about the STRF. Furthermore, being able to use high-dimensional stimuli and correlated stimulus distributions will be important for studying population responses. In this setting, VLR will be useful for estimating the receptive fields of many cells in parallel. Finally, VLR allows for cross-validated estimates of the STRF rank, which may prove important for quantifying single-cell response properties and categorizing cells into specific functional classes.

While we have presented our method in the context of gaussian linear encoding models, it is also possible to extend the variational inference algorithm we presented here to non gaussian observation models in a generalized linear encoding model (GLM). Here, the ability to perform closed-form updates for the posterior means and covariances will be lost and will instead require direct optimization of the variational free energy. Previous approaches have made use of basis functions to improve the scalability of RF estimation in Poisson GLMs (Pillow et al., 2008). However, this typically requires setting parameters like the number, location, and shape of basis functions by hand. Recent work has started to consider circumventing this cumbersome by-hand selection of basis functions by using sparse variational gaussian process approximations instead (Dowling et al., 2020). This approach relies on inducing point methods that have recently gained popularity in improving the scalability of gaussian process methods more generally (Hensman et al., 2013; Titsias, 2009). The framework we presented here is closely related to the basis function approach for GLMs and can be used to determine parameters such as the number, location, and shape of basis functions automatically. The improvements in scalability we achieve due to whitened representations of the receptive field also apply in the context of GLMs. Ultimately, algorithms that can automatically determine such hyperparameters and do not require user-dependent input will contribute to the robustness and reproducability of modeling results.

## Appendix A: Bayesian Receptive Field Estimation

### A.1 Maximum A Posteriori (MAP) Estimation

### A.2 Evidence Optimization and Empirical Bayes

A critical question we have not yet discussed is how to set the noise variance $\sigma 2$ and the hyperparameters $\theta $ governing the prior. A common approach for selecting hyperparameter values is cross-validation over a grid of candidate values. However, this approach is unwieldy for models with more than one or two hyperparameters.

*empirical Bayes*:

## Appendix B: Variational Updates for Low-Rank STRF Inference

## Appendix C: M-Step Updates for Low-Rank STRF Inference

The M-step updates for the hyperparameters $\theta ={\theta x,\theta t}$ can be carried out most efficiently by performing conditional M-steps for either the spatial or temporal parameters while holding the other set of parameters fixed.

It is apparent from the above expression that the conditional M-steps will rely on statistics of the stimulus projected onto the basis we are not optimizing over. In order to evaluate each M-step update as efficiently as possible, these statistics can be precomputed and stored throughout the M-step update. Below, we provide further details for each conditional M-step.

### C.1 Spatial Parameter Updates

### C.2 Temporal Parameter Updates

Analogous to the spatial parameter updates, this involves precomputing and storing $r+1r-1$ matrices of size $pt\xd7pt$.

## Code Availability

We provide Matlab code for fitting low-rank STRFs under flexible choices of prior covariance functions at https://github.com/pillowlab/VLR-STRF.

## Author Contributions

L.D. and J.W.P. conceived of this project. L.D. developed the presented methods and performed analyses under the supervision of J.W.P. K.M.R. and G.D.F. collected the retinal ganglion cell experimental data and assisted in the retinal ganglion cell analyses. L.D. and J.W.P. wrote the manuscript with input from K.M.R. and G.D.F.

## Acknowledgments

We thank Sneha Ravi for sharing the RGC data, and Nicole Rust and Tony Movshon for providing access to V1 simple cell data. L.D. was supported by the Gatsby Charitable Foundation. J.W.P. was supported by grants from the McKnight Foundation, the Simons Collaboration on the Global Brain (SCGB AWD543027), the NIH BRAIN initiative (NS104899 and R01EB026946), and a U19 NIH-NINDS BRAIN Initiative Award (5U19NS104648).

## References

*J. Opt. Soc. Am. A*

*Scalable Bayesian inference for high-dimensional neural receptive fields*

*Nature*

*Biological Cybernetics*

*Pattern recognition and machine learning*

*PLOS One*

*Network: Computation in Neural Systems*

*Journal of Neuroscience*

*Network: Computation in Neural Systems*

*IEEE Transact. Biomed. Eng.*

*Journal of the Royal Statistical Society Series B*

*Non-parameteric generalized linear model*

*Journal of the American Statistical Association*

*Journal of Neuroscience*

*Journal of Neurophysiology*

*Frontiers in Computational Neuroscience*

*Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence*

*Technometrics*

*Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability*

*Journal of Neurophysiology*

*International Journal of Control*

*Journal of Neurophysiology*

*Neural Computation*

*Annals of Statistics*

*Frontiers in Systems Neuroscience*

*PLOS Comput Biol.*

*Advances in neural information processing systems*

*Nature*

*Journal of Vision*

*Journal of Neurophysiology*

*Gaussian processes for machine learning*

*Journal of Neuroscience*

*J. Physiol.*

*Vision Research*

*Neuron*

*Advances in neural information processing systems*

*Journal of Vision*

*Neural Computation*

*The cognitive neurosciences*

*Journal of Neuroscience*

*Network: Computation in Neural Systems*

*Proceedings of the 12th International Conference on Artificial Intelligence and Statistics*

*Advances in neural information processing systems*