## Abstract

When governed by underlying low-dimensional dynamics, the interdependence of simultaneously recorded populations of neurons can be explained by a small number of shared factors, or a low-dimensional trajectory. Recovering these latent trajectories, particularly from single-trial population recordings, may help us understand the dynamics that drive neural computation. However, due to the biophysical constraints and noise in the spike trains, inferring trajectories from data is a challenging statistical problem in general. Here, we propose a practical and efficient inference method, the variational latent gaussian process (vLGP). The vLGP combines a generative model with a history-dependent point process observation, together with a smoothness prior on the latent trajectories. The vLGP improves on earlier methods for recovering latent trajectories, which assume either observation models inappropriate for point processes or linear dynamics. We compare and validate vLGP on both simulated data sets and population recordings from the primary visual cortex. In the V1 data set, we find that vLGP achieves substantially higher performance than previous methods for predicting omitted spike trains, as well as capturing both the toroidal topology of visual stimuli space and the noise correlation. These results show that vLGP is a robust method with the potential to reveal hidden neural dynamics from large-scale neural recordings.

## 1 Introduction

Neural populations implement dynamics that produce robust behavior; however, our current experimental observations of these dynamics are invariably indirect and partial. In classical analyses of neural spike trains, noisy responses are averaged over repeated trials that are presumably time-locked to a stereotypical computation process. However, neural dynamics are not necessarily time-locked or precisely repeated from trial to trial; rather, many cognitive processes generate observable variations in the internal processes that sometimes manifest in behavior such as error trials, broad reaction time distributions, and change of mind (Latimer, Yates, Meister, Huk, & Pillow, 2015; Jazayeri & Shadlen, 2010; Resulaj, Kiani, Wolpert, & Shadlen, 2009). In addition, it is difficult to disambiguate different possible neural implementations of computation from the average trajectory since they may differ only in their trial-to-trial variability (Latimer et al., 2015; Churchland et al., 2011). Therefore, if we wish to understand how neural computation is implemented in neural populations, it is imperative that we recover these hidden dynamics from individual trials (Paninski et al., 2010; Kao et al., 2015).

Advances in techniques for recording from larger subpopulations facilitate single-trial analysis, especially the inference of single-trial latent dynamical trajectories. Several statistical approaches have been developed for extracting latent trajectories that describe the activity of observed populations (Yu et al., 2009; Koyama, Pérez-Bolde, Shalizi, & Kass, 2010; Macke et al., 2011; Pfau, Pnevmatikakis, & Paninski, 2013; Archer, Koster, Pillow, & Macke, 2014; Frigola, Chen, & Rasmussen, 2014). For example, latent trajectories recovered from motor cortex suggest that these methods can provide insight into the coding and preparation of planned reaching behavior (Sadtler et al., 2014; Churchland et al., 2012; Churchland, Cunningham, Kaufman, Ryu, & Shenoy, 2010). Latent trajectories also elucidate the low-dimensional noise structure of neural codes and computations (Moreno-Bote et al., 2014; Haefner, Gerwinn, Macke, & Bethge, 2013; Sadtler et al., 2014; Ecker et al., 2014).

Inference of latent dynamical trajectories is a dimensionality-reduction method for multivariate time series, akin to Kalman smoothing or factor analysis (Kao et al., 2015). Given a high-dimensional observation sequence, we aim to infer a shared, low-dimensional latent process that explains much of the variation in high-dimensional observations. A large class of methods assumes an autoregressive linear dynamics model in the latent process due to its computational tractability (Paninski et al., 2010; Macke et al., 2011; Buesing, Macke, & Sahani, 2012; Kao et al., 2015); we refer to these as Poisson linear dynamical systems (PLDS). Although the assumption of linear dynamics can help in smoothing, it can also be overly simplistic: interesting neural computations are naturally implemented as nonlinear dynamics, and evidence points to nonlinear dynamics in the brain in general. Therefore, we propose to relax this modeling assumption and impose a general gaussian process prior to nonparametrically inferring the latent dynamics, similar to the gaussian process factor analysis (GPFA) method (Yu et al., 2009; Lakshmanan et al., 2015). However, we differ from GPFA in that we use a point-process observation model with self-history dependence rather than an instantaneous gaussian observation model. A gaussian observation model is inappropriate for inference in the millisecond-range timescale since it cannot generate spike counts. The price we pay is a nonconjugate prior and, consequently, an approximate posterior inference (Paninski et al., 2010). We use a variational approximation (Blei, Kucukelbir, & McAuliffe, in press), where we assume a gaussian process posterior over the latents and optimize a lower bound of the marginal likelihood for the inference. Our algorithm, which we call a variational latent gaussian process (vLGP), is fast and has better predictability compared to both GPFA and PLDS at a fine timescale (1 ms bins). We compare these algorithms on simulated systems with known latent processes. We apply it to high-dimensional V1 data from anesthetized monkey to recover both the noise correlation structure and topological structure of population encoding of drifting orientation grating stimuli.

## 2 Generative Model

^{1}Neurons are conditionally independent otherwise: all trial-to-trial variability is attributed to either the latent process or individual point-process noise (Goris, Movshon, & Simoncelli, 2014; Ecker et al., 2014; Lin, Okun, Carandini, & Harris, 2015).

The vector denotes the -dimensional latent process at time . We assume that , since we are looking for a small number of latent processes that explain the structure of a large number of observed neurons. The vector consists of the weights of the spike history and a time-independent bias term of the log firing rate for each neuron, and is a vector of length containing the dummy value 1 for the bias and time step spike self-history. This parameterization assumes that at most bins in the past influence the current intensity.

Note that this model is not identifiable since where is an arbitrary invertible matrix (see later sections for further discussion). Also, the mean of latent process can be traded off with the bias term in .

## 3 Variational Inference

Variational inference for the entire posterior over latents, parameters, and hyperparameters can all be formulated in terms of maximizing equation 3.5. We sequentially update all parameters coordinate-wise; each conditional update turns out to be a convex-optimization problem except for the hyperparameters as explained below. We derive the inference algorithm (vLGP) in the following sections, and it is summarized in algorithm 1.

Our algorithm scales linearly in space and time per iteration (for a fixed hyperparameter) where thanks to the rank- incomplete Cholesky factorization of the prior covariance matrix. For comparison, the time complexity of GPFA is and that of PLDS is .

### 3.1 Posterior over the Latent Process

There is a redundancy between the bias term in and the mean . During optimization, we constrain the latent mean by zero-centering and normalize the loading by its max-norm latent-wise.

### 3.2 Weights

### 3.3 Hyperparameters

One way to choose hyperparameters is to maximize the marginal likelihood with regard to the hyperparameters. Since the marginal likelihood is intractable in the vLGP model, we instead maximize equation 3.5 once again given the parameters and posterior. Interestingly, this objective function takes the same form as the one that is maximized in the GPFA’s hyperparameters updates.

The above derivation of the hyperparameter optimization technique assumes a fixed posterior and parameters. Thus, it requires complete prior covariance matrices and explicit posterior covariance matrices rather than low-rank decompositions. In order to avoid numerical singularity, we add a small quantity to the diagonal of prior covariance matrices. It would be extremely costly to use these complete covariance matrices for long, consecutive time series. Therefore, we randomly take many shorter temporal subsamples of the posterior for fast computation (Yu et al., 2009). One hyperparameter iteration is performed every fixed number of iterations of posterior and parameter optimization.

## 4 Results

We verified that our inference algorithm recovers the true parameters and latent variables when there is no model mismatch and then apply it to two simulated systems and one real data set. We compare our method, vLGP, against GPFA and PLDS.

### 4.1 Convergence

First, we demonstrate that vLGP converges to the correct parameters and latent variables under the assumed generative model. We applied our method to simulated spike trains driven by a two-dimensional gaussian process. We fixed the number of time bins of a trial, GP variance, and timescale (). The values of parameters were randomly drawn from standard normal distribution. There are two limits that we need to consider for the convergence: increasing the duration of observations (more trials) and increasing the number of observed neurons. The parameters and latent variables are initialized by factor analysis (FA). To identify the property of the global optima, we also initialize the parameters and latent variables at the values near the true ones (by adding zero mean and 0.1 standard deviation gaussian noises).

We calculated the mean squared error (MSE) of posterior mean and weights on a grid of different numbers of trials and neurons. Figure 2a shows the convergence in MSE trend. The posterior mean of the latent distribution converges to the true latent as the number of neurons grows, and the parameters converge to the true weights as the number of time bins grows. The difference between FA and near-truth initialization shows relative error in FA initialization combined with the nonconvex vLGP inference, which is small for latent process estimation.

We also fit the vLGP model to simulated spike trains driven by a one-dimensional gaussian process latent at four different timescales while the rest of the setting was the same as the above simulation. For each timescale, we simulated 10 data sets. We initialized the timescale at a very smooth value () and set the initial values of the parameters and latent variables by factor analysis. Figure 2b shows the learned values scattered around the ground truth for a wide range of true timescales. We note that learning the timescale in general is challenging, especially in high-dimensional latent spaces (data not shown).

### 4.2 Evaluation

We use a leave-one-neuron-out prediction likelihood to compare models. For each data set comprising several trials, we choose one of the trials as a test trial and the others as training trials. First, the weights and posterior are inferred from the training trials. Next, we leave one neuron out of the test trial and make an inference on the posterior using the remaining neurons with the weights estimated from the training trials. Then the spike train of the left-out neuron is predicted by the model given the weights estimated from the training trials and the posterior inferred from the test trial. We repeat this procedure on each neuron of the chosen test trial and choose each trial of one data set as a test trial. Finally we obtain the prediction of all spike trains in the data set.

For simulated data sets, we know the true latent process that generates observations. Since latent space is identifiable only up to affine transformation, we can quantify using the angle between subspaces (Buesing et al., 2012; Pfau et al., 2013). However, due to a possible mismatch in the point nonlinearity, the subspace can be distorted. To account for this mismatch, we use the mean Spearman’s rank correlation that allows invertible monotonic mapping in each direction. The Spearman’s rank correlation between the posterior and the true latent trajectory gives a measure of the goodness of the posterior. If the correlation is large, the posterior recovers more information about the underlying trajectory.

### 4.3 Simulation

We simulate two data sets: one with deterministic nonlinear dynamics and one with linear dynamics and model-mismatched nonlinear observation. Each data set consists of five samples (simulated data sets), and each sample contains 10 trials from 50 neurons that last 1 sec. We choose a bin size of 1 ms.

Figure 3 shows one trial from each data set and the corresponding inferred posterior mean latent process. The posterior means are rotated toward the true latent subspace. The PLDS inference (blue) looks the farthest away from the true Lorenz latent relatively but much closer to the LDS latent because the true latent meets its assumption. However, PLDS-inferred latents lack smoothness. The GPFA inference (green) is better than PLDS for the Lorenz latent but shows deviations from the true LDS latent. The smoothness is kept in the inference. The inferences of our method (red) are very close to the true latent in both cases along the time while being smooth at the same time.

Figure 4 shows the Spearman’s rank correlation between the posterior mean and true latent versus running time (log scale). The figures show that our method, vLGP, resulted in an overall larger correlation than the PLDS and GPFA after the algorithms terminated. PLDS uses nuclear norm penalized rate estimation as initialization (Pfau et al., 2013). The rank correlation from PLDS inference improved only slowly from the initial value through the optimization. Both the GPFA and vLGP use factor analysis as initialization (Yu et al., 2009). Note that the GPFA divides each trial into small time segments for estimating the loading matrix and bias, which it breaks the continuity within each trial. Only the final iteration infers each trial as whole. Thus, the correlations of the final iterations jump up in the figures. It is obvious that vLGP greatly improves the result of factor analysis in terms of the rank correlation.

^{2}We also compare the instead of the PLL of the predictions with respect to the three models (see Figure 5b) where the raw linear prediction without the rectifier link was used. PLL and largely agree despite that assumes a squared error measure GPFA optimizes for and is inappropriate for PLDS and vLGP.

### 4.4 V1 Population Recording

We apply our method to a large-scale recording to validate that vLGP picks up meaningful known signals and investigate the population-wide trial-to-trial variability structure. We use the data set (Graf, Kohn, Jazayeri, & Movshon, 2011) where 72 different equally spaced directional drifting gratings were presented to an anesthetized monkey for 50 trials each (array 5, 148 simultaneously recorded single units). We use 63 V1 neurons by considering only neurons with tuning curves that could be well approximated () by bimodal circular gaussian functions (the sum of two von Mises functions with different preferred orientations, amplitudes, and bandwidths) according to Graf et al. (2011). We do not include the stimulus drive in the model, in hopes that the inferred latent processes would encode the stimulus. We used a bin size of 1 ms.

We use four-fold cross-validation to determine the number of latents. A 15-dimensional model is fitted to a subsample composed of the first trial of each direction at first. In each fold, we use its estimated parameter to infer the latent process from another subsample composed of the second trial of each direction. The inference is made by leaving a quarter of the neurons out, and we predict the spike trains of the left-out neurons given the first () orthogonalized latent process corresponding to -dimension. This procedure led us to choose 5 as the dimension since the predictive log likelihood reached its maximum.

Figure 5 shows the predictive performance based on two subsets. The first one is four trials (0, 90, 180, 270) of the subset of 63 neurons with a five-dimensional latent process. The second one is 10 trials (five trials of 0 and five trials of 90) of all 148 neurons with a four-dimensional latent process.

To evaluate the predictive performance under a longer timescale, we also cross-validated on the first subset with 20 ms time bins with GPFA and vLGP. GPFA models were fitted for both the raw spike count and its square root. The mean of the square root spike count by vLGP was obtained from simulation using the predicted firing rate. We report the normalized MSEs (MSE/variance of observation) of both methods for the spike count and its square root, respectively: GPFA: 0.713 (spike count) and 0.699 (square root), and vLGP: 0.709 (spike count) and 0.708 (square root). We use the *F*-test to compare the MSEs and see if either of the two methods results in a significantly larger error. The corresponding -values are 0.588 (spike count) and 0.140 (square root). It shows that the MSEs are not significantly different between GPFA and vLGP with a 20 ms time bin.

Although the parameters are estimated from a subsample, we can use them to infer the latent process of all trials of all 72 directions. Figure 6 shows inferred latent processes for two trials for two directions. We rotate the inferred latent process by the singular value decomposition (SVD; details are given later). Variational posterior distributions over the latents are shown for each trial. During the second half of the trial when the stimulus was off and the firing rate was lower, the uncertainty in the latent processes increased. There is visible trial-to-trial variability in the spike trains, which is reflected in the variations of latents.

First, we investigate how the “signal,” defined as visual stimuli, is captured by the latent processes. We average the inferred latent processes over 50 trials with identical spatiotemporal stimuli (see Figure 7). Since the stimuli are time-locked, the corresponding average latent trajectory should reveal the time-locked population fluctuations driven by the visual input. We concatenate the average latent processes along the dimension of time. Then we orthogonalize it by SVD. The dimensions of orthogonalized one are ordered by the respective singular values. The latent process of a single trial is also rotated to the same subspace.

Furthermore, we visualized the trajectories in 3D (see the supplementary online video)^{3} that show how signal and noise are dynamically encoded in the state space. Figure 8 shows the projection of the average latent process corresponding to each orientation to the first three principal components. The projection topologically preserves the orientation tuning in the V1 population. There are two continuous circular variables in the stimuli space to be encoded: orientation and temporal phase of oscillation. The simplest topological structure of neural encoding is a torus, and we observe a toroidal topology (highlighted as rings of cycle averages).

To see if our method captures the noise correlation structure through the latents as one would predict from recent studies of cortical population activity (Goris et al., 2014; Ecker et al., 2014; Lin et al., 2015), we calculated pairwise correlations between all neurons. We simulated spike trains by model-predicted firing rates of all trials with and stimulus. The model-predicted firing rates are the fitted values of the time bins that are calculated in the way of the generative model by using estimated parameters and latents. For each bin, we simulated the spike by using a Poisson random number generator with the corresponding firing rate. To remove the signal, we subtracted the mean over 50 trials for each direction of stimulus. Figure 9 shows the correlation matrices during the stimulus period (150–1150 ms) and off-stimulus period (1400–2400 ms). The neurons are sorted by the total correlations during the stimulus period. The power of model-explained noise correlation is defined as , where is the zero-diagonal correlation matrix with regard to its subscript and is the Frobenius norm. The proposed model explains more noise correlation in contrast to GPFA and PLDS for both periods.

These results show that vLGP is capable of capturing both the signal—repeated over multiple trials—and noise—population fluctuation not time locked to other task variables present in the cortical spike trains.

## 5 Discussion

We propose vLGP, a method that recovers low-dimensional latent dynamics from high-dimensional time series. Latent state-space inference methods are different from methods that only recover the average neural response time-locked to an external observation (Churchland et al., 2012; Brendel, Romo, & Machens, 2011). By inferring latent trajectories on each trial, they provide a flexible framework for studying the internal neural processes that are not time-locked. Higher-order processes such as decision making, attention, and memory recall are well suited for latent trajectory analysis due to their intrinsic low dimensionality of computation. Our method performs dimensionality reduction on a single trial basis and allows decomposition of neural signals into a small number of smooth temporal signals and their relative contribution to the population signal.

We compare our method to two widely used latent state-space modeling tools in neuroscience: GPFA (Yu et al., 2009) and PLDS (Macke et al., 2011). Unlike GPFA, vLGP allows a generalized linear model observation, which is more suitable for a broad range of spike train observations. Moreover, vLGP is significantly faster than PLDS, yet it shows superior performance in capturing the spatiotemporal structures in the neural data to both PLDS and GPFA at a fine timescale (1 ms bins).

To test its validity in real electrophysiological recordings, we used a V1 population recording driven by a fixed stimulus as a litmus test. We showed that our inferred latents contain meaningful information about the external stimuli, encoding both orientation and temporal modulation on a continuous manifold.

We considered only smoothness encoded in the GP prior in this letter, but a plethora of GP kernels are available (Rasmussen & Williams, 2005; Schölkopf & Smola, 2002). For example, to capture prior assumptions about periodicity in some of the latent processes, we can use spectral kernels (Ulrich, Carlson, Dzirasa, & Carin, 2015). This can be particularly useful for capturing internal neural oscillations (Fries, Reynolds, Rorie, & Desimone, 2001). In addition, it is straightforward to incorporate additional covariates such as external stimuli (Park et al., 2014) or local field potential (Kelly, Smith, Kass, & Lee, 2010) to vLGP.

We have not found systematic issues with vLGP, but potential weaknesses could stem from the variational approximation, inappropriate assumptions on the latent processes, and particular numerical shortcuts used in the implementation. It could be challenging for our method to recover very fast-changing latent trajectories, especially when the overall firing rate is very low. Subsampling used while optimizing the hyperparameters may miss a rare but key spike when the spike trains are sparse.

The proposed method has potential application in many areas, and it will be particularly useful in discovering how specific neural computations are implemented as neural dynamics. We are working on applying this method and its extensions to the sensorimotor decision-making process where the normative model guides what is being computed, but it is unclear as to how the neural system implements it.

An open-source python implementation of vLGP is available online (https://github.com/catniplab/vLGP) under MIT license.

## Acknowledgments

We thank the reviewers for their constructive feedback. We are grateful to Arnulf Graf, Adam Kohn, Tony Movshon, and Mehrdad Jazayeri for providing the V1 data set. We also thank Evan Archer, Jakob Macke, Yuanjun Gao, Chethan Pandarinath, and David Sussillo for helpful feedback. This work was partially supported by the Thomas Hartman Foundation for Parkinson’s Research.

## References

## Notes

^{1}

It is straightforward to add external covariates similar to the self-history in this point-process regression (see Park, Meister, Huk, & Pillow, 2014).

^{2}

We initially tried square link function for GPFA. However, it often produces a detrimental PLL due to large rate predictions from large negative raw () predictions. Also note that a straightforward linear rectifier can result in an undefined PLL due to predicting zero rate in a bin with nonzero observation.