Abstract

When governed by underlying low-dimensional dynamics, the interdependence of simultaneously recorded populations of neurons can be explained by a small number of shared factors, or a low-dimensional trajectory. Recovering these latent trajectories, particularly from single-trial population recordings, may help us understand the dynamics that drive neural computation. However, due to the biophysical constraints and noise in the spike trains, inferring trajectories from data is a challenging statistical problem in general. Here, we propose a practical and efficient inference method, the variational latent gaussian process (vLGP). The vLGP combines a generative model with a history-dependent point process observation, together with a smoothness prior on the latent trajectories. The vLGP improves on earlier methods for recovering latent trajectories, which assume either observation models inappropriate for point processes or linear dynamics. We compare and validate vLGP on both simulated data sets and population recordings from the primary visual cortex. In the V1 data set, we find that vLGP achieves substantially higher performance than previous methods for predicting omitted spike trains, as well as capturing both the toroidal topology of visual stimuli space and the noise correlation. These results show that vLGP is a robust method with the potential to reveal hidden neural dynamics from large-scale neural recordings.

1  Introduction

Neural populations implement dynamics that produce robust behavior; however, our current experimental observations of these dynamics are invariably indirect and partial. In classical analyses of neural spike trains, noisy responses are averaged over repeated trials that are presumably time-locked to a stereotypical computation process. However, neural dynamics are not necessarily time-locked or precisely repeated from trial to trial; rather, many cognitive processes generate observable variations in the internal processes that sometimes manifest in behavior such as error trials, broad reaction time distributions, and change of mind (Latimer, Yates, Meister, Huk, & Pillow, 2015; Jazayeri & Shadlen, 2010; Resulaj, Kiani, Wolpert, & Shadlen, 2009). In addition, it is difficult to disambiguate different possible neural implementations of computation from the average trajectory since they may differ only in their trial-to-trial variability (Latimer et al., 2015; Churchland et al., 2011). Therefore, if we wish to understand how neural computation is implemented in neural populations, it is imperative that we recover these hidden dynamics from individual trials (Paninski et al., 2010; Kao et al., 2015).

Advances in techniques for recording from larger subpopulations facilitate single-trial analysis, especially the inference of single-trial latent dynamical trajectories. Several statistical approaches have been developed for extracting latent trajectories that describe the activity of observed populations (Yu et al., 2009; Koyama, Pérez-Bolde, Shalizi, & Kass, 2010; Macke et al., 2011; Pfau, Pnevmatikakis, & Paninski, 2013; Archer, Koster, Pillow, & Macke, 2014; Frigola, Chen, & Rasmussen, 2014). For example, latent trajectories recovered from motor cortex suggest that these methods can provide insight into the coding and preparation of planned reaching behavior (Sadtler et al., 2014; Churchland et al., 2012; Churchland, Cunningham, Kaufman, Ryu, & Shenoy, 2010). Latent trajectories also elucidate the low-dimensional noise structure of neural codes and computations (Moreno-Bote et al., 2014; Haefner, Gerwinn, Macke, & Bethge, 2013; Sadtler et al., 2014; Ecker et al., 2014).

Inference of latent dynamical trajectories is a dimensionality-reduction method for multivariate time series, akin to Kalman smoothing or factor analysis (Kao et al., 2015). Given a high-dimensional observation sequence, we aim to infer a shared, low-dimensional latent process that explains much of the variation in high-dimensional observations. A large class of methods assumes an autoregressive linear dynamics model in the latent process due to its computational tractability (Paninski et al., 2010; Macke et al., 2011; Buesing, Macke, & Sahani, 2012; Kao et al., 2015); we refer to these as Poisson linear dynamical systems (PLDS). Although the assumption of linear dynamics can help in smoothing, it can also be overly simplistic: interesting neural computations are naturally implemented as nonlinear dynamics, and evidence points to nonlinear dynamics in the brain in general. Therefore, we propose to relax this modeling assumption and impose a general gaussian process prior to nonparametrically inferring the latent dynamics, similar to the gaussian process factor analysis (GPFA) method (Yu et al., 2009; Lakshmanan et al., 2015). However, we differ from GPFA in that we use a point-process observation model with self-history dependence rather than an instantaneous gaussian observation model. A gaussian observation model is inappropriate for inference in the millisecond-range timescale since it cannot generate spike counts. The price we pay is a nonconjugate prior and, consequently, an approximate posterior inference (Paninski et al., 2010). We use a variational approximation (Blei, Kucukelbir, & McAuliffe, in press), where we assume a gaussian process posterior over the latents and optimize a lower bound of the marginal likelihood for the inference. Our algorithm, which we call a variational latent gaussian process (vLGP), is fast and has better predictability compared to both GPFA and PLDS at a fine timescale (1 ms bins). We compare these algorithms on simulated systems with known latent processes. We apply it to high-dimensional V1 data from anesthetized monkey to recover both the noise correlation structure and topological structure of population encoding of drifting orientation grating stimuli.

2  Generative Model

Suppose we simultaneously observe spike trains from neurons. Let denote the spike count time series from the th neuron for a small time bin. We model noisy neural spike trains mathematically as a simple point process that is fully described by its conditional intensity function (Daley & Vere-Jones, 1988). We assume the following parametric form of the conditional intensity function for the point-process log likelihood (Daley & Vere-Jones, 1988; Macke et al., 2011):
formula
2.1
where is a latent process and denotes the spike history vector (Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Pillow et al., 2008). Each neuron is directly influenced by the observed self-history with weight and also driven by the common latent process with weight (see Figure 1).1 Neurons are conditionally independent otherwise: all trial-to-trial variability is attributed to either the latent process or individual point-process noise (Goris, Movshon, & Simoncelli, 2014; Ecker et al., 2014; Lin, Okun, Carandini, & Harris, 2015).
Figure 1:

Generative model schematic representing equation 2.1 for one neuron driven by two latent processes. Every neuron in the observed population is driven by the same set of latent processes. The inferred latent processes are more likely to be smooth, as assumed by the smooth gaussian process prior. Given and , the spike train is generated as a generalized linear model (GLM) (Truccolo et al., 2005; Pillow et al., 2008). The point nonlinearity is fixed to be exponential .

Figure 1:

Generative model schematic representing equation 2.1 for one neuron driven by two latent processes. Every neuron in the observed population is driven by the same set of latent processes. The inferred latent processes are more likely to be smooth, as assumed by the smooth gaussian process prior. Given and , the spike train is generated as a generalized linear model (GLM) (Truccolo et al., 2005; Pillow et al., 2008). The point nonlinearity is fixed to be exponential .

The vector denotes the -dimensional latent process at time . We assume that , since we are looking for a small number of latent processes that explain the structure of a large number of observed neurons. The vector consists of the weights of the spike history and a time-independent bias term of the log firing rate for each neuron, and is a vector of length containing the dummy value 1 for the bias and time step spike self-history. This parameterization assumes that at most bins in the past influence the current intensity.

Under conditional independence, the joint distribution (data likelihood) of spike trains is given by
formula
2.2

Note that this model is not identifiable since where is an arbitrary invertible matrix (see later sections for further discussion). Also, the mean of latent process can be traded off with the bias term in .

Our assumptions about the latent process—namely, the smoothness over time in this letter—are encoded in the prior distribution over the latent process. We use the gaussian process (GP) framework (Rasmussen & Williams, 2005) for flexible prior design of each dimension independently:
formula
2.3
where , and are mean and covariance functions, respectively. When time is discretized, the GP prior reduces to a multivariate gaussian distribution over the latent time series. We use the following form:
formula
2.4
For the analyses in this letter, we choose the squared exponential covariance function (Rasmussen & Williams, 2005) for general smoothness over time:
formula
2.5
where and are hyperparameters corresponding to the magnitude and inverse timescale of the latent process, respectively.

3  Variational Inference

Our goal is to infer the posterior distribution over the latent process and fit the model parameters given the observed data. By Bayes’ theorem, the posterior distribution of the latent process is
formula
3.1
However, unlike in GPFA, the posterior under a point-process likelihood and Gaussian process prior does not have an analytical form (Paninski et al., 2010). Consequently, we must turn to an approximate inference technique. We employ variational inference, which aims to find an approximate distribution of the intractable true posterior . We can introduce this approximate posterior into the likelihood by rewriting it as
formula
3.2
formula
3.3
where denotes an expectation over , and is the Kullback-Leibler divergence, which measures the difference in the true posterior and its variational approximation. Since is nonnegative, is the lower bound for the marginal likelihood. Finding an approximate posterior close to the true posterior by minimizing the Kullback-Leibler divergence is equivalent to maximizing the lower bound , also known as the evidence lower bound (ELBO).
We further assume that the distribution factorizes into gaussian distributions with respect to each dimension of the latent process, such that
formula
3.4
We then obtain the lower bound:
formula
3.5
where is the number of total time steps and each temporal slice is a vector of posterior means of the latent variables at time . Each temporal slice is a diagonal matrix whose diagonal contains the variances of the latent variables at time .

Variational inference for the entire posterior over latents, parameters, and hyperparameters can all be formulated in terms of maximizing equation 3.5. We sequentially update all parameters coordinate-wise; each conditional update turns out to be a convex-optimization problem except for the hyperparameters as explained below. We derive the inference algorithm (vLGP) in the following sections, and it is summarized in algorithm 1.

formula

Our algorithm scales linearly in space and time per iteration (for a fixed hyperparameter) where thanks to the rank- incomplete Cholesky factorization of the prior covariance matrix. For comparison, the time complexity of GPFA is and that of PLDS is .

3.1  Posterior over the Latent Process

The variational distribution is assumed to be gaussian and thus determined only by its mean and covariance . The optimal solution is therefore obtained by
formula
3.6
while holding other parameters and hyperparameters fixed.
Denote the expected firing rate of neuron at time by :
formula
3.7
The optimal can be obtained by the Newton-Raphson method. The gradient and Hessian are given as
formula
3.8
formula
3.9
where is a vector of length with value 1 at and zero elsewhere. Note that the Hessian is negative definite, and hence this is a convex optimization given the other arguments and . In each iteration, the update is
formula
3.10
If we set the derivative with regard to to 0,
formula
3.11
we obtain the optimal covariance,
formula
3.12
formula
3.13
where is a diagonal matrix. Therefore, there is no need for optimization of the covariance. This simple form of variational posterior covariance has been noted before (Opper & Archambeau, 2008). Also note that .

There is a redundancy between the bias term in and the mean . During optimization, we constrain the latent mean by zero-centering and normalize the loading by its max-norm latent-wise.

The prior covariance matrix is large () and is often severely ill-conditioned. We keep only a truncated incomplete Cholesky factor  (Bach & Jordan, 2002) of size where is the rank of the resulting approximation,
formula
3.14
which provides both a compact representation and numerical stability. Now, we derive key quantities that are necessary for a memory-efficient and numerically stable implementation. For convenience and without ambiguity, we omit the subscript of all vectors and matrices below. By the matrix inversion lemma (Rasmussen & Williams, 2005), we have
formula
3.15
and applying the lemma again,
formula
3.16
where . We obtain two useful identities as a result:
formula
3.17
formula
3.18
With equations 3.17 and 3.18, we can avoid large matrices in the above equations such as,
formula
3.19
formula
3.20
formula
3.21
formula
3.22
where is the all-ones vector and . In addition, by the one-to-one correspondence between and , we use the diagonal of as a representation of in the algorithm.

3.2  Weights

Denote the temporal slices of ’s by matrix . The optimal weights and given the posterior over the latents can be obtained by the Newton-Raphson method with the following derivatives and Hessians,
formula
3.23
formula
3.24
and
formula
3.25
formula
3.26
The updating rules are
formula
3.27
formula
3.28
Once again, both Hessians are negative definite and, hence, in the territory of convex optimization.

3.3  Hyperparameters

One way to choose hyperparameters is to maximize the marginal likelihood with regard to the hyperparameters. Since the marginal likelihood is intractable in the vLGP model, we instead maximize equation 3.5 once again given the parameters and posterior. Interestingly, this objective function takes the same form as the one that is maximized in the GPFA’s hyperparameters updates.

We write the squared-exponential covariance kernel as
formula
3.29
where is the matrix of squared distances of each time pair. Hyperparameters and correspond to prior variance and inverse (squared) timescale. We optimize the log-transformed hyperparameter for those are positive. To the th transformed hyperparameter of the th latent dimension, , the derivative is given as
formula
3.30
formula
3.31
The optimal value can be found by common gradient algorithms for each latent dimension independently.

The above derivation of the hyperparameter optimization technique assumes a fixed posterior and parameters. Thus, it requires complete prior covariance matrices and explicit posterior covariance matrices rather than low-rank decompositions. In order to avoid numerical singularity, we add a small quantity to the diagonal of prior covariance matrices. It would be extremely costly to use these complete covariance matrices for long, consecutive time series. Therefore, we randomly take many shorter temporal subsamples of the posterior for fast computation (Yu et al., 2009). One hyperparameter iteration is performed every fixed number of iterations of posterior and parameter optimization.

4  Results

We verified that our inference algorithm recovers the true parameters and latent variables when there is no model mismatch and then apply it to two simulated systems and one real data set. We compare our method, vLGP, against GPFA and PLDS.

4.1  Convergence

First, we demonstrate that vLGP converges to the correct parameters and latent variables under the assumed generative model. We applied our method to simulated spike trains driven by a two-dimensional gaussian process. We fixed the number of time bins of a trial, GP variance, and timescale (). The values of parameters were randomly drawn from standard normal distribution. There are two limits that we need to consider for the convergence: increasing the duration of observations (more trials) and increasing the number of observed neurons. The parameters and latent variables are initialized by factor analysis (FA). To identify the property of the global optima, we also initialize the parameters and latent variables at the values near the true ones (by adding zero mean and 0.1 standard deviation gaussian noises).

We calculated the mean squared error (MSE) of posterior mean and weights on a grid of different numbers of trials and neurons. Figure 2a shows the convergence in MSE trend. The posterior mean of the latent distribution converges to the true latent as the number of neurons grows, and the parameters converge to the true weights as the number of time bins grows. The difference between FA and near-truth initialization shows relative error in FA initialization combined with the nonconvex vLGP inference, which is small for latent process estimation.

Figure 2:

vLGP learning under the assumed model. (a) Convergence of latent variable and parameters vLGP. Notice that we use instead of the raw because a tiny deviation in the base firing rate results in a huge difference in the bias term through the exp/log transform. MSE was computed over a grid over a number of neurons and trials. We plot the median MSE. (b) Learned timescales . Green diamonds denote the initial values, red diamonds denote the true values, and blue dots denote learned values. For every true value, we simulated 10 data sets. All initial values were fixed at .

Figure 2:

vLGP learning under the assumed model. (a) Convergence of latent variable and parameters vLGP. Notice that we use instead of the raw because a tiny deviation in the base firing rate results in a huge difference in the bias term through the exp/log transform. MSE was computed over a grid over a number of neurons and trials. We plot the median MSE. (b) Learned timescales . Green diamonds denote the initial values, red diamonds denote the true values, and blue dots denote learned values. For every true value, we simulated 10 data sets. All initial values were fixed at .

We also fit the vLGP model to simulated spike trains driven by a one-dimensional gaussian process latent at four different timescales while the rest of the setting was the same as the above simulation. For each timescale, we simulated 10 data sets. We initialized the timescale at a very smooth value () and set the initial values of the parameters and latent variables by factor analysis. Figure 2b shows the learned values scattered around the ground truth for a wide range of true timescales. We note that learning the timescale in general is challenging, especially in high-dimensional latent spaces (data not shown).

4.2  Evaluation

We use a leave-one-neuron-out prediction likelihood to compare models. For each data set comprising several trials, we choose one of the trials as a test trial and the others as training trials. First, the weights and posterior are inferred from the training trials. Next, we leave one neuron out of the test trial and make an inference on the posterior using the remaining neurons with the weights estimated from the training trials. Then the spike train of the left-out neuron is predicted by the model given the weights estimated from the training trials and the posterior inferred from the test trial. We repeat this procedure on each neuron of the chosen test trial and choose each trial of one data set as a test trial. Finally we obtain the prediction of all spike trains in the data set.

For simulated data sets, we know the true latent process that generates observations. Since latent space is identifiable only up to affine transformation, we can quantify using the angle between subspaces (Buesing et al., 2012; Pfau et al., 2013). However, due to a possible mismatch in the point nonlinearity, the subspace can be distorted. To account for this mismatch, we use the mean Spearman’s rank correlation that allows invertible monotonic mapping in each direction. The Spearman’s rank correlation between the posterior and the true latent trajectory gives a measure of the goodness of the posterior. If the correlation is large, the posterior recovers more information about the underlying trajectory.

4.3  Simulation

We simulate two data sets: one with deterministic nonlinear dynamics and one with linear dynamics and model-mismatched nonlinear observation. Each data set consists of five samples (simulated data sets), and each sample contains 10 trials from 50 neurons that last 1 sec. We choose a bin size of 1 ms.

In the first data set, the latent trajectories are sampled from the Lorenz dynamical system with the time step of 0.0015. This three-dimensional system is defined by the following set of equations:
formula
4.1
Spike trains are simulated by equation 2.1 with a 10-step suppressive history filter (from most recent: ) given the latent trajectory.
In the second data set, Poisson spike trains are simulated from a three-dimensional linear dynamical system (LDS) defined as
formula
4.2

Figure 3 shows one trial from each data set and the corresponding inferred posterior mean latent process. The posterior means are rotated toward the true latent subspace. The PLDS inference (blue) looks the farthest away from the true Lorenz latent relatively but much closer to the LDS latent because the true latent meets its assumption. However, PLDS-inferred latents lack smoothness. The GPFA inference (green) is better than PLDS for the Lorenz latent but shows deviations from the true LDS latent. The smoothness is kept in the inference. The inferences of our method (red) are very close to the true latent in both cases along the time while being smooth at the same time.

Figure 3:

Spike trains from 50 simultaneously observed neurons and corresponding three-dimensional latent dynamics. (Left) Simulated spike trains from each corresponding system. See equations 4.1 and 4.2 for the exact generative model. (Right) True and inferred three-dimensional latent processes. vLGP and GPFA infer smooth posterior, while noticeable high-frequency noise is present in the PLDS inference.

Figure 3:

Spike trains from 50 simultaneously observed neurons and corresponding three-dimensional latent dynamics. (Left) Simulated spike trains from each corresponding system. See equations 4.1 and 4.2 for the exact generative model. (Right) True and inferred three-dimensional latent processes. vLGP and GPFA infer smooth posterior, while noticeable high-frequency noise is present in the PLDS inference.

Figure 4 shows the Spearman’s rank correlation between the posterior mean and true latent versus running time (log scale). The figures show that our method, vLGP, resulted in an overall larger correlation than the PLDS and GPFA after the algorithms terminated. PLDS uses nuclear norm penalized rate estimation as initialization (Pfau et al., 2013). The rank correlation from PLDS inference improved only slowly from the initial value through the optimization. Both the GPFA and vLGP use factor analysis as initialization (Yu et al., 2009). Note that the GPFA divides each trial into small time segments for estimating the loading matrix and bias, which it breaks the continuity within each trial. Only the final iteration infers each trial as whole. Thus, the correlations of the final iterations jump up in the figures. It is obvious that vLGP greatly improves the result of factor analysis in terms of the rank correlation.

Figure 4:

Performance comparison on simulated data sets. (a, b) Convergence speed of each algorithm in terms of inferred rank correlation between the true generative latent time series and the inferred mean posterior. GPFA is the fastest, and PLDS converges very slowly. vLGP achieves the largest correlation, yet an order of magnitude faster than PLDS. The origin of time is shifted to 1 for convenience. Computer specification: Intel(R) Xeon(R) CPU E5-2680 v3 2.50 GHz, 24 cores, 126 GB RAM.

Figure 4:

Performance comparison on simulated data sets. (a, b) Convergence speed of each algorithm in terms of inferred rank correlation between the true generative latent time series and the inferred mean posterior. GPFA is the fastest, and PLDS converges very slowly. vLGP achieves the largest correlation, yet an order of magnitude faster than PLDS. The origin of time is shifted to 1 for convenience. Computer specification: Intel(R) Xeon(R) CPU E5-2680 v3 2.50 GHz, 24 cores, 126 GB RAM.

To quantify predictive performance on the spike trains, we use the log likelihood on the leave-one-neuron-out as described in section 4.2. We normalize the test point-process likelihood with respect to that of a baseline model that assumes a homogeneous Poisson process to obtain, the predictive log likelihood (PLL), given as,
formula
4.3
where is the leave-one-neuron-out prediction to the firing rate of neuron at time , and is the population mean firing rate. Positive PLL implies the model predicts better than the mean firing rate, and a higher PLL implies better prediction. PLL has a unit of bits per spike and is widely used to quantify spike train prediction (Pillow et al., 2008).
In Figure 5, we compare the three models for each data set. Since GPFA assumes a gaussian likelihood, it is incompatible to compare directly using a point-process likelihood. We use a linear rectifier to convert the GPFA predictions to nonnegative rates and then compute PLL (see Figure 5a). We denote the linear predictor by , omitting the neuron, time, and model. Specifically, the rate prediction is given by
formula
4.4
where was chosen to prevent the prediction from producing an invalid PLL while preserving linearity as much as possible.2 We also compare the instead of the PLL of the predictions with respect to the three models (see Figure 5b) where the raw linear prediction without the rectifier link was used. PLL and largely agree despite that assumes a squared error measure GPFA optimizes for and is inappropriate for PLDS and vLGP.
Figure 5:

Predictive performance (the larger the better). (a) Predictive log likelihood (PLL). (b) Predictive (coefficient of determination). Note that GPFA can predict a negative mean; thus, the PLL cannot make a fair comparison even with a soft rectifier link. We visually added a gradient in the GPFA PLL results to caution readers. , which is GPFA’s natural measure of performance, can also be negative due to overfitting. Note that GPFA performs better than PLDS on LDS when compared in predictive , but vLGP is consistently the best in both measures.

Figure 5:

Predictive performance (the larger the better). (a) Predictive log likelihood (PLL). (b) Predictive (coefficient of determination). Note that GPFA can predict a negative mean; thus, the PLL cannot make a fair comparison even with a soft rectifier link. We visually added a gradient in the GPFA PLL results to caution readers. , which is GPFA’s natural measure of performance, can also be negative due to overfitting. Note that GPFA performs better than PLDS on LDS when compared in predictive , but vLGP is consistently the best in both measures.

4.4  V1 Population Recording

We apply our method to a large-scale recording to validate that vLGP picks up meaningful known signals and investigate the population-wide trial-to-trial variability structure. We use the data set (Graf, Kohn, Jazayeri, & Movshon, 2011) where 72 different equally spaced directional drifting gratings were presented to an anesthetized monkey for 50 trials each (array 5, 148 simultaneously recorded single units). We use 63 V1 neurons by considering only neurons with tuning curves that could be well approximated () by bimodal circular gaussian functions (the sum of two von Mises functions with different preferred orientations, amplitudes, and bandwidths) according to Graf et al. (2011). We do not include the stimulus drive in the model, in hopes that the inferred latent processes would encode the stimulus. We used a bin size of 1 ms.

We use four-fold cross-validation to determine the number of latents. A 15-dimensional model is fitted to a subsample composed of the first trial of each direction at first. In each fold, we use its estimated parameter to infer the latent process from another subsample composed of the second trial of each direction. The inference is made by leaving a quarter of the neurons out, and we predict the spike trains of the left-out neurons given the first () orthogonalized latent process corresponding to -dimension. This procedure led us to choose 5 as the dimension since the predictive log likelihood reached its maximum.

We refit a five-dimensional vLGP model using the subsample of the first trials. To quantify how much the model explains, we report pseudo-, defined as
formula
4.5
where refers to the log likelihood of the population mean firing rate model (single parameter) and is the log likelihood for the saturated model in which the rate of each bin is estimated by the empirical mean. The pseudo- of our model (vLGP with 5D latents) is 20.88%. This model explains with shared variability through the latents and the heterogeneity of the baseline firing of individual neurons. For a baseline model with only a per neuron noise component (and no shared latent), the pseudo- is 6.77%.

Figure 5 shows the predictive performance based on two subsets. The first one is four trials (0, 90, 180, 270) of the subset of 63 neurons with a five-dimensional latent process. The second one is 10 trials (five trials of 0 and five trials of 90) of all 148 neurons with a four-dimensional latent process.

To evaluate the predictive performance under a longer timescale, we also cross-validated on the first subset with 20 ms time bins with GPFA and vLGP. GPFA models were fitted for both the raw spike count and its square root. The mean of the square root spike count by vLGP was obtained from simulation using the predicted firing rate. We report the normalized MSEs (MSE/variance of observation) of both methods for the spike count and its square root, respectively: GPFA: 0.713 (spike count) and 0.699 (square root), and vLGP: 0.709 (spike count) and 0.708 (square root). We use the F-test to compare the MSEs and see if either of the two methods results in a significantly larger error. The corresponding -values are 0.588 (spike count) and 0.140 (square root). It shows that the MSEs are not significantly different between GPFA and vLGP with a 20 ms time bin.

Although the parameters are estimated from a subsample, we can use them to infer the latent process of all trials of all 72 directions. Figure 6 shows inferred latent processes for two trials for two directions. We rotate the inferred latent process by the singular value decomposition (SVD; details are given later). Variational posterior distributions over the latents are shown for each trial. During the second half of the trial when the stimulus was off and the firing rate was lower, the uncertainty in the latent processes increased. There is visible trial-to-trial variability in the spike trains, which is reflected in the variations of latents.

Figure 6:

Single trial spike trains and inferred latent. The visual stimulus was only on for the first half of the trial. The left two columns are the spike trains and respective inferred latent of two trials of . The right ones are two trials of . The colors indicate the latent dimensions that are rotated to maximize the power captured by each latent in decreasing order (blue, green, red, purple, and yellow). The solid lines are the posterior means, and the light colors are corresponding uncertainty.

Figure 6:

Single trial spike trains and inferred latent. The visual stimulus was only on for the first half of the trial. The left two columns are the spike trains and respective inferred latent of two trials of . The right ones are two trials of . The colors indicate the latent dimensions that are rotated to maximize the power captured by each latent in decreasing order (blue, green, red, purple, and yellow). The solid lines are the posterior means, and the light colors are corresponding uncertainty.

First, we investigate how the “signal,” defined as visual stimuli, is captured by the latent processes. We average the inferred latent processes over 50 trials with identical spatiotemporal stimuli (see Figure 7). Since the stimuli are time-locked, the corresponding average latent trajectory should reveal the time-locked population fluctuations driven by the visual input. We concatenate the average latent processes along the dimension of time. Then we orthogonalize it by SVD. The dimensions of orthogonalized one are ordered by the respective singular values. The latent process of a single trial is also rotated to the same subspace.

Figure 7:

Inferred latent processes averaged for two stimulus directions (0 and 90). Latents are rotated to maximize the power captured by each latent in decreasing order.

Figure 7:

Inferred latent processes averaged for two stimulus directions (0 and 90). Latents are rotated to maximize the power captured by each latent in decreasing order.

Furthermore, we visualized the trajectories in 3D (see the supplementary online video)3 that show how signal and noise are dynamically encoded in the state space. Figure 8 shows the projection of the average latent process corresponding to each orientation to the first three principal components. The projection topologically preserves the orientation tuning in the V1 population. There are two continuous circular variables in the stimuli space to be encoded: orientation and temporal phase of oscillation. The simplest topological structure of neural encoding is a torus, and we observe a toroidal topology (highlighted as rings of cycle averages).

Figure 8:

3D projection of mean latent trajectories given each orientation. We plot the first three singular vectors of the inferred latent corresponding to the signal interval (0–1400 ms) colored by orientation. The colored circles are cycle averages that visualize the temporal phase of oscillation per direction and form an approximate torus. The black circle visualizes the circular orientation that goes through the center of the torus. The left side shows 0–180 and the right side shows 180–360.

Figure 8:

3D projection of mean latent trajectories given each orientation. We plot the first three singular vectors of the inferred latent corresponding to the signal interval (0–1400 ms) colored by orientation. The colored circles are cycle averages that visualize the temporal phase of oscillation per direction and form an approximate torus. The black circle visualizes the circular orientation that goes through the center of the torus. The left side shows 0–180 and the right side shows 180–360.

To see if our method captures the noise correlation structure through the latents as one would predict from recent studies of cortical population activity (Goris et al., 2014; Ecker et al., 2014; Lin et al., 2015), we calculated pairwise correlations between all neurons. We simulated spike trains by model-predicted firing rates of all trials with and stimulus. The model-predicted firing rates are the fitted values of the time bins that are calculated in the way of the generative model by using estimated parameters and latents. For each bin, we simulated the spike by using a Poisson random number generator with the corresponding firing rate. To remove the signal, we subtracted the mean over 50 trials for each direction of stimulus. Figure 9 shows the correlation matrices during the stimulus period (150–1150 ms) and off-stimulus period (1400–2400 ms). The neurons are sorted by the total correlations during the stimulus period. The power of model-explained noise correlation is defined as , where is the zero-diagonal correlation matrix with regard to its subscript and is the Frobenius norm. The proposed model explains more noise correlation in contrast to GPFA and PLDS for both periods.

Figure 9:

Noise-correlation analysis. The pairwise noise correlations between all neurons were calculated during the stimulus period (top, 150–1150 ms) and off-stimulus period (bottom, 1400–2400 ms). The time bin size is 50 ms. Neurons are sorted by the total noise correlation defined as the row sum of the stimulus-driven noise correlation matrix (top left). The model-explained power percentages are shown at the bottom of each matrix.

Figure 9:

Noise-correlation analysis. The pairwise noise correlations between all neurons were calculated during the stimulus period (top, 150–1150 ms) and off-stimulus period (bottom, 1400–2400 ms). The time bin size is 50 ms. Neurons are sorted by the total noise correlation defined as the row sum of the stimulus-driven noise correlation matrix (top left). The model-explained power percentages are shown at the bottom of each matrix.

These results show that vLGP is capable of capturing both the signal—repeated over multiple trials—and noise—population fluctuation not time locked to other task variables present in the cortical spike trains.

5  Discussion

We propose vLGP, a method that recovers low-dimensional latent dynamics from high-dimensional time series. Latent state-space inference methods are different from methods that only recover the average neural response time-locked to an external observation (Churchland et al., 2012; Brendel, Romo, & Machens, 2011). By inferring latent trajectories on each trial, they provide a flexible framework for studying the internal neural processes that are not time-locked. Higher-order processes such as decision making, attention, and memory recall are well suited for latent trajectory analysis due to their intrinsic low dimensionality of computation. Our method performs dimensionality reduction on a single trial basis and allows decomposition of neural signals into a small number of smooth temporal signals and their relative contribution to the population signal.

We compare our method to two widely used latent state-space modeling tools in neuroscience: GPFA (Yu et al., 2009) and PLDS (Macke et al., 2011). Unlike GPFA, vLGP allows a generalized linear model observation, which is more suitable for a broad range of spike train observations. Moreover, vLGP is significantly faster than PLDS, yet it shows superior performance in capturing the spatiotemporal structures in the neural data to both PLDS and GPFA at a fine timescale (1 ms bins).

To test its validity in real electrophysiological recordings, we used a V1 population recording driven by a fixed stimulus as a litmus test. We showed that our inferred latents contain meaningful information about the external stimuli, encoding both orientation and temporal modulation on a continuous manifold.

We considered only smoothness encoded in the GP prior in this letter, but a plethora of GP kernels are available (Rasmussen & Williams, 2005; Schölkopf & Smola, 2002). For example, to capture prior assumptions about periodicity in some of the latent processes, we can use spectral kernels (Ulrich, Carlson, Dzirasa, & Carin, 2015). This can be particularly useful for capturing internal neural oscillations (Fries, Reynolds, Rorie, & Desimone, 2001). In addition, it is straightforward to incorporate additional covariates such as external stimuli (Park et al., 2014) or local field potential (Kelly, Smith, Kass, & Lee, 2010) to vLGP.

We have not found systematic issues with vLGP, but potential weaknesses could stem from the variational approximation, inappropriate assumptions on the latent processes, and particular numerical shortcuts used in the implementation. It could be challenging for our method to recover very fast-changing latent trajectories, especially when the overall firing rate is very low. Subsampling used while optimizing the hyperparameters may miss a rare but key spike when the spike trains are sparse.

The proposed method has potential application in many areas, and it will be particularly useful in discovering how specific neural computations are implemented as neural dynamics. We are working on applying this method and its extensions to the sensorimotor decision-making process where the normative model guides what is being computed, but it is unclear as to how the neural system implements it.

An open-source python implementation of vLGP is available online (https://github.com/catniplab/vLGP) under MIT license.

Acknowledgments

We thank the reviewers for their constructive feedback. We are grateful to Arnulf Graf, Adam Kohn, Tony Movshon, and Mehrdad Jazayeri for providing the V1 data set. We also thank Evan Archer, Jakob Macke, Yuanjun Gao, Chethan Pandarinath, and David Sussillo for helpful feedback. This work was partially supported by the Thomas Hartman Foundation for Parkinson’s Research.

References

Archer
,
E. W.
,
Koster
,
U.
,
Pillow
,
J. W.
, &
Macke
,
J. H.
(
2014
).
Low-dimensional models of neural population activity in sensory cortical circuits
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
27
(pp.
343
351
).
Red Hook, NY
:
Curran
.
Bach
,
F. R.
, &
Jordan
,
M. I.
(
2002
).
Kernel independent component analysis
.
Journal of Machine Learning Research
,
3
(
1
),
1
48
.
Blei
,
D. M.
,
Kucukelbir
,
A.
, &
McAuliffe
,
J. D.
(in press).
Variational inference: A review for statisticians
.
Journal of the American Statistical Association
.
Brendel
,
W.
,
Romo
,
R.
, &
Machens
,
C. K.
(
2011
).
Demixed principal component analysis
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P.
Bartlett
,
F. C. N.
Pereira
, and
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 24
(pp.
2654
2662
).
Red Hook, NY
:
Curran
.
Buesing
,
L.
,
Macke
,
J.
, &
Sahani
,
M.
(
2012
).
Spectral learning of linear dynamics from generalised-linear observations with application to neural population data
. In
F.
Pereira
,
C. J. C.
Burges
,
L.
Bottou
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
25
(pp.
1691
1699
).
Red Hook, NY
:
Curran
.
Churchland
,
A. K.
,
Kiani
,
R.
,
Chaudhuri
,
R.
,
Wang
,
X.-J. J.
,
Pouget
,
A.
, &
Shadlen
,
M. N.
(
2011
).
Variance as a signature of neural computations during decision making
.
Neuron
,
69
(
4
),
818
831
.
Churchland
,
M. M.
,
Cunningham
,
J. P.
,
Kaufman
,
M. T.
,
Foster
,
J. D.
,
Nuyujukian
,
P.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2012
).
Neural population dynamics during reaching
.
Nature
,
487
(
7405
),
51
56
.
Churchland
,
M. M.
,
Cunningham
,
J. P.
,
Kaufman
,
M. T.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2010
).
Cortical preparatory activity: Representation of movement or first cog in a dynamical machine
?
Neuron
,
68
(
3
),
387
400
.
Daley
,
D. J.
, &
Vere-Jones
,
D.
(
1988
).
An introduction to the theory of point processes
.
New York
:
Springer
.
Ecker
,
A. S.
,
Berens
,
P.
,
Cotton
,
R. J.
,
Subramaniyan
,
M.
,
Denfield
,
G. H.
,
Cadwell
,
C. R.
, …
Tolias
,
A. S.
(
2014
).
State dependence of noise correlations in macaque primary visual cortex
.
Neuron
,
82
(
1
),
235
248
.
Fries
,
P.
,
Reynolds
,
J. H.
,
Rorie
,
A. E.
, &
Desimone
,
R.
(
2001
).
Modulation of oscillatory neuronal synchronization by selective visual attention
.
Science
,
291
(
5508
),
1560
1563
.
Frigola
,
R.
,
Chen
,
Y.
, &
Rasmussen
,
C.
(
2014
).
Variational gaussian process state-space models
. In
Z.
Ghahramani
,
M.
Welling
,
C.
Cortes
,
N. D.
Lawrence
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 27
(pp.
3680
3688
).
Red Hook, NY
:
Curran
.
Goris
,
R. L. T.
,
Movshon
,
J. A.
, &
Simoncelli
,
E. P.
(
2014
).
Partitioning neuronal variability
.
Nature Neuroscience
,
17
(
6
),
858
865
.
Graf
,
A. B.
,
Kohn
,
A.
,
Jazayeri
,
M.
, &
Movshon
,
J. A.
(
2011
).
Decoding the activity of neuronal populations in macaque primary visual cortex
.
Nature Neuroscience
,
14
(
2
),
239
245
.
Haefner
,
R. M.
,
Gerwinn
,
S.
,
Macke
,
J. H.
, &
Bethge
,
M.
(
2013
).
Inferring decoding strategies from choice probabilities in the presence of correlated variability
.
Nature Neuroscience
,
16
(
2
),
235
242
.
Jazayeri
,
M.
, &
Shadlen
,
M. N.
(
2010
).
Temporal context calibrates interval timing
.
Nature Neuroscience
,
13
(
8
),
1020
1026
.
Kao
,
J. C.
,
Nuyujukian
,
P.
,
Ryu
,
S. I.
,
Churchland
,
M. M.
,
Cunningham
,
J. P.
, &
Shenoy
,
K. V.
(
2015
).
Single-trial dynamics of motor cortex and their applications to brain-machine interfaces
.
Nature Communications
,
6
,
7759
.
Kelly
,
R. C.
,
Smith
,
M. A.
,
Kass
,
R. E.
, &
Lee
,
T. S.
(
2010
).
Local field potentials indicate network state and account for neuronal response variability
.
Journal of Computational Neuroscience
,
29
(
3
),
567
579
.
Koyama
,
S.
,
Pérez-Bolde
,
L. C. C.
,
Shalizi
,
C. R. R.
, &
Kass
,
R. E.
(
2010
).
Approximate methods for state-space models
.
Journal of the American Statistical Association
,
105
(
489
),
170
180
.
Lakshmanan
,
K. C.
,
Sadtler
,
P. T.
,
Tyler-Kabara
,
E. C.
,
Batista
,
A. P.
, &
Yu
,
B. M.
(
2015
).
Extracting low-dimensional latent structure from time series in the presence of delays
.
Neural Computation
,
27
(
9
),
1825
1856
.
Latimer
,
K. W.
,
Yates
,
J. L.
,
Meister
,
M. L. R.
,
Huk
,
A. C.
, &
Pillow
,
J. W.
(
2015
).
Single-trial spike trains in parietal cortex reveal discrete steps during decision-making
.
Science
,
349
(
6244
),
184
187
.
Lin
,
I.-C.
,
Okun
,
M.
,
Carandini
,
M.
, &
Harris
,
K. D.
(
2015
).
The nature of shared cortical variability
.
Neuron
,
87
(
3
),
644
656
.
Macke
,
J. H.
,
Buesing
,
L.
,
Cunningham
,
J. P.
,
Yu
,
B. M.
,
Shenoy
,
K. V.
, &
Sahani
,
M.
(
2011
).
Empirical models of spiking in neural populations
. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems, 24
(pp.
1350
1358
).
Red Hook, NY
:
Curran
.
Moreno-Bote
,
R.
,
Beck
,
J.
,
Kanitscheider
,
I.
,
Pitkow
,
X.
,
Latham
,
P.
, &
Pouget
,
A.
(
2014
).
Information-limiting correlations
.
Nature Neuroscience
,
17
(
10
),
1410
1417
.
Opper
,
M.
, &
Archambeau
,
C.
(
2008
).
The variational gaussian approximation revisited
.
Neural Computation
,
21
(
3
),
786
792
.
Paninski
,
L.
,
Ahmadian
,
Y.
,
Ferreira
,
D. G. G.
,
Koyama
,
S.
,
Rahnama Rad
,
K.
,
Vidne
,
M.
, …
Wu
,
W.
(
2010
).
A new look at state-space models for neural data
.
Journal of Computational Neuroscience
,
29
(
1–2
),
107
126
.
Park
,
I. M.
,
Meister
,
M. L. R.
,
Huk
,
A. C.
, &
Pillow
,
J. W.
(
2014
).
Encoding and decoding in parietal cortex during sensorimotor decision-making
.
Nature Neuroscience
,
17
(
10
),
1395
1403
.
Pfau
,
D.
,
Pnevmatikakis
,
E. A.
, &
Paninski
,
L.
(
2013
).
Robust learning of low-dimensional dynamics from large neural ensembles
. In
C. J. C.
Burges
,
L.
Bottou
,
M.
Welling
,
Z.
Ghahramani
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
26
(pp.
2391
2399
).
Red Hook, NY
:
Curran
.
Pillow
,
J. W.
,
Shlens
,
J.
,
Paninski
,
L.
,
Sher
,
A.
,
Litke
,
A. M.
,
Chichilnisky
,
E. J.
, &
Simoncelli
,
E. P.
(
2008
).
Spatio-temporal correlations and visual signaling in a complete neuronal population
.
Nature
,
454
,
995
999
.
Rasmussen
,
C. E.
, &
Williams
,
C. K. I.
(
2005
).
Gaussian processes for machine learning.
Cambridge MA
:
MIT Press
.
Resulaj
,
A.
,
Kiani
,
R.
,
Wolpert
,
D. M.
, &
Shadlen
,
M. N.
(
2009
).
Changes of mind in decision-making
.
Nature
,
461
(
7261
),
263
266
.
Sadtler
,
P. T.
,
Quick
,
K. M.
,
Golub
,
M. D.
,
Chase
,
S. M.
,
Ryu
,
S. I.
,
Tyler-Kabara
,
E. C.
, …
Batista
,
A. P.
(
2014
).
Neural constraints on learning
.
Nature
,
512
(
7515
),
423
426
.
Schölkopf
,
B.
, &
Smola
,
A. J.
(
2002
).
Learning with kernels: Support vector machines, regularization, optimization, and beyond
.
Cambridge MA
:
MIT Press
.
Truccolo
,
W.
,
Eden
,
U. T.
,
Fellows
,
M. R.
,
Donoghue
,
J. P.
, &
Brown
,
E. N.
(
2005
).
A point process framework for relating neural spiking activity to spiking history, neural ensemble and extrinsic covariate effects
.
J. Neurophysiol
,
93
(
2
),
1074
1089
.
Ulrich
,
K. R.
,
Carlson
,
D. E.
,
Dzirasa
,
K.
, &
Carin
,
L.
(
2015
).
GP kernels for cross-spectrum analysis
. In
C.
Cortes
,
N. D.
Lawrence
,
D. D.
Lee
,
M.
Sugiyama
,
R.
Garnett
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 28
(pp.
1990
1998
).
Red Hook, NY
:
Curran
.
Yu
,
B. M.
,
Cunningham
,
J. P.
,
Santhanam
,
G.
,
Ryu
,
S. I.
,
Shenoy
,
K. V.
, &
Sahani
,
M.
(
2009
).
Gaussian-process factor analysis for low-dimensional single-trial analysis of neural population activity
.
Journal of Neurophysiology
,
102
(
1
),
614
635
.

Notes

1

It is straightforward to add external covariates similar to the self-history in this point-process regression (see Park, Meister, Huk, & Pillow, 2014).

2

We initially tried square link function for GPFA. However, it often produces a detrimental PLL due to large rate predictions from large negative raw () predictions. Also note that a straightforward linear rectifier can result in an undefined PLL due to predicting zero rate in a bin with nonzero observation.