Stimulus reconstruction or decoding methods provide an important tool for understanding how sensory and motor information is represented in neural activity. We discuss Bayesian decoding methods based on an encoding generalized linear model (GLM) that accurately describes how stimuli are transformed into the spike trains of a group of neurons. The form of the GLM likelihood ensures that the posterior distribution over the stimuli that caused an observed set of spike trains is log concave so long as the prior is. This allows the maximum a posteriori (MAP) stimulus estimate to be obtained using efficient optimization algorithms. Unfortunately, the MAP estimate can have a relatively large average error when the posterior is highly nongaussian. Here we compare several Markov chain Monte Carlo (MCMC) algorithms that allow for the calculation of general Bayesian estimators involving posterior expectations (conditional on model parameters). An efficient version of the hybrid Monte Carlo (HMC) algorithm was significantly superior to other MCMC methods for gaussian priors. When the prior distribution has sharp edges and corners, on the other hand, the “hit-and-run” algorithm performed better than other MCMC methods. Using these algorithms, we show that for this latter class of priors, the posterior mean estimate can have a considerably lower average error than MAP, whereas for gaussian priors, the two estimators have roughly equal efficiency. We also address the application of MCMC methods for extracting nonmarginal properties of the posterior distribution. For example, by using MCMC to calculate the mutual information between the stimulus and response, we verify the validity of a computationally efficient Laplace approximation to this quantity for gaussian priors in a wide range of model parameters; this makes direct model-based computation of the mutual information tractable even in the case of large observed neural populations, where methods based on binning the spike train fail. Finally, we consider the effect of uncertainty in the GLM parameters on the posterior estimators.
Understanding the exact nature of the neural code is a central goal of theoretical neuroscience. Neural decoding provides an important method for comparing the fidelity and robustness of different codes (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The decoding problem, in its general form, is the problem of estimating the relevant stimulus, x, that elicited the observed spike trains, r, of a population of neurons over a course of time. Neural decoding is also of crucial importance in the design of neural prosthetic devices (Donoghue, 2002).
A large literature exists on developing and applying different decoding methods to spike train data, in both single cell and population decoding. Bayesian methods lie at the basis of a major group of these decoding algorithms (Sanger, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998; Maynard et al., 1999; Stanley & Boloori, 2001; Shoham et al., 2005; Barbieri et al., 2004; Wu et al., 2004; Brockwell, Rojas, & Kass, 2004; Kelly & Lee, 2004; Karmeier, Krapp, & Egelhaaf, 2005; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Jacobs, Grzywacz, & Nirenberg, 2006; Yu et al., 2009; Gerwinn, Macke, & Bethge, 2009; see also the companion article in this issue: J. Pillow, Y. Ahmadian, & L. Paninski, “Model-based decoding, information estimation, and change-point detection in multineuron spike trains,” referred to hereafter as the companion article, Pillow et al., 2011). In such methods, the a priori distribution of the sensory signal, p(x), is combined, via Bayes' rule, with an encoding model describing the probability, , of different spike trains given the signal, to yield the posterior distribution, , that carries all the information contained in the observed spike train responses about the stimulus. A Bayesian estimate is one that, given a definite cost function on the amount of error, minimizes the expected error cost under the posterior distribution. Assuming the prior distribution and the encoding model are appropriately chosen, the Bayes estimate is thus optimal by construction. Furthermore, since the Bayesian approach yields a distribution over the possible stimuli that could lead to the observed response, Bayes estimates naturally come equipped with measures of their reliability or posterior uncertainty.
In a fully Bayesian approach, one has to be able to evaluate any desired functional of the high-dimensional posterior distribution. Unfortunately, calculating these can be computationally very expensive. For example, most Bayesian estimates involve integrations over the (often very high-dimensional) space of possible signals. Accordingly, most work on Bayesian decoding of spike trains has focused on cases where the signal is low dimensional (Sanger, 1994; Maynard et al., 1999; Abbott & Dayan, 1999; Karmeier et al., 2005) or on situations where the joint distribution, p(x, r), has a certain Markov tree decomposition, so that computationally efficient recursive techniques may be applied (Zhang et al., 1998; Brown et al., 1998; Barbieri et al., 2004; Wu et al., 2004; Brockwell et al., 2004; Kelly & Lee, 2004; Shoham et al., 2005; Eden, Frank, Barbieri, Solo, & Brown, 2004; Truccolo et al., 2005; Ergun, Barbieri, Eden, Wilson, & Brown, 2007; Yu et al., 2009; Paninski et al., 2010). The Markov setting is extremely useful; it lends itself naturally to many problems of interest in neuroscience and has thus been fruitfully exploited. In particular, this setting is very useful in an important class of decoding problems where stimulus estimation is performed online: the stimulus at some time, t, is estimated conditioned on the observation of the spike trains only up to that time, as opposed to the entire spike train.
Some decoding problems cannot be formulated in the online estimation framework. In such cases quantities of interest should naturally be conditioned on the entire history of the spike train. In this article, we focus on this latter class of problems (although many of the methods we discuss can potentially be adopted to the online case as well). Furthermore, it is awkward to cast many decoding problems of interest in the Markov setting. A more general method that does not require such tree decomposition properties is to calculate the maximum a posteriori (MAP) estimate (Stanley & Boloori, 2001; Jacobs et al., 2006; Gerwinn et al., 2009; see the companion paper, Pillow et al., 2011, for further review and discussion). The MAP estimate requires no integration, only maximization of the posterior distribution, and can remain computationally tractable even when the stimulus space is very high dimensional. This is the case for general log-concave posterior distributions; many problems in sensory and motor coding fall in this class (it should be noted, however, that in many cases of interest where this condition is not satisfied, for example, when the distributions are inherently multimodal, posterior maximization can become highly intractable). The MAP is a good estimator when the posterior is well approximated by a gaussian distribution centered at (Tierney & Kadane, 1986; Kass, Tierney, & Raftery, 1991). Because the mode and the mean of a gaussian distribution are identical, the MAP in this case is approximately equal to the posterior mean as well. This gaussian approximation is expected to be sufficiently accurate, for example, when the prior distribution and the likelihood function (i.e., as function of x) are not very far from gaussian or when the likelihood is sharply concentrated around . However, when the prior distribution has sharp boundaries and corners and the likelihood function does not constrain the estimate away from such nongaussian regions, the gaussian approximation can fail, resulting in a large average error in the MAP estimate. In such cases, one expects the MAP to be inferior to the posterior mean , which is the optimal estimate under squared error loss.
Accordingly, in section 3 of this article we develop efficient Markov chain Monte Carlo (MCMC) techniques for sampling from general log-concave posterior distributions and compare their performance in situations relevant to our neural decoding setting (for comprehensive introductions to MCMC methods, including their application in Bayesian problems, see, e.g., Robert & Casella, 2005 and Gelman, 2004). By providing a tool for approximating averages (integrals) over the exact posterior distribution, (where are the parameters of the encoding forward model, in principle obtained by fitting to experimental data), these techniques allow us to calculate general Bayesian estimates such as and provide estimates of their uncertainty. Although in principle many of the MCMC methods we discuss are applicable even to posterior distributions that are not log concave, they may lose their efficiency in such cases, and estimates based on them may not even converge to true posterior averages. In section 4 we compare the MAP and the posterior mean stimulus estimates based on the simulated response of a population of retinal ganglion cells (RGC). In section 5 we discuss the applications of MCMC for calculating more complicated properties of beyond marginal statistics, such as the statistics of first-passage times. We also discuss an MCMC-based method known as bridge sampling (Bennett, 1976; Meng & Wong, 1996) that provides a tool for a direct calculation of the mutual information. Using this technique, we show that for gaussian priors, the estimates of Pillow et al. (2011) in the companion article for this quantity based on the Laplace approximation are robust and accurate. Finally, in section 6 we discuss the effect of uncertainty in the parameters of the forward model, , on the MAP and posterior mean estimate. We proceed by first introducing the forward model used to calculate the likelihood .
2. The Encoding Model, the MAP, and the Stimulus Ensembles
In this section, we give an overview of neural encoding models based on generalized linear models (GLM) (Brillinger, 1988; McCullagh & Nelder, 1989; Paninski, 2004; Truccolo et al., 2005), and briefly review the treatment of Pillow et al. (2011) in the companion article for MAP-based decoding. (Note that much of the material in this section is covered in this companion article, but we include a brief review here to make this article self-contained.) A neural encoding model assigns a conditional probability to the neural response given the stimulus. We take the stimulus to be an artificially discretized, possibly multicomponent, function of time, x(t,n), which will be represented as a d-dimensional vector x.1
Normally in the statistical setting, the Laplace approximation is formally justified in the limit of large samples due to the central limit theorem, leading to a likelihood function with a very sharp peak (in neural decoding, the meaning of “large samples” depends in general on the nature of the stimulus, a point we discuss further in section 4). However, this approximation often proves adequate even for moderately strong likelihoods as long as the posterior is not grossly nonnormal. An obvious case where the approximation fails is for strongly multimodal distributions where no particular mode dominates. Here, we restrict our attention to the class of log-concave posteriors that are unimodal. For this class and for a smooth enough GLM nonlinearity, , we expect equation 2.8 to hold for prior distributions that are close to normal, even when the likelihood is not extremely sharp. However, for flatter priors with sharp boundaries or “corners” we expect it to fail unless the likelihood is narrowly concentrated away from such nongaussian regions.
3. Monte Carlo Techniques for Bayesian Estimates
The major drawback of the MCMC method is that the generated samples are dependent, and thus it is harder to estimate how long we need to run the chain to get an accurate estimate, and in general we may need to run the chain much longer than the i.i.d. case. Thus, we would like to choose a proposal density, , that gives rise to a chain that explores the support of (i.e., mixes) quickly, and has a small correlation time (roughly the number of steps separation to yield i.i.d samples), to reduce the number of steps the chain has to be iterated and hence the computational time (see section 3.5 and Gelman, 2004, and Robert & Casella, 2005, for further details). In general, a good proposal density should allow for large jumps with higher probability for falling in regions of larger (so as to avoid a high MH rejection rate). A good rule of thumb is for the proposals to resemble the true density as much as possible. We review a few useful well-known proposals and explore different ways of boosting their efficiency in the GLM-based neural decoding setting. We note here that these algorithms can be applied to general distributions and do not require the log-concavity condition for . However, some of the enhancements that we consider can only be implemented, or are only expected to boost up the performance of the chain, when the distribution is log concave (see the discussion of nonisotropic proposals in sections 3.1 and 3.2 and that of adaptive rejection sampling in section 3.4).
3.1. Nonisotropic Random Walk Metropolis (RWM).
As we will see in the next section, in the flat prior case and for weak stimulus filters or a small number of identical cells, the Laplace approximation can be poor. In particular, the Hessian, equation 2.9, does not contain any information about the prior in the flat case, and therefore the approximate distribution, equation 2.8, can be significantly broader than the extent of the prior support in some directions. To take advantage of the Laplace approximation in this case, we regularized the Hessian by adding to it the inverse covariance matrix of the flat prior, obtaining a matrix that would be the Hessian if the flat prior was replaced by a gaussian with the same mean and covariance. Although the gaussian with this regularized Hessian is still not a very good approximation for the posterior, we saw that in many cases, it improved the mixing rate of the chain.
3.2. Hybrid Monte Carlo and MALA.
A more powerful method for constructing rapidly mixing chains is the so-called hybrid or Hamiltonian Monte Carlo (HMC) method. In a sense, HMC is at the opposite end of the spectrum with respect to RWM in that it is designed to suppress the random walk nature of the chain by exploiting information about the local shape of , via its gradient, to encourage steps toward regions of higher probability. This method was originally inspired by the equations of Hamiltonian dynamics for the molecules in a gas (Duane, Kennedy, Pendleton, & Roweth, 1987) but has since been used extensively in Bayesian settings (for its use in sampling from posteriors based on GLM, see Ishwaran, 1999; see also Neal, 1996, for further applications and extensions).
This method starts with augmenting the vector x with an auxiliary vector of the same dimension z. Let us define the potential energy as up to a constant and a “Hamiltonian function” by . Instead of sampling points, , from , the HMC method constructs an MH chain that samples points, , from the joint distribution . But since this distribution is factorized into the products of its marginals for x and z, the x-part of the obtained samples yields samples from . On the other hand, sampling from the marginal over z is trivial, since z is normally distributed. In a generic step of the Markov chain, starting from (xt, zt), the HMC algorithm performs the following steps to generate (xt+1, zt+1). First, to construct the MH proposal:
Set , and sample from the isotropic gaussian distribution .
Set , and evolve (x, z) according to the equations of Hamiltonian dynamics8 discretized based on the leapfrog method, by repeating the following steps L times
Finally, to implement the MH acceptance step, equation 3.4,
With probability , where , accept the proposal x as xt+1. Otherwise reject it, and set xt+1=xt. (It can be shown that this is a bona-fide MH rejection rule, ensuring that the resulting MCMC chain indeed has the desired equilibrium density; Duane et al., 1987.)
The scale parameter , which also needs to be adjusted for the RWM chain, sets the average size of the proposal jumps: we must typically choose this scale to be small enough to avoid jumping wildly into a region of low , and therefore wasting the proposal, since it will be rejected with high probability. At the same time, we want to make the jumps as large as possible, on average, in order to improve the mixing time of the algorithm. (See Roberts and Rosenthal, 2001, and Gelman, 2004, for some tips on how to find a good balance between these two competing desiderata for the RWM and MALA chains.) For the HMC chains with L>1, we chose , by trial and error, to obtain an MH acceptance rate of 60% to 70%. We adopted this rule of thumb, based on a qualitative extrapolation of the results of Roberts and Rosenthal (1998), for the special cases of L=0 and 1 (corresponding to the RWM and MALA chains, respectively), and their suggestion to tune the acceptance rate in those cases to 25% and 55%, respectively, for optimal mixing (for further discussion, see section 3.5; for a study on tuning the parameter for HMC with general L, see, e.g., Kennedy, Edwards, Mino, & Pendleton, 1996).
For highly nonisotropic distributions, the HMC chains can also be enhanced by exploiting the Laplace approximation (or its regularized version in the uniform prior case, as explained in the RWM case) by modifying the HMC proposals. Equivalently, as noted after equation 3.9, we can sample from the auxiliary distribution (where A is given in equation 3.9) using the unmodified HMC chain, described above, and subsequently transforming the samples by A. As explained in the final paragraph of section 3.1, we can perform this transformation efficiently in computational time, where T is the stimulus duration. Another practical advantage of this transformation by A is that the process of finding the appropriate scale parameter simplifies considerably, since may be approximated as a gaussian distribution with identity covariance regardless of the scaling of different dimensions in the original distribution . To our knowledge, this enhancement of the HMC chain using the Laplace approximation is novel. This chain turned out to be the most efficient in most of the decoding examples we explored (we discuss this in more detail in section 3.5).
It is worth noting that when sampling from high-dimensional distributions with sharp gradients, the MALA, HMC, and RWM chains have a tendency to be trapped in “corners” where the log posterior changes suddenly. This is because when the chain eventually ventures close to the corner, a jump proposal will very likely fall on the exterior side of the sharp high-dimensional corner (the probability of jumping to the interior side from the tip of a cone decreases exponentially with increasing dimensionality). Thus, most proposals will be rejected, and the chain will effectively stop. As we will see below, the hit-and-run chain is known to have an advantage in escaping from such sharp corners (Lovasz & Vempala, 2004). We discuss this point further in section 3.4.
3.3. The Gibbs Sampler.
It is important to note that the Gibbs update rule can sometimes fail to lead to an ergodic chain; the chain can get “stuck” and not sample from properly (Robert & Casella, 2005). An extreme case of this is when the conditional distributions are deterministic: then the Gibbs algorithm will never move, clearly breaking the ergodicity of the chain. More generally, in cases where strong correlations between the components of x lead to nearly deterministic conditionals, the mixing rate of the Gibbs method can be extremely low (Figure 3a shows this phenomenon for a two-dimensional distribution with strong correlation between the two components). Thus, it is a good idea to choose the parameterization of the model carefully before blindly applying the Gibbs algorithm. For example, we can change the basis or, more systematically, exploit the Laplace approximation, as described above, to sample from the auxiliary distribution instead.
3.4. The Hit-and-Run Algorithm.
The main gain over RWM or HMC is that instead of taking small local steps (of size proportional to , in equation 3.7 or 3.11), we may take very large jumps in the n direction; the jump size is set by the underlying distribution itself, not an arbitrary scale, , which has to be tuned by the user to achieve optimal efficiency. The jump size in a given direction, n, is set by the scale of the “slice” distribution, equation 3.13.
This, together with the fact that all hit-and-run proposals are accepted, makes the chain better at escaping from sharp high-dimensional corners (see Lovasz & Vempala, 2004, and the discussion at the end of section 3.2 above). The advantage over Gibbs is in situations such as depicted in Figure 2, where jumps parallel to coordinates lead to small steps, but there are directions that allow long jumps to be made by hit-and-run. The price to pay for these possibly long nonlocal jumps, however, is that now (as well as in the Gibbs case) we need to sample from the one-dimensional density , which is in general nontrivial. Fortunately, as we mentioned above (see the discussion leading to equations 2.5 to 2.7 and following it), in the case of neurons modeled by the GLM, the posterior distribution and thus all its “slices” are log concave, and efficient methods such as adaptive rejection sampling (ARS) (Gilks, 1992; Gilks & Wild, 1992) can be used to sample from the one-dimensional slice in the hit-and-run step. Let us emphasize, however, that the hit-and-run algorithm by itself does not require the distribution to be log concave. Given a method other than ARS for sampling from the one-dimensional conditional distributions, , hit-and-run can be applied to general distributions that are not log concave as well.
Regarding the direction density, , the easiest choice is the isotropic . More generally it is easy to sample from ellipses by sampling from the appropriate gaussian distribution and normalizing. Thus, again, a reasonable approach is to exploit the Laplace approximation: we sample n by sampling an auxiliary point from , where J is the Hessian, equation 2.9, and setting (see Figure 2). This prescription is equivalent to sampling n from the distribution , which is referred to as the angular central gaussian distribution in the statistical literature (see e.g., Tyler, 1987). This further advantages hit-and-run over Gibbs by giving more weight to directions that allow larger jumps to be made.
3.5. Comparison of Different MCMC Chains.
Above, we pointed out some qualitative reasons behind the strengths and weaknesses of the different MCMC algorithms in terms of their mixing rates and computational costs. Here we give a more quantitative account and also compare the different methods based on their performance in the neural decoding setting.
The situation can be very different, however, for highly nongaussian (but still log-concave) distributions, such as those with sharp boundaries. In our GLM setting, this can be the case with flat priors on convex sets, equation 2.11, when the likelihood is broad and does not restrict the posterior support away from the boundaries and corners of the prior support . In this case, HMC and MALA lose their advantage because they do not take advantage of the information in the prior distribution, which has zero gradient within its support. Furthermore, as mentioned in sections 3.2 and 3.4, when the convex body has sharp corners, hit-and-run will have an advantage over both RWM and HMC in avoiding getting trapped in those corners, which can otherwise considerably slow the chain in large dimensionality (see the arguments in Lovasz & Vempala, 2004). Finally, the MALA or HMC proposals can in principle be inefficient in regions of sharp gradient changes; for example, in the GLM setting, if the nonlinearity f(.) is very sharp, then the log likelihood might vary much more quickly than quadratic. In such cases, the HMC proposal jumps can be too large, falling in regions where is very low and leading to high rejection rates. This can potentially reduce HMC's advantage significantly even in case that the prior is gaussian. However, in our experience, with f(.)=exp(.), this did not occur.
Figure 4b shows the estimated autocorrelation function for different chains in sampling from the posterior distribution in GLM-based decoding with a flat prior stimulus distribution, equation 2.11, with cubic support.11 For this prior, the correlation time of the hit-and-run chain was consistently lower than those of the RWM, MALA, and Gibbs (not shown in the figure) chains, unless the likelihood was sharp and concentrated away from the boundaries of the prior cube. As we mentioned above (also see the next section), the Laplace approximation is adequate in this latter case. Thus, we see that hit-and-run is the faster chain when this approximation fails, which is also the case where MCMC is more indispensable. We thus used the hit-and-run algorithm in our decoding examples for the flat prior case presented in the next section.
Finally, we note that other methods of diagnosing mixing and convergence, such as the so-called r-hat () statistic (Brooks & Gelman, 1998), gave consistent results with those based on the autocorrelation time, , presented here.
4. Comparison of MAP and Monte Carlo Decoding
In this section we compare Bayesian stimulus decoding using the MAP and the posterior mean estimates, equations 2.7 and 3.1, based on the response of a population of neurons modeled via the GLM introduced in section 2. We will show that in the flat prior case, equation 2.11, the MAP estimate, in terms of its mean squared error, is much less efficient than the posterior mean estimate. We contrast this with the gaussian prior case, where the Laplace approximation is accurate in a large range of model parameters, and thus the two estimates are close. Furthermore, for both kinds of priors, in the limit of strong likelihoods (e.g., due to a strong stimulus filter or a large number of neurons), the posterior distribution will be sharply concentrated, the Laplace approximation becomes asymptotically more and more accurate, and both estimates will eventually converge to the true stimulus (more precisely, the part of the stimulus that is not outside the receptive field of all the neurons; see footnote 13).
In the first two examples (Figures 5 and 6), the stimulus estimates were computed given the simulated spike trains of a population of pairs of ON and OFF retinal ganglion cells (RGC) in response to a spatially uniform, full-field fluctuating light intensity signal. The stimuli were discretized white noise with gaussian and flat distributions (see the paragraph after equation 2.11). Spike responses were generated by simulating the GLM point process encoding model, described by equations 2.3 and 2.4, with exponential nonlinearity, f(u)=exp(u). The coupling between different cells ( of equation 2.3 for ) was set equal to zero, but the diagonal kernels, , representing the spike history feedback of each cell to itself were closely matched to those found with fits to macaque ON and OFF RGC's reported in Pillow et al. (2008), and so were the DC biases, bi; the values of the DC biases were such that the baseline firing rate, exp(bi), in the absence of stimulus was approximately 7 Hz (see the appendix of Pillow et al., 2011, the companion article, for a more detailed description of the fits for stimulus and spike history filters). However, for demonstration purposes, the stimulus filters, Ki, were set to positive and negative delta functions (for ON and OFF cells, respectively), resulting in being proportional to the light stimulus, x(t), so that bandpass filtering of the stimulus did not result in information loss and convergence of the estimates to the true stimulus could be observed more easily. For a fixed number of cells, the parameter of relevance here, which determines the signal-to-noise ratio of the RGCs' spike trains, is the strength of the filtered stimulus input, , to the GLM nonlinearity. The magnitude of this input is proportional to , where c is the stimulus contrast and is the norm of the receptive field filter (which we have taken to be the same for all cells in this example). Figure 5 shows the stimulus, the spike trains, and the two estimates for three different magnitudes of , based on the response of one pair of ON and OFF cells. Figure 6 shows the same based on the response of 10 identical pairs of RGCs.
Because the prior distribution here is flat on the 50-dimensional cube centered at the origin, the Laplace approximation, equation 2.8, will be justified only when the likelihood is sharply concentrated and supported away from the edges of the cube.12 Moreover, since the flat prior is only “felt” on the boundaries of the cube (the horizontal dashed lines in Figures 5 and 6), the MAP will lie in the interior of the cube only if the likelihood has a maximum there. For filtered stimulus inputs with small magnitude, , the log likelihood, equations 2.3 and 2.4, becomes approximately linear in the components of x. With a flat prior, the almost linear log posterior will very likely be maximized only on the boundaries of the cube (since linear functions on convex domains attain their maxima at the “corners” of the domain). Thus, in the absence of a strong, confining likelihood, the MAP has a tendency to stick to the boundaries, as seen in the first two columns of Figure 5; in other words, the MAP falls on a corner of the cube, where the Laplace approximation is worst and where MALA and RWM are least efficient. We note that the likelihood will be further weakened, in fact, if we replace the delta function stimulus filters with more realistic filters, as the bandpass filtering will remove the dependence of the likelihood on the features of the stimulus that were filtered out (see a similar discussion in our companion article, Pillow et al., 2011, on MAP decoding).
A sharp likelihood confines the posterior away from the boundaries of the prior support and solely determines the position of both the MAP and the posterior mean. In this case, the gaussian approximation for the posterior distribution is valid, and the two estimates will in fact be very close (as the mean and the mode of a gaussian are one and the same). This can be seen in the right column of Figure 5, where the large value of the stimulus filter has sharpened the likelihood. Also, as is generally true in statistical parameter estimation, when the number of data points becomes large, the likelihood term gets very sharp, leading to accurate estimates.13 In our case, this corresponds to increasing the number of cells with similar receptive fields, leading to the smaller error bars in Figure 6 and the more accurate and closer MAP and mean estimates.
To compare the performance of the two estimates more quantitatively, in Figure 7, we have plotted the average squared errors of the two estimates under the full stimulus-response distribution, p(x, r) (for the same type of stimulus and cell pair as in the Figure 5 simulations), as a function of the magnitude of the filtered stimulus input, . This was done by generating five samples of the stimulus in each case and then simulating the GLM to generate the spike train response of the pair of ON and OFF cells to each stimulus, leading to sample pairs (xi, ri) for . For each of the responses, ri, the MAP and MCMC mean were computed based on the posteriors . The average (over p(x, r)) square error, , was then approximated by its sample mean, . The left and right panels in Figure 7 show plots of the squared error per dimension, for MAP and mean estimates, as a function of the stimulus filter strength for the case of the flat and gassian white noise stimulus ensembles, respectively. As is evident from the plots, in the former ensemble, the MAP is inferior to the mean, due to its higher mean squared error, unless the filter strength is large. For the gaussian ensemble, the plot shows that the error of the MAP and posterior mean estimates are very close throughout the range of stimulus filter strength. Thus, due to its much lower computational cost, the MAP-based decoding method of Pillow et al. (2011) in the companion article is superior for this prior. Let us mention that the magnitude of the filtered stimulus, , in the experimental data reported in Pillow et al. (2008) (which is also the basis of the final example in this section—see Figure 8) was in the range , depending on the cell; smaller values of can be achieved experimentally by lowering the contrast of the visual stimulus as needed. Thus, the values of this parameter used in Figure 7, as well as in Figures 5 and 6, are on the same order of magnitude as those used in that experiment and cover a range of values that is experimentally and biologically relevant.
Figure 8 shows the spike trains, as well as the corresponding true inputs and MAP and posterior mean estimates. The closeness of the posterior mean to the MAP (the L2 norm of their difference is only about 9% of the L2 norm of the MAP) is an indication of the accuracy of the Laplace approximation in this case.
5. Other Applications: Estimation of Nonmarginal Quantities
So far we have focused on using the MCMC samples to estimate or the posterior covariance. Both quantities involve separate averaging over the marginal distribution of single components or pairs of components of x. However, since MCMC provides samples from the joint distribution , we can also calculate quantities that cannot be reduced to averages over one- or two-dimensional marginal distributions and involve the whole joint distribution . We consider two examples below.
5.1. Posterior Statistics of Crossing Times.
One important example of these nonmarginal computations involves the statistics (e.g., mean and variance) of some crossing time for the time series x (e.g., the time that xt first crosses some threshold value). (First-passage time computations are especially important, for example, in the context of integrate-and-fire-based neural encoding models in Paninski, Iyengar, Kass, & Brown, 2008.) Smith et al. (2004) proposed a hidden state-space model that provides a dynamical description for the learning process of an animal in a task learning experiment (with binary responses) and yields suitable statistical indicators for establishing the occurrence of learning or determining the “learning trial.” In the proposed model, the state variable, xt, evolves according to a gaussian random walk from trial to trial (labeled by t), and the probability of a correct response on every trial, qt, is given by a logistic function of the corresponding state variable, xt. Given the observation of the responses in all trials, the hidden state variable trajectory can be inferred. Smith et al. (2007) carried out this inference in Bayesian fashion by using Gibbs sampling from the posterior distribution over the state variable time series and the model parameters conditioned on the observed responses. There, the learning trial was defined as the first trial after which the ideal (Bayesian) observer can state with 95% confidence that the animal will perform better than chance. More mathematically, using the MCMC samples (using the winBUGS package), they obtained the sequence of the lower 95% confidence bounds for qt for all t's (for each t, this bound depends on only the one-dimensional marginal distribution of qt). The authors defined the learning trial as the t for which the value of this lower confidence bound crosses the probability value corresponding to chance performance and stays above it in all the following trials.
However, it is reasonable to consider several alternative definitions of learning trial in this setting. One plausible approach is to define the learning trial, tL, in terms of certain passage times of qt, for example, the trial in which qt first exceeds the chance level and does not become smaller than this value at later trials. In this definition, tL is a random variable whose value is not known by the ideal observer with certainty, and its statistics is determined by the full joint posterior distribution and cannot be obtained from its marginals. The posterior mean of tL provides an estimate for this quantity and its posterior variance, an estimate of its uncertainty. These quantities involve nonlinear expectations over the full joint posterior distribution of and can be estimated by the MCMC samples from that distribution.
Figure 9 shows a simulated example in which we used our MCMC methods to decode the crossing times of the input to a Poisson neuron based on the observation of its spike train. The neuron's rate was given by , and the threshold corresponded to a value xt=x0. The hidden process xt was assumed to evolve according to a gaussian AR(1) process, as in Smith et al. (2007). After a spike train was observed, samples from the posterior distribution of xt were obtained by an HMC chain. To estimate the first and the last times that xt crosses x0 from below, we calculate these times for each MCMC sample, obtaining samples from the posterior distribution of these times. Then we calculate their sample mean to estimate when learning occurs. Figure 9 shows the full histograms of these passage times, emphasizing that these statistics are not fully determined by a single observation of the spike train.
As a side note, to obtain a comparison between the performance of the Gibbs-based winBUGS package employed in Smith et al. (2007) versus the HMC chain used here, we simulated a Gibbs chain for y(t) on the same posterior distribution. The estimated correlation time of the Gibbs chain was —that is, Gibbs mixes a hundred times slower than the HMC chain here due to the nonnegligible temporal correlations in xt (see Figure 9; recall Figure 3). In addition, due to the state-space nature of the prior on xt here, the Hessian of the log posterior on x is tridiagonal, and therefore the HMC update requires just time, just like a full Gibbs sweep.
5.2. Mutual Information.
Our second example is the calculation of the mutual information. Estimates of information transfer rates of neural systems, and the mutual information between the stimulus and response of some neural population, are essential in the study of the neural encoding and decoding problems (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Warland, Reinagel, & Meister, 1997; Barbieri et al., 2004). Estimating this quantity is known to be often computationally quite difficult, particularly for high-dimensional stimuli and responses (Paninski, 2003). Pillow et al. (2011), in the companion article, present an easy and efficient method for calculating the mutual information for neurons modeled by the GLM, equations 2.3 and 2.4, based on the Laplace approximation, equation 2.8. As discussed above, this approximation is expected to hold in the case of gaussian priors, in a broad region of the GLM parameter space. Our goal here is to verify this intuition by comparing the Laplace approximation for the mutual information with an exact direct estimation using MCMC integration. As we will see, the main difficulty in using MCMC to estimate the mutual information lies in the fact that we can calculate only up to an unknown normalization constant. Estimating this unknown constant turns out to be tricky in that naive methods for calculating it lead to large sampling errors. Below, we use an efficient low-error method, known as bridge sampling, for estimating this constant.
Figure 10 shows a plot of and per stimulus dimension, calculated as described above,15 as a function of the standard deviation of the filtered stimulus input (which, for the white noise stimulus, is the contrast, c, times the magnitude of ki(t)). grows as grows, but does not change significantly and remains small. Thus, the Laplace approximation for the mutual information is accurate for moderately large . Furthermore, for vanishing , the posterior is equal to the gaussian prior in this case, and this approximation is exact. Therefore, the error has a maximum at a finite value of the stimulus filter, away from which the Laplace approximation is accurate. For comparison with our real spike data example presented in section 4 and Figure 8, we note that in that case, the standard deviation of the filtered stimulus to different cells was in the range , depending on the cell, and the Laplace approximation did indeed provide an accurate approximation for the mutual information, with .
6. Effect of Uncertainty in the Model Parameters
where the inequality followed from the fact that , and therefore are positive definite operators. Thus, we see that, at least in the special regime that we considered, parameter uncertainty will shrink the norm of the posterior mean estimate, sending it toward the prior mean at the origin. This result is in agreement with the intuition stated above and was corroborated by our numerical results in more general parameter regimes.
Markov chain Monte Carlo allows the calculation of general, fully Bayesian posterior estimates. The main goal of this article was to survey the performance of a number of efficient MCMC algorithms in the context of model-based neural decoding of spike trains. Using these methods, we also verified and extended the results of Pillow et al. (2011) in the companion article on MAP-based decoding and information estimation, via Laplace approximation, in GLM-based neural decoding problems. Although MCMC integration is more general in this sense, it is at the same time significantly more computationally expensive than the optimization algorithms used to find the MAP. As we explained in section 2, the MAP is in general a good estimator when the Laplace approximation is accurate. The MAP also comes with natural error bars estimated through the Hessian matrix of the log posterior at MAP, equation 2.9. Furthermore, when it is valid, this approximation provides a very efficient way of estimating the mutual information through equation 5.3. Thus, it is important to have a clear knowledge of when this approximation holds, since when it does, it can be exploited to dramatically reduce the computational cost of stimulus decoding or information estimation.
In section 3, we introduced the RWM, HMC, Gibbs, and hit-and-run Markov chains, all special cases of the Metropolis-Hastings algorithm. Although these methods allow sampling from general posterior distributions, regardless of the forward model, we also took advantage of the specific properties of the distributions involved in our GLM-based decoding to increase the efficiency of these chains. The ARS algorithm, which exploits the log-concavity property of the GLM likelihood and the prior distribution, was used to significantly reduce the computational cost of the one-dimensional sampling in each hit-and-run step. We took advantage of the Laplace approximation (or a regularized version of it in the flat prior case) to shape the proposal distributions to roughly match the covariance structure of the underlying distribution. Furthermore, we were able to carry this out in computational time (i.e., scaling only linearly with the stimulus duration, T) by exploiting the bandedness of the log-posterior Hessian in these settings. To the best of our knowledge, the use of , Laplace-enhanced HMC in neural applications is novel. Similarly, athough the hit-and-run algorithm is well known in the statistics literature, we are unaware of any previous application of it in the context of high-dimensional neural decoding.
We mention that these chains with , Laplace-based enhancement can also be implemented in decoding posterior distributions based on state-space models with Markovian structure; an example of such an application was presented in Figure 9, based on the state-space model used in Smith et al. (2007). However, in cases where the log-posterior turns out to be nonconcave, obtaining the Laplace approximation may be infeasible, or it may not improve the chain's mixing. Although MCMC without this enhancement is still applicable in such cases, other methods, such as sequential Monte Carlo (“particle-filtering”) (Doucet, de Freitas, & Gordon, 2001; Brockwell et al., 2004; Kelly & Lee, 2004; Godsill, Doucet, & West, 2004; Shoham et al., 2005; Ergun et al., 2007; Vogelstein et al., in press; Huys and Paninski, 2009), are solely applicable in models with Markovian structure may prove to be more efficient.
We compared the mixing rates of the mentioned MCMC chains, in sampling from the posterior stimulus distributions for GLM-modeled neurons. In this setting, when the posterior is smooth throughout its support, the HMC algorithm outperforms the other chains by an order of magnitude. On the other hand, when sampling is from posteriors based on flat priors with sharp corners, the hit-and-run chain mixed consistently faster than the others.
In section 4, we compared the performance of the MAP and the posterior mean, calculated using MCMC, in different settings. In one example, we decoded simulated spike trains (generated in response to gaussian and flat white noise stimuli) in a range of stimulus input strengths and for different numbers of identical cells. We also decoded the filtered stimulus input into six retinal ganglion cells, based on their experimentally recorded spike trains. The average squared error of the MAP and mean estimates were in general quite close in the case of gaussian stimuli, justifying MAP decoding in this case. In the flat prior case, however, the posterior mean can often have a much smaller average squared error than the MAP.
In section 5, we applied MCMC to the problem of estimating properties of the joint distribution that cannot be obtained from its low-dimensional marginals. In particular, we investigated the reliability of the Laplace approximation for the mutual information between the stimulus and spike trains (model-based calculations of the mutual information with gaussian priors have been previously presented in Barbieri et al., 2004). We found that the Laplace approximation for the mutual information was adequate in the case of gaussian priors, except in a small range of moderate stimulus input strengths.
In the previous section we dealt with the effect of uncertainty in GLM parameters (e.g., based on fits to experimental data) on the decoding. Intuitively, it is expected that when the forward model parameters become uncertain, information coming from the spike train, and hence the likelihood, becomes less reliable, and therefore the estimate will rely more heavily on the prior information. Thus, the posterior mean is expected to revert toward the prior mean as parameter uncertainty increases. We verified this intuition analytically in the special case of localized stimulus filters (with no bandpass filtering) and small baseline firing rates. Our numerics showed that the main systematic effect of increasing parameter uncertainty on the mean estimate, , is to shrink its magnitude (thus sending to the origin, the prior mean in our case) in a wide range of parameter values.
The methods developed in this article and in Pillow et al. (2011), the companion article, can be used for a variety of applications. In future work we plan to further apply these techniques to other experimental data and to compare different “codebooks” (as mentioned in section 1) based on different reductions of the full spike trains according to their robustness and fidelity.
Thanks to M. Kennel, E. Simoncelli, E.J. Chichilnisky, T. Teravainen, and K. Rahnama Rad for helpful conversations. We are grateful to Y. Ding for her implementation of the bridge sampling method. Y.A. is supported by the Robert Leet and Clara Guthrie Patterson Trust Postdoctoral Fellowship, Bank of America, Trustee. L.P. is supported by NEI R01 EY018003, an Alfred P. Sloan Research Fellowship, and the McKnight Scholar award. J.P. is supported by a Royal Society USA/Canada Research Fellowship.
The dimension of x is thus d=NT, where T is the number of time steps and N is the total number of components at each time step.
That is, the kernels ki(t, n) and hij(t) vanish for t<0.
We note that even though the nonlinearity, , has to be an increasing function, with appropriately chosen negative postspike feedback filters, hii, the mean firing rate of the GLM modeled neurons can still exhibit saturation as a function of the input strength, x.
That is, for fixed , it is a concave function of x, and vice versa, but in general not a concave function of jointly.
Let us mention, however, that no first principle dictates that the posterior distribution over a biologically or behaviorally relevant variable (e.g., an external variable that a part of the brain seeks to estimate) should be log concave. In fact, distributions that are not log concave, such as multimodal or very fat-tailed distributions, can be highly relevant in biological settings.
White or correlated gaussian priors are often used in neural applications (e.g., gaussian stimuli are widely used in neurophysiological experiments). Flat priors with infinitely sharp boundaries are less biologically motivated. However, flat priors are the best log-concave approximation to binary priors, which are also quite common in sensory physiology as both binary white noise and M-sequences (Pillow et al., 2008; Reid, Victor, & Shapley, 1997). In this article, we consider the flat prior mainly as a limiting case of concave log priors with sharp derivatives when we check the accuracy of the Laplace approximation and compare the efficiency of various MCMC chains (see section 3.5) in different regimes.
We are, of course, interested in calculating posterior expectations corresponding to the case , but as the discussion is general, we use in the rest of this section for ease of notation.
As with the Gibbs case, it can be shown again that this proposal leads to an MH acceptance probability of one. Hence, hit-and-run is also a special case of MH.
Although this prior belongs to the class of product densities considered in Roberts and Rosenthal (1998), it does not satisfy the stringent smoothness conditions crucial for the part of their theorem regarding the (fast) mixing of MALA.
More precisely, “sharply concentrated” here means that the curvature of the log posterior is large enough so that the Taylor expansion of the log posterior involved in the Laplace approximation, equation 2.8, is accurate for deviations from on a scale determined by the inverse square root of the smallest eigenvalue of the Hessian matrix, equation 2.9.
This is obviously not the case, however, for parameter directions along which the data are noninformative and the likelihood function does not vary much. In the RGC case, these correspond to stimulus features (directions in the stimulus space) that fall orthogonal to the cells' spatiotemporal filters Ki, and are hence filtered out.
For a similar reason, a “brute force” method for computing Z, such as a simple Monte Carlo integration of the high-dimensional distribution , gives rise to an estimate with slow convergence and a large error (Meng & Wong, 1996).
In principle, according to equations 5.1 to 5.3, one has to average IL(r) and over the marginal response distribution p(r). But due to the intensive nature of and and the large d, they depended on the specific realization of r only weakly, and they were to a good degree self-averaging, so that averaging over p(r) would not considerably alter the plot of Figure 10. A similar argument appeared in Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998.
This is true when is well-defined.