## Abstract

Stimulus reconstruction or decoding methods provide an important tool for understanding how sensory and motor information is represented in neural activity. We discuss Bayesian decoding methods based on an encoding generalized linear model (GLM) that accurately describes how stimuli are transformed into the spike trains of a group of neurons. The form of the GLM likelihood ensures that the posterior distribution over the stimuli that caused an observed set of spike trains is log concave so long as the prior is. This allows the maximum a posteriori (MAP) stimulus estimate to be obtained using efficient optimization algorithms. Unfortunately, the MAP estimate can have a relatively large average error when the posterior is highly nongaussian. Here we compare several Markov chain Monte Carlo (MCMC) algorithms that allow for the calculation of general Bayesian estimators involving posterior expectations (conditional on model parameters). An efficient version of the hybrid Monte Carlo (HMC) algorithm was significantly superior to other MCMC methods for gaussian priors. When the prior distribution has sharp edges and corners, on the other hand, the “hit-and-run” algorithm performed better than other MCMC methods. Using these algorithms, we show that for this latter class of priors, the posterior mean estimate can have a considerably lower average error than MAP, whereas for gaussian priors, the two estimators have roughly equal efficiency. We also address the application of MCMC methods for extracting nonmarginal properties of the posterior distribution. For example, by using MCMC to calculate the mutual information between the stimulus and response, we verify the validity of a computationally efficient Laplace approximation to this quantity for gaussian priors in a wide range of model parameters; this makes direct model-based computation of the mutual information tractable even in the case of large observed neural populations, where methods based on binning the spike train fail. Finally, we consider the effect of uncertainty in the GLM parameters on the posterior estimators.

## 1. Introduction

Understanding the exact nature of the neural code is a central goal of theoretical neuroscience. Neural decoding provides an important method for comparing the fidelity and robustness of different codes (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). The decoding problem, in its general form, is the problem of estimating the relevant stimulus, **x**, that elicited the observed spike trains, ** r**, of a population of neurons over a course of time. Neural decoding is also of crucial importance in the design of neural prosthetic devices (Donoghue, 2002).

A large literature exists on developing and applying different decoding methods to spike train data, in both single cell and population decoding. Bayesian methods lie at the basis of a major group of these decoding algorithms (Sanger, 1994; Zhang, Ginzburg, McNaughton, & Sejnowski, 1998; Brown, Frank, Tang, Quirk, & Wilson, 1998; Maynard et al., 1999; Stanley & Boloori, 2001; Shoham et al., 2005; Barbieri et al., 2004; Wu et al., 2004; Brockwell, Rojas, & Kass, 2004; Kelly & Lee, 2004; Karmeier, Krapp, & Egelhaaf, 2005; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005; Jacobs, Grzywacz, & Nirenberg, 2006; Yu et al., 2009; Gerwinn, Macke, & Bethge, 2009; see also the companion article in this issue: J. Pillow, Y. Ahmadian, & L. Paninski, “Model-based decoding, information estimation, and change-point detection in multineuron spike trains,” referred to hereafter as the companion article, Pillow et al., 2011). In such methods, the a priori distribution of the sensory signal, *p*(**x**), is combined, via Bayes' rule, with an encoding model describing the probability, , of different spike trains given the signal, to yield the posterior distribution, , that carries all the information contained in the observed spike train responses about the stimulus. A Bayesian estimate is one that, given a definite cost function on the amount of error, minimizes the expected error cost under the posterior distribution. Assuming the prior distribution and the encoding model are appropriately chosen, the Bayes estimate is thus optimal by construction. Furthermore, since the Bayesian approach yields a distribution over the possible stimuli that could lead to the observed response, Bayes estimates naturally come equipped with measures of their reliability or posterior uncertainty.

In a fully Bayesian approach, one has to be able to evaluate any desired functional of the high-dimensional posterior distribution. Unfortunately, calculating these can be computationally very expensive. For example, most Bayesian estimates involve integrations over the (often very high-dimensional) space of possible signals. Accordingly, most work on Bayesian decoding of spike trains has focused on cases where the signal is low dimensional (Sanger, 1994; Maynard et al., 1999; Abbott & Dayan, 1999; Karmeier et al., 2005) or on situations where the joint distribution, *p*(**x**, ** r**), has a certain Markov tree decomposition, so that computationally efficient recursive techniques may be applied (Zhang et al., 1998; Brown et al., 1998; Barbieri et al., 2004; Wu et al., 2004; Brockwell et al., 2004; Kelly & Lee, 2004; Shoham et al., 2005; Eden, Frank, Barbieri, Solo, & Brown, 2004; Truccolo et al., 2005; Ergun, Barbieri, Eden, Wilson, & Brown, 2007; Yu et al., 2009; Paninski et al., 2010). The Markov setting is extremely useful; it lends itself naturally to many problems of interest in neuroscience and has thus been fruitfully exploited. In particular, this setting is very useful in an important class of decoding problems where stimulus estimation is performed online: the stimulus at some time,

*t*, is estimated conditioned on the observation of the spike trains only up to that time, as opposed to the entire spike train.

Some decoding problems cannot be formulated in the online estimation framework. In such cases quantities of interest should naturally be conditioned on the entire history of the spike train. In this article, we focus on this latter class of problems (although many of the methods we discuss can potentially be adopted to the online case as well). Furthermore, it is awkward to cast many decoding problems of interest in the Markov setting. A more general method that does not require such tree decomposition properties is to calculate the maximum a posteriori (MAP) estimate (Stanley & Boloori, 2001; Jacobs et al., 2006; Gerwinn et al., 2009; see the companion paper, Pillow et al., 2011, for further review and discussion). The MAP estimate requires no integration, only maximization of the posterior distribution, and can remain computationally tractable even when the stimulus space is very high dimensional. This is the case for general log-concave posterior distributions; many problems in sensory and motor coding fall in this class (it should be noted, however, that in many cases of interest where this condition is not satisfied, for example, when the distributions are inherently multimodal, posterior maximization can become highly intractable). The MAP is a good estimator when the posterior is well approximated by a gaussian distribution centered at (Tierney & Kadane, 1986; Kass, Tierney, & Raftery, 1991). Because the mode and the mean of a gaussian distribution are identical, the MAP in this case is approximately equal to the posterior mean as well. This gaussian approximation is expected to be sufficiently accurate, for example, when the prior distribution and the likelihood function (i.e., as function of **x**) are not very far from gaussian or when the likelihood is sharply concentrated around . However, when the prior distribution has sharp boundaries and corners and the likelihood function does not constrain the estimate away from such nongaussian regions, the gaussian approximation can fail, resulting in a large average error in the MAP estimate. In such cases, one expects the MAP to be inferior to the posterior mean , which is the optimal estimate under squared error loss.

Accordingly, in section 3 of this article we develop efficient Markov chain Monte Carlo (MCMC) techniques for sampling from general log-concave posterior distributions and compare their performance in situations relevant to our neural decoding setting (for comprehensive introductions to MCMC methods, including their application in Bayesian problems, see, e.g., Robert & Casella, 2005 and Gelman, 2004). By providing a tool for approximating averages (integrals) over the exact posterior distribution, (where are the parameters of the encoding forward model, in principle obtained by fitting to experimental data), these techniques allow us to calculate general Bayesian estimates such as and provide estimates of their uncertainty. Although in principle many of the MCMC methods we discuss are applicable even to posterior distributions that are not log concave, they may lose their efficiency in such cases, and estimates based on them may not even converge to true posterior averages. In section 4 we compare the MAP and the posterior mean stimulus estimates based on the simulated response of a population of retinal ganglion cells (RGC). In section 5 we discuss the applications of MCMC for calculating more complicated properties of beyond marginal statistics, such as the statistics of first-passage times. We also discuss an MCMC-based method known as bridge sampling (Bennett, 1976; Meng & Wong, 1996) that provides a tool for a direct calculation of the mutual information. Using this technique, we show that for gaussian priors, the estimates of Pillow et al. (2011) in the companion article for this quantity based on the Laplace approximation are robust and accurate. Finally, in section 6 we discuss the effect of uncertainty in the parameters of the forward model, , on the MAP and posterior mean estimate. We proceed by first introducing the forward model used to calculate the likelihood .

## 2. The Encoding Model, the MAP, and the Stimulus Ensembles

In this section, we give an overview of neural encoding models based on generalized linear models (GLM) (Brillinger, 1988; McCullagh & Nelder, 1989; Paninski, 2004; Truccolo et al., 2005), and briefly review the treatment of Pillow et al. (2011) in the companion article for MAP-based decoding. (Note that much of the material in this section is covered in this companion article, but we include a brief review here to make this article self-contained.) A neural encoding model assigns a conditional probability to the neural response given the stimulus. We take the stimulus to be an artificially discretized, possibly multicomponent, function of time, *x*(*t*,*n*), which will be represented as a *d*-dimensional vector **x**.^{1}

**x**, the

*i*th neuron emits a spike train response, where is the time of the th spike of the

*i*th neuron. We represent this function by

*r*_{i}(we use boldface symbols for both continuous time and discretized, finite-dimensional vectors) and the collection of response data of all cells by

**.**

*r***, is not fully determined by**

*r***x**and is subject to trial-to-trial variations. We model

**as a point process whose instantaneous firing rate is the output of a GLM (Brillinger, 1988; McCullagh & Nelder, 1989; Paninski, 2004). This class of models has been extensively discussed in the literature. Briefly, it is a generalization of the popular linear-nonlinear-Poisson model that includes feedback and interaction between neurons, with parameters that have natural neurophysiological interpretations (Simoncelli, Paninski, Pillow, & Schwartz, 2004) and has been applied in a wide variety of experimental settings (Brillinger, 1992; Dayan & Abbott, 2001; Chichilnisky, 2001; Theunissen et al., 2001; Brown, Barbieri, Eden, & Frank, 2003; Paninski, Fellows, Shoham, Hatsopoulos, & Donoghue, 2004; Truccolo et al., 2005; Pillow et al., 2008). The model gives the conditional (on the stimulus, as well as the history of the observed spike train) instantaneous firing rate of the**

*r**i*th observed cell as which we write more concisely as Here, the linear operators (filters)

*K*and have causal,

_{i}^{2}time-translation-invariant kernels

*k*(

_{i}*t*,

*n*) and

*h*(

_{ij}*t*) (we note that the causality condition for

*k*(

_{i}*t*,

*n*) is true only for sensory neurons). The kernel

*k*(

_{i}*t*,

*n*) represents the

*i*th cell's linear receptive field, and

*h*(

_{ij}*t*) describe possible excitatory or inhibitory postspike effects of the

*j*th observed neuron on the

*i*th. The diagonal components

*h*describe the postspike feedback of the neuron to itself, and can account for refractoriness, adaptation and burstiness depending on their shape (see Paninski, 2004, for details). The constant

_{ii}*b*is the DC bias of the

_{i}*i*th cell, such that

*f*(

*b*) may be considered as the

_{i}*i*th cell's constant baseline firing rate. Finally, is a nonlinear, nonnegative, increasing function.

^{3}Figure 1B shows a schematic diagram of the GLM encoding model for two cells.

**x**or . We will restrict ourselves to

*f*(

*u*) that are convex and log concave (e.g., this is the case for

*f*(

*u*)=exp(

*u*)). Then the log-likelihood function is guaranteed to be a separately concave function of either the stimulus

**x**or the model parameters,

^{4}regardless of the observed spike data

**. The log concavity with respect to the model parameters makes maximum likelihood fitting of this model very easy, as concave functions on convex parameter spaces have no nonglobal local maxima. Therefore, simple gradient ascent algorithms can be used to find the maximum likelihood estimate.**

*r**p*(

**x**), and having observed the spike trains,

**, the posterior probability distribution over the stimulus is given by Bayes' rule, where The MAP estimate is, by definition, (Except in section 6, we will drop from the arguments of or the distributions, it being understood that they are conditioned on the specific obtained from the experimental fit.) As discussed above, for the GLM nonlinearities that we consider, the likelihood, , is log concave in**

*r***x**. If the prior,

*p*(

**x**), is also log concave, then the posterior distribution is log concave as a function of

**x**, and its maximization (see equation 2.7) can also be achieved using simple gradient ascent techniques. The class of log-concave prior distributions is quite large, and it includes exponential, triangular, and general gaussian distributions, as well as uniform distributions with convex support.

^{5}

*J*is the Hessian of the negative log posterior at ,

Normally in the statistical setting, the Laplace approximation is formally justified in the limit of large samples due to the central limit theorem, leading to a likelihood function with a very sharp peak (in neural decoding, the meaning of “large samples” depends in general on the nature of the stimulus, a point we discuss further in section 4). However, this approximation often proves adequate even for moderately strong likelihoods as long as the posterior is not grossly nonnormal. An obvious case where the approximation fails is for strongly multimodal distributions where no particular mode dominates. Here, we restrict our attention to the class of log-concave posteriors that are unimodal. For this class and for a smooth enough GLM nonlinearity, , we expect equation 2.8 to hold for prior distributions that are close to normal, even when the likelihood is not extremely sharp. However, for flatter priors with sharp boundaries or “corners” we expect it to fail unless the likelihood is narrowly concentrated away from such nongaussian regions.

^{6}Here, is the covariance matrix, and is the indicator function of a convex region, , of

**R**

^{d}. In particular, for the white-noise stimuli we consider in section 4, in the gaussian case, and is the

*d*-dimensional cube in the flat case (this choice of corresponds to a uniformly distributed white noise stimulus). Here,

*c*is the standard deviation of the stimulus on a subinterval, and in the case where

*x*(

*t*) is the normalized light intensity (with the average luminosity removed), it is referred to as the contrast. We compare the performance of the MAP and posterior mean estimates in each case in section 4. In section 5.2 we verify the adequacy of this approximation for the estimation of the mutual information in the case of gaussian priors.

## 3. Monte Carlo Techniques for Bayesian Estimates

**. However, a much simpler (and therefore less complete) representation of this information can be provided by a so-called Bayesian estimate for the stimulus, possibly accompanied by a corresponding estimate of its error. A commonly used Bayesian estimate is the posterior mean, which is the optimal estimator with respect to average square error. The uncertainty of this estimator is in turn provided by the posterior covariance matrix. When the posterior distribution can be reasonably approximated as gaussian, the posterior mean can be approximated by its mode, that is, the MAP estimate, equation 2.7, and the inverse of the log-posterior Hessian, equation 2.9, can represent its uncertainty. We adopt here the posterior mean, , as a benchmark for comparing the performance of the two estimates, and we take the deviation of the MAP from the latter as a measure of the validity of the gaussian approximation for the posterior distribution.**

*r***x**. Computationally this is quite costly. The Monte Carlo method is based on the idea that if one could generate

*N*independent and identically distributed (i.i.d.) samples,

**x**

_{t}(), from a probability distribution, ,

^{7}then one could approximate integrals involved in expectations such as equation 3.1 by sample averages. This is because by the law of large numbers, for any

*g*(

**x**) (such that ), Also, to decide how many samples are sufficient, we may estimate when

*N*is large enough that this variance is sufficiently small, we may stop sampling. However, it is often quite challenging to sample directly from a complex multidimensional distribution, and the efficiency of methods yielding i.i.d. samples often decreases exponentially with the number of dimensions.

**x**

_{t}, whose equilibrium distribution is . This is the idea behind the MCMC method based on the Metropolis-Hastings (MH) algorithm (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953; Hastings, 1970). In the general form of this algorithm, the Markov transitions are constructed as follows. Starting at point

**x**, we first sample a point

**y**from some “proposal” density and then accept this point as the next point in the chain, with probability If

**y**is rejected, the chain stays at point

**x**, so that the conditional Markov transition probability, , is given by where is the rejection probability of proposals from

**x**. The reason for accepting the proposals according to equation 3.4 is that doing so guarantees that is invariant under the Markov evolution (see, e.g., Robert & Casella, 2005, for details). It is important to note that from equation 3.4, to execute this algorithm we need only to know up to a constant, which is an advantage because often, particularly in Bayesian settings, normalizing the distribution itself requires the difficult integration for which we are using MCMC (we discuss a method of calculating the normalization constant in section 5.2).

The major drawback of the MCMC method is that the generated samples are dependent, and thus it is harder to estimate how long we need to run the chain to get an accurate estimate, and in general we may need to run the chain much longer than the i.i.d. case. Thus, we would like to choose a proposal density, , that gives rise to a chain that explores the support of (i.e., mixes) quickly, and has a small correlation time (roughly the number of steps separation to yield i.i.d samples), to reduce the number of steps the chain has to be iterated and hence the computational time (see section 3.5 and Gelman, 2004, and Robert & Casella, 2005, for further details). In general, a good proposal density should allow for large jumps with higher probability for falling in regions of larger (so as to avoid a high MH rejection rate). A good rule of thumb is for the proposals to resemble the true density as much as possible. We review a few useful well-known proposals and explore different ways of boosting their efficiency in the GLM-based neural decoding setting. We note here that these algorithms can be applied to general distributions and do not require the log-concavity condition for . However, some of the enhancements that we consider can only be implemented, or are only expected to boost up the performance of the chain, when the distribution is log concave (see the discussion of nonisotropic proposals in sections 3.1 and 3.2 and that of adaptive rejection sampling in section 3.4).

### 3.1. Nonisotropic Random Walk Metropolis (RWM).

*q*(.). Centered isotropic gaussian distributions are a simple choice, leading to proposals where

**z**is gaussian of zero mean and identity covariance, and determines the proposal jump scale. (In this simple form, the RWM chain was used in a recent study to fit a hierarchical model of tuning curves of neurons in the primary visual cortex to experimental data: Cronin, Stevenson, Sur, & Kording, 2009.) Of course, different choices of the proposal distribution will affect the mixing rate of the chain. To increase this rate, it is generally a good idea to align the axes of

*q*(.) with the target density, if possible, so that the proposal jump scales in different directions are roughly proportional to the width of along those directions. Such proposals will reduce the rejection probability and increase the average jump size by biasing the chain to jump in more favorable directions. For gaussian proposals, we can thus choose the covariance matrix of

*q*(.) to be proportional to the covariance of . Of course, calculating the latter covariance is often a difficult problem (which the MCMC method is intended to solve), but we can exploit the Laplace approximation, equation 2.8, and take the inverse of the Hessian of the log posterior at MAP, equation 2.9, as a first approximation for the covariance. This is equivalent to modifying the proposal rule 3.7 into where

*A*is the Cholesky decomposition of

*J*

^{−1}and

*J*was defined in equation 2.9. We refer to chains with such jump proposals as nonisotropic gaussian RWM. Figure 2 compares the isotropic and nonisotropic proposals. The modification, equation 3.8, is equivalent to running a chain with isotropic proposals, equation 3.7, but for the auxiliary distribution (whose corresponding Laplace approximation corresponds to a standard gaussian with identity covariance), and subsequently transforming the samples, , by the matrix

*A*to obtain samples from . Implementing nonisotropic sampling using the transformed distribution , instead of modifying the proposals as in equation 3.8 is more readily extended to chains more complicated than RWM (see below), and therefore we used this latter method in our simulations using different chains.

As we will see in the next section, in the flat prior case and for weak stimulus filters or a small number of identical cells, the Laplace approximation can be poor. In particular, the Hessian, equation 2.9, does not contain any information about the prior in the flat case, and therefore the approximate distribution, equation 2.8, can be significantly broader than the extent of the prior support in some directions. To take advantage of the Laplace approximation in this case, we regularized the Hessian by adding to it the inverse covariance matrix of the flat prior, obtaining a matrix that would be the Hessian if the flat prior was replaced by a gaussian with the same mean and covariance. Although the gaussian with this regularized Hessian is still not a very good approximation for the posterior, we saw that in many cases, it improved the mixing rate of the chain.

*d*by a matrix involves , and the inversion of a matrix involves basic operations. In the decoding examples we consider, the dimension of

**x**is most often proportional to the temporal duration of the stimulus. Thus, naively, the one-time inversion of

*J*and calculation of

*A*takes basic operations, where

*T*is the duration of the stimulus, while the multiplication of

**x**by

*A*in each step of the MCMC algorithm takes operations. This would make the decoding of stimuli with even moderate duration forbidding. Fortunately, the quasi-locality of the GLM model allows us to overcome this limitation. Since the filters

*K*in the GLM have a finite temporal duration,

_{i}*T*, the Hessian of the GLM log-likelihood, equation 2.4, is banded in time: the matrix element vanishes when . The Hessian of the log posterior, equation 2.9, is the sum of the Hessians of the log prior and the log likelihood, which in the gaussian case is where is the prior covariance (see equation 2.10). Thus, if is also banded,

_{k}*J*will be banded in time as well. As an example, gaussian autoregressive processes of any finite order form a large class of priors that have banded

*C*

^{−1}. In particular, for white noise stimuli,

*C*

^{−1}is diagonal, and therefore

*J*will have the same bandwidth as

*J*. Efficient algorithms can find the Cholesky decomposition of a banded matrix, with bandwidth

*n*, in a number of computations , instead of (e.g., the command chol in Matlab uses the method automatically if

_{b}*J*is banded and is encoded as a sparse matrix). Likewise, if

*B*is a banded matrix with bandwidth

*n*, the linear equation

_{b}*B*

**x**=

**y**can be solved for

**x**in computations. Therefore, to calculate from in each step of the Markov chain, we proceed as follows. Before starting the chain, we first calculate the Cholesky decomposition of

*J*such that and . Then at each step of the MCMC, given , we find

**x**

_{t}by solving the equation . Since both of these procedures involve a number of computations that scale only with

*d*(and thus with

*T*), we can perform the whole MCMC decoding in computational time. This allows us to decode stimuli with durations on the order of many seconds. Similar methods with computational cost have been used previously in applications of MCMC to inference and estimation problems involving state-space models (Shephard & Pitt, 1997; Davis & Rodriguez-Yam, 2005; Jungbacker & Koopman, 2007), but these had not been generalized to non-state-space models (such as the GLM model we consider here) where the Hessian has a banded structure nevertheless. For a review of applications of state-space methods to neural data analysis see Paninski et al., 2010. That review also elucidates the close relationship between methods based on state-space models and methods exploiting the bandedness of the Hessian matrix, as described here. Exploiting the bandedness of the Hessian matrix in the optimization problem of finding the MAP is discussed in the companion article.

### 3.2. Hybrid Monte Carlo and MALA.

A more powerful method for constructing rapidly mixing chains is the so-called hybrid or Hamiltonian Monte Carlo (HMC) method. In a sense, HMC is at the opposite end of the spectrum with respect to RWM in that it is designed to suppress the random walk nature of the chain by exploiting information about the local shape of , via its gradient, to encourage steps toward regions of higher probability. This method was originally inspired by the equations of Hamiltonian dynamics for the molecules in a gas (Duane, Kennedy, Pendleton, & Roweth, 1987) but has since been used extensively in Bayesian settings (for its use in sampling from posteriors based on GLM, see Ishwaran, 1999; see also Neal, 1996, for further applications and extensions).

This method starts with augmenting the vector **x** with an auxiliary vector of the same dimension **z**. Let us define the potential energy as up to a constant and a “Hamiltonian function” by . Instead of sampling points, , from , the HMC method constructs an MH chain that samples points, , from the joint distribution . But since this distribution is factorized into the products of its marginals for **x** and **z**, the **x**-part of the obtained samples yields samples from . On the other hand, sampling from the marginal over **z** is trivial, since **z** is normally distributed. In a generic step of the Markov chain, starting from (**x**_{t}, **z**_{t}), the HMC algorithm performs the following steps to generate (**x**_{t+1}, **z**_{t+1}). First, to construct the MH proposal:

Set , and sample from the isotropic gaussian distribution .

Set , and evolve (

**x**,**z**) according to the equations of Hamiltonian dynamics^{8}discretized based on the leapfrog method, by repeating the following steps*L*times

Finally, to implement the MH acceptance step, equation 3.4,

With probability , where , accept the proposal

**x**as**x**_{t+1}. Otherwise reject it, and set**x**_{t+1}=**x**_{t}. (It can be shown that this is a bona-fide MH rejection rule, ensuring that the resulting MCMC chain indeed has the desired equilibrium density; Duane et al., 1987.)

*L*and , which can be chosen to maximize the mixing rate of the chain while minimizing the number of evaluations of and its gradient. In practice, even a small

*L*, requiring fewer gradient evaluations, often yields a rapidly mixing chain, and therefore in our simulations, we used . The special case of

*L*= 1 corresponds to a chain that has proposals of the form where

**z**is normal with zero mean and identity covariance, and the proposal

**y**is accepted according to the MH rule, equation 3.4. In the limit , this chain becomes a continuous Langevin process with the potential function , whose stationary distribution is the Gibbs measure, , without the MH rejection step. For a finite , however, the Metropolis-Hastings acceptance step is necessary to guarantee that is the invariant distribution. The chain is thus referred to as the Metropolis-adjusted Langevin algorithm (MALA) (Roberts & Tweedie, 1996).

The scale parameter , which also needs to be adjusted for the RWM chain, sets the average size of the proposal jumps: we must typically choose this scale to be small enough to avoid jumping wildly into a region of low , and therefore wasting the proposal, since it will be rejected with high probability. At the same time, we want to make the jumps as large as possible, on average, in order to improve the mixing time of the algorithm. (See Roberts and Rosenthal, 2001, and Gelman, 2004, for some tips on how to find a good balance between these two competing desiderata for the RWM and MALA chains.) For the HMC chains with *L*>1, we chose , by trial and error, to obtain an MH acceptance rate of 60% to 70%. We adopted this rule of thumb, based on a qualitative extrapolation of the results of Roberts and Rosenthal (1998), for the special cases of *L*=0 and 1 (corresponding to the RWM and MALA chains, respectively), and their suggestion to tune the acceptance rate in those cases to 25% and 55%, respectively, for optimal mixing (for further discussion, see section 3.5; for a study on tuning the parameter for HMC with general *L*, see, e.g., Kennedy, Edwards, Mino, & Pendleton, 1996).

For highly nonisotropic distributions, the HMC chains can also be enhanced by exploiting the Laplace approximation (or its regularized version in the uniform prior case, as explained in the RWM case) by modifying the HMC proposals. Equivalently, as noted after equation 3.9, we can sample from the auxiliary distribution (where *A* is given in equation 3.9) using the unmodified HMC chain, described above, and subsequently transforming the samples by *A*. As explained in the final paragraph of section 3.1, we can perform this transformation efficiently in computational time, where *T* is the stimulus duration. Another practical advantage of this transformation by *A* is that the process of finding the appropriate scale parameter simplifies considerably, since may be approximated as a gaussian distribution with identity covariance regardless of the scaling of different dimensions in the original distribution . To our knowledge, this enhancement of the HMC chain using the Laplace approximation is novel. This chain turned out to be the most efficient in most of the decoding examples we explored (we discuss this in more detail in section 3.5).

It is worth noting that when sampling from high-dimensional distributions with sharp gradients, the MALA, HMC, and RWM chains have a tendency to be trapped in “corners” where the log posterior changes suddenly. This is because when the chain eventually ventures close to the corner, a jump proposal will very likely fall on the exterior side of the sharp high-dimensional corner (the probability of jumping to the interior side from the tip of a cone decreases exponentially with increasing dimensionality). Thus, most proposals will be rejected, and the chain will effectively stop. As we will see below, the hit-and-run chain is known to have an advantage in escaping from such sharp corners (Lovasz & Vempala, 2004). We discuss this point further in section 3.4.

### 3.3. The Gibbs Sampler.

*x*is the

_{m}*m*th component of

**x**, and denotes the other components, that is, the projection of

**x**on the subspace orthogonal to the

*m*th axis. The Gibbs update is defined as follows. First, choose the dimension

*m*randomly or in order. Then update

**x**along this dimension, that is, sample

*x*from (while leaving the other components fixed). This is equivalent to sampling a one-dimensional auxiliary variable,

_{m}*s*, from and setting

**y**=

**x**+

*s*

**e**

_{m}, where

**e**

_{m}is the unit vector along the

*m*th axis (we discuss how to sample from this one-dimensional distribution in section 3.4). It is well known that the Gibbs rule is indeed a special case of the MH algorithm where the proposal, equation 3.12, is always accepted. (For applications of the Gibbs algorithm for sampling from posterior distributions involving GLM-like likelihoods, see Chan & Ledolter, 1995, and Gamerman, 1997, 1998; see also Smith, Wirth, Suzuki, & Brown, 2007, for some related applications in neural data analysis, discussed in section 5.1.)

It is important to note that the Gibbs update rule can sometimes fail to lead to an ergodic chain; the chain can get “stuck” and not sample from properly (Robert & Casella, 2005). An extreme case of this is when the conditional distributions are deterministic: then the Gibbs algorithm will never move, clearly breaking the ergodicity of the chain. More generally, in cases where strong correlations between the components of **x** lead to nearly deterministic conditionals, the mixing rate of the Gibbs method can be extremely low (Figure 3a shows this phenomenon for a two-dimensional distribution with strong correlation between the two components). Thus, it is a good idea to choose the parameterization of the model carefully before blindly applying the Gibbs algorithm. For example, we can change the basis or, more systematically, exploit the Laplace approximation, as described above, to sample from the auxiliary distribution instead.

### 3.4. The Hit-and-Run Algorithm.

**x**along one of the coordinate axes, we update it along a random general direction not necessarily parallel to any coordinate axis. More precisely, the sampler is defined in two steps. First, choose a direction

**n**from some positive density (with respect to the normalized Lebesgue measure) on the unit sphere

**n**

^{T}

**n**=1. Then, similar to Gibbs, sample the new point on the line defined by

**n**and

**x**, with a density proportional to the underlying distribution. That is, sample

*s*from and set

**y**=

**x**+

*s*

**n**.

^{9}Although the hit-and-run chain is well known in the statistics literature, it has not been used in neural decoding.

The main gain over RWM or HMC is that instead of taking small local steps (of size proportional to , in equation 3.7 or 3.11), we may take very large jumps in the **n** direction; the jump size is set by the underlying distribution itself, not an arbitrary scale, , which has to be tuned by the user to achieve optimal efficiency. The jump size in a given direction, **n**, is set by the scale of the “slice” distribution, equation 3.13.

This, together with the fact that all hit-and-run proposals are accepted, makes the chain better at escaping from sharp high-dimensional corners (see Lovasz & Vempala, 2004, and the discussion at the end of section 3.2 above). The advantage over Gibbs is in situations such as depicted in Figure 2, where jumps parallel to coordinates lead to small steps, but there are directions that allow long jumps to be made by hit-and-run. The price to pay for these possibly long nonlocal jumps, however, is that now (as well as in the Gibbs case) we need to sample from the one-dimensional density , which is in general nontrivial. Fortunately, as we mentioned above (see the discussion leading to equations 2.5 to 2.7 and following it), in the case of neurons modeled by the GLM, the posterior distribution and thus all its “slices” are log concave, and efficient methods such as adaptive rejection sampling (ARS) (Gilks, 1992; Gilks & Wild, 1992) can be used to sample from the one-dimensional slice in the hit-and-run step. Let us emphasize, however, that the hit-and-run algorithm by itself does not require the distribution to be log concave. Given a method other than ARS for sampling from the one-dimensional conditional distributions, , hit-and-run can be applied to general distributions that are not log concave as well.

Regarding the direction density, , the easiest choice is the isotropic . More generally it is easy to sample from ellipses by sampling from the appropriate gaussian distribution and normalizing. Thus, again, a reasonable approach is to exploit the Laplace approximation: we sample **n** by sampling an auxiliary point from , where *J* is the Hessian, equation 2.9, and setting (see Figure 2). This prescription is equivalent to sampling **n** from the distribution , which is referred to as the angular central gaussian distribution in the statistical literature (see e.g., Tyler, 1987). This further advantages hit-and-run over Gibbs by giving more weight to directions that allow larger jumps to be made.

### 3.5. Comparison of Different MCMC Chains.

Above, we pointed out some qualitative reasons behind the strengths and weaknesses of the different MCMC algorithms in terms of their mixing rates and computational costs. Here we give a more quantitative account and also compare the different methods based on their performance in the neural decoding setting.

*f*. As one always has access to finitely many samples,

*N*, even in the optimal case of i.i.d. samples from has a finite random error, equation 3.3. For the correlated samples of the MCMC chain and for large

*N*, the error is larger, and equation 3.3 generalizes to (see Kipnis & Varadhan, 1986) for , independent of the starting point.

^{10}Here, is the equilibrium autocorrelation time of the scalar process

*g*(

**x**

_{i}), based on the chain

**x**

_{i}. It is defined by where we refer to as the lag-

*t*autocorrelation for

*g*(

**x**). Thus, the smaller the , the more efficient is the MCMC algorithm, as one can run a shorter chain to achieve a desired estimated error.

*m*th component of

**x**,

*x*. From equation 3.16, it follows that the weighted average of over all components (with weights Var[

_{m}*x*]), is given by . Thus, maximizing the FOE is roughly equivalent to minimizing correlations. One analytical result concerning the mixing performance of different MCMC chains was obtained in Roberts and Rosenthal (1998) for the FOE of RWM and MALA when sampling from the restricted class of product distributions , and asymptotically large dimension (often a relevant limit in neural decoding). Based on their results, the authors also argue that in general, the jump scales of RWM and MALA proposals may be chosen such that their acceptance rates are roughly 0.25 and 0.55, respectively. For the special case of sampling from a

_{m}*d*-dimensional standard gaussian distribution, and for optimally chosen proposal jump scales, they show that the FOE of gaussian MALA and RWM are asymptotically equal to 1.6

*d*

^{2/3}and 1.33, respectively.

**y**=

**x**+

*s*

**n**, with

*s*sampled as in equation 3.13, we see that Now, from equation 3.13, , and using , we obtain . Thus, where we used for the standard gaussian distribution and . Therefore, while hit-and-run has higher FOE than RWM in this case, we see that for unimodal, nearly gaussian distributions, MALA will mix much faster (by a factor ) than both RWM and hit-and-run in large dimensions. Although we know of no such result for general HMC chains with higher-order leapfrog steps than the one-step MALA algorithm, we expect their mixing speed to increase even further for higher leapfrog steps. The superiority of HMC over the other chains is clearly visible in Figure 4a, which shows a plot of the estimated autocorrelation function for the sampling of the three chains from the GLM posterior with standard gaussian priors, and a weak stimulus filter leading to a weak likelihood. More generally, in our simulations with gaussian priors and smooth GLM nonlinearities, HMC (including MALA) had an order-of-magnitude advantage over the other chains for most of the relevant parameter ranges. Thus, we used this chain in section 5.2 for evaluating the mutual information with gaussian priors.

The situation can be very different, however, for highly nongaussian (but still log-concave) distributions, such as those with sharp boundaries. In our GLM setting, this can be the case with flat priors on convex sets, equation 2.11, when the likelihood is broad and does not restrict the posterior support away from the boundaries and corners of the prior support . In this case, HMC and MALA lose their advantage because they do not take advantage of the information in the prior distribution, which has zero gradient within its support. Furthermore, as mentioned in sections 3.2 and 3.4, when the convex body has sharp corners, hit-and-run will have an advantage over both RWM and HMC in avoiding getting trapped in those corners, which can otherwise considerably slow the chain in large dimensionality (see the arguments in Lovasz & Vempala, 2004). Finally, the MALA or HMC proposals can in principle be inefficient in regions of sharp gradient changes; for example, in the GLM setting, if the nonlinearity *f*(.) is very sharp, then the log likelihood might vary much more quickly than quadratic. In such cases, the HMC proposal jumps can be too large, falling in regions where is very low and leading to high rejection rates. This can potentially reduce HMC's advantage significantly even in case that the prior is gaussian. However, in our experience, with *f*(.)=exp(.), this did not occur.

Figure 4b shows the estimated autocorrelation function for different chains in sampling from the posterior distribution in GLM-based decoding with a flat prior stimulus distribution, equation 2.11, with cubic support.^{11} For this prior, the correlation time of the hit-and-run chain was consistently lower than those of the RWM, MALA, and Gibbs (not shown in the figure) chains, unless the likelihood was sharp and concentrated away from the boundaries of the prior cube. As we mentioned above (also see the next section), the Laplace approximation is adequate in this latter case. Thus, we see that hit-and-run is the faster chain when this approximation fails, which is also the case where MCMC is more indispensable. We thus used the hit-and-run algorithm in our decoding examples for the flat prior case presented in the next section.

Finally, we note that other methods of diagnosing mixing and convergence, such as the so-called r-hat () statistic (Brooks & Gelman, 1998), gave consistent results with those based on the autocorrelation time, , presented here.

## 4. Comparison of MAP and Monte Carlo Decoding

In this section we compare Bayesian stimulus decoding using the MAP and the posterior mean estimates, equations 2.7 and 3.1, based on the response of a population of neurons modeled via the GLM introduced in section 2. We will show that in the flat prior case, equation 2.11, the MAP estimate, in terms of its mean squared error, is much less efficient than the posterior mean estimate. We contrast this with the gaussian prior case, where the Laplace approximation is accurate in a large range of model parameters, and thus the two estimates are close. Furthermore, for both kinds of priors, in the limit of strong likelihoods (e.g., due to a strong stimulus filter or a large number of neurons), the posterior distribution will be sharply concentrated, the Laplace approximation becomes asymptotically more and more accurate, and both estimates will eventually converge to the true stimulus (more precisely, the part of the stimulus that is not outside the receptive field of all the neurons; see footnote 13).

In the first two examples (Figures 5 and 6), the stimulus estimates were computed given the simulated spike trains of a population of pairs of ON and OFF retinal ganglion cells (RGC) in response to a spatially uniform, full-field fluctuating light intensity signal. The stimuli were discretized white noise with gaussian and flat distributions (see the paragraph after equation 2.11). Spike responses were generated by simulating the GLM point process encoding model, described by equations 2.3 and 2.4, with exponential nonlinearity, *f*(*u*)=exp(*u*). The coupling between different cells ( of equation 2.3 for ) was set equal to zero, but the diagonal kernels, , representing the spike history feedback of each cell to itself were closely matched to those found with fits to macaque ON and OFF RGC's reported in Pillow et al. (2008), and so were the DC biases, *b _{i}*; the values of the DC biases were such that the baseline firing rate, exp(

*b*), in the absence of stimulus was approximately 7 Hz (see the appendix of Pillow et al., 2011, the companion article, for a more detailed description of the fits for stimulus and spike history filters). However, for demonstration purposes, the stimulus filters,

_{i}*K*, were set to positive and negative delta functions (for ON and OFF cells, respectively), resulting in being proportional to the light stimulus,

_{i}*x*(

*t*), so that bandpass filtering of the stimulus did not result in information loss and convergence of the estimates to the true stimulus could be observed more easily. For a fixed number of cells, the parameter of relevance here, which determines the signal-to-noise ratio of the RGCs' spike trains, is the strength of the filtered stimulus input, , to the GLM nonlinearity. The magnitude of this input is proportional to , where

*c*is the stimulus contrast and is the norm of the receptive field filter (which we have taken to be the same for all cells in this example). Figure 5 shows the stimulus, the spike trains, and the two estimates for three different magnitudes of , based on the response of one pair of ON and OFF cells. Figure 6 shows the same based on the response of 10 identical pairs of RGCs.

Because the prior distribution here is flat on the 50-dimensional cube centered at the origin, the Laplace approximation, equation 2.8, will be justified only when the likelihood is sharply concentrated and supported away from the edges of the cube.^{12} Moreover, since the flat prior is only “felt” on the boundaries of the cube (the horizontal dashed lines in Figures 5 and 6), the MAP will lie in the interior of the cube only if the likelihood has a maximum there. For filtered stimulus inputs with small magnitude, , the log likelihood, equations 2.3 and 2.4, becomes approximately linear in the components of **x**. With a flat prior, the almost linear log posterior will very likely be maximized only on the boundaries of the cube (since linear functions on convex domains attain their maxima at the “corners” of the domain). Thus, in the absence of a strong, confining likelihood, the MAP has a tendency to stick to the boundaries, as seen in the first two columns of Figure 5; in other words, the MAP falls on a corner of the cube, where the Laplace approximation is worst and where MALA and RWM are least efficient. We note that the likelihood will be further weakened, in fact, if we replace the delta function stimulus filters with more realistic filters, as the bandpass filtering will remove the dependence of the likelihood on the features of the stimulus that were filtered out (see a similar discussion in our companion article, Pillow et al., 2011, on MAP decoding).

A sharp likelihood confines the posterior away from the boundaries of the prior support and solely determines the position of both the MAP and the posterior mean. In this case, the gaussian approximation for the posterior distribution is valid, and the two estimates will in fact be very close (as the mean and the mode of a gaussian are one and the same). This can be seen in the right column of Figure 5, where the large value of the stimulus filter has sharpened the likelihood. Also, as is generally true in statistical parameter estimation, when the number of data points becomes large, the likelihood term gets very sharp, leading to accurate estimates.^{13} In our case, this corresponds to increasing the number of cells with similar receptive fields, leading to the smaller error bars in Figure 6 and the more accurate and closer MAP and mean estimates.

To compare the performance of the two estimates more quantitatively, in Figure 7, we have plotted the average squared errors of the two estimates under the full stimulus-response distribution, *p*(**x**, ** r**) (for the same type of stimulus and cell pair as in the Figure 5 simulations), as a function of the magnitude of the filtered stimulus input, . This was done by generating five samples of the stimulus in each case and then simulating the GLM to generate the spike train response of the pair of ON and OFF cells to each stimulus, leading to sample pairs (

**x**

_{i},

*r*_{i}) for . For each of the responses,

*r*_{i}, the MAP and MCMC mean were computed based on the posteriors . The average (over

*p*(

**x**,

**)) square error, , was then approximated by its sample mean, . The left and right panels in Figure 7 show plots of the squared error per dimension, for MAP and mean estimates, as a function of the stimulus filter strength for the case of the flat and gassian white noise stimulus ensembles, respectively. As is evident from the plots, in the former ensemble, the MAP is inferior to the mean, due to its higher mean squared error, unless the filter strength is large. For the gaussian ensemble, the plot shows that the error of the MAP and posterior mean estimates are very close throughout the range of stimulus filter strength. Thus, due to its much lower computational cost, the MAP-based decoding method of Pillow et al. (2011) in the companion article is superior for this prior. Let us mention that the magnitude of the filtered stimulus, , in the experimental data reported in Pillow et al. (2008) (which is also the basis of the final example in this section—see Figure 8) was in the range , depending on the cell; smaller values of can be achieved experimentally by lowering the contrast of the visual stimulus as needed. Thus, the values of this parameter used in Figure 7, as well as in Figures 5 and 6, are on the same order of magnitude as those used in that experiment and cover a range of values that is experimentally and biologically relevant.**

*r***x**, in this case, is a spatiotemporally fluctuating binary white noise, with

*x*(

*t*,

*n*) representing the contrast of the pixel

*n*at time

*t*. In Pillow et al. (2008), 20 minutes of these data were used to fit the GLM model parameters including cross-couplings,

*h*, to these cells (see that reference for details about the recording and the fitting method and a full description of the fit GLM parameters). Here, we took a 500 ms portion of the recorded spike trains of six neighboring RGCs (three ON and three OFF) and, using the fit GLM parameters for them, decoded the filtered inputs, to these cells using the MAP and posterior mean (calculated using an HMC chain). The inputs are a priori correlated due to the overlaps between the cell's receptive fields, and the covariance matrix of the

_{ij}**y**

_{i}is given by , where is the covariance of the white noise visual stimulus. More explicitly, Notice that with the experimentally fit

*k*, which have a finite temporal duration

_{i}*T*, the covariance matrix, is banded: it vanishes when . Since

_{k}**x**is binary,

**y**

_{i}is not a gaussian vector. However, because the filters

*k*(

_{i}*t*,

*n*) have a relatively large spatiotemporal dimension,

**y**

_{i}(

*t*) are weighted sums of many i.i.d. binary random variables, and their prior marginal distributions can be well approximated by gaussian distributions. For this reason and because the likelihood was relatively strong for these data (and hence the dependence on the prior relatively weak), we replaced the true (highly nongaussian) joint prior distribution of

**y**

_{i}with a gaussian distribution with zero mean and covariance, equation 4.2. This allowed us to implement the efficient nonisotropic HMC chain, described above, so that its computational cost scales only linearly with the stimulus duration

*T*, allowing us to decode very long stimuli. However, in this case, the details of the procedure explained in the final paragraph of section 3.1 have to be modified as follows. The Hessian for

**y**is given by where the Hessian of the negative log likelihood term,

*J*

^{LL}

_{y}, is now diagonal, because

*y*(

_{i}*t*) affects the conditional firing rate instantaneously (see equation 2.3). Let , similar to equation 3.9. The nonisotropic chain requires the calculation of for some vector at each step of the MCMC. In order to carry this out in computational time, we proceed as follows. First we calculate the Cholesky decomposition,

*L*, of , satisfying . As mentioned in section 3.1, since is banded, this can be performed in operations. Then we can rewrite equation 4.3 as Since

*L*is banded (due to the bandedness of ) and

*J*

^{LL}

_{y}is diagonal, it follows that

*Q*is also banded. Therefore, its Cholesky decomposition,

*B*, satisfying , can be calculated in time and is also banded. Using this definition and inverting equation 4.4, we obtain , from which we deduce

*A*=

*LB*

^{−1}, or The calculation of

*L*and

*B*can be performed before running the HMC chain. Then at each step we need to perform equation 4.5. As described in the final paragraph of section 3.1, calculating and the multiplication of the resulting vector by

*L*require only elementary operations due to the bandedness of

*B*and

*L*.

Figure 8 shows the spike trains, as well as the corresponding true inputs and MAP and posterior mean estimates. The closeness of the posterior mean to the MAP (the *L*_{2} norm of their difference is only about 9% of the *L*_{2} norm of the MAP) is an indication of the accuracy of the Laplace approximation in this case.

## 5. Other Applications: Estimation of Nonmarginal Quantities

So far we have focused on using the MCMC samples to estimate or the posterior covariance. Both quantities involve separate averaging over the marginal distribution of single components or pairs of components of **x**. However, since MCMC provides samples from the joint distribution , we can also calculate quantities that cannot be reduced to averages over one- or two-dimensional marginal distributions and involve the whole joint distribution . We consider two examples below.

### 5.1. Posterior Statistics of Crossing Times.

One important example of these nonmarginal computations involves the statistics (e.g., mean and variance) of some crossing time for the time series **x** (e.g., the time that *x _{t}* first crosses some threshold value). (First-passage time computations are especially important, for example, in the context of integrate-and-fire-based neural encoding models in Paninski, Iyengar, Kass, & Brown, 2008.) Smith et al. (2004) proposed a hidden state-space model that provides a dynamical description for the learning process of an animal in a task learning experiment (with binary responses) and yields suitable statistical indicators for establishing the occurrence of learning or determining the “learning trial.” In the proposed model, the state variable,

*x*, evolves according to a gaussian random walk from trial to trial (labeled by

_{t}*t*), and the probability of a correct response on every trial,

*q*, is given by a logistic function of the corresponding state variable,

_{t}*x*. Given the observation of the responses in all trials, the hidden state variable trajectory can be inferred. Smith et al. (2007) carried out this inference in Bayesian fashion by using Gibbs sampling from the posterior distribution over the state variable time series and the model parameters conditioned on the observed responses. There, the learning trial was defined as the first trial after which the ideal (Bayesian) observer can state with 95% confidence that the animal will perform better than chance. More mathematically, using the MCMC samples (using the winBUGS package), they obtained the sequence of the lower 95% confidence bounds for

_{t}*q*for all

_{t}*t*'s (for each

*t*, this bound depends on only the one-dimensional marginal distribution of

*q*). The authors defined the learning trial as the

_{t}*t*for which the value of this lower confidence bound crosses the probability value corresponding to chance performance and stays above it in all the following trials.

However, it is reasonable to consider several alternative definitions of learning trial in this setting. One plausible approach is to define the learning trial, *t _{L}*, in terms of certain passage times of

*q*, for example, the trial in which

_{t}*q*first exceeds the chance level and does not become smaller than this value at later trials. In this definition,

_{t}*t*is a random variable whose value is not known by the ideal observer with certainty, and its statistics is determined by the full joint posterior distribution and cannot be obtained from its marginals. The posterior mean of

_{L}*t*provides an estimate for this quantity and its posterior variance, an estimate of its uncertainty. These quantities involve nonlinear expectations over the full joint posterior distribution of and can be estimated by the MCMC samples from that distribution.

_{L}Figure 9 shows a simulated example in which we used our MCMC methods to decode the crossing times of the input to a Poisson neuron based on the observation of its spike train. The neuron's rate was given by , and the threshold corresponded to a value *x _{t}*=

*x*

_{0}. The hidden process

*x*was assumed to evolve according to a gaussian AR(1) process, as in Smith et al. (2007). After a spike train was observed, samples from the posterior distribution of

_{t}*x*were obtained by an HMC chain. To estimate the first and the last times that

_{t}*x*crosses

_{t}*x*

_{0}from below, we calculate these times for each MCMC sample, obtaining samples from the posterior distribution of these times. Then we calculate their sample mean to estimate when learning occurs. Figure 9 shows the full histograms of these passage times, emphasizing that these statistics are not fully determined by a single observation of the spike train.

As a side note, to obtain a comparison between the performance of the Gibbs-based winBUGS package employed in Smith et al. (2007) versus the HMC chain used here, we simulated a Gibbs chain for *y*(*t*) on the same posterior distribution. The estimated correlation time of the Gibbs chain was —that is, Gibbs mixes a hundred times slower than the HMC chain here due to the nonnegligible temporal correlations in *x _{t}* (see Figure 9; recall Figure 3). In addition, due to the state-space nature of the prior on

*x*here, the Hessian of the log posterior on

_{t}**x**is tridiagonal, and therefore the HMC update requires just time, just like a full Gibbs sweep.

### 5.2. Mutual Information.

Our second example is the calculation of the mutual information. Estimates of information transfer rates of neural systems, and the mutual information between the stimulus and response of some neural population, are essential in the study of the neural encoding and decoding problems (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Warland, Reinagel, & Meister, 1997; Barbieri et al., 2004). Estimating this quantity is known to be often computationally quite difficult, particularly for high-dimensional stimuli and responses (Paninski, 2003). Pillow et al. (2011), in the companion article, present an easy and efficient method for calculating the mutual information for neurons modeled by the GLM, equations 2.3 and 2.4, based on the Laplace approximation, equation 2.8. As discussed above, this approximation is expected to hold in the case of gaussian priors, in a broad region of the GLM parameter space. Our goal here is to verify this intuition by comparing the Laplace approximation for the mutual information with an exact direct estimation using MCMC integration. As we will see, the main difficulty in using MCMC to estimate the mutual information lies in the fact that we can calculate only up to an unknown normalization constant. Estimating this unknown constant turns out to be tricky in that naive methods for calculating it lead to large sampling errors. Below, we use an efficient low-error method, known as bridge sampling, for estimating this constant.

*H*, of the distribution over the stimulus) of an ideal observer having access to the spike trains of the RGC, compared to its prior state of knowledge about the stimulus: Here,

*p*(

**) is given by equation 2.6, and the posterior probability is given by Bayes' rule, equation 2.5. The logarithms are assumed to be in base 2, so that information is measured in bits. We consider gaussian priors given by equation 2.10, for which we can compute the entropy**

*r**H*[

**x**] explicitly: Thus, the real problem is to evaluate the second term in equation 5.1. The integral involved in the definition of is in general hard to evaluate. One approach, which is computationally very fast, is to use the Laplace approximation, equation 2.8, if it is justified; we took this approach in Pillow et al. (2011), the companion article. In that case, from equation 2.8, we obtain where

*J*(

**) is the Hessian, equation 2.9.**

*r**p*(

**)), has the form that is, one representing the posterior expectation of a function**

*r**g*(

**x**). If we could evaluate

*g*(

**x**) for arbitrary

**x**, we could evaluate this expectation by the MCMC method, via equation 3.2. As we mentioned above, the difficulty lies in that in general, we can evaluate only an unnormalized version of the posterior distribution, and thus , only up to an additive constant. Suppose we can evaluate for some

*Z*(

**) at any arbitrary**

*r***x**. Then can be rewritten as From the normalization condition for ,

*Z*(

**) is given by The main difficulty in calculating the mutual information lies in estimating**

*r**Z*(

**) (for a discussion of the difficulties involved in estimating normalization constants and marginal probabilities see Meng & Wong, 1996; the discussion of the paper by Newton & Raftery, 1994; and Neal, 2008). By contrast, the second term in equation 5.6 already has the form equation 5.4 (with replacing**

*r**g*(

**x**)) and can be estimated using equation 3.2. In the following, we introduce an efficient method for estimating

*Z*(

**).**

*r**p*(

**), , calculated using equation 5.13, gives the correction to the Laplace approximation for the mutual information, equation 5.3. When the Laplace approximation is justified, this correction will be small (even before averaging over**

*r**p*(

**)). Also, note that in that case , and the last term in equation 5.13 is small on its own.**

*r*^{14}A method of evaluating using equation 5.15 by employing an optimal , was originally developed by Bennett (1976) and was further refined by Meng and Wong (1996); it is referred to as bridge sampling for the above reason. These authors have shown that the asymptotically optimal (for large number of samples from each distribution) choice of is where

*s*=

_{i}*N*/(

_{i}*N*

_{1}+

*N*

_{2}) (

*i*=1, 2) and

*N*

_{1,2}are the number of samples drawn from and , respectively. As this choice for itself depends on , it suggests an iterative solution, where we substitute equation 5.16 with the current into equation 5.15 to obtain the next iteration: where

**x**

_{ij}(

*i*=1, 2) are samples drawn from and , respectively, and . Since we expect , we take . In our calculations, we stopped the bridge sampling iterations when , where

*d*is the stimulus dimension.

Figure 10 shows a plot of and per stimulus dimension, calculated as described above,^{15} as a function of the standard deviation of the filtered stimulus input (which, for the white noise stimulus, is the contrast, *c*, times the magnitude of *k _{i}*(

*t*)). grows as grows, but does not change significantly and remains small. Thus, the Laplace approximation for the mutual information is accurate for moderately large . Furthermore, for vanishing , the posterior is equal to the gaussian prior in this case, and this approximation is exact. Therefore, the error has a maximum at a finite value of the stimulus filter, away from which the Laplace approximation is accurate. For comparison with our real spike data example presented in section 4 and Figure 8, we note that in that case, the standard deviation of the filtered stimulus to different cells was in the range , depending on the cell, and the Laplace approximation did indeed provide an accurate approximation for the mutual information, with .

## 6. Effect of Uncertainty in the Model Parameters

**x**, is modified to (In this section, unlike in sections 3 to 5, we write explicitly when a distribution is conditioned on it. When there is no in the argument of the distribution, it means it has been marginalized.) We assume the uncertainty in the parameters is small enough that a Laplace approximation for applies, that is, it can be taken to be gaussian with mean and a small covariance . Here, is the maximum likelihood fit to data, and is the Hessian of the negative log likelihood (as a function of GLM parameters, given the experimental data) at . For simplicity we assume the GLM nonlinearity (see equation 2.3) is exponential:

*f*(

*u*)=exp(

*u*). We also assume that the prior stimulus ensemble is gaussian, with probability distribution described by equation 2.10.

**x**), and therefore we expect that as it increases, the posterior estimate will move toward the prior mean (in our case, zero). Intuitively, this is because as the Bayesian decoder's knowledge of the encoding mechanism (represented by the parameters ) decreases, it discounts the information that the observed spike train,

**, carries about the stimulus and instead relies more strongly on its prior information. To verify this intuition analytically, we consider the case where (e.g., as we saw in the previous section, the Laplace approximation for the posterior stimulus distribution is often quite adequate in the case of gaussian priors, and this approximation therefore holds in that case). Assuming this, we can replace with in equation 6.1, and obtain In the following, we drop**

*r***from the arguments of when it is understood. We denote the average over in equation 6.2 by .**

*r***x**=0. Assuming the matrix is invertible,

^{16}we then obtain We write , where has zero mean, and expand equation 6.11 in up to second order to obtain where and such that is homogeneously of order

*n*in . Here, we defined , , and and ( and ) are the random first- (second-) order variations of and in . The first-order variations, and , are thus gaussian with zero mean and a covariance determined by the covariance of . After averaging over , will vanish, and we have

*K*is uncertain). Replacing for and using equation 6.13 with to write , for this case we obtain Therefore, to the first nonvanishing order, the change in the

_{i}*L*

^{2}norm of the estimate is

where the inequality followed from the fact that , and therefore are positive definite operators. Thus, we see that, at least in the special regime that we considered, parameter uncertainty will shrink the norm of the posterior mean estimate, sending it toward the prior mean at the origin. This result is in agreement with the intuition stated above and was corroborated by our numerical results in more general parameter regimes.

*K*. Here, was not constrained to be proportional to . However, again, as uncertainty in model parameters increases, leading to broadening of the likelihood, the posterior mean moves toward the prior mean.

_{i}## 7. Discussion

Markov chain Monte Carlo allows the calculation of general, fully Bayesian posterior estimates. The main goal of this article was to survey the performance of a number of efficient MCMC algorithms in the context of model-based neural decoding of spike trains. Using these methods, we also verified and extended the results of Pillow et al. (2011) in the companion article on MAP-based decoding and information estimation, via Laplace approximation, in GLM-based neural decoding problems. Although MCMC integration is more general in this sense, it is at the same time significantly more computationally expensive than the optimization algorithms used to find the MAP. As we explained in section 2, the MAP is in general a good estimator when the Laplace approximation is accurate. The MAP also comes with natural error bars estimated through the Hessian matrix of the log posterior at MAP, equation 2.9. Furthermore, when it is valid, this approximation provides a very efficient way of estimating the mutual information through equation 5.3. Thus, it is important to have a clear knowledge of when this approximation holds, since when it does, it can be exploited to dramatically reduce the computational cost of stimulus decoding or information estimation.

In section 3, we introduced the RWM, HMC, Gibbs, and hit-and-run Markov chains, all special cases of the Metropolis-Hastings algorithm. Although these methods allow sampling from general posterior distributions, regardless of the forward model, we also took advantage of the specific properties of the distributions involved in our GLM-based decoding to increase the efficiency of these chains. The ARS algorithm, which exploits the log-concavity property of the GLM likelihood and the prior distribution, was used to significantly reduce the computational cost of the one-dimensional sampling in each hit-and-run step. We took advantage of the Laplace approximation (or a regularized version of it in the flat prior case) to shape the proposal distributions to roughly match the covariance structure of the underlying distribution. Furthermore, we were able to carry this out in computational time (i.e., scaling only linearly with the stimulus duration, *T*) by exploiting the bandedness of the log-posterior Hessian in these settings. To the best of our knowledge, the use of , Laplace-enhanced HMC in neural applications is novel. Similarly, athough the hit-and-run algorithm is well known in the statistics literature, we are unaware of any previous application of it in the context of high-dimensional neural decoding.

We mention that these chains with , Laplace-based enhancement can also be implemented in decoding posterior distributions based on state-space models with Markovian structure; an example of such an application was presented in Figure 9, based on the state-space model used in Smith et al. (2007). However, in cases where the log-posterior turns out to be nonconcave, obtaining the Laplace approximation may be infeasible, or it may not improve the chain's mixing. Although MCMC without this enhancement is still applicable in such cases, other methods, such as sequential Monte Carlo (“particle-filtering”) (Doucet, de Freitas, & Gordon, 2001; Brockwell et al., 2004; Kelly & Lee, 2004; Godsill, Doucet, & West, 2004; Shoham et al., 2005; Ergun et al., 2007; Vogelstein et al., in press; Huys and Paninski, 2009), are solely applicable in models with Markovian structure may prove to be more efficient.

*r*(

*t*), are produced by an inhomogeneous Poisson process with a rate , where

*B*(

_{i}*t*) is a cubic B-spline basis and are the spline coefficients. Here,

*i*runs from 1 to

*k*+2, where

*k*is the number of spline knots with positions ; the spline basis functions,

*B*(

_{i}*t*), implicitly depend on

*k*and . Conditioned on fixed and

*k*, the prior distribution of the spline coefficients is taken to be gaussian with zero mean and inverse covariance (a unit information prior). Thus conditioned on fixed spline knots, the BARS model involves Poisson observations from a gaussian latent variable ; this is directly analogous to our GLM model with gaussian stimuli,

**x**, with and

**(**

*B**t*) replacing

**x**and

*K*in the analogy, respectively. In particular, sampling from the posterior distribution of the a priori gaussian (given and

*k*) is very similar to sampling from the posterior over the gaussian stimulus,

**x**, in our examples in this article. Furthermore, to obtain conditional samples of , the BARS code uses an RWM chain (Wallstrom et al., 2007), which, as in equations 3.8 and 3.9, employs the Hessian at the MAP point for the spline coefficients to produce nonisotropic proposals. When the form of the prior covariance, mentioned above, and the standard likelihood expression for a Poisson process are used, the Hessian of the negative log posterior for (given

*r*(

*t*), and

*k*) is given by where the first term is the inverse prior covariance and

*a*is some positive constant. Since, by definition,

*B*(

_{i}*t*) is nonzero only when , we see that

*H*vanishes when |

_{ij}*i*−

*j*| > 3, and hence, is banded. Again, the bandedness of the Hessian is exploited (Wallstrom et al., 2007) to obtain the RWM proposals in computational time by the method described after equation 3.10. We note that the BARS package could potentially be improved by using a faster-mixing chain such as HMC, which can outperform RWM by orders of magnitude (see Figure 4).

We compared the mixing rates of the mentioned MCMC chains, in sampling from the posterior stimulus distributions for GLM-modeled neurons. In this setting, when the posterior is smooth throughout its support, the HMC algorithm outperforms the other chains by an order of magnitude. On the other hand, when sampling is from posteriors based on flat priors with sharp corners, the hit-and-run chain mixed consistently faster than the others.

In section 4, we compared the performance of the MAP and the posterior mean, calculated using MCMC, in different settings. In one example, we decoded simulated spike trains (generated in response to gaussian and flat white noise stimuli) in a range of stimulus input strengths and for different numbers of identical cells. We also decoded the filtered stimulus input into six retinal ganglion cells, based on their experimentally recorded spike trains. The average squared error of the MAP and mean estimates were in general quite close in the case of gaussian stimuli, justifying MAP decoding in this case. In the flat prior case, however, the posterior mean can often have a much smaller average squared error than the MAP.

In section 5, we applied MCMC to the problem of estimating properties of the joint distribution that cannot be obtained from its low-dimensional marginals. In particular, we investigated the reliability of the Laplace approximation for the mutual information between the stimulus and spike trains (model-based calculations of the mutual information with gaussian priors have been previously presented in Barbieri et al., 2004). We found that the Laplace approximation for the mutual information was adequate in the case of gaussian priors, except in a small range of moderate stimulus input strengths.

In the previous section we dealt with the effect of uncertainty in GLM parameters (e.g., based on fits to experimental data) on the decoding. Intuitively, it is expected that when the forward model parameters become uncertain, information coming from the spike train, and hence the likelihood, becomes less reliable, and therefore the estimate will rely more heavily on the prior information. Thus, the posterior mean is expected to revert toward the prior mean as parameter uncertainty increases. We verified this intuition analytically in the special case of localized stimulus filters (with no bandpass filtering) and small baseline firing rates. Our numerics showed that the main systematic effect of increasing parameter uncertainty on the mean estimate, , is to shrink its magnitude (thus sending to the origin, the prior mean in our case) in a wide range of parameter values.

The methods developed in this article and in Pillow et al. (2011), the companion article, can be used for a variety of applications. In future work we plan to further apply these techniques to other experimental data and to compare different “codebooks” (as mentioned in section 1) based on different reductions of the full spike trains according to their robustness and fidelity.

## Acknowledgments

Thanks to M. Kennel, E. Simoncelli, E.J. Chichilnisky, T. Teravainen, and K. Rahnama Rad for helpful conversations. We are grateful to Y. Ding for her implementation of the bridge sampling method. Y.A. is supported by the Robert Leet and Clara Guthrie Patterson Trust Postdoctoral Fellowship, Bank of America, Trustee. L.P. is supported by NEI R01 EY018003, an Alfred P. Sloan Research Fellowship, and the McKnight Scholar award. J.P. is supported by a Royal Society USA/Canada Research Fellowship.

## Notes

^{1}

The dimension of **x** is thus *d*=*NT*, where *T* is the number of time steps and *N* is the total number of components at each time step.

^{2}

That is, the kernels *k _{i}*(

*t*,

*n*) and

*h*(

_{ij}*t*) vanish for

*t*<0.

^{3}

We note that even though the nonlinearity, , has to be an increasing function, with appropriately chosen negative postspike feedback filters, *h _{ii}*, the mean firing rate of the GLM modeled neurons can still exhibit saturation as a function of the input strength,

**x**.

^{4}

That is, for fixed , it is a concave function of **x**, and vice versa, but in general not a concave function of jointly.

^{5}

Let us mention, however, that no first principle dictates that the posterior distribution over a biologically or behaviorally relevant variable (e.g., an external variable that a part of the brain seeks to estimate) should be log concave. In fact, distributions that are not log concave, such as multimodal or very fat-tailed distributions, can be highly relevant in biological settings.

^{6}

White or correlated gaussian priors are often used in neural applications (e.g., gaussian stimuli are widely used in neurophysiological experiments). Flat priors with infinitely sharp boundaries are less biologically motivated. However, flat priors are the best log-concave approximation to binary priors, which are also quite common in sensory physiology as both binary white noise and M-sequences (Pillow et al., 2008; Reid, Victor, & Shapley, 1997). In this article, we consider the flat prior mainly as a limiting case of concave log priors with sharp derivatives when we check the accuracy of the Laplace approximation and compare the efficiency of various MCMC chains (see section 3.5) in different regimes.

^{7}

We are, of course, interested in calculating posterior expectations corresponding to the case , but as the discussion is general, we use in the rest of this section for ease of notation.

^{9}

As with the Gibbs case, it can be shown again that this proposal leads to an MH acceptance probability of one. Hence, hit-and-run is also a special case of MH.

^{11}

Although this prior belongs to the class of product densities considered in Roberts and Rosenthal (1998), it does not satisfy the stringent smoothness conditions crucial for the part of their theorem regarding the (fast) mixing of MALA.

^{12}

More precisely, “sharply concentrated” here means that the curvature of the log posterior is large enough so that the Taylor expansion of the log posterior involved in the Laplace approximation, equation 2.8, is accurate for deviations from on a scale determined by the inverse square root of the smallest eigenvalue of the Hessian matrix, equation 2.9.

^{13}

This is obviously not the case, however, for parameter directions along which the data are noninformative and the likelihood function does not vary much. In the RGC case, these correspond to stimulus features (directions in the stimulus space) that fall orthogonal to the cells' spatiotemporal filters *K _{i}*, and are hence filtered out.

^{14}

For a similar reason, a “brute force” method for computing *Z*, such as a simple Monte Carlo integration of the high-dimensional distribution , gives rise to an estimate with slow convergence and a large error (Meng & Wong, 1996).

^{15}

In principle, according to equations 5.1 to 5.3, one has to average *I _{L}*(

**) and over the marginal response distribution**

*r**p*(

**). But due to the intensive nature of and and the large**

*r**d*, they depended on the specific realization of

**only weakly, and they were to a good degree self-averaging, so that averaging over**

*r**p*(

**) would not considerably alter the plot of Figure 10. A similar argument appeared in Strong, Koberle, de Ruyter van Steveninck, & Bialek, 1998.**

*r*^{16}

This is true when is well-defined.