## Abstract

Given recent experimental results suggesting that neural circuits may evolve through multiple firing states, we develop a framework for estimating state-dependent neural response properties from spike train data. We modify the traditional hidden Markov model (HMM) framework to incorporate stimulus-driven, non-Poisson point-process observations. For maximal flexibility, we allow external, time-varying stimuli and the neurons’ own spike histories to drive both the spiking behavior in each state and the transitioning behavior between states. We employ an appropriately modified expectation-maximization algorithm to estimate the model parameters. The expectation step is solved by the standard forward-backward algorithm for HMMs. The maximization step reduces to a set of separable concave optimization problems if the model is restricted slightly. We first test our algorithm on simulated data and are able to fully recover the parameters used to generate the data and accurately recapitulate the sequence of hidden states. We then apply our algorithm to a recently published data set in which the observed neuronal ensembles displayed multistate behavior and show that inclusion of spike history information significantly improves the fit of the model. Additionally, we show that a simple reformulation of the state space of the underlying Markov chain allows us to implement a hybrid half-multistate, half-histogram model that may be more appropriate for capturing the complexity of certain data sets than either a simple HMM or a simple peristimulus time histogram model alone.

## 1. Introduction

_{t}at time

*t*is given by where

*f*is a positive, nonlinear function (e.g., the exponential), is the stimulus input at time

*t*(which can also include spike history and interneuronal effects), and is the direction in stimulus space that causes maximal firing (i.e., the preferred stimulus or receptive field of the neuron). Since does not change with time, this model assumes that the response function of the neuron is constant throughout the presentation of the stimulus (i.e., the standard GLM is a single-state model that would be unable to capture the experimental results discussed above).

In this article, we propose a generalization of the GLM appropriate for capturing the time-varying stimulus-response properties of neurons in multistate systems. We base our model on the hidden Markov model (HMM) framework (Rabiner, 1989). Specifically, we model the behavior of each cell in each state *n* as a GLM with a state-dependent stimulus filter , where transitions from state to state are governed by a Markov chain whose transition probabilities may also be stimulus dependent. Our model is an extension of previous HMMs applied to neural data (Abeles et al., 1995; Seidemann, Meilijson, Abeles, Bergman, & Vaadia, 1996; Jones, Fontanini, Sadacca, Miller, & Katz, 2007; Chen, Vijayan, Barbieri, Wilson, & Brown, 2009; Tokdar, Xi, Kelly, & Kass, 2009), and is thus an alternative to several of the recently developed linear state-space models (Brown, Nguyen, Frank, Wilson, & Solo, 2001; Smith & Brown, 2003; Eden, Frank, Barbieri, Solo, & Brown, 2004; Kulkarni & Paninski, 2007), which also attempt to capture more of the complexity in the stimulus-response relationship than is possible with a simple GLM.

To infer the most likely parameters of our HMM given an observed spike train, we adapt the standard Baum-Welch expectation-maximization (EM) algorithm (Baum, Petrie, Soules, & Weiss, 1970; Dempster, Laird, & Rubin, 1977) to point-process data with stimulus-dependent transition and observation densities. The E-step here proceeds via a standard forward-backward recursion, while the M-step turns out to consist of a separable set of concave optimization problems if a few reasonable restrictions are placed on the model (Paninski, 2004). The development of EM algorithms for the analysis of point-process data with continuous state-space models has been previously described (Chan & Ledolter, 1995; Smith & Brown, 2003; Kulkarni & Paninski, 2007; Czanner et al., 2008), as has the development of EM algorithms for the analysis of point-process data with discrete state-space models, albeit using Markov chain Monte Carlo techniques to estimate the E-step of the algorithm (Sansom & Thomson, 2001; Chen et al., 2009; Tokdar et al., 2009). Our algorithm, on the other hand, uses a discrete state-space model with inhomogeneous transition and observation densities and allows the posterior probabilities in the E-step to be computed exactly.

This article is organized as follows: Section 2 briefly reviews the basic HMM framework and associated parameter learning algorithm, and then develops our extension of these methods for stimulus-driven multistate neurons. We also introduce an extension that may be appropriate for data sets with spike trains that are triggered by an event (e.g., the beginning of a behavioral trial) but are not driven by a known time-varying stimulus. This extension results in a hybrid half-multistate, half-histogram model. Section 3 presents the results of applying our model and learning procedure to two simulated data sets meant to represent a thalamic relay cell with different tonic and burst firing modes and a cell in sensory cortex that switches between stimulus-attentive and stimulus-ignoring states. In section 4, we analyze a data set from rat gustatory cortex in which multistate effects have previously been noted (Jones et al., 2007), expanding the prior analysis to permit spike-history-dependent effects. Our results show that accounting for history dependence significantly improves the cross-validated performance of the HMM. In section 5 we conclude with a brief discussion of the models and results presented in this article in comparison to other approaches for capturing multistate neuronal behavior.

## 2. Methods

### 2.1. Hidden Markov Model Review.

Before we present our modification of the HMM framework for modeling the stimulus-response relationship of neurons in multistate systems, we briefly review the traditional framework as described in Rabiner (1989). While sections 2.1.1 through 2.1.3 are not specific to neural data, we will note features of the model that we modify in later sections and introduce notation that we use throughout the article.

#### 2.1.1. Model Introduction and Background.

*t*: the state

*q*and the emission

_{t}*y*. Assuming that the state variable

_{t}*q*can take on one of

_{t}*N*discrete states {1, …,

*N*} and makes a transition at every time step according to fixed transition probabilities (as shown in Figure 1 for

*N*= 3), then the states form a homogeneous, discrete-time Markov chain defined by the following two properties. First, or the future state is independent of past states and emissions given the present state (i.e., the Markov assumption). Thus, the sequence of states, , evolves only with reference to itself, without reference to the sequence of emissions, . Second, or the probability of transitioning from state

*n*to state

*m*is constant (homogeneous) for all time points. All homogeneous, discrete-time Markov chains can then be completely described by matrices with the constraints that 0 ⩽ α

_{nm}⩽ 1 and ∑

^{N}

_{m=1}α

_{nm}= 1. We will relax both the independence of state transition probabilities on past emissions (see equation 2.1) and the homogeneity assumption (see equation 2.2) in our adaptation of the model to allow for spike history dependence and dynamic state transition probabilities, respectively.

_{nk}⩽ 1 and ∑

^{K}

_{k=1}η

_{nk}= 1, where η

_{nk}≡

*p*(

*y*=

_{t}*k*∣

*q*=

_{t}*n*) for a system with

*K*discrete emission classes {1, …,

*K*}.

*N*×

*N*matrix and the

*N*×

*K*matrix are as defined above, and the

*N*-element vector is the initial state distribution (π

_{n}≡

*p*(

*q*

_{0}=

*n*)).

*T*and is the topic of the next section.

#### 2.1.2. The Forward-Backward Algorithm.

In order to find the parameters that maximize the marginal log likelihood, we first need to be able to evaluate this likelihood efficiently. This is solved by the forward-backward algorithm (Baum et al., 1970), which also comprises the E-step of the Baum-Welch algorithm (EM for HMMs).

*t*and the probability that at time

*t*, the system is in state

*n*. The forward probabilities can be calculated recursively by and which involves computation. Marginalizing over the hidden state in the final forward probabilities yields the likelihood the log of which is equivalent to equation 2.6.

#### 2.1.3. HMM Expectation-Maximization.

*i*. During the M-step, the next setting of the parameters is found by maximizing the expected complete log likelihood with respect to the parameters, where the expectation is taken over the posterior distribution resulting from the E-step: While EM is guaranteed to increase the likelihood with each iteration of the procedure, it is susceptible to being trapped in local minima and may not converge as rapidly as other procedures (Salakhutdinov, Roweis, & Ghahramani, 2003).

### 2.2. HMMs Modified for Stimulus-Driven Neural Response Data.

We develop an HMM to model spike train data produced by neurons that transition between several hidden neuronal states. In the most general case, we assume that an external stimulus is driving the neurons’ firing patterns within each state, as well as the transitions between states. We further extend the model to allow spike history effects such as refractory periods and burst activity. Although, for notational simplicity, we initially develop the model assuming that the data consist of a single spike train recorded from a single neuron, in section 2.2.5 we show that this framework can be easily extended to the multicell and multitrial setting.

#### 2.2.1. Incorporating Point-Process Observations.

_{n}. Thus, each row of becomes the Poisson distribution corresponding to each state, where λ

_{n}is the

*n*th state's firing rate, η

_{ni}is the probability of observing

*i*spikes during some time step given that the neuron is in state

*n*, and

*dt*is the time step duration (Abeles et al., 1995).

*n*to state

*m*during a single time step. Therefore, we use the following model: where λ′

_{nm}is the “pseudo-rate” of transitioning from state

*n*to state

*m*. (Throughout the article, the ′ notation is used to denote rates and parameters associated with transitioning as opposed to spiking. Here, for example, λ′ is a transition rate, while λ is a firing rate.) This definition of is convenient because it restricts transitions to at most one per time step (i.e., if

*m*≠

*n*) and guarantees that the rows of sum to one. Furthermore, in the limit of small

*dt*, the pseudo-rates become true rates (i.e., the probabilities of transitioning become proportional to the rates):

#### 2.2.2. Incorporating Stimulus and Spike History Dependence.

In our model we permit the spike trains to be dependent on an external, time-varying stimulus , where is the stimulus at time *t*. The vector has a length equal to the dimensionality of the stimulus. For example, if the stimulus is a 10 × 10 pixel image patch, then would be a 100-element vector corresponding to the pixels of the patch. In the general case, can also include past stimulus information.

*g*and

*f*are nonlinear rate functions mapping real scalar inputs to nonnegative scalar outputs. In the absence of a stimulus (i.e., when ), the bias terms

*b*′

_{nm}and

*b*determine the background transitioning and firing rates as

_{n}*g*(

*b*′

_{nm}) and

*f*(

*b*) respectively. It is possible to simplify the notation by augmenting the filter and stimulus vectors according to and Then equations 2.23 and 2.24 reduce to and The stimulus filters for firing are the

_{n}*N*preferred stimuli or receptive fields associated with each of the

*N*states of the neuron. In the degenerate case where

*N*= 1, the model reduces to a standard GLM model, and becomes the canonical receptive field. The stimulus filters for transitioning are, by analogy, “receptive fields” for transitioning, and since there are

*N*(

*N*− 1) of these, there are

*N*

^{2}total transition and firing stimulus filters describing the full model. This stimulus-dependent HMM is represented graphically in Figure 2b.

*t*: Then the transition and firing rate equations are modified by additional linear terms as and where and are weight vectors or linear filters that describe the neuron's preferred spike history patterns for transitioning and firing, respectively. The effect of adding history dependence to the rate equations is captured in Figure 2c.

*N*

^{2}history filters. Thus, adding history dependence introduces τ

*N*

^{2}additional parameters to the model, and if

*dt*is much smaller than the maximal duration of history effects, τ can be large, which can lead to a significant increase in the number of parameters. One way to reduce the number of parameters associated with history dependence is to assume that the history filters are linear combinations of

*H*fixed-basis filters where

*H*< τ. These basis filters could, for example, be exponentials with appropriately chosen time constants. We can then define to be the

*H*-element vector of coefficients corresponding to the linear combination composing the history filter rather than the history filter itself. In this formulation, the spike history data vector is redefined as while the transition and firing rate equations remain unchanged (equations 2.30 and 2.31 respectively).

Since either choice of spike history dependence simply adds linear terms to the rate equations and since either formulation of can be precomputed directly from the spike train with equations 2.29 and 2.32, we can safely augment and with and , as in equations 2.25 and 2.26. Thus, for the remainder of this article, without loss of generality, we will consider only equations 2.27 and 2.28 for both history-dependent and history-independent models.

#### 2.2.3. Summary of Model.

_{t}and λ

_{t}, which in turn are calculated from linear-nonlinear filterings of the stimulus and the spike history . Specifically, the transition matrix in the final model is and the emission matrix is Therefore, with

*N*hidden states, the parameters of the model are the

*N*(

*N*− 1) transition filters, the

*N*spiking filters, and the initial state distribution . Since the number of parameters grows quadratically with

*N*, it may be desirable to consider reduced-parameter models in some contexts (see appendix A for discussion). The filters are the state-specific receptive fields (and possible history filters) of the model neuron, while the filters are the “receptive fields” describing how the stimulus (and possibly spike history) influences the state transition dynamics.

#### 2.2.4. Parameter Learning with Baum-Welch EM.

*g*and

*f*are chosen from a restricted set of functions, it is possible to ensure that the ECLL is concave and smooth with respect to the parameters of the model and , and therefore each M-step has a global maximizer that can be easily found with a gradient ascent technique. The appropriate choices of

*g*and

*f*are discussed in section 2.2.6.

#### 2.2.5. Modeling Multicell and Multitrial Data.

One major motivation for the application of the HMM framework to neural data is that the hidden variable can be thought of as representing the overall state of the neural network from which the data are recorded. Thus, if multiple spike trains are simultaneously observed (e.g., with tetrodes or multielectrode arrays), an HMM can be used to model the correlated activity between the single units (under the assumption that each of the behaviors of the single units depends on the hidden state of the entire network as in Abeles et al., 1995; Seidemann et al., 1996; Gat, Tishby, & Abeles, 1997; Yu et al., 2006; Jones et al., 2007; Kulkarni & Paninski, 2007). Additionally, if the same stimulus is repeated to the same experimental preparation, the data collected on each trial can be combined to improve parameter estimation. In this section, we provide the extension of our stimulus- and history-dependent framework to the regime of data sets with *C* simultaneously recorded spike trains and *R* independent trials.

*y*being the number of spikes in time-step

_{t}*t*. We now consider the joint probability of the spiking behavior of all

*C*cells at time

*t*conditioned on state

*q*, or . We factorize where each cell-specific emission matrix is defined according to the Poisson distribution as before (see equation 2.20): with λ

_{t}^{c}

_{n}as the state- and cell-specific firing rate for cell

*c*in state

*n*, given the observed stimulus and past spike history. The time-varying rates also retain their definitions from the single-cell setting (see equation 2.28): Note that the number of transition filters for the multicell model is unchanged (

*N*

^{2}−

*N*), but that the number of spiking filters is increased from

*N*to

*NC*(i.e., there is one spiking filter per state per cell).

*R*trials is independent of the rest, then the total log likelihood for all of the trials is simply the sum of the log likelihoods for each of the individual trials. Thus, to get the total log likelihood, the forward-backward algorithm is run on each trial

*r*separately, and the resultant trial-specific log likelihoods are summed. The M-step is similarly modified, as the total ECLL is again just the sum of the trial-specific ECLLs: where and are the trial-specific single- and consecutive-pairwise marginals of the posterior distribution over the state sequence given by the forward-backward algorithm applied to each trial during the E-step, and

*y*

^{c,r}

_{t}is the number of spikes by cell

*c*in the

*t*th time step of the

*r*th trial. The M-step update for the start-state distribution (see equation 2.19) is modified as To update the transition and spiking filters, gradient ascent is performed as in the single trial setting, except that the trial-specific gradients and Hessians for each filter are simply summed to give the complete gradients and Hessians. Note that the order of the sums in equation 2.44 represents the fact that the parameters that determine the transitioning behavior away from each state

*n*are independent of each other, as are the parameters that determine the spiking behavior for each cell

*c*, and so these sets of parameters can be updated independently during each M-step.

#### 2.2.6. Guaranteeing the Concavity of the M-step.

As Paninski (2004) noted for the standard GLM model, the ECLL in equation 2.38 depends on the model parameters through the spiking nonlinearity *f* via a sum of terms involving log *f*(*u*) and −*f*(*u*). Since the sum of concave functions is concave, the concavity of the ECLL will be ensured if we constrain log *f*(*u*) to be a concave function and *f*(*u*) to be a convex function of its argument *u*. Example nonlinearities that satisfy these log concavity and convexity constraints include the exponential, rectified linear, and rectified power law functions.

The concavity constraints for the transition rate nonlinearity are significantly more stringent. From equation 2.37, we see that *g* enters into the ECLL in two logarithmic forms: log *g*(*u*) and −log(1 + ∑_{i}*g*(*u _{i}*)

*dt*), where each

*u*is a function of the parameters corresponding to a single transition (e.g., ). The first term gives rise to the familiar constraint that

_{i}*g*must be log concave. To analyze the second term, we consider the limiting case where

*g*(

*u*)

_{j}*dt*≫ 1 and

*g*(

*u*) ≫

_{j}*g*(

*u*), ∀

_{i}*i*≠

*j*(which will be true for some setting of the parameters that compose the

*u*). Then the second logarithmic term reduces to −log

_{i}*g*(

*u*), which introduces the necessary condition that

_{j}*g*must be log convex. Explicit derivation of the second derivative matrix of −log(1 + ∑

_{i}

*g*(

*u*)

_{i}*dt*) confirms that the log convexity of the nonlinearity is sufficient to guarantee that this matrix is negative-definite for all values of the

*u*(i.e., not just in the limiting case). The only functions that are both log concave and log convex are those that grow exponentially, and thus, if the transition nonlinearity is the exponential function (if

_{i}*g*(

*u*) =

*e*), the concavity of the M-step will be guaranteed.

^{u}^{1}

#### 2.2.7. Using a Continuous-Time Model.

It is possible to adapt our model to a continuous-time rather than a discrete-time framework (see appendix C for details). This has several potential advantages. First, as the derivation of the continuous-time M-step reveals, the stringent requirement that the transition rate nonlinearity *g* must be the exponential can be relaxed while maintaining the concavity of the ECLL. This significantly increases the flexibility of the model. More important, however, the continuous-time implementation may require less computation and memory storage. During the E-step in the discrete-time case, the forward and backward probabilities are calculated for every time step *t*. When one considers the fact that the vast majority of time steps for a reasonable choice of *dt* (⩽10 ms) are associated with the trivial “no-spike” emission even for neurons with relatively high firing rates, it becomes obvious why a continuous-time framework might potentially be advantageous since it is possible to numerically integrate the forward and backward probabilities from spike time to spike time. Using an ODE solver to perform this integration is effectively the same as using adaptive time stepping where *dt* is modified to reflect the rate at which the probabilities are changing. This can result in significantly fewer computations per iteration than in the discrete-time case. Additionally, it is necessary to store the marginal probabilities of the posterior distribution (for eventual use during the M-step) only at the time points where the ordinary differential equation (ODE) solver chooses to evaluate them, which is likely to be many fewer total points. Although the results we present below were all obtained using a discrete-time algorithm, for the reasons just mentioned, implementation of a continuous-time model may be more appropriate for certain large data sets, specifically those with highly varying firing rates where a single time step would be either too computationally expensive or would result in a loss of the finely grained structure in the data.

### 2.3. Hybrid Peristimulus Time Histogram and Hidden Markov Models.

As a demonstration of how the framework introduced in this article can be extended to more appropriately suit certain data sets, in this section we introduce modifications that allow the modeling of state-dependent firing rates when the rates are not being driven by an explicit time-varying stimulus, but simultaneously are not time homogeneous. Many experimental data sets consist of multiple trials that are triggered by an event and exhibit interesting event-triggered dynamics (see, e.g., the data discussed in section 4). Assuming these dynamics evolve in a state-dependent manner, the ability to model such inhomogeneous systems with HMMs is potentially useful. It is important to note, however, that the models that we introduce in sections 2.3.1 and 2.3.2, while mathematically very similar to those already presented, differ philosophically in that they are not specifically motivated by a simple neural mechanism. In the previous models, firing rate changes are driven by the time-varying stimulus. In these models, though they allow the capture of firing rate changes, the genesis of these inhomogeneities is not explicitly defined (although these models are designed to provide a snapshot of the dynamics of an underlying neural network from which a data set is recorded).

#### 2.3.1. The Trial-Triggered Model.

As our first example, we consider a model in which the transition and firing rates depend on the time *t* since the beginning of the trial (in addition to the hidden state *q _{t}* and, possibly, spike history effects). For simplicity, we assume that no time-varying stimulus is present, although incorporating additional stimulus terms is straightforward. Explicitly, we can model the transition and firing rates as and , respectively, where the spike history effects are as defined in section 2.2.2, is the

*t*th element of , and the filters and are now time-varying functions of length

*T*(see Kass & Ventura, 2001; Frank, Eden, Solo, Wilson, & Brown, 2002; Kass, Ventura, & Cai, 2003; Czanner et al., 2008; Paninski et al., 2009, for discussion of related models). In principle, the filter elements can take arbitrary values at each time

*t*, but clearly estimating such arbitrary functions given limited data would lead to overfitting. Thus, we may either represent the filters in a lower-dimensional basis set (as we discussed with in section 2.2.2), such as a set of splines (Wahba, 1990), or we can take a penalized maximum likelihood approach to obtain smoothly varying filters (where the difference between adjacent filter elements and must be small), or potentially combine these two approaches. (For a full treatment of a penalized maximum likelihood approach to this “trial-triggered” model, see appendix D.)

A convenient feature of using the smoothness penalty formulation is that the model then completely encapsulates the homogeneous HMM with static firing rates in each state. If the smoothness penalties are set such that the difference between adjacent filter elements is constrained to be essentially zero, then the spiking and transition filters will be flat, or, equivalently, the model will become time homogeneous. In the opposite limit, as discussed above, if the penalties are set such that the differences between adjacent elements can be very large, then we revert to the standard maximum likelihood setting, where overfitting is ensured. Thus, by using model selection approaches for choosing the setting of the penalty parameter (e.g., with cross-validation or empirical Bayes as in Rahnama Rad and Paninski, 2010), it is possible to determine the optimal level of smoothness required of the spiking and transitioning filters.

It is clear that this proposed trial-triggered model is a hybrid between a standard peristimulus time histogram-based (PSTH) model and a time-homogeneous HMM. Although we have been able to reformulate this model to fit exactly into our framework, it is illustrative to consider the model, rather than as an *N*-state time-inhomogeneous model with an unrestricted transition matrix (as in Figure 1), as an *NT*-state time-homogeneous model with a restricted state-space connectivity. In this interpretation, state *n _{t}* is associated with the

*N*− 1 transition rates for

*m*≠

*n*leading to the states available at time

*t*+ 1.

^{2}In other words, the transition matrix is sparse with nonzero entries only between states corresponding to adjacent time steps. Note that this model is exactly the same as before, merely represented in a different way. Conceptually, a small state space with dynamic firing and transition rates has been replaced by a large state space with static rates. A schema of the Markov chain underlying the trial-triggered model is given in Figure 3a. Each row of states in the figure corresponds to what was a single state in the previous representation of the model, and each column corresponds to a time step

*t*. Due to the restricted nature of the state-space connectivity (i.e., the few nonzero entries in the transition matrix), the system will always be in a state of the

*t*th column at time

*t*.

#### 2.3.2. The Transition-Triggered Model.

Another possible extension of this framework is illustrated in Figure 3b. The idea is to couple the dynamics of the system to the times at which the state transitions rather than the start of the trial. This model structure is closely connected to the “semi-Markov” models and other related models described previously (Sansom & Thomson, 2001; Guédon, 2003; Fox, Sudderth, Jordan, & Willsky, 2008; Chen et al., 2009; Tokdar et al., 2009), as we will discuss further below. In this model, transitions that result in a change of row reset the system to the first column of the state space, as opposed to the trial-triggered model, where all transitions move the system to the next column. In this transition-triggered model, we label the current state as *n*_{τ}, which is the τth state in the *n*th row of states. Note that the index τ is only equal to the time-step *t* prior to the first transition back to the first column of states. Subsequently, τ, which can be thought of as the depth of the current state in the state-space cascade, will reflect the time since the last transition, not the time since the onset of the trial, exactly as desired. The model parameters and can now be thought of as state-dependent peritransition time histograms (PTTHs) for spiking and transitioning (rather than PSTHs) due to the decoupling of τ from *t*. Note that each state *n*_{τ} is associated with *N* transition rates where *m* may equal *n* (unlike in the trial-triggered case, where each state was associated with *N* − 1 transition rates) because we permit transitions back to the start of the current row. Additionally, recall that when the trial-triggered model was reformulated as having *NT*-states rather than *N*-states, the model became time homogeneous. For the transition-triggered model, however, since τ and *t* are decoupled, the firing rates for each state *n*_{τ} are no longer time homogeneous. A consequence is that the time complexity of the associated Baum-Welch learning algorithm becomes rather than . For a full treatment of the transition-triggered model, see appendix E. Results from the analysis of real data using this model appear elsewhere (Escola, 2009).

## 3. Results with Simulated Data

In this section, we apply our algorithm to simulated data sets to test its ability to appropriately learn the parameters of the model.

### 3.1. Data Simulation.

In our trials with simulated data, the stimuli used to drive the spiking of the model neurons are time correlated gaussian white noise stimuli with spatially independent and identically distributed (i.i.d.) pixels. More specifically, the intensity of the stimulus at each pixel was given by an independent autoregressive process of order 1 with a mean of 0, a variance of 1, an autocorrelation of 200 ms, and a time step of 2 ms.

*f*is continuous and has continuous first and second derivatives, thus facilitating learning in gradient algorithms. Furthermore, the properties of convexity and log concavity are also maintained, guaranteeing that the ECLL has a unique maximum (recall section 2.2.6). The nonlinearity

*g*governing the transitioning behavior is selected to be the exponential function (also per section 2.2.6).

### 3.2. A Tonic and Burst Two-State Model.

We tested our learning algorithm on spike trains generated from a model representing tonic and burst thalamic relay cells. Experimental studies such as those reviewed in Sherman (2001) have shown that relay cells exhibit two distinct modes of firing. In the tonic mode (hereafter referred to as the tonic state), interspike intervals (ISIs) are approximately distributed according to an exponential distribution, suggesting that spikes are more or less independent and that a Poisson firing model is reasonable. In the burst state, neighboring spikes are highly correlated (they tend to occur in bursts), as indicated by a vastly different ISI distribution (Ramcharan, Gnadt, & Sherman, 2000), and thus any reasonable model must capture these correlations. To do so, we employed different spike history filters for the two states.

If the tonic state history filter were the zero vector (where the subscripts *t* and *b* refer to the tonic and burst states, respectively), then tonic state spikes during a constant stimulus would be independent, leading to an exactly exponential ISI distribution. Instead we chose the history filter shown in Figure 4a, which has a large, negative value for the most recent time step, followed by small, near-zero values for earlier time steps. This negative value models the intrinsic refractoriness of neurons by strongly reducing the probability of a subsequent spike one time step (2 ms) after a preceding spike (recall how the spike history affects the firing rate according to equation 2.31). The resulting ISI distribution (in light gray in Figure 5) has low probability density for short intervals due to the imposed refractoriness, but it is otherwise essentially exponential.

The burst-state history filter (see Figure 4b) has a similar negative value for the most recent time step and thus also models refractoriness, but it has strong, positive values for the previous two time steps. This has the effect of raising the probability of a spike following an earlier spike, and thus encourages bursting. Furthermore, the filter returns to negative values for more distant time steps, which tends to cause gaps between bursts, another known neurophysiological feature. The resulting ISI distribution (in dark gray in Figure 5) has the signature bimodal shape of bursting (Ramcharan et al., 2000).

A reasonable choice for the history filter for the transition from the tonic state to the burst state () consists of negative values for the several time steps preceding the transition. This is because bursts tend to follow periods of relative quiescence (Wang et al., 2007), and, with this choice of (see Figure 4c),^{3} the model neuron will prefer to transition to the burst state when there has not been a recent spike. We chose the history filter for the reverse transition () to be the zero vector (see Figure 4d), and thus spike history does not affect the return to the tonic state from the burst state. To reduce the model dimensionality, the history filters were defined by the coefficients of three exponential basis functions with time constants 2, 4, and 8 ms (recall the discussion in section 2.2.2).

The stimulus filters for spiking for both states ( and ; see Figures 4e and 4f, respectively) were chosen to be identical, following experimental evidence that the spatial component of the preferred stimulus does not change regardless of whether a relay cell is firing in the tonic or burst regime (Bezdudnaya et al., 2006).^{4} The spiking bias terms were set such that the background firing rates were 45 Hz in both states: Hz.

To choose the stimulus filter for the transition from the tonic state to the burst state (; see Figure 4g), we used a similar line of reasoning as in the choice of the corresponding history filter. Since bursts tend to follow periods of quiescence, we selected as this transition filter the negative of the spiking filter. Thus, the antipreferred stimulus would drive the cell into the burst state, where the preferred stimulus could then trigger bursting. This is reasonable from a neurophysiological point of view by noting that voltage recordings from patched cells have shown hyperpolarized membrane potentials immediately prior to bursts (Sherman, 2001; Wang et al., 2007) and that an antipreferred stimulus would be expected to hyperpolarize a neuron through push-pull inhibition. The stimulus filter for the reverse transition , as with , was chosen to be the zero vector (see Figure 4h). Thus, the return to the tonic state in this model is governed solely by the background transition rate. The bias terms *b*′_{tb} and *b*′_{bt} were set such that the background transition rates were 3 Hz and 7 Hz, respectively, for the tonic→burst and the burst→tonic transitions. When the neuron is presented with a stimulus, however, due to the variance of and the effects of the nonlinearity *g*, the average resultant rates are roughly equal for both transitions (approximately 7 Hz), and thus the model neuron spends about the same amount of time in each state.

*dt*. Conveniently, the nonlinearity

*f*has the same concavity constraints under this Bernoulli model as in the original Poisson model (see appendix F for proof).

Using this spiking model and the parameter settings described above, we generated 2000 s spike trains as test data. Before iterating our learning algorithm, the filters and biases were initialized to random values drawn from zero mean, unit variance gaussians, while the initial state distribution was initialized from an *N*-dimensional uniform distribution and then normalized to sum to 1. Learning proceeded according to the Baum-Welch EM algorithm described in sections 2.1.2, 2.1.3, and 2.2.4, with Newton-Raphson optimization used to perform the update of the parameters during the M-step (see section F.1 for the gradient and Hessian of the Bernoulli model). Considerable experimentation with the learning procedure suggested that except perhaps for the first one or two iterations of EM when the parameters are far from their correct values, a single Newton-Raphson step was sufficient to realize the parameter maximum for each M-step (i.e., the ECLL was very well approximated by a quadratic function). For these parameters and this amount of data, learning generally converged in about 200 to 300 iterations, which requires about 30 minutes of CPU time on an Apple 2.5 GHz dual-core Power Mac G5 with 3 GB of RAM running MATLAB.

Learning was repeated for 100 trials, each with a unique stimulus/spike train pair and a unique random initialization. By visual inspection, all trials appeared to avoid local minima and converged to reasonable solutions. The results for the history and stimulus filters (without bias terms) are shown in Figure 4. The ±1 σ error ranges for the bias terms (expressed in rate space) are 44.5 to 45.5 Hz, 44.6 to 45.4 Hz, 2.5 to 3.3 Hz, and 6.5 to 7.4 Hz, for *b _{t}*,

*b*,

_{b}*b*′

_{tb}, and

*b*′

_{bt}, respectively. All true filter and bias parameters fall within the ±1 σ error ranges, suggesting that parameter learning was successful. The larger-than-average error bars for the weights of the transition history filters at 2 ms in the past (see Figures 4c and 4d) reflect the fact that spike trains contain little information about the dependence of the state transitioning on the spike history at very short timescales. The estimation of the consecutive-pairwise marginal probabilities of the posterior distribution of the state sequence (see equation 2.15) calculated by the forward-backward algorithm (see section 2.1.2) is not able to temporally localize the transitions to within a 2 ms precision even if the true parameters are used for the estimation. Therefore, one would need to average over a great deal more data to infer the dependence at this timescale than at slightly longer timescales. If more data were used to estimate the parameters, these error bars would be expected to decrease accordingly.

Although the parameter values appear to be learned appropriately, they are not learned perfectly. To understand the implication of these deviations, data generated using the true parameters can be compared to those generated using a sample learned parameter set. Rather than compare spike trains directly, it is sufficient to compare instantaneous firing rates, since the rate is a complete descriptor of Bernoulli (or Poisson) firing statistics. Figure 6a shows the instantaneous firing rates of two sample simulations of the model using the same stimulus but two different parameter sets.^{5} The most striking feature is how different the two traces seem from each other. This is because spikes in the two traces are very rarely coincident, and the spike history dependence dramatically alters the firing rates during the several milliseconds following a spike. This is apparent from the many dips in the firing rates to near-zero values (immediately after spikes), followed by relatively quick rebounds to the purely stimulus-evoked firing rate (the rate given by a spike history independent model). Also noticeable are the huge jumps in the firing rate corresponding to times when the neuron has been simulated to be in the burst state and is actively bursting.

The distributions of the instantaneous firing rates calculated over 1000 model simulations for both the true parameters and the set of learned parameters are shown in Figure 6b. Despite the fact that individual trials such as those shown in Figure 6a can differ significantly, the means and ±1 σ deviations are almost identical between the two distributions, confirming that the two parameter sets (true and learned) produce identical behavior in the model neuron. In other words, the interparameter set firing rate variability is no more than the intraparameter set firing rate variability.

To additionally evaluate the estimation performance, in Figure 7, we compare the posterior probability of the hidden state variable at each time step with the true state sequence. The trace corresponding to the posterior probability calculated using the learned parameters is essentially the same as that calculated using the true parameters, suggesting that both sets of parameters are equally able to extract all the information about the sequence of states that exists in the spike train. The difference between the true state sequence and the posterior probability calculated using the true parameters represents the intrinsic uncertainty in the system, which we cannot hope to remove. However, over 2000 s of stimulus/spike train data, the percentage of time steps when the true state was predicted with a posterior probability greater than 0.5 was 92%. These results support the fidelity of the learning procedure and suggest that it may be possible to use this method to recapitulate an unknown state sequence.

### 3.3. An Attentive/Ignoring Two-State Model.

We also tested our learning algorithm on spike trains generated from a simulated neuron with two hidden states corresponding to stimulus-driven spiking and stimulus-ignoring spiking, respectively. This model could be interpreted to represent a neuron in primary sensory cortex. The attentive state would correspond to times when the synaptic current into the neuron is predominantly derived from thalamic afferents, and thus when the neuron's spiking behavior would be highly correlated with the sensory stimulus. The ignoring state would be associated with times when recurrent activity in the local cortical column or feedback activity from higher cortical areas overwhelms the inputs and drives the neuron's spiking in a stimulus-independent manner.

The ignoring state can be represented by setting the stimulus filter for spiking in that state to be zero for all elements except for the bias term: , where the subscript *i* indicates the ignoring state (and *a* the attentive state). The stimulus filters of the model—, , , and —are shown in Figure 8 (history effects are ignored for this simple model). The forms of these filters are arbitrary choices (with the exception of ), and the magnitudes of the filter values were chosen to be of the same order of magnitude as the zero mean, unit variance stimulus. The bias terms were set such that the background firing and transition rates in both states were 45 Hz and 0.1 Hz, respectively, which resulted in mean firing and transition rates in the presence of a stimulus of about 50 Hz and 9 Hz, respectively, due to the effects of the nonlinearities. Note that the original Poisson spiking model was used to generate the data for this example.

Learning proceeded as in the previous example and was repeated for 100 trials, each with a unique 2000 s stimulus/spike train pair and a unique random parameter initialization. By visual inspection, all trials appeared to avoid local minima and converged to reasonable solutions. The results for the filter parameters (without biases) are summarized in Figure 8. The ±1 σ error ranges for the bias terms (expressed in rate space) are 44.6 Hz to 45.4 Hz, 44.8 Hz to 45.2 Hz, 0.04 Hz to 0.13 Hz, and 0.07 Hz to 0.12 Hz, for *b _{a}*,

*b*,

_{i}*b*′

_{ai}, and

*b*′

_{ia}, respectively. All true filter and bias parameters fall within the ±1 σ error ranges; thus, parameter learning was successful. For comparison purposes, the linear filter of a standard GLM (i.e., one-state) model was also learned. The resulting filter (shown with the dotted line in Figure 8a) differs significantly from the underlying stimulus filter for spiking and seems to represent some combination of and (i.e., the two spiking filters), as well as , the transition filter that drives the neuron into the attentive state so that it can subsequently be driven to fire by the stimulus acting through .

As is shown in Figure 6b for the previous example, the distributions of the instantaneous firing rates calculated over many simulations of the attentive/ignoring model for both the true parameters and a set of learned parameters can be compared; again, the means and ±1 σ deviations are almost identical between the two distributions, confirming that the two parameter sets (true and learned) produce identical behavior in the model neuron (data not shown). Analysis of the inferred posterior probability of the hidden state variable at each time step compared with the true state sequence further confirms the success of learning. The posterior probabilities resulting from the true parameters and a set of learned parameters are nearly identical, suggesting that the learning procedure was as successful as possible (data not shown). Over 2000 s of data, the correlation coefficient between the true state sequence and the inferred posterior probability was 0.91, while the percentage of time steps when the true state was predicted with a posterior probability greater than 0.5 was 95%.

## 4. Multistate Data in Rat Gustatory Cortex

Jones et al. (2007) have presentd an analysis of multielectrode data collected from gustatory cortex during the delivery of tastants—solutions of sucrose (sweet), sodium chloride (salty), citric acid (sour), and quinine (bitter)—to the tongues of awake, restrained rats. Each tastant was applied 7 to 10 times during each recording session, with water washouts of the mouth between trials. Across all recording sessions and all four tastants, the data consist of 424 trials, where each trial composes the 2.5 s of multielectrode spiking data immediately following the application of a tastant. Note that different sets of cells (varying in number from 6 to 10) were isolated during each recording session, and so only the trials corresponding to the same session and tastant pair can be considered to be samples from the same neural process.^{6} In the initial analysis of the data, after directly inspecting the spike raster plots over multiple trials, it was realized that when multiple cells tended to change their firing rates during the course of a trial, they tended to do so simultaneously on a given trial but that this transition time often differed between trials. Thus, the choice was made to perform a simple HMM analysis to model these data. Specifically, a four-state model with constant state-dependent firing rates for each cell and constant transition rates between all pairs of states was fit to the data. Couching this previous model in our current notation, the stimulus filters and reduce to the background firing and transition rates *b ^{c}_{n}* and

*b*′

_{nm}, respectively, with all history filters equal to . Note that these data conform to the multicell multitrial paradigm introduced in section 2.2.5.

### 4.1. Results from Spike History Dependent Models.

While this data set does not have a known external time-varying stimulus and thus determining preferred stimuli for firing and transitioning is not possible, we provide a brief analysis extending the model presented in Jones et al. (2007) to include one aspect of the framework developed in section 2.2: the modeling of spike history effects. Note that in Chen et al. (2009), the authors also include spike history dependence in their model of UP and DOWN states during slow-wave sleep.

Unlike the case of simulated data, we do not know the true state-dependent firing rates, transition rates, or history filters, and so rather than comparing the learned model parameters to the true model parameters as is done in the previous section, we instead evaluate the cross-validated log likelihood of the data, an unbiased measure of the goodness of fit, over several model classes. The model class with the highest cross-validated log likelihood provides the best fit to the data. We compute the cross-validated log likelihood using leave-one-out cross-validation as follows. For every trial, we fit an HMM to the remaining trials of the same session and tastant pair and then evaluate the log likelihood of the left-out trial on the trained model. The sum of these log likelihoods for all 424 trials equals the total cross-validated log likelihood for the particular model class in question.

*y*∼ Poisson(λ

^{c}_{t}^{c}

_{t}) as before. The length

*T*filter is fit by penalized maximum likelihood, exactly as discussed in section 2.3.1.

^{7}

Figure 9 shows the comparison of the cross-validated log likelihoods for the HMM and PSTH models with and without the inclusion of history effects. Since the number of spike trains in the data set varies across recording sessions and since the firing rates vary significantly across session and tastant pairs, we normalize the cross-validated log likelihoods by dividing by the number of cells and subtracting off the log likelihoods derived from a homogeneous Poisson model (i.e., to capture differences in firing rates). This allows a comparison of trials across session and tastant pairs and, thus, the calculation of meaningful error bars. Note that the relative heights of the normalized data in the figure are unchanged when compared to the raw data (see the figure inset).

*C*is the number of cells composing the data of left-out trial

_{r}*r*, consists of the spike trains of trial

*r*, and is the maximum likelihood solution learned using the remaining trials from the same session and tastant pair as trial

*r*. The Poisson models are simple fits of homogeneous firing rates to each cell: λ

^{c}=

*f*(

*b*). Then, if

^{c}*N*

_{cells}refers to the total number of spike trains across every combination of trial and tastant session (3872 spike trains in total) and if

*N*

_{trials}refers to the total number of trials over every session and tastant pair (424 trials), the sample means and sample standard errors shown in the figure are calculated from the set of values {LL

^{r}

_{norm}}, where each value in the set is included in the sample

*C*times: and As in section 3.2, the history filters for each cell were parameterized by the coefficients of exponential basis functions with time constants of 2, 4, and 8 ms. For simplicity, no interneuronal cross-coupling spike history terms were included in this analysis. Since the true value of

_{r}*N*, the number of hidden states, is unknown when using experimental data, we calculate the cross-validated log likelihoods for both

*N*= 3 and

*N*= 4 across the entire data set and then calculate the “optimal” cross-validated log-likelihood by summing the higher of the two values for each session and tastant pair. The choices of three and four states are empirically driven: for

*N*< 3, the cross-validated log likelihood falls off significantly, while for

*N*>4, we found that the system is typically not inferred to spend at least some period of time in every state. Additionally,

*N*= 4 was the choice that Jones et al. (2007) made.

The figure confirms the result in Jones et al. (2007) that HMMs fit the data better than PSTHs, but more strikingly draws attention to the importance of correctly modeling spike history effects. In terms of maximizing the cross-validated log likelihood, whether or not to include history dependence in the model far outstrips the choice of PSTH versus HMM. To understand why history dependence seems to outweigh the importance of using an HMM, it is helpful to consider the behavior of the log likelihood under a simple Bernoulli model with parameter *p*; in our context, of course, *p* would model the probability of a spike in a single time step. The Fisher information for this model is , which implies that the log likelihood is much more sensitive to perturbations around *p* ≈ 0 than for larger values of *p*. The spike history filters serve mainly to capture intrinsic neuronal refractoriness and thus allow *p* to be correctly set to near zero values in time bins immediately following a spike. This evidently has a larger impact on the total log likelihood than does the choice of PSTH versus HMM (which can be thought to modulate *p* around less sensitive regions of parameter space when the true probability of spiking is far from *p* = 0). Thus, the magnitudes in the figure are somewhat artifactual. A history-ignoring model is clearly wrong, but not necessarily more wrong than the choice of PSTH over HMM.

Sample history filters as determined by both the history-dependent PSTH and HMM for one of the cells in the data set are shown in Figure 10a. The two model classes determine nearly identical history filters, with a large negative component at short post spike times accounting for neuronal refractoriness. Figures 10b–10d show instantaneous firing rates for PSTHs and HMMs with and without history dependence. Note that when history effects are modeled, the baseline firing rates of the cell are higher than in the history-ignoring models. This suggests that the history-ignoring models must reduce their baseline firing rates to account for the net reduction in firing rate introduced by neuronal refractoriness (thus keeping the area under the firing rate curves constant). However, by doing so, they fail to capture the correct background rate.

As we noted, Jones et al. (2007), argued that HMMs offer improved performances over PSTHs due to their ability to identify transition times. This avoids the need to average across trials and model the data with intermediate firing rates that may never be realized on any actual trial. Comparing the PSTH-determined firing rates in Figure 10b with the HMM-determined rates in Figures 10c and 10d reveals this phenomenon. In the PSTH models, the firing rates increase smoothly during the length of the trial, whereas they make discrete jumps when determined by HMMs.

### 4.2. Results from the Trial-Triggered Model.

In a preliminary analysis of these data with the trial-triggered model presented in section 2.3.1, we found that the best setting of the smoothness penalty, as determined by the cross-validated log likelihood, is to require that the difference between adjacent firing rate values (e.g., and ) be essentially zero. This causes the trial-triggered model to degenerate to a time-homogenous HMM (albeit with spike history effects) or, in other words, almost exactly the model just presented.^{8} While this degeneration is initially surprising, it does agree with the interpretation of Jones et al. (2007) as to why a multistate model more accurately reflects the data from trial to trial than does a standard PSTH model. Recall the argument that if a cortical computation (in this case, tastant recognition) requires a neural network to progress through a sequence of states with homogeneous state-dependent firing rates, and if the time needed to complete each step of the sequence varies from trial to trial, then a PSTH will smear out these state-dependent rates and result in average firing rates that do not reflect the true rates observed on actual trials. Now imagine if the true structure of the data is that each state *n* is defined not by a set of homogeneous firing rates, but, rather, by a set of time-varying firing rate trajectories that the cells of the network follow each time the system visits state *n*. Then if the time at which the system transitions to state *n* varies from trial to trial, the trial-triggered model will smear out these state-dependent trajectories (in the same way that a standard PSTH model will smear out state-dependent homogeneous firing rates) due to the fact that the parameters of the trial-triggered model, the state-dependent PSTHs, are locked to the onset time of the trial, not the onset time of the current state. Since the trial-triggered framework is unable to account for firing rate trajectories that evolve from the state onset times, this may explain why the cross-validated log likelihood is maximal for the setting of the smoothness penalty that forces the model to have unchanging firing rates.

## 5. Discussion

### 5.1. Previous HMMs Applied to Spike Trains.

Our model can be considered to be a direct extension of previous applications of HMMs to spike-train data discussed in Abeles et al. (1995), Seidemann et al. (1996), and Gat et al. (1997). In these earlier models, the goal was to predict which task a monkey was performing during the recording of a spike train by comparing the likelihoods of the spike train tested on two different HMMs—one trained on only trials corresponding to task 1 and the other trained on trials corresponding to task 2. These are similar to the model recently described in Jones et al. (2007) and discussed in detail in section 4 for predicting tastants from recordings of neurons in gustatory cortex. The major difference between these models and ours, and thus the reason that ours may be considered an extension, is the lack of an external time-varying stimulus and spike history dependence. Thus, both the transition rates and the firing rates in their models were time homogeneous and the size of the parameter set significantly smaller (because rates are defined by single parameters rather than stimulus and history filters). In the degenerate case where no stimulus is present and history effects are ignored, the state-dependent firing and transition rates are constant, and our model becomes identical to these previous models.

Although we specifically consider stimulus-driven sensory neurons in this article, the model can also represent the relationship between spike trains and time-varying behavioral variables (e.g., hand position in a motor task). With no change to the model as presented in section 2, we could infer the hidden state sequence and optimal “motor filters” from paired spike train and behavior data (Kemere et al., 2008). A related approach is developed in Wu et al. (2004) where they assume that hand position data (location, velocity, and acceleration in two dimensions) evolve according to an autoregressive model. The graphical representation of their model is essentially identical to Figure 2b, except that they have arrows between adjacent stimuli (or, in their case, position vectors) to reflect the fact that one stimulus affects the next.^{9} Thus, while the focus of our work can been seen to be the identification of the filters of the model and the recapitulation of the hidden state sequence, these goals are closely related to models that seek to identify either the behavior or stimulus (the two are mathematically identical) encoded by spike trains.

### 5.2. Alternative Current Techniques.

There has been a good deal of recent work representing the stimulus-response relationship of neurons with linear state-space models, which, since they also employ latent variables that alter the stimulus-dependent spiking behavior, are similar in spirit to our model. In Smith and Brown (2003), for example, there is a one-dimensional state variable given by , where ρ is the correlation between adjacent time steps and βε(*t*) represents gaussian noise with variance β^{2}. We have otherwise modified their notation to match our own. Given the state variable *q _{t}*, the firing rate is λ

_{t}=

*f*(

*q*+

_{t}*b*) with bias

*b*. This model is similar to ours in that the state dynamics are stimulus dependent (i.e., through the filter ), but, conditioned on the state employ homogeneous firing rates.

Similarly, in Frank et al. (2002), Eden et al. (2004), Czanner et al. (2008), and Kulkarni and Paninski (2007), the state-variable vectors evolve according to homogeneous gaussian dynamics, but the state-conditioned firing rates are stimulus dependent. For example, the rates can be given by where the stimulus filter itself is the state variable, or by , where is meant to represent some unmeasured input current (i.e. not stimulus derived). See also Wu, Kulkarni, Hatsopoulos, and Paninski (2009) for a related approach and Paninski et al. (2009) for further examples.

The continuous state-space formulation has a number of potential advantages and disadvantages compared to the discrete HMM approach. On the one hand, the state dynamics in the discrete setting are nonlinear, and the state dynamics and the spiking behavior are driven by the stimulus in a flexible, nonlinear manner. On the other hand, the number of parameters in the discrete HMM approach depends quadratically on the number of states (although techniques exist to limit the dimensionality of the model, as discussed in appendix A), and the transitioning is assumed to occur on a fast timescale (within a single time step of length *dt*), while in many cases, the true transitioning behavior may be slower.

An interesting continuous state-space model that employs nonlinear dynamics and may serve as a middle ground between the linear continuous models and HMMs is given in Yu et al. (2006). The authors propose a model with a stochastic recurrent nonlinear neural network as the hidden variable. Such networks, as is known from the classic network literature, may have multiple stable attractor states, and these could correspond to the discrete states of an HMM. In addition, recent work in the network literature indicates that it may be possible to design networks that implement a desired Markov chain (Jin, 2009).

While we expect even richer models to be developed as computational power increases and experimental evidence for multistate neurons grows, we believe that the model presented in this article nicely complements these other available models and offers an alternative for the analysis of the stimulus-reponse relationship. Ultimately, however, the appropriate model for a particular data set is dictated by the data themselves and can be determined by evaluating the different approaches mentioned in this section under the rules of model selection. By contributing to the library of point-process models, we hope to provide an additional tool that may be the most appropriate for certain data sets.

### 5.3. Additional Applications of the HMM Framework.

Another interesting application for which the framework developed in the article may be appropriate involves the modeling of neural systems that have been shown to have stimulus-specific adaptation (Blake & Merzenich, 2002; Borst, Flanagin, & Sompolinsky, 2005; Maravall, Petersen, Fairhall, Arabzadeh, & Diamond, 2007), in which optimal linear filters of neural encoding models change with certain stimulus attributes. Such stimulus dependencies have previously been discussed in theoretical contexts in, for example, Ahrens, Linden, and Sahani (2008) and Hong, Lundstrom, and Fairhall (2008), but the stimulus-dependent HMM might provide an alternative model for similar phenomena. In such a model, the stimulus-response relationship in each state could be thought to define, in a piecewise fashion, a portion of the overall relationship, and the state dynamics would thus reflect the dynamics of adaptation.

### 5.4. Hidden Semi-Markov Models and the Transition-Triggered Model.

The transition-triggered model introduced in section 2.3.2 falls into the category of hidden semi-Markov models (see Sansom & Thomson, 2001, for an introduction to HSMMs). These models replace exponentially distributed state dwell times with dwell times drawn from other distributions with support on the nonnegative real axis (e.g., the gamma distribution). Thus, the models are no longer purely Markovian. That is, knowing the current state of the system at time *t* does not provide the maximum possible information about future states of the system; the elapsed time in the current state is also needed, which is analogous to keeping track of both *t* and τ (where τ, as introduced in section 2.3.2, is the current depth in the state-space cascade of Figure 3b). Although two previous papers have presented the use of HSMMs for the analysis of spike train data (Chen et al., 2009; Tokdar et al., 2009), the transition-triggered model differs significantly from these in two major respects. First, by introducing the use of an expanded state-space with *NT* instead of *N* states, we are able to adapt the Baum-Welch algorithm to do exact inference of the hidden state distribution (as described in appendix D), while previous approaches employed Markov chain Monte Carlo (MCMC) techniques for inference. Both methods have a time complexity of per iteration. Second, and more important, the transition-triggered model can capture state-dependent firing-rate dynamics via the peritransition time histograms (PTTHs), which parameterize the model. These PTTHs allow the spiking behavior of the model to be dependent on the time since the most recent transition in addition to stimulus and spike history–driven effects. Previous authors have focused on accounting for nonexponentially distributed state dwell times, but have used traditional spiking models in each state (which, for Chen et al., 2009, does include the modeling of history effects, but, is otherwise determined only by the state-dependent background firing rates). Again, further work incorporating the strengths of each approach should be fruitful.

## Appendix A: Reduced Parameter Models

If the dimensionality of the stimulus (including the augmentation for the bias term and any history dependence) is *D*, then the total number of parameters corresponding to the transition and spiking filters of our model is *DN*^{2} (with an additional *N* − 1 for ). Since this number grows quadratically with *N*, it will become unfeasibly expensive, with respect to both computational time and data demands, to fit the parameters of a model with even a modest number of states and a modest stimulus dimensionality. Thus, it is reasonable to consider techniques for reducing the number of parameters and thereby control for this quadratic dependence on *N*.

The obvious target for such a reduction is in the parameters that define the transition matrix , since these are the ones that grow quadratically. The total collection of transition parameters () can be thought to constitute a rank 3 tensor **K**′ of dimensionality *D* × *N* × *N* − 1. Thus **K**′_{dnm} corresponds to the *d*th element of the transition filter . Ahrens, Paninski, and Sahani (2008) developed a formalism for the approximation of a full-rank tensor such as **K′** by the sum of lower-rank tensors. In general, there are many ways to decompose a full-rank tensor into low-rank components, which can result in reduced parameter sets as small as *D* + 2*N* − 1 for the most restricted case (i.e., if each dimension of **K′** is considered to vary independent of the others).

*D*×

*N*− 1 matrix composed of all filters corresponding to transitions away from state

*n*. This matrix can be approximated as where

*R*is the desired rank of the approximation to , which is at most the lesser of

*D*and

*N*− 1. Each individual transition filter is thus a linear combination of the

*R*filters with the mixing coefficients given by the weight vectors . The directions in stimulus space given by the filters can be thought of as the set of stimulus triggers (or single trigger, in the case that

*R*= 1) for transitioning away from state

*n*, while the weight vectors dictate which transition is most likely given the trigger.

^{10}While this reduced model certainly has more restricted state dynamics, it may reasonably capture the behavior of some multistate neural systems. The total number of parameters in this reduced model is

*NR*(

*D*+

*N*− 1).

Another example is essentially the converse of the last. In this case, we consider , the matrix consisting of all filters corresponding to transitions into state *m*. If the parameter reduction is performed as before, the interpretation is that one or more triggers (the number of which depends on *R*) drive the system into state *m*, and the responsiveness to a given trigger when the system is in state *n* depends on the weight vectors .

A caveat when using these techniques is that the ECLL maximized during each M-step is no longer concave in the model parameters as described in section 2.2.6. However, the ECLL is concave in the filters while holding the weights constant and concave in the weights while holding the filters constant. Thus, the M-step may be maximized by alternatively maximizing the ECLL with respect to each parameter subset separately until the procedure converges. Unfortunately, there is no way to guarantee that the globally optimal solution for each M-step is found when using these low-rank models (i.e., local optima may exist). However, empirical results in Ahrens, Paninski, et al. (2008) suggest that this is not a serious problem.

Besides low-rank models, other parameter-reduction techniques are possible. The simplest is to assume that some of the transitions are stimulus independent, for which one might have some a priori biological evidence. In this case, these filters can be removed from the model and the corresponding transitions fit with homogeneous rates. Ultimately, one may even wish to eliminate certain transitions altogether and define those rates to be zero.

A more robust and general approach with many desirable properties is to add a prior distribution over the parameters that can control overfitting of the data even in the case of very high-dimensional parameter spaces. Essentially such priors reduce the number of parameters in the model from the allowable maximum to the effective subset needed to represent the data appropriately. Parameters that are not in this effective subset have their values dictated by their prior distributions. In this case, the learned maximum a posteriori parameter setting is that which maximizes the product of the prior and the likelihood rather than the likelihood alone. It is convenient to choose priors that do not affect the concavity of the ECLL (i.e., those that are log concave with respect to the parameters).

## Appendix B: Gradient Ascent of the Expected Complete Log Likelihood

The solution for the M-step for each iteration of EM—the parameter setting that maximizes the ECLL—is not analytically tractable. However, given the concavity constraints for the nonlinearities *g* and *f*, we know that a unique solution exists, and thus that it may be found by ascending the gradient of this likelihood. Although numerical gradient techniques are possible, maximization is much more rapid if the gradient and the Hessian (second-derivative matrix) have analytic solutions, which permits the application of Newton-Raphson optimization.

*n*(i.e., the parameters of each row of can be optimized independently). Thus, we have and the following gradient: The Hessian can be further broken down into two types of matrices depending on whether the second derivative is taken with respect to the same filter as that from which the first derivative was calculated () or with respect to another filter corresponding to a different transition in the same row of (e.g., ). For the former case, we have and for the latter case, Since the concavity constraints on

*g*require that it be the exponential function (see section 2.2.6), the complexity of equations B.2, B.3, and B.4 is considerably reduced. For example,

*g*and

*g*′ cancel in the gradient, and the first term of the is equal to zero.

*f*, these formulas may simplify considerably.

Thus, since analytic formulations for the gradient and Hessian can be found for all parameters, Newton-Raphson optimization can be used to solve the M-step.

## Appendix C: Continuous-Time HMMs

*dt*→ 0, where

*t*is now a real number rather than the time-step index. For the off-diagonal terms of , we have where we use the notation lim

_{x→0}

*f*(

*x*) =

*g*(

*x*) to indicate that

*f*(

*x*) =

*g*(

*x*) +

*o*(

*x*) for values of

*x*near zero. To take the limit of the diagonal terms of , we use the fact that the Taylor expansion of for small values of

*x*yields

*f*(

*x*) ≈ 1 −

*x*: The resulting probability distribution is consistent as expected (i.e., ∑

_{m}α

_{nm,t}= 1). If we define a rate matrix as then can be written as where is the identity matrix.

*dt*, there will never be more than one spike per time step, and so the matrix reduces from the description in equation 2.20 to a simple two-column matrix (i.e., the Poisson distribution becomes a binary distribution) as follows: becomes and where we use the linear terms of the Taylor expansion to make the approximation that

*e*

^{−x}≈ 1 −

*x*for values of

*x*near 0. Consistency is again assured (i.e., η

_{n0,t}+ η

_{n1,t}= 1).

### C.1. The E-Step.

*y*is the “no-spike” emission, we have which can be written in matrix form as For small

_{t}*dt*, this yields the linear differential equation This equation holds from spike to spike (i.e., for all the

*dt*’s when the emission is in fact “no spike”). If

*t*

_{i−1}and

*t*are consecutive spike times, a numerical ODE solver can be used to determine given by using equation C.10. Determining the update at the spike times is similar: By taking the limit as

_{i}*dt*→ 0 and dropping the

*dt*multiplier,

^{11}we get the spike time update as where is the forward probability vector before the spike (i.e., the result of the integration from

*t*

_{i−1}to

*t*using equation C.10) and is the vector after the spike. Finally, is initialized as .

_{i}*t*. It is clear from equation C.18 that the marginal probabilities are continuous between spike times (since the forward and backward probabilities are), while at the spike times, we have

*r*

_{n→m,t}given , , and are needed: where we substitute the single and pairwise marginals with equations 2.14 and 2.15. Notice that the forward probability

*a*

_{n,t−dt}cancels out of equation C.20. This reflects the Markovian nature of the model. The expected rate

*r*

_{n→m,t}is the transition rate to

*m*assuming that the current state is

*n*. The Markov assumption states that given the present, the past and future are independent. In other words, assuming some assignment of the current state, the forward probabilities (which reflect the past) do not affect the expected transition rates (which reflect predictions about the future). At the nonspike times, equation C.20 becomes and at spike times, equations C.21 and C.22 have an intuitive explanation. Between spikes,

*r*

_{n→m,t}(the expected rate of transition from state

*n*to state

*m*) is equal to λ′

_{nm,t}(the rate given by the stimulus and the current parameter settings of the model) scaled by the ratio of the probabilities of the future given that the current state is

*m*versus

*n*. In other words, if the remainder of the spike train can be better explained by having the neuron in state

*m*than in state

*n*at time

*t*, the rate should be increased beyond λ′

_{nm,t}; otherwise, it should be reduced. At the spike times, the additional information of knowing that a spike occurred further scales the expected transition rate by the ratio of the firing rates between the two states, which is equal to the ratio of the probabilities of firing in each state. As is obvious from equations C.21 and C.22, the expected transition rates

*r*

_{n→m,t}are discontinuous at the spike times where they jump by a factor of but continuous between spikes.

### C.2. The M-Step.

As with the discrete-time case, is maximized by gradient ascent. To guarantee the concavity of the ECLL, the constraints on the spiking nonlinearity remain the same as before (*f* must be convex and log concave) since and each λ_{n,t} enters into equation C.24 as a log and a negative. A desirable feature of the continuous-time model is that these constraints are also enough to guarantee concavity with respect to *g* as opposed to the more stringent requirement from the discrete-time case that *g* had to be exponential. Equation C.23 depends on *g* only through log λ′_{nm,t} and −λ′_{nm,t}, where .

## Appendix D: The Trial-Triggered Model

The trial-triggered model is a hybrid HMM–PSTH for the modeling of the evolution of the firing rates of cells recorded from a multistate neural network in the absence of an external time-varying stimulus. This hybrid model can be cast in the framework developed in this article by defining the “stimulus” to be the standard basis vector in , where *T* is the number of time steps in the trial. The spiking and transitioning filters and are thus length *T*, and each of the filter elements is used only once (i.e., the dot products yield the *t*th filter elements thus giving the model as defined in section 2.3.1). The vectors and are thus PSTHs for spiking in state *n* and for transitioning from state *n* to state *m*, respectively.

*T*and given that the complexity of the inversion of a

*T*×

*T*matrix is , it would appear that the convenient linear dependence on

*T*is no longer preserved for the trial-triggered model. However, due to the decomposition of the ECLL into independent state- and cell-specific terms (see equation 2.44) and due to the nature of the smoothness priors (see equations D.1 and D.2), the Hessian of the ECLP is a sparse, banded matrix

^{12}for which algorithms exist to perform the update given in equation D.5 (Paninski et al., 2009).

## Appendix E: The Transition-Triggered Model

*n*

_{τ}(the τth state in the

*n*th row of states), the restricted state-space connectivity permits transitions to the next state only in the current row of states (

*n*

_{τ+1}) or to one of the start states

*m*

_{0}, where

*m*may equal

*n*. This topology decouples the time-step

*t*from the depth in the state-space cascade τ and thus permits the state-dependent filters to evolve from the time of the last “state” transition (i.e., transition to the begin of a row of states). These filters can be thought of as peritransition time histograms (PTTHs) rather than PSTHs, as in the trial-triggered case. The spiking and transitioning rates are thus given as and where, unlike in the trial-triggered model, the state-specific firing rates are no longer time homogeneous.

*NT*-state system must be considered, albeit with certain simplifications due to the restricted structure of the transition matrix. First, we note that the forward recursion (see equation 2.9) has different expressions for states with τ = 0 and τ>0. In the latter case, any state

*n*

_{τ}for τ>0 can have been reached only on time-step

*t*if the system was in state

*n*

_{τ−1}at time

*t*− 1 (as is clear from the state-space connectivity given in Figure 3b). Thus, for these states, the forward recursion simplifies as where, in the multicell setting, will become . The states with τ = 0, on the other hand, can have been reached on time step

*t*from any state with τ <

*t*, which follows from the fact that the farthest depth into the state-space cascade that is achievable

*t*− 1 time steps following the trial onset is τ − 1. For these states, the forward recursion becomes As always, the likelihood is computed from the forward probabilities at time

*T*: Note that the computational complexity of the forward recursion as well as the storage requirement for the forward probabilities are now both , whereas they had previously been . This significant loss in efficiency is the cost of a transition-triggered model. In order to have the behavior of the system triggered on the transition times rather than the elapsed time since the trial onset, the forward recursion must keep track of both the transition index τ and the trial index

*t*, thus squaring the complexity (although it would be possible to reduce this quadratic dependence if the maximum value of τ were restricted to be less than

*T*).

*m*

_{0}are reachable from any state

*n*

_{τ}), while the final term deals with the latter transitions (where only state

*n*

_{τ+1}is reachable from

*n*

_{τ}). Again, the time and storage complexities of this recursion are .

*t*, and the sum over

*m*the fact that transitions to the start states are parameterized (e.g., parameterizes the transition from

*n*

_{τ}to

*m*

_{0}). Recall that the transitions along the rows of the state-space (e.g., from

*n*

_{τ}to

*n*

_{τ+1}) are not parameterized and are determined by the residual probability (see equation 2.33). By rearranging the sums of

*t*and τ, the expression takes a more convenient form: It is clear from equation E.10 that the only consecutive-pairwise marginals that are needed are those involving transitions to the first column of states (given by equation E.8). Furthermore, by moving the sum over time-step

*t*to be the inner sum, it is clear that the consecutive-pairwise marginals themselves are not needed, but rather the sum over

*t*of these marginals for the transitions between every state

*n*

_{τ}and each of the states in the first column

*m*

_{0}. Although computation of these summed pairwise marginals still requires time, the storage requirement is only . However, the emission-dependent term shows that the full set of single marginals still needs to be stored for the update of the spiking parameters: Note that this expression is unchanged from equation 2.38 except for the additional sums.

Newton-Raphson optimization can again be employed to update the parameters of the transition-triggered model, as the gradient and Hessian of the ECLL are essentially unchanged from the standard setting (albeit with the additional sums present in equations E.10 and E.11). As with the trial-triggered model, the inversion of the Hessian mandated by the optimization procedure can be computed efficiently (i.e., in time) by exploiting the banded structure of the Hessian. Since the computation required to construct the Hessian during each M-step is , it is clear that the Newton-Raphson update itself is not a computational bottleneck in this setting. Note that as with the trial-triggered model, smoothness priors (see equations D.1 and D.2) are employed to prevent overfitting.

## Appendix F: Concavity Constraints for the Bernoulli Spiking Model

*f*entered into the ECLL as both its logarithm and its negative. Therefore, the sufficient and necessary conditions for guaranteeing concavity were that log

*f*needed to be concave and

*f*needed to be convex. In a Bernoulli spiking model, the term in equation 2.36 becomes The actual dependence of equation F.1 on the nonlinearity

*f*is determined by the definition of

*p*(

*y*= 0) and

_{t}*p*(

*y*= 1). One reasonable choice is to have the probability of not spiking be the same as in the Poisson spiking model. The probability of having a spike is then equal to the probability of one or more spikes in the Poisson model. This model has the desirable property that in the limit of small

_{t}*dt*, the Bernoulli formulation converges to the Poisson formulation. Thus, we have and where we have used equation 2.34 and the fact that the probabilities must sum to 1. Substituting these definitions into equation F.1 gives

*f*as before in equation 2.38, and so

*f*must be convex. If we assume that the other original constraint also holds—that log

*f*is concave—then we can show that the first term will also be concave. From the concavity of log

*f*, we have The second derivative of the first term in equation F.4 can be expanded as follows: where we have used the result from equation F.5, the convexity of

*f*(i.e.,

*f*′

^{}′ ⩾ 0), and the fact that 1 −

*e*+

^{u}*u*⩽ 0 for all values of

*u*. Thus, we have shown that for the Bernoulli spiking model as defined by equations F.2 and F.3, convexity and log concavity of the nonlinearity

*f*are sufficient conditions to guarantee the concavity of the M-step.

### F.1. Gradient and Hessian of the Bernoulli Spiking Model.

## Acknowledgments

S.E. acknowledges the NIH Medical Scientist Training Program, the Columbia University M.D.-Ph.D. Program, and the Leon Levy Foundation for supporting this research. L.P. is supported by an NSF CAREER award, an Alfred P. Sloan Research Fellowship, and a McKnight Scholar award. We also thank K. Miller, B. Bialek, L. Abbott, M. Eisele, T. Toyoizumi, X. Pitkow, and J. Pillow for many helpful discussions. We additionally thank our reviewer for pointing out the potential application of our work outlined in section 5.3.

## Notes

^{1}

In general, any function of the form *e*^{cu+d} satisfies log concavity and log convexity in *u*. But for our model where , the parameters *c* and *d* can be eliminated by scaling the filter and changing the bias term *b*. Thus, we can restrict ourselves to consider only the nonlinearity *e ^{u}* without loss of generality.

^{2}

Note that no parameter is needed for the transition from state *n _{t}* to

*n*

_{t+1}, as this is the default behavior of the system in the absence of any other transition (see equation 2.21).

^{3}

Comparing Figures 4c to 4a and 4b, one might conclude that is relatively insignificant due to the fact that the magnitudes of its weights are much less than those of and . Recall, however, that the nonlinearity for transitioning *g* grows exponentially, while the nonlinearity for spiking *f* grow quadratically, so small-magnitude filter weights can still have pronounced effects on the transition rate.

^{4}

These experiments also show that the temporal component of the preferred stimulus differs between the two states, which we could model by including multiple time slices in the stimulus filters. For simplicity and reduction of parameters, we ignore the temporal differences in our model.

^{5}

To remove any potential bias, the learned parameter set was not trained on the stimulus used to create Figure 6.

^{6}

In Jones et al. (2007) and in the analysis presented here, the trials from each session and tastant pair are assumed to be i.i.d. samples, which could be a false assumption due to, for example, changing motivational and arousal factors. However, a previous analysis investigated this issue and found that to a first approximation, the neural spiking behavior remains stationary over the course of a recording session (Fontanini & Katz, 2006).

^{7}

In fact, this Poisson-spiking PSTH model is exactly a one-state trial-triggered model whose parameters (which consists of the *C* vectors ) can be estimated using the same algorithm developed for the trial-triggered model (see appendix D). Specifically, estimation is accomplished by a single M-step (since there are no latent variables in the model).

^{8}

Although cross-validation chooses a smoothness penalty for the spiking filters that yields time homogeneous spiking, the optimal degree of smoothness for the transitioning filters does result in time-inhomogeneous transitioning. This result indicates that the true state dwell times are not exponentially distributed (i.e., that there is some reliability in the transitioning behavior of the system from trial to trial). Thus, the trial-triggered model marginally outperforms the model presented in section 4.1 by capturing this inhomogeneity. Further analysis with the semi-Markov transition-triggered model described in section 2.3.2 will appear elsewhere (Escola, 2009).

^{9}

They also remove the arrows between the stimuli and the hidden state variables, meaning that they have homogeneous transition rates.

^{10}

Since the weights can be either positive or negative, each filter actually corresponds to two potential stimulus triggers: and .

^{11}

In a continuous-time model, the probability of any given spike train with specific spike times is zero. Thus, we discard these zero multipliers since they are independent of the model parameters and merely add a constant (albeit infinite) component to the log likelihood.

^{12}

Specifically, the sub-Hessians corresponding to the transition filter updates can be shown to have *N* unique bands (the main diagonal, *N* − 1 upper subdiagonals, and *N* − 1 symmetric lower subdiagonals) where *N* is the number of states, while the sub-Hessians corresponding to the spiking filters prove to be tridiagonal.