## Abstract

Neurons communicate nonlinearly through spike activities. Generalized linear models (GLMs) describe spike activities with a cascade of a linear combination across inputs, a static nonlinear function, and an inhomogeneous Bernoulli or Poisson process, or Cox process if a self-history term is considered. This structure considers the output nonlinearity in spike generation but excludes the nonlinear interaction among input neurons. Recent studies extend GLMs by modeling the interaction among input neurons with a quadratic function, which considers the interaction between every pair of input spikes. However, quadratic effects may not fully capture the nonlinear nature of input interaction. We therefore propose a staged point-process model to describe the nonlinear interaction among inputs using a few hidden units, which follows the idea of artificial neural networks. The output firing probability conditioned on inputs is formed as a cascade of two linear-nonlinear (a linear combination plus a static nonlinear function) stages and an inhomogeneous Bernoulli process. Parameters of this model are estimated by maximizing the log likelihood on output spike trains. Unlike the iterative reweighted least squares algorithm used in GLMs, where the performance is guaranteed by the concave condition, we propose a modified Levenberg-Marquardt (L-M) algorithm, which directly calculates the Hessian matrix of the log likelihood, for the nonlinear optimization in our model. The proposed model is tested on both synthetic data and real spike train data recorded from the dorsal premotor cortex and primary motor cortex of a monkey performing a center-out task. Performances are evaluated by discrete-time rescaled Kolmogorov-Smirnov tests, where our model statistically outperforms a GLM and its quadratic extension, with a higher goodness-of-fit in the prediction results. In addition, the staged point-process model describes nonlinear interaction among input neurons with fewer parameters than quadratic models, and the modified L-M algorithm also demonstrates fast convergence.

## 1 Introduction

Neurons communicate through spike activities, where the temporal pattern of spike timing varies to pass information. The invasive electrophysiology recording technique allows the simultaneous acquisition of neuronal population activities. Neuroscientists are now able to study and describe how the information is transferred through spike activities with respect to inputs such as visual stimuli (Chichilnisky, 2001; Schwartz, Chichilnisky, & Simoncelli, 2002; Pillow & Simoncelli, 2003; Rust, Schwartz, Movshon, & Simoncelli, 2004), and outputs such as movement generation (Wessberg et al., 2000; Sanchez et al., 2002; Ergun, Barbieri, Eden, Wilson, & Brown, 2007; Wang, Paiva, Príncipe, & Sanchez, 2009), or among different areas in the brain (Okatan, Wilson, & Brown, 2005; Zanos et al., 2008; Eldawlatly, Jin, & Oweiss, 2009; Stevenson et al., 2009; Quinn, Coleman, Kiyavash, & Hatsopoulos, 2011). Wessberg et al. (2000) designed a tunable prosthetic retina encoder to partially restore visual function for humans with retinal degenerative dysfunctions. Serruya, Hatsopoulos, Paninski, Fellows, and Donoghue (2002) developed a brain-machine interface that allows robot arm control using spike activities recoded from a monkey's primary motor cortex. Song et al. (2007) proposed to design cognitive prostheses that can compensate for the loss of transregional communication caused by brain injury (Berger et al., 2005, 2012; Guggenmos et al., 2013).

Many computational methods have been proposed to analyze the functional relationship that encodes the stimulus into spike trains or decodes the states from the spike trains and predicts spikes from spike inputs. The encoding model describes how the neurons respond to the stimulus (or receptive fields) and uses two groups of methods to estimate parameters. The first set of methods, the classic moment-based methods, characterizes individual neural gain using the first two moments of spike-triggered stimulus distribution, namely, the spike-triggered average (STA) (De Boer & Kuyper, 1968; Marmarelis & Naka, 1972; Bryant & Segundo, 1976; Eggermont, Johannesma, & Aertsen, 1983; de Ruyter van Steveninck & Bialek, 1988; Ringach, Sapiro, & Shapley, 1997; Chichilnisky, 2001) and the spike-triggered covariance matrix (STC) (Schwartz et al., 2002; Sharpee, Rust, & Bialek, 2004; Horwitz, Chichilnisky, & Albright, 2005; Slee, Higgs, Fairhall, & Spain, 2005; Park & Pillow, 2011; Sharpee, Atencio, & Schreiner, 2011). Both the STA and STC rely on the assumption that inputs follow a gaussian distribution (Chichilnisky, 2001; Paninski, 2003; Schwartz, Pillow, Rust, & Simoncelli, 2006; Samengo & Gollisch, 2013) and cannot be directly used to describe the spike-in and spike-out relationship. The second set of methods, generalized linear models (GLMs), investigates the functional relationship between the stimulus and the spike output by maximizing the log likelihood on output spike trains (Brillinger, 1988; Brown, Kass, & Mitra, 2004), which has also been successfully used in characterizing receptive fields in sensory areas (Calabrese, Schumacher, Schneider, Paninski, & Woolley, 2011), sensorimotor areas (Park, Meister, Huk, & Pillow, 2014) and the hippocampus (Brillinger, 1988; Brown, Kass, & Mitra, 2004). GLMs are a powerful technique in fitting output spike trains and are a generalization of linear regression over each input history and are followed by a canonical link function (Brown, Frank, Tang, Quirk, and Wilson, 1998; Paninski, 2004; Truccolo, Eden, Fellows, Donoghue, & Brown, 2005). However, they assume that the output spike probability depends on the linear combination of inputs, which ignores the nonlinear interaction across the input neurons.

Many efforts have been made in modeling input nonlinearity. Most of them introduce nonlinearity preprocessing before the GLM. For example, Ahrens, Paninski, and Sahani (2008) proposed a generalized bilinear model, which performs a pointwise nonlinear transformation before the linear filtering of the GLM, while McFarland, Cui, and Butts (2013) added a rectified linear unit to extract visual stimulus patterns as a neuronal input before the GLM. Artificial neural networks (ANNs) have also been developed to encode neural activities from stimulus inputs. Prenger, Wu, David, and Gallant (2004) designed a three-layer feedforward network to predict the smoothed spike rate of V1 neurons from natural image stimuli, and Wu, Park, and Pillow (2015) developed a three-layered neural subunit model to encode spikes of the primary visual cortex from white noise stimulus. In the model, the first layer is formed by an identical and spatially shifted convolutional kernel and a static nonlinearity. Finally, Benjamin et al. (2017) implemented a recurrent neural network (long short-term memory (LSTM)) to predict M1 spikes from macaque hand movement trajectories.

The above studies apply to encoding external stimulus into neural activities, but they are not appropriate models to investigate spike prediction from spiking input neurons. Examples of spike-in/spike-out prediction models include building the relation between spike trains recorded from hippocampal CA1 and CA3 (Song et al., 2007, 2009), the relation between spike trains from neurons in different prefrontal cortical columnar layers (Hampson et al., 2012; Marmarelis et al., 2014), and the relation between spike trains recorded from PMd (dorsal premotor cortex) and M1 (primary motor cortex) (Stevenson et al., 2009; Xing et al., 2017). A few of these have also considered the interaction among input neurons. In Song et al. (2007), a modified Volterra kernel approach, in which the Volterra kernel is used to approximate the correlations between each pair of neurons, was proposed to formulate the nonlinear interaction among hippocampus CA3 neurons. Results of experiments showed that the prediction accuracies increased with kernel complexity, and the second- and third-order Volterra kernel models were found to successfully predict the CA1 output spike distribution from CA3 input spike trains. The introduction of the Volterra kernel into the GLM enables nonlinear interaction among input neurons but increases the number of parameters due to the correlation calculated in a pairwise manner.

To mitigate the overfitting problem introduced by Volterra kernels, Marmarelis and his colleagues introduced and developed the Laguerre expansion techniques (Marmarelis, 1993), which greatly reduce the number of free parameters in a given-ordered Volterra system. Marmarelis et al. (2013) treated the Laguerre filters and polynomial nonlinearity as several “principal dynamic modes,” which were estimated using matrix decomposition techniques (eigenvalue decomposition and singular value decomposition). Alataris, Berger, and Marmarelis (2000) reframed the Laguerre expanded Volterra model as a Laguerre-Volterra network (LVN) with a polynomial activation function in its hidden layer. Geng and Marmarelis (2017) extended the LVN into a recurrent LVN (or autoregressive sparse LVN) by adding an L1 regularization term in its cost function and using one-lag-delayed output as the autoregressive input. Geng et al. (2018) extended the LVN into a neuronal mode network by replacing the polynomial activation function with a nonparametric multidimensional histogram.

The above Laguerre-Volterra modeling studies are based on dimensionality reduction only up to third-order Volterra systems. Although any continuous functions could be approximated by a finite-order Volterra kernel, the approximation of the nonlinear function always requires a considerably high kernel order (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1999), as lower-order Volterra kernels may have limited power in expressing nonlinearity (Fitzgerald, Rowekamp, Sincich, & Sharpee, 2011; Wu et al., 2015). The brain is a highly correlated dynamic nonlinear system, where neurons with dendrites and synapses communicate though intraregional or interregional neuronal connections (Shepherd, 2004; Andersen, Morris, Amaral, Bliss, & O'Keefe, 2007). The neural interaction among input neurons could be arbitrary. In addition, the recording of neural activity is only the sampling of the whole dynamic system. The information transmission between input and output neurons may be affected by spike information generated by other unrecorded neurons. These unrecorded hidden neurons interconnect and form a neural pathway, where the feedforward activity of downstream output neurons may directly or indirectly feed back to upstream input neurons (Gerstner & Kistler, 2002; Trappenberg, 2002). Therefore, the functional relationship between the input and output neurons could be arbitrarily nonlinear. In a feedforward computational model on output spike prediction, the interaction among the recorded input neurons could also be regarded as arbitrarily nonlinear to the generation of the outputs.

In this letter, we follow the idea of ANNs (Cybenko, 1989; Kåurková, 1991) to use hidden units to allow interaction among input neurons as well as the spike generation of output neurons to be arbitrarily nonlinear. Importantly, our model describes the nonlinear interaction among input neurons with two cascaded linear-nonlinear (LN) stages, followed by an inhomogeneous Bernoulli process on the output spike generation. By maximizing the log likelihood on output spike trains, we theoretically demonstrate that this staged point-process model will be equivalent to building a two-layered ANN but probabilistically dealing with spike observation. A new training algorithm is developed to estimate model parameters using a modified L-M method. The model is validated on synthetic spike data and real spike train data recorded from PMd and M1 in a monkey performing a four-direction center-out task (Wang et al., 2015). The performance of the proposed method is evaluated by the goodness-of-fit using a discrete-time rescaling Kolmogorov-Smirnov (DTR-KS) test (Haslinger, Pipa, & Brown, 2010).

The staged point-process model and the modified L-M algorithm are presented in section 2, together with a short review of GLMs and GLMs with second-order polynomial series (second-order GLMs; Song et al., 2007). Performance of our model, GLMs, and second-order GLMs is evaluated and compared in section 3. The characteristics, limitations, and possible extensions are discussed in section 4, and a conclusion is given.

## 2 Method

In this section, we start with a brief description of GLMs and second-order GLMs and then propose the staged point-process model based on the point-process framework. This model is designed to predict spike trains of an output neuron (Y) from the spike trains of input neurons (X). The novelty of the proposed model is to include the hidden units (z) that allow nonlinear interaction among input neurons. Here, parameters of the staged point-process model are estimated by maximum-likelihood estimation (M-LE) probabilistically using a modified L-M optimization algorithm. In addition, the connection between our model and ANNs is theoretically revealed in the appendix.

### 2.1 GLMs and Second-Order GLMs

Given the observation interval $(0,T]$, $Xik$ denotes the spike observation of the $ith$ input neuron during the discrete-time interval $(tk-1,tk]$, where discrete-time intervals are indexed with $k$. $Yk$ denotes the spike observation of the output neuron Y during the time interval $(tk-1,tk]$, $\lambda Yk$ denotes the firing probability of the output neuron Y during the time interval $(tk-1,tk]$, and $Y^k$ denotes the spike prediction of the neuron Y during the time interval $(tk-1,tk]$. Let $NX$ be the number of input neurons in X. Hence, $X1:NX1:k$ represents the input spike trains over all neurons in X during the time interval $(0,tk]$, $Y1:k$ represents the spike train of the neuron Y during the time interval $(0,tk]$, $Y^1:k$ represents the predicted spike train of the neuron Y during the time interval $(0,tk]$, and $\lambda Y1:k$ represents the firing probability time series of the neuron Y during the time interval $(0,tk]$.

### 2.2 The Staged Point-Process Model

Here, we add an extra layer of hidden neurons that allows the arbitrary nonlinear interaction among input neurons. Let $Zjk$ be the $jth$ hidden neuron, $\lambda Zj,k$ be the $jth$ hidden unit representing the firing probability of $Zjk$ during the time interval $(tk-1,tk]$, and $NZ$ be the number of hidden neurons in Z. Thus, $\lambda Z1:NZ,1:k$ denotes the time series of all hidden units' values during the time interval $(0,tk]$.

The staged point-process model can be described as a cascade structure containing two LN stages and an inhomogeneous Bernoulli or Poisson process. The cascaded LN stages approximate an ANN structure, which allows arbitrary nonlinear mapping between input and output spikes (Cybenko, 1989; Kůrková, 1991). Here, specifically, it allows the nonlinear interaction between input spikes. And different from an ANN, our model uses an inhomogeneous Bernoulli or Poisson process to generate spikes for output neuron predictions.

### 2.3 MLE by Modified L-M Algorithm

The modified L-M algorithm updates weight parameters not only by calculating the gradient vector and Hessian matrix, but also by adjusting the combination coefficient $\mu $. When the update does not increase the log likelihood, $\mu $ will multiply and the algorithm will be close to the gradient ascent algorithm. Otherwise, $\mu $ will diminish and accept the weight update. The algorithm stops if the change of log likelihood is below a threshold $\delta $ for several (seven in this letter) steps in succession. The initial value of $\mu $ is set to 0.01 and $\delta =0.001$ (Hagan & Menhaj, 1994). In a real case, it is important to try several different initial schemes of $\mu $ and $\delta $ to get a better optimization. Besides this, the optimal result is acquired by maximizing the log likelihood on the training data. In this letter, we calculate the goodness-of-fit on the validating data to acquire the optimal parameters of $\mu $ and $\delta $.

The estimation of the staged point-process model can be equivalent to the backpropagation training process of a two-layered ANN. But the cost function has been switched from the traditional least squares error to the cross-entropy error (the point-process log likelihood), which deals with 0-1 point-process observations. And due to this difference, the Hessian matrix here is directly calculated from the log likelihood of Bernoulli spike events without approximation using the Jacobian matrix.

### 2.4 Assessing Goodness-of-Fit

### 2.5 Data Collection and Preprocessing

To obtain real data, we simultaneously recorded the neural spike trains from the PMd and M1 cortical area of a rhesus macaque monkey in a behavioral experiment. The monkey was trained to perform a four-direction center-out task by controlling a joystick. In each trial, directional targets randomly appeared at one location out of four on a monitor, and the macaque pulled a joystick to move the cursor to match the target displayed on the screen 100 cm in front. Raw neural data were recorded by a Cerebus Data Acquisition System (Blackrock Microsystems, Salt Lake City, UT, U.S.A.) with a sampling rate of 30 kHz. A total 127 neurons of spike trains from the macaque's PMd cortex and 102 neurons from the M1 cortex were recorded for 636.4 seconds. The raw data were then filtered by a bandpass filter with the band ranging from 250 Hz to 7.5 kHz. An offline spike sorting was performed on the filtered neural data using the software Offline Sorter (Nicolelis et al., 2003) (Plexon, Dallas, TX, U.S.A.) to restore the spike timings for further processing. The bin size of the spike train was set to be 10 msec such that the time intervals having more than one spike would constitute 1.29% (PMd) and 1.05% (M1) of all the time slots containing spikes (Wang et al., 2009). The animal-handling procedures were approved by the Animal Care Committee at Zhejiang University, China, abiding strictly by the Guide for Care and Use of Laboratory Animals (China Ministry of Health). Further details of this experiment can be found in Xu et al. (2013) and Wang et al. (2015).

Since not all recorded neural activities are related to the movement task (Xu et al., 2013), we preprocessed the spike trains to identify neurons that are more correlated to behavior, as well as within two cortical areas. For each M1 spike train, the top 10 PMd neurons having maximum mutual information (Nirenberg & Latham, 2003) with the M1 spike train were selected out for the prediction model. The selection is helpful to reduce the parameter space for a second-order GLM, where the number of parameters grows quadratically with its input dimensionality. After preprocessing, spike trains from 54 M1 neurons and their corresponding top 10 PMd neurons were used for our study.

## 3 Results

The staged point-process model is evaluated first on synthetic spike data and then applied to the real neural data recorded from the PMd and M1 cortex of a monkey performing center-out tasks. The overall performance is compared with GLMs and second-order GLMs. Both the simulated and real data are divided into three parts: 60% for training, 10% for validation, and 30% for testing. The training stops when the log likelihood is maximized on the validation set. In both cases, the model performances on the test data are evaluated by the DBR defined in equation 2.13 and the correlation between the prediction and real spike train. Besides this, the performance using the modified L-M algorithm is evaluated for both the synthetic data and real data and compared with the gradient ascent algorithm and Newton-Raphson algorithm.

### 3.1 Simulation Study on Synthetic Spike Data

We start from synthesizing input spike trains for two interactive neurons, and we transform the inputs into one output spike sequence using a linear, quadratic, and nonlinear function, respectively. This implies that the generation of the output depends on the different levels of the nonlinear interaction between input neurons.

The spike probability of the input spike trains is designed to be a sinusoidal function $\lambda X(t)=\alpha Xsin(\beta Xt+\gamma X)+\delta X$, where $\alpha X$ is the fluctuating amplitude determining the distance between the max and min firing probability, $\beta X$ is the fluctuating frequency of the firing probability, $\gamma X$ determines the initial phase of the fluctuation, and $\delta X$ determines the background firing probability. For each time bin, a spike is generated as a Bernoulli random variable with the probability determined by $\lambda X(t)$, where the size of the time bins is set to be 10 msec (Wang et al., 2009). Here, the background firing probabilities for both input neurons are set to be 0.30, the parameter $\alpha X$ is set to be 0.30, and the parameter $\beta X$ is set to be 0.01. Specifically, $\gamma X$ is set to be 1.05 for input neuron 1 ($\gamma X1$), and $-$1.05 for input neuron 2 ($\gamma X2$). Here, the spike probability of the two input neurons ($\lambda X1$ and $\lambda X2$) shares the same fluctuating frequencies $\beta X$ but different initial phases $\gamma X$. This could happen when both neurons are driven by the same external stimulus or the same behavior. An example of synthetic input spike trains is shown in Figure 2.

We design three different bivariate nonlinear functions $\lambda Y(t)=g(\lambda X1(t),\lambda X2(t))$ to generate the output spike trains from the inputs and test the expressive power of different models in approximating different levels of nonlinearity. Note that the firing probability is a nonnegative value and $\lambda Y(t)$ is 0 if the transformed value is negative. The first transfer function is a linear function written as $\lambda Y1(t)=g1(\lambda X1,\lambda X2)=\alpha 1\lambda X1+\beta 1\lambda X2+\gamma 1$ (illustrated in Figure 3A.1), where $\alpha 1$ and $\beta 1$ modulate the linear dependence between the input and output spike and $\gamma 1$ is the background level of this transformation. Here, in order to restrict the max and min firing probability of the output spike, we set $\alpha 1$ and $\beta 1$ to be 0.83 and $\gamma 1$ to be $-$0.25. Thus, the linearly transformed firing probability is illustrated in Figure 3A.3. One realization of the spike train generated according to the firing probability is illustrated in Figure 3A.2. We can see that both the raster and the firing probability of the output spike train are similar to that of the input spike trains due to the linear dependence between input and output spikes.

The second transfer function is a quadratic function written as $\lambda Y2(t)=g2(\lambda X1,\lambda X2)=\alpha 2\lambda X12+\beta 2\lambda X22+\gamma 2\lambda X1\lambda X2+\delta 2\lambda X1+\epsilon 2\lambda X2+\zeta 2$ (illustrated in Figure 3B.1), where $\alpha 2$ and $\beta 2$ represent the nonlinearity within each input neuron, $\gamma 2$ represents the nonlinear interaction between input neurons, $\delta 2$ and $\epsilon 2$ represent the linear dependence between the input and output spike, and $\zeta 2$ is the background level of this quadratic transformation. Here we set $\alpha 2$ to be $-$5.56, $\beta 2$ to be $-$5.56, $\gamma 2$ to be $-$11.11, $\delta 2$ to be 6.67, $\epsilon 2$ to be 6.67, and $\zeta 2$ to be $-$1.50. The quadratically transformed spike probability and one realization of the corresponding spike trains are illustrated in Figures 3B.3 and 3B.2, respectively.

The third transfer function is a sinusoidal function written as $\lambda Y3(t)=g3(\lambda X1,\lambda X2)=\alpha 3sin[\beta 3\lambda X1+\gamma 3\lambda X2+\delta 3]+\epsilon 3$ (illustrated in Figure 3C.1). Here the output spike train is generated using a nonlinear transformation function on two input neurons. If we expand the transfer function using a Taylor series, this nonlinear transfer function can be equivalently expressed by combining the different order information of the interaction between the two input neurons. For this sinusoidal case, apparently, the nonlinearity cannot be fully described using the information up to the third-order Volterra kernels. Here, the parameter $\alpha 3$ determines the amplitude of the fluctuation, $\beta 3$ and $\gamma 3$ determine the frequency of interactions that arise from different input signals, $\delta 3$ determines the initial phase, and $\epsilon 3$ is the background level of the transformation. Here, we set $\alpha 3$ to be 0.25, $\beta 3$ to be 15.71, $\gamma 3$ to be 15.71, $\delta 3$ to be 3.14, and $\epsilon 3$ to be 0.25. The nonlinearly transformed firing probability and one realization of the corresponding spike train are illustrated in Figures 3B.3 and 3B.2. Compared to the input firing probabilities (see Figures 2A.2 and 2B.2), we can see that the output firing probability fluctuates with more complex patterns due to the higher degree of nonlinearity.

We generate output spike trains (see Figures 3A.2, 3B.2, and 3C.2) for three types of dependence (see Figures 3A.1, 3B.1, and 3C.1). The temporal length of each spike train is 200 sec. The temporal history ($H$) of the input spike train is explored ranging from 10 msec to 1 sec for all models. The optimal input spike train history H is set to be 500 msec. The number of hidden units ($NZ$) is explored ranging from 2 to 100 for the stage point-process model, and the optimal number of hidden units $NZ$ is set to be 15. The maximum number of iterative steps is set to be 1000 for the modified L-M algorithm in the stage point-process model. For the model, the first-stage weight parameters $\theta $ and the second-stage weight parameters $\theta $ are initialized as uniform distributed random variables in $[-\xi 1NZ,\xi 1NZ]$ and $[-\xi 2HNX,\xi 2HNX]$, respectively, where $\xi 1$ and $\xi 2$ denote the initialized weight variance of $\theta $ and $\omega $. Similarly, weights of the GLM and second-order GLM are initialized as uniform distributed random variables in $[-\xi Nin,\xi Nin]$, where $\xi $ denotes the initialized weight variance. For the GLM, $Nin$ denotes the dimension of inputs. And for the second-order GLM, $Nin$ denotes the number of input spike multiplication pairs. Here, $\xi 1$, $\xi 2$, and $\xi $ are explored within $[0.01,10]$ during the training.

To reduce the randomness of the spike generation, the synthetic input and output spike trains for both training and validation are generated according to the firing probability 10 times. The DBRs of different models in approximating the three output transformations are shown in Table 1, where the number of hidden units ($NZ$) for the staged point-process model is set to be 15. We can see that the averaged DBR of the staged point-process model is below 1 in predicting all types of output spike trains. These results indicate the staged point-process model is able to predict output spike trains with different levels of nonlinear dependence on the interactive input spikes. One-tailed paired Student's $t$-tests are performed to compare the model performances on spike prediction. When predicting linearly transformed output spike trains, no significant difference in performance has been found. When predicting quadratically transformed output spike trains, the performance of the GLM is significantly lower than those of the staged point-process model ($p<$1e-6) and the second-order GLM ($p<$1e-6). And when predicting nonlinearly transformed output spike trains, the performance of the staged point-process model is significantly higher than those of the GLM ($p<$1e-8) and second-order GLM ($p<$1e-5).

Model | GLM | Second-Order GLM | Staged Point-Process Model |

Linear dependence | $0.76\xb10.09$ | $0.76\xb10.17$ | $0.68\xb10.07$ |

Quadratic dependence | $1.78\xb10.05$ | $1.03\xb10.08$ | $0.96\xb10.09$ |

Nonlinear dependence | $1.90\xb10.06$ | $1.59\xb10.11$ | $0.92\xb10.14$ |

Model | GLM | Second-Order GLM | Staged Point-Process Model |

Linear dependence | $0.76\xb10.09$ | $0.76\xb10.17$ | $0.68\xb10.07$ |

Quadratic dependence | $1.78\xb10.05$ | $1.03\xb10.08$ | $0.96\xb10.09$ |

Nonlinear dependence | $1.90\xb10.06$ | $1.59\xb10.11$ | $0.92\xb10.14$ |

To give a straightforward illustration of the learning capability of different models, we present the learned transformation across different initial phase differences ($\Delta \gamma X=\gamma X1-\gamma X2$) between two input neurons. By training the three models in all cases on phase differences and interpolating the output predictions with respect to two input firing probabilities $(\lambda X1,\lambda X2)$, we obtain the learned transformation patterns. The DTR-KS plot and prediction results using one phase difference as an example (columns 1 to 3), and learned transformations across all phase differences (column 4) are shown in Figures 4, 5, and 6. In addition, for the convenience of comparing the predicted and synthetic firing probabilities, in column 3 of Figures 4 to 6, the prediction results are smoothed using a 500 msec normalized gaussian kernel, with the smoothed curves illustrated in the same plot.

Figure 4 illustrates the prediction results of the linearly transformed output spike train. Here, the optimal regularization coefficient $\alpha R$ for the second-order GLM is 0.003 and that for the GLM is 0. The first row shows the ground truth of the output spike train and the input-output dependence. Rows 2, 3, and 4 represent outputs of the GLM, second-order GLM, and stage point-process model, respectively. We can see that both the raster plots (see Figures 4B.2, 4C.2, and 4D.2), the predicted firing probabilities (see Figures 4B.3, 4C.3, and 4D.3), and the learned transformations (see Figures 4B.4, 4C.4, and 4D.4) of the different models are similar to each other and match well with the synthetic output spike train (see Figure 4A.2), the transformed output firing probabilities (see Figure 4A.3), and the original transformation (see Figure 4A.4). All the points in the KS plots of these three models (see Figures 4B.1, 4C.1, and 4D.1) lie within the 95% confidence bounds. These results indicate that the GLM and other models successfully predict the linearly transformed output spike train with performances that are close to one another.

Figure 5 illustrates the prediction results of the quadratically transformed output spike train. Here, the optimal regularization coefficient $\alpha R$ for the second-order GLM is 0 and that for the GLM is 0. The GLM fails to match the quadratically transformed output spike train from Figures 5B.2, 5B.3, and 5B.4), while the second-order GLM and staged point-process model match well with the quadratically transformed output spike train prediction (see Figures 5C.2 and 5D.2), and KS plots (see Figures 5(C.1 and 5D.1) and learned transformation (see Figures 5C.4 and 5D.4). This indicates that the second-order GLM is capable of catching the quadratic transformation, as is our stage point-process model.

Figure 6 illustrates the prediction results of the nonlinearly transformed output spike train. Here, the optimal regularization coefficient $\alpha R$ for the second-order GLM is 0 and that for the GLM is 0. Here, only our stage point-process model is able to predict the output firing activity (see Figures 6D.2 and 6D.3) and capture the nonlinear transformation (see Figure 6D.4), while both the GLM and second-order GLM fail, with their KS plots out of the 95% confidence bounds (see Figures 6B.1 and 6C.1). And the learned transformation patterns (see Figures 6B.4 and 6C.4) are very different from the ground truth (see Figure 6A.4). These results indicate that our staged point-process model demonstrates capability in modeling neural interactions when the output neurons depend on the interactive input neurons with high-order nonlinearity.

Here we validate the modified L-M algorithm during the training. We apply the staged point-process model together with the modified L-M algorithm, gradient ascent algorithm, and Newton-Raphson algorithm, respectively, to maximize the log likelihood. Here we demonstrate the case illustrated in Figure 3D.2 due to its high degree of nonlinearity. The maximum number of iteration steps is 1000. The averaged convergence curve over 10 trials is shown in Figure 7. From Figure 7, we can see that the convergence of the modified L-M algorithm is faster than that of the gradient ascent algorithm. The Newton-Raphson algorithm converges initially fast but stops at less optimal results. This is because the optimization surface is not concave, and the existence of local minima limits the performance of the algorithm. The modified L-M algorithm is the best, with fast convergence and a shift to better performance due to the balance between the above two algorithms. Overall, the modified L-M algorithm needs 428 steps on average to converge, while the gradient ascent algorithm does not converge in its 1000 iteration steps. The final DBR on the test set of the modified L-M algorithm is 0.92 on average, which is 39.4% lower than that of the gradient ascent algorithm and 51.3% lower than that of the Newton-Raphson algorithm. Besides this, the convergence curves of the modified L-M algorithm demonstrate a significant difference among the 10 trials. This is because the performance shifts controlled by the combination coefficient $\mu $ are different among the trials.

From the simulation study, we can see that the staged point-process model demonstrates the best capability to obtain the mapping on the linear, quadratic, and nonlinear dependences between the input and output spike trains. The model works well with the modified L-M optimization algorithm and outperforms the GLM and second-order GLM. Besides this, compared with the Newton-Raphson algorithm and gradient ascent algorithm, the modified L-M algorithm learns faster and converges with better performance.

### 3.2 Application on the Prediction from PMd to M1

In this section, we test the performance of the staged point-process model trained with the modified L-M algorithm on real neural data recorded from a macaque PMd and M1 cortex. Performance is evaluated on the test data over several segments, where each segment contains 3000 samples with 50% temporal overlapping. The performance averaged on segments is compared among the GLM, second-order GLM, and staged point-process model.

#### 3.2.1 Predicting Spike Trains Form PMd to M1 with Different Models

In this section, we test our proposed model to predict M1 spike trains using the selected PMd neurons and compare its performance with the GLM and second-order GLM. The temporal history (H) of the PMd spike train ranging from 10 msec to 100 msec is explored, as well as the optimal number of hidden units, ranging from 2 to 40 (see Figure 8). Note that both the GLM and the second-order GLM are L1-regularized. From Figure 8, we can see that the optimal length of the input spike train history H is 50 msec, where the performance of the second-order GLM is the best. The optimal number of hidden units $NZ$ is set to be 20 considering the trade-off between the performance and computation burden. The maximum number of iterative steps is set to be 2000 for the modified L-M algorithm. Initialization of the weight parameters is set as in section 3.1, where the initialized weight variances $\xi 1,\xi 2,and\xi $ were explored within [0.01, 10] during the training.

The predicted M1 firing probabilities of the GLM, second-order GLM, and staged point-process model on the 19th and 28th M1 neuron are illustrated in Figure 9. The predicted firing probability, as well as the actual spike observation of the output, are smoothed by convolving a normalized gaussian kernel (the mean is 0, and the standard deviation is 1). Here, the optimal size of this gaussian kernel is explored within the range from 0.1 sec to 10 sec, and the correlation coefficients (CCs) between the smoothed M1 spike train and the smoothed prediction results under different kernel sizes are illustrated in Figure 9 (see panels and 9A.3 and 9B.3). From Figure (see panels 9A.3 and 9B.3), we can see that the CCs of the three different models are relatively better under the kernel size of 1.5 sec, while too large a kernel size will smooth out the detail of the prediction and too small a size will lead to poor performance of the GLM and second-order GLM. The predicted firing probabilities smoothed using the 1.5-sec kernel are illustrated in Figures 9A.2 and 9B.2. From Figures 9A.2 and 9B.2, we can see that the predicted firing probabilities of the GLM and second-order GLM fail to match the smoothed M1 spike train at both the 19th and 28th M1 neurons, while the predicted firing probability of the staged point-process model matches well. For the spike prediction in neuron 19, the staged point-process model achieves the highest CC, at 0.94, while the GLM achieves a CC at 0.55 and the second-order GLM achieves a CC at 0.59. And for the spike prediction in neuron 28, the staged point-process model achieves the highest CC at 0.97, while the GLM achieves a CC at 0.80, and the second-order GLM achieves a CC at 0.82.

DTR-KS plots of several individual M1 neurons are illustrated in Figure 10. The two $45\u2218$ black dashed lines form the 95% confidence bounds of the DTR-KS test. We expect the maximal distance between the DTR-KS curve and $45\u2218$ black solid line to be smaller for better goodness-of-fit. From Figure 10, we can see that the DTR-KS curve of the staged point-process model is always the closest to the $45\u2218$ black solid line. This further validates that the staged point-process model statistically encodes the spike input the best.

We calculate the DBR of the GLM, second-order GLM, and the staged point-process model for each neuron. The statistical results of DBR across all M1 neurons using the three models are compared in the box plot of Figure 11. From the figure, we can see that the staged point-process model achieves the smallest neuron-averaged DBR. The one-tailed paired Student's $t$-test shows that the DBR of the staged point-process model at all M1 neurons is significantly lower those of the GLM ($p<$ 0.05) and second-order GLM ($p<$ 0.05). We also calculate the number of neurons of the prediction results that pass the DTR-KS test. For the GLM and second-order GLM, the numbers are 18 and 16, respectively. Here, despite the prediction results of the GLM having more M1 neurons that pass the DTR-KS test than those of the second-order GLM, the prediction results of the passed M1 neurons in the second-order GLM achieve significantly higher goodness-of-fit than those of the GLM. And for the staged point-process model, the number is 21, which is higher than for the GLM and second-order GLM.

#### 3.2.2 Comparison of the Modified LM Algorithm with Newton-Raphson Algorithm and Gradient Ascent Algorithm

To evaluate the efficiency and robustness of the modified L-M algorithm on real data, we show the prediction on two M1 neurons (neuron 1 and neuron 19) as examples using the staged point-process model. The performance is then compared with the performance of the gradient ascent algorithm and Newton-Raphson algorithm on each M1 neuron prediction. Initialization and other settings of the staged point-process model are the same as in sections 3.1 and 3.2.1. The convergence curves are averaged over 10 trials and illustrated with a mean curve and a standard deviation ribbon in Figure 12. As we expected, we can see that the modified L-M algorithm converges at an average 93.8 iteration steps, while the gradient ascent algorithm needs 646.8 iteration steps. The convergences of the modified L-M algorithm and Newton-Raphson algorithm are faster than that of the gradient ascent algorithm. When the learning curve converges, the final DBR of the modified L-M algorithm is 1.22, which is averaged on 54 M1 neurons. The final DBR of the gradient ascent algorithm is 1.33, which is 8.3% lower than that of the modified L-M algorithm. And the final DBR of the Newton-Raphson algorithm is 1.98, which is 38.4% lower than that of the modified L-M algorithm. These results indicate that the modified L-M algorithm is more efficient than the gradient ascent algorithm on real data, as on simulated data.

We further statistically validate the convergence of the different algorithms on all M1 channels. The convergence steps and final performance of the different algorithms on all M1 neurons are illustrated in Figure 13. We can see that the Newton-Raphson algorithm converges very fast at the early steps (67.8 on average) at a relatively lower optimal DBR compared with the other algorithms. This seems to verify the point that the Newton-Raphson algorithm is affected by the existence of local optima, while our algorithm is less affected. And from Figure 13, we can also see that both the convergence steps and final performance of the modified L-M algorithm are distributed in the lower area of the plot, which is expected as we obtained the same result in the simulation.

## 4 Discussion and Conclusion

Modeling the dependence of spike trains among brain regions enables us to understand how information is communicated among neurons, which makes it possible to design prosthetic devices that compensate for the loss of neuronal communication. The widely used generalized linear model assumes the output spike depends on the linearly weighted firing history of inputs, and it considers neither the nonlinear interaction among input neurons nor the nonlinearity across the input history. Volterra kernels were introduced in GLMs to consider the nonlinear interaction among pairs of input spikes, which improved the capability in modeling nonlinear dynamics and in turn improved the performance of prediction. However, a lower-order Volterra kernel may limit the capability of neural interaction modeling, where the interaction could be arbitrarily nonlinear.

To address the nonlinear interaction among input neurons and the nonlinearity across the input history, we propose a staged point-process model, which can be described by a cascade of two L-N stages and an inhomogeneous Bernoulli process. These two LN stages theoretically allow arbitrary nonlinear interaction among input neurons as well as nonlinearity across the input firing history. The final stage of the model is the output spike generation using the inhomogeneous Bernoulli process. The model is estimated by maximizing the log likelihood on output spike trains. We show that it is theoretically equivalent to a two-layer neural network (in the appendix) but deals with spike observations probabilistically.

Compared to GLMs and GLMs with a Volterra kernel, the staged point-process model allows arbitrary nonlinearity to model the interactions among input neurons. According to Cybenko (1989), any continuous nonlinear function can be approximated by a linear combination of finite sigmoidal hidden units, which corresponds in our model to the arbitrary nonlinear interaction among the input neurons. Therefore, we anticipated much more powerful modeling of nonlinear interaction among input neurons than GLMs and GLMs with a second-order Volterra kernel. This is clearly supported in our synthetic data with different levels of nonlinear interaction between input neurons. When input spikes are linearly projected to generate output spikes, the performance of our model is similar to a GLM and GLM with a Volterra kernel. But when the level of nonlinear interaction increases, our model demonstrates its advantage, as it achieves a much lower distance-bound ratio in DTR-KS results. This is further validated in real spike train data recorded from PMd and M1 in a monkey performing a four-direction center-out task. The distance-bound ratios of our model are lower than those of the GLM and second-order GLM on average with statistical significance. Besides this, our model has the most M1 neurons whose spike prediction results on a test set passing the DTR-KS test. This indicates that the staged point-process model describes the nonlinearity among input spike trains, which leads to a better prediction performance on real data.

In addition, the staged point-process model uses fewer parameters to approximate the nonlinear relationship between input and output neurons. If we consider the correlation in a pairwise manner within the input space, the number of parameters in the second-order GLM increases by $n(n+1)/2+1$, where $n$ is the input dimension (the multiplication of the length of input history $H$ and number of input neurons $NX$). The Laguerre expansion methods in the time domain would help to reduce the number of parameters under the framework of a Volterra kernel when a longer history is considered. We solve this problem by introducing hidden units, the same idea as in ANNs, but we add the stage of spike generation for each output neuron. In our model, the number of parameters increases by $NZ(n+2)+1$, where $NZ$ is the number of hidden units. From our results, we see that the number of parameters as well as the computational burden increases much more slowly with the input dimension in the staged point-process model compared with the second-order GLM. $NZ$ is generally much smaller than $n/2$ for both cases in section 3. Thus, our proposed model uses fewer parameters to achieve better performance. This is validated on both the simulation and real data, where our model has better performance and relatively fewer free parameters than the second-order GLM.

Here, the maximization of the log likelihood is nonconcave due to the structure of a cascade of two LN stages, which may lead to the existence of local maxima. We develop a modified L-M optimization algorithm that enables adaptive learning by directly calculating the Hessian matrix from the log-likelihood function. It inherits the same properties as the L-M algorithm, that is, it is efficient in searching and not easily trapped in local optima (Levenberg, 1944; Marquardt, 1963). In both simulation and real data, the modified L-M algorithm demonstrates faster convergence and better prediction performance than both the gradient ascent algorithm and Newton-Raphson algorithm. All of these results indicate that the staged point-process model, together with the modified L-M algorithm, is a powerful and efficient tool that allows nonlinear interaction among input spikes as well as nonlinearity in spike generation. Thus, the application of this model could be extended to improve the performance of nonlinear modeling on neuronal spike activities recorded from other brain regions, such as hippocampal CA1 and CA3.

The staged point-process model is based on the assumption of hidden neurons that may not be recorded during neural information processing among input and output neurons. On real data, we show that only some of the selected input neurons successfully predict output spikes. We consider that hidden neurons are interconnected to form neural pathways, which may not be picked up individually by an electrode array but with a net effect. We computationally model the effects of these hidden neurons as several hidden units in a two-stage LN-LN Bernoulli model. Objectively, our proposed model with hidden units better predicts the output spike trains, but the actual anatomical structure between input and output neurons is not necessarily two-staged. Thus, the explanation of our model on the interaction among input neurons is not as straightforward as that for the second-order GLM. A model that approximates the real anatomical structure (Qi et al., 2018; Xu et al., 2018) and is easy to interpret should be investigated in future work.

## Appendix: Derivation of the Staged Point-Process Model

## Acknowledgments

The corresponding authors are G.P. and Y.W.G.P.'s work is supported by the National Key Research and Development Program of China (2017YFB1002503) and the Zhejiang Provincial Natural Science Foundation of China (LR15F020001). Y. W.'s work is supported by the National Natural Science Foundation of China (grant 61836003) and partially supported by the HKUST Initiation Grant for New Faculty (IGN16EG10) and HKUST startup fund (R9395).