## Abstract

Decoding in the context of brain-machine interface is a prediction problem, with the aim of retrieving the most accurate kinematic predictions attainable from the available neural signals. While selecting models that reduce the prediction error is done to various degrees, decoding has not received the attention that the fields of statistics and machine learning have lavished on the prediction problem in the past two decades. Here, we take a more systematic approach to the decoding prediction problem and search for risk-optimized reverse regression, optimal linear estimation (OLE), and Kalman filter models within a large model space composed of several nonlinear transformations of neural spike counts at multiple temporal lags. The reverse regression decoding framework is a standard prediction problem, where penalized methods such as ridge regression or Lasso are routinely used to find minimum risk models. We argue that minimum risk reverse regression is always more efficient than OLE and also happens to be 44% more efficient than a standard Kalman filter in a particular application of offline reconstruction of arm reaches of a rhesus macaque monkey. Yet model selection for tuning curves–based decoding models such as OLE and Kalman filtering is not a standard statistical prediction problem, and no efficient method exists to identify minimum risk models. We apply several methods to build low-risk models and show that in our application, a Kalman filter that includes multiple carefully chosen observation equations per neural unit is 67% more efficient than a standard Kalman filter, but with the drawback that finding such a model is computationally very costly.

## 1 Introduction

Encoding models linking neural activity to biological and behavioral variables have long been used to understand how neural systems represent information or demonstrate that information about a signal is present and can be extracted from models of neural activity (Perel et al., 2013; Nishimoto & Gallant, 2011; Keat, Reinagel, Reid, & Meister, 2001; Pillow, Paninski, Uzzell, Simoncelli, & Chichilnisky, 2005; Stanley, 2013; Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1997). In this letter, we focus specifically on decoding kinematic variables from spike-based encoding models in the context of brain-computer interfaces (BCIs).

Building efficient decoding models is crucial for BCIs to obtain the most accurate kinematic predictions possible. We briefly describe some approaches to build and optimize decoding models. Decoding initially relied mostly on modeling the kinematics as linear or additive nonparametric functions of lagged spike counts (Rieke et al., 1997; Warland, Reinagel, & Meister, 1997), with the lag often chosen to maximize of the model. Mulliken, Musallam, & Andersen (2008) included spike counts at three consecutive lags to induce smoothness in the predictions and fitted the model by ridge regression to minimize its prediction error. Another class of decoders relies on physiological models of neurons' firing rates as functions of kinematics (the tuning curves) and includes the optimal linear decoder (Salinas & Abbott, 1994) and the Kalman filter (Brown, Frank, Tang, Quirk, & Wilson, 1998). The latter incorporates a prior kinematic model that can improve predictions substantially. In this framework, spike counts are typically assumed to be gaussian or Poisson distributed (Truccolo, Eden, Fellows, Donoghue, & Brown, 2005) with means assumed to be additive functions of the kinematics, lagged by a time chosen to maximize the model or minimize the prediction mean squared error (Wu, Gao, Bienenstock, Donoghue, & Black, 2006). Response transformations can improve the fit of such models to data; for example, square roots of spike counts may be more gaussian with more constant variance (Moran & Schwartz, 1999; Wu et al., 2006). Selecting which neurons to use, for example, by rating the amount of kinematic information they encode (Vargas-Irwin et al., 2010; Liao, Wang, Zheng, & Principe, 2012), can also improve decoding (Kahn, Sheiber, Thakor, & Sarma, 2011). Rather than select neurons, Tankus, Fried, and Shoham (2012) suggest identifying a sparse decomposition of neural features and decoding from a low-dimensional projection.

These examples illustrate the wide range of model families and model selection procedures that have been successful in improving decoding performance. But they also show that the search for good models is not done systematically or with the exclusive objective of minimizing the prediction error. In this letter, we focus specifically on this objective. We search for risk-optimized models within a large model space composed of parametric and nonparametric transformations of neural spike counts at multiple temporal lags, applying model selection techniques that let the data determine which variables should enter the models to minimize the prediction error. While the reverse regression decoding framework is a standard statistical prediction problem, where penalized methods such as ridge regression or Lasso are routinely used to find minimum risk models, model selection for tuning curves–based decoding models such as OLE and Kalman filtering is not standard, and no efficient method exists to identify minimum risk models. We thus investigate the effectiveness of several existing and new ad hoc methods to build low-risk models.

## 2 Methods

Assume that we record the signal of electrodes to decode the three-dimensional velocity of an arm at time ; the methods described here apply to other kinematic variables. We use *neural units* to mean neurons or unsorted electrodes treated as putative neurons and *spikes* to mean extracellular voltage threshold crossings. For simplicity and brevity, we did not spike-sort the electrode signals in this letter; we observed comparable results with spike-sorted data. We let denote a vector of neural covariates, which we can initially think of as , the vector of the neural units' spike counts in the time bins centered at , where is an appropriate lag between neural activity and kinematics; later we include several lags and transformations of spike counts (see equation 2.7).

### 2.1 Linear Decoding Models

Nonlinear nongaussian observation equations have also been used (Truccolo et al., 2005), but we do not consider them here because they do not yield closed-form predictions and may thus be too computationally expensive for real-time decoding. However, in section 2.4, we introduce nonparametric response transformation observation equations, which can capture nonlinear relationship between velocity and spike counts yet remain linear in and thus yield closed-form predictions.

^{1}We also consider combining the RR model with the kinematic model in equation 2.3 to form an autoregressive reverse regression (AR-RR) model: , where , , and are regression coefficients fitted to training data and is gaussian noise, from which predictions are recursively obtained as

Equations 2.2 and 2.4 to 2.6 define four classes of velocity predictions that we summarized as graphs in Figure 3 in appendix A. They are all linear combinations of the same neural covariates , so we should be able to compare their efficiencies (see section 2.5). But first we specify the model space that we will search for efficient models.

### 2.2 Decoding Model Space

The common practice is to use at most one velocity encoding quantity per neural unit in a decoding model—for example, , the spike counts lagged by , or their square roots, if judged appropriate. Here we explore a much larger space of models, allowing to be any subset of a large set composed of transformed and untransformed spike counts at several lags, and we apply model selection techniques that let the data determine which variables should enter the model to minimize the prediction error. The resulting model may therefore include several transformed and untransformed spike counts per neural unit, possibly at several lags, while some units may not be represented at all.

### 2.3 Model Selection for Reverse Regression

Reverse regression predictions that have been considered include equation 2.5 with , the spike counts lagged by , and extensions that replace by nonparametric transformations of such as splines (Rieke et al., 1997; Warland et al., 1997; Wagenaar, Ventura, & Weber, 2009). Mulliken et al. (2008) used three consecutive lagged spike counts, that is, , and fit the model using ridge regression to minimize the prediction risk. Here we consider basic RR with , and minimum risk RR, the prediction that minimizes the cross-validated risk estimate (see equation 2.8) within the space defined by equation 2.7, where we take npm() to be a smoothing spline of with four degrees of freedom. The other models mentioned above belong to our model space and are therefore necessarily less efficient than minimum-risk RR.

An exhaustive search through the model space to identify the minimum-risk model is prohibitively time consuming with the large number of predictors we allow, as is stepwise regression. Regularized regression is faster. It consists of including all card available predictors in equation 2.5, first standardized to control for differences in scale and penalizing the norm of their regression coefficients by a factor chosen to minimize the prediction risk; Lasso (Tibshirani, 1996) and ridge regression use the and norms, respectively. The ridge regression coefficients for the th component of velocity in the th row of (see equation 2.5) are estimated by where is the matrix with rows , ; is the size of the training data; is the vector of the th velocity component values; is the average of over ; is the identity matrix; and are chosen to minimize the cross-validated risk estimate (see equation 2.8) of the predicted velocity. The Lasso estimates do not have closed forms. In section 3, we report only the performances of ridge estimates for brevity; Lasso results were practically identical.

Finally, we also consider minimum-risk AR-RR models obtained by fitting equation 2.6 using ridge regression. We apply either (1) the same penalty on and for simplicity, (2) two different penalties on and since they measure different types of quantities, or (3) a penalty on and none on to preserve the kinematic model, since it captures a physical process. Option 3 effectively is a minimum-risk kinematic model with penalized neural control variables, and option 2 with an infinite penalty on yields the minimum-risk RR model.

### 2.4 Model Selection for OLE and Kalman Filtering

Decoding in the OLE and KF framework is a parameter estimation rather than a prediction problem, since we want to predict the covariate in a regression model like equation 2.1; the typical prediction framework consists instead of predicting the response variable of a regression.

The prediction risk, estimated in equation 2.8, can be expressed as the sum of the variance and squared bias of the estimate of . Given a decoding/regression model like equation 2.1, OLE predictions are maximum likelihood estimates (MLEs), so they are unbiased if the observation equations represent true relationships and they have minimum variance; moreover, the variance decreases with the number of observations, which in our context is the number of observation equations. We could therefore reduce the MLE variance by including in the decoding model several observation equations per neural units, for example, if spike counts encode kinematics at several lags or if quantities other than spike counts also encode kinematics such as waveform moments (Ventura & Todorova, 2015). However, it is unlikely that the observation equations are true models—indeed, models are often useful but seldom correct so they may induce bias in the MLE; they also contain noise since they are estimated from training data,^{2} so using too many will contribute more noise than signal and increase the MLE variance. Model selection for OLE decoding therefore consists of choosing which observation equations to include in the model (see equation 2.1) to balance the bias and variance of the predictions.

Model selection for KF also amounts to choosing observation equations that balance the prediction bias and variance, although they will likely be different from the equations chosen for OLE. Indeed, the KF prediction combines the MLE (OLE) and prior predictions, so their biases and variances combine too. For example, KF and OLE predictions are equal when the prior model, equation 2.3, has infinite variance so the same equations would be chosen, while a small prior variance reduces the variance and increases the bias of the KF prediction compared to the OLE prediction, so different equations would be chosen.

Exhaustive and stepwise searches through the model space defined by equation 2.9 to identify the minimum cross-validated risk OLE and KF models are prohibitively time-consuming with the many observation equations considered here, and we know of no computationally efficient alternative search method like penalized regression, so we will investigate ad hoc model building strategies to identify low-risk OLE and low-risk KF models in section 3. We also consider the commonly used basic OLE and basic KF models, based on the observation equations in equation 2.1 with , the spike counts lagged by .

### 2.5 Comparing the Four Classes of Predictors

The classes of predictors we consider—OLE in equation 2.2, KF in equation 2.4, RR in equation 2.5, and AR-RR in equation 2.6—are all linear in (see equation 2.7). There are different subsets that either include or exclude each of the variables in . To identify the minimum-risk prediction in that model space, we should fit all models to data by maximum likelihood, evaluate their prediction risks, and choose the model with the smallest risk. If the true prediction is indeed linear in , so that it belongs to the model space defined by equation 2.7, this procedure ensures that we identify it, up to the error of fitting it to data (this error decreases as the sample size increases). If the true prediction is not linear, this procedure identifies the linear prediction that is closest to the true prediction.

The exhaustive search described is precisely how one identifies the minimum-risk RR prediction; hence, minimum-risk RR is the best linear prediction within the model space defined by equation 2.7.^{3} The OLE prediction, equation 2.2, is also linear in and therefore cannot be more efficient than minimum-risk RR. This makes sense: the OLE model coefficients in equation 2.2 are functions of the observation equations' coefficients in equation 2.1 according to , so they are not fitted to data with the specific objective of minimizing the prediction error. In contrast, in equation 2.5 is unconstrained and fitted to data to minimize the prediction risk.

For AR-RR predictions, we replace on the right-hand side of equation 2.6 by the left-hand side of equation 2.6 with replaced by , and so on recursively, so that AR-RR predictions can be written as linear combinations of , , , and so on. Depending on which elements of compose , the set , , may contain spike counts that are older than 12 lags, in which case AR-RR predictions do not belong to the model space defined by . However, neural activity this old should not have meaningful predictive power, so AR-RR models are unlikely to outperform minimum-risk RR when the sample space includes many lags. It makes sense: if we could use the true velocity on the right-hand side of equation 2.6, and if, as is likely the case, provides information about that does not already contain, then the best AR-RR prediction would certainly outperform the best RR prediction since it uses an informative additional predictor (compare equations 2.5 and 2.6). However, we use and not in equation 2.6, which is a function of that is unlikely to contain much, if any, additional information about that is not already provided by , provided includes many lags. We actually expect AR-RR predictions to underperform more or less depending on how we penalize and in equation 2.6. Using no penalty on (option 3 at the end of section 2.3) forces to be close to , which could accumulate the prediction error as increases if is a poor prediction of . The other two options penalize a potentially bad prediction so they might fare better, especially option 2, since the penalty on is independent of the penalty on ; setting the penalty on to infinity removes the influence of entirely and yields the minimum-risk RR prediction.

Finally, KF predictions (see equation 2.4) can also be recursively expressed as linear combinations of , , , so they might not outperform minimum-risk RR predictions, as we concluded already for AR-RR predictions. Additionally, like OLE predictions, KF predictions are constrained: in equation 2.4 is a function of the coefficients of the observation equations in equation 2.1 and kinematic model in equation 2.3, so that unlike the minimum-risk RR prediction parameters in equation 2.5, is not fitted to data with the specific objective of minimizing the prediction error. However, KF predictions are different in two ways. First, the model is not fixed within the model space, since its coefficients are functions of time ; hence its prediction risk is an average over of the risks at each , which is not comparable to the risk of a nondynamic decoder: it could be larger or smaller than the risk of minimum-risk RR, depending on the application. Second, for each reach we decode, we initialize the decoders at the observed initial velocity; then the kinematic prior model constrains the next few KF predictions to be close to the initial velocity, which happens to match the speed profile of a typical reach. Therefore, KF models need only decode well the last portion of reaches, and because the reaches in our data set are short, KF predictions may be more efficient than minimum-risk RR predictions. They may not be as efficient to decode continuous motion.

## 3 Results

We illustrate the benefit of using models that minimize the prediction error to decode the arm velocity of a rhesus macaque in an experiment performed in Andrew Schwartz's MotorLab (Fraser & Schwartz), 2012; Todorova, Sadtler, Batista, Chase, & Ventura, 2014). The data were recorded during four days, with five sessions per day and one session consisting of 52 reaches to and from 27 targets arranged evenly on a virtual 3D sphere (see Figure 1A). We analyze the portion of the reaches between movement onset and target acquisition, which amounts to 8 minutes of data. The neural activity was recorded in ventral premotor cortex on the 71 active channels of a 96-electrode Utah array. We decode from threshold crossings without spike sorting, using 16 ms spike count bins initially lagged bins (80 ms) compared to arm movement for all electrodes, where is the best uniform lag, that is, the integer value in that maximizes the average of the electrodes' tuning curve models in equation 2.1.^{4} We use all combinations of four sessions recorded the same day ( trials) to train the decoding models and evaluate their performances to decode the 52 reaches of the remaining session of that day, totaling training sets and test trials. We initialize the decoders at the observed initial velocity for each trial. We measure the relative efficiency of two decoders to reconstruct a test trial by the ratio of their respective mean squared errors (MSE, equation 2.8) and summarize the distribution of MSE ratios across the 1040 test trials using box plots.

We start with the decoding methods that are computationally easy to implement: the basic OLE, basic KF, minimum-risk RR, and the three minimum-risk AR-RR models. Figure 1B shows the distribution over the 1040 test trials of the decoding efficiencies of basic OLE (equation 2.2 with ), basic RR (equation 2.5 with ), and minimum-risk RR (see section 2.3), relative to basic KF (equation 2.4 with ), where relative efficiencies greater than one mean that KF is less efficient than the competing method. Basic OLE is the least efficient. Basic OLE and basic RR use the same information, the lagged spike counts as observation equations and predictors, respectively, yet the latter is about twice as efficient in median across the 1040 test trials. Both methods are less efficient than KF on most trials because KF uses the extra kinematic information provided by the state equation (see equation 2.3), which is akin to using information from previous spike counts (see section 2.4). For example, RR with improves the decoding efficiency of basic RR by over 30% in median (not shown). Finally, minimum-risk RR uses all the covariates in equation 2.7 to capture linear and nonlinear associations between velocity and spike counts at lags up to and is fitted by ridge regression to minimize its prediction risk, using 10-fold cross-validation to prevent overfitting; it is five times more efficient than OLE (not shown) and 44% more efficient than KF in median across the 1040 test trials (see Figure 1B, third box plot). The spike counts and their square roots have the largest ridge regression coefficients (see appendix B, Figure 4A), but it is the inclusion of multiple lags that matters most, certainly because they help smooth the predictions, but perhaps also because the spike counts encode the kinematics across several time bins. Finally, as conjectured in section 2.5, none of the minimum-risk autoregressive RR models (see equation 2.6) outperformed minimum-risk RR; the models that do not penalize at all or not independent of the neuronal covariates (options 1 and 3 at the end of section 2.3) actually underperformed by 53% in median (not shown).

To evaluate the robustness of these results to the number and quality of available electrode recordings, we compared basic KF to minimum-risk RR in 100 random subsets of neural units, capping at 40 to ensure sufficient variability across subsamples. For each subset of neural units, we recorded the median relative efficiency of KF versus minimum-risk RR for the 1040 test trials and calculated the average and standard deviation (SD) of these medians over the 100 subsamples. Figure 1C shows the average median efficiency SD as a function of : minimum-risk RR is superior to KF in median when , it is about 15% and 25% more efficient when is 20 and 40, respectively, and it outperforms KF on all test trials when (not shown). As increases, the information provided by the spike counts outweighs the prior kinematic information (see equation 2.3) and KF loses some edge against the non-Bayesian RR decoder.

### 3.1 Model Selection for OLE and Kalman Filtering

Minimum-risk RR outperforms minimum-risk autoregressive RR, basic OLE, and basic KF decoders for the data analyzed here. We now investigate how better OLE and KF models within the space defined by equation 2.9 compare to minimum-risk RR. Exhaustively evaluating the prediction risks of the models in that space to identify the minimum-risk models would take roughly years per training set on a 2 GHz Inter Core i7 processor. We are not aware of computationally efficient searches in this framework, so we perform greedy searches to identify low-risk models.

#### 3.1.1 Low-Prediction Risk Model Search

We first build an optimal model with a single observation equation per unit, then we consider adding other observation equations. Specifically, we start with the basic OLE and KF models, equations 2.2 and 2.4 with and , and optimize the lags in a procedure similar to the “nonuniform lag” selection of Wu et al. (2006): for each unit in turn, we determine which of the 13 lagged equations among equation 2.9 with (untransformed spike counts) reduces the prediction risk the most and replace the current with the optimal equation. (We evaluate 71 units 13 models rather than the full combinatorial models.) We repeat this process thrice to allow the lags to stabilize. Next, we hold the lags fixed and apply a similar procedure to search for the best response transformations among equation 2.9: about half the units retain untransformed spike counts, and a quarter each select the square root and ACE transforms. The resulting OLE and KF models are 4% and 11% more efficient in median than the basic OLE and KF models, respectively (see Figures 2Aa and 2Ba, first box plots).

Decoding models typically include at most one observation equation per unit; here we allow several because together, they may contribute more information about the kinematics, compensate for lack of fit of the observation and state equations by balancing bias and variance (see section 2.4), and smooth the predictions. We start with the previous models, and for each unit in turn, we add the equation for that unit that reduces the prediction risk the most, if one exists. The resulting OLE and KF models include 135 and 115 equations on average over the 20 training data sets and are 90% and 35% more efficient in the testing data sets than the basic OLE and KF models (see Figures 2Aa and 2Ba, second box plots). We perform a second and third equation selection sweep across all units (see Figures 2Aa and 2Ba, third and fourth box plots) and prune the resulting models, removing sequentially the equation that reduces the risk the most, until the reduction is negligible. The final OLE and KF models contain 218 and 131 equations—that is, an average of about three and two equations per electrode—and are, respectively, 190% and 67% more efficient in median than the basic OLE and KF models (Figures 2Aa and 2Ba, fifth box plots). The final KF model is also 19% more efficient than the minimum-risk RR model (compare Figures 2Bf and 2Ba, fifth box plot), although it took 23 hours to fit the former versus 3.6 hours for the latter, using a 2 GHz Inter Core i7 processor.

#### 3.1.2 Searches That Do Not Explicitly Lower the Prediction Risk

Risk-reducing searches are computationally demanding, so we investigate alternatives. The added variable test (AVT) search (Ventura & Todorova, 2015) starts with an initial model—two options are described below—and for each unit in turn, augments the model with the equation for that unit that has the most variability explained by the kinematics, after accounting for the spike counts that are already in the model. For example, if the model contains the observation equations for (see equation 2.1) and we consider adding another equation for unit 1, we regress each transformed lagged spike count of unit 1 on and record how much these regressions are improved by adding the kinematics as regressors. If the largest improvement is statistically significant at the Bonferroni-corrected level , where is the number of tests performed, we add the corresponding equation to the decoding model.

We considered two computationally simple initial models composed of one observation equation per unit: the truncated input space model of Mulliken et al. (2008) includes the equation corresponding to the predictor with the largest absolute RR ridge coefficient (see Figures 2Ad and 2Bd) and the unit-specific best model includes the equation with the largest ; we used the latter because it proved more efficient. Figures 2Ab and 2Bb show the efficiencies of this initial model before and after one, two, and three sweeps across the neural units of the AVT procedure and after pruning the last model: for our data set, the AVT and low-risk selection procedures are comparable when building OLE models, but the former is less efficient in the KF framework.

We considered other fast model selection options: the largest model contains the observation equations with the largest among all equations in equation 2.9, and the largest RR ridge coefficients model includes the equations corresponding to the predictors with the largest coefficients, in absolute value, of the minimum-risk RR model; these models do not necessarily include equations for each unit. Figures 2Ace and 2Bce show the efficiencies of these models relative to the basic OLE and KF models for several values of : larger models have lower risk but perform comparatively poorly even with equations, so we cannot recommend them.

## 4 Discussion

Building good decoding models is crucial to obtain the most accurate kinematic predictions possible. While many models have been successful in improving decoding performance, the search for optimal models is not done systematically or with the exclusive objective of minimizing the prediction error. In this letter, we focused on this objective. We considered a large model space, larger than typically considered, that we searched to identify low-risk models and showed that we can improve the accuracy of offline reconstructions of arm reaches of a rhesus macaque monkey.

With computationally efficient decoding in mind, we considered models based on linear gaussian observation and state equations, specifically OLE, Kalman filtering (KF), and reverse regression (RR), because they provide closed-form predictions. We also introduced autoregressive RR (AR-RR) models by combining RR with the KF state equation. All four classes of models yield predictions that are additive functions of covariates, which were chosen by cross-validation from a large set of parametric and nonparametric transformations of spike counts at several lags, to minimize or reduce the prediction error. For the data analyzed here, the minimum-risk RR and low-risk KF models outperformed the others and were 44% and 67% more efficient than a standard KF decoder, respectively. Any other quantity thought to encode kinematic information should be added to the set of candidate covariates—for example, spike counts binned at multiple timescales (Kim, Carmena, Nicolelis, & Principe, 2005), counts derived from different spike-sorting schemes (Todorova et al., 2014), and waveform moments of unsorted electrodes (Ventura & Todorova, 2015). For non-Poisson neural units, statistical summaries of spike trains other than spike counts may be modulated by kinematics—for example, the variance of interspike intervals or repeated temporal spike patterns such as triplets or bursts of spikes might encode kinematic information that a rate code cannot capture (Brown et al., 1998; Bansal, Truccolo, Vargas-Irwin, & Donoghue, 2012).

We emphasize that our low-risk OLE and KF models can include several observation equations per neural unit, which deviates from the standard practice of using at most one. Multiple equations may contribute more information about the kinematics, for example, if neural units encode kinematics across several lags; they also certainly compensate for the likely lack of fit of the observation and state equations, and they smooth the predictions. Ventura and Todorova (2015) also used multiequation per unit models to include waveform moment equations in addition to spike counts equations from unsorted electrodes and were thus able to decode kinematics as efficiently as from sorted spike trains, without having to spike sort the electrode signals.

From a computational viewpoint, exhaustive searches through the model space to identify minimum-risk models is prohibitively time-consuming with the large number of candidate covariates we allowed. The Lasso and ridge regression provide faster alternatives in the RR and AR-RR frameworks, but we know of no computationally efficient method to find the minimum-risk OLE and KF models; we are currently developing such methods. Nevertheless, we learned from the unequal performances of the model selection methods we implemented that to build low-risk OLE and KF models, one could start with a single unit-specific lag and transformation observation equation per neural unit, chosen to reduce the risk the most (optimal but tedious) or maximize (suboptimal but fast); and include additional equations if they reduce the prediction risk (optimal but tedious) or explain significant kinematic variability not already explained by the current model (AVT procedure, suboptimal but fast). The decoding model could be alternatively grown and pruned to control its size when the number of equations must be limited due to computational constraints (Bansal et al., 2012).

In addition to seeking optimal models within the four model classes we considered, we also compared them across classes. We argued that minimum-risk RR is always more efficient than all other RR and OLE models, but because KF predictions are dynamic, their risk cannot be compared to the risk of nondynamic predictions; KF predictions may thus be more, or less, efficient than minimum-risk RR predictions depending on the applications. Minimum-risk RR also outperformed AR-RR predictions, likely because of the many lagged spike counts we included in the sample space. Using a smaller sample space would give KF predictions, and perhaps AR-RR predictions, an edge over minimum-risk RR. These comparisons are theoretically motivated and therefore apply generally to data collected online, offline, and in closed loop, although the benefit of the better methods may vary across experiments and data. Indeed, Koyama et al. (2010) contrasted the performance of various models in the offline and closed-loop contexts and found that improvements could be marginal. It is also possible that other classes of decoders, for example nonlinear, nongaussian, or other decoders, could outperform those we considered here; for example, Li et al. (2009) showed that an unscented KF outperformed the basic KF in both offline and closed-loop decoding, although these results are likely dependent on the number of units, task type, single units versus multiunit, and how the models were optimized. Whichever decoders are considered, the same conceptual approach should apply: we should specify a large space of models and search through that space to retain the model with the lowest prediction risk (Bansal et al., 2012), provided methods exist or can be developed to do so efficiently. Model searching could also be extended to the state equation in equation 2.3; for example, Wagenaar et al. (2011) reduced the prediction error two-fold to decode the joint angles of a cat's limb by modeling their evolution within a lower-dimensional manifold rather than using the more common random walk or AR(1) evolution model we used in this letter.

Finally, we offer a word of caution: low-risk models are designed to decode optimally data that are similar to the training data. Hence, what an optimized model gains in efficiency could be lost to lack of robustness to changes in behavioral tasks. It is also known that neurons can adapt to a decoder in closed loop (Koyama et al., 2010). It would therefore be useful to determine which classes of decoders are more robust to task changes and which facilitate neuroplasticity.

## Appendix A: Graphs of the Four Model Classes

## Appendix B: Postmortem on the Minimum-Risk RR Model

We now look more closely at the minimum-risk RR model within the model space . Models selected with the objective of minimizing the prediction risk are not interpretable: regression coefficients no longer quantify the effects of predictors on the response variable after accounting for all other predictors. Large coefficients may nevertheless suggest that the corresponding predictors have large effects on the response. To measure the overall influence of particular predictors in the minimum-risk RR model, we summed the coefficients' absolute values across units for each combination of power and lag , standardized them by the overall sum, and plotted them in Figure 4A (the plotted points sum to one). The spike counts and their square roots contribute most to the fit compared to the other power transformations. Older lags contribute substantially, most likely because they help smooth the predictions for the type of movement in our data set, that is, short and straight reaches. The optimal model makeup is data dependent and would likely be different given a different task (e.g., circular movements). An advantage of using a predictive approach is that we do not have to carefully screen which predictors matter most; they are likely to be different in different data sets. We can consider all predictors that encode the kinematics and let the data determine which contribute to lowering the prediction risk.

## Notes

^{1}

is typically assumed to have independent components. In that case, the models for each velocity component , , are estimated independently of one another.

^{2}

Decoding relies on a regression model, but unlike in typical regression problems, the covariates are not observed; rather, they are estimated from training data in the encoding stage.

^{3}

There is a caveat: we fit the minimum risk of a convex relaxation of the model, which is not equal to the corresponding unrelaxed minimum-risk model; no result exists to evaluate the distance from one model to the other.

^{4}

We analyzed these data in Todorova et al. (2014) and Ventura and Todarova (2015) using (128 ms), the optimal lag for spike counts in a different bin size than we eventually used. Using in these two papers would not change the results. In this letter, using instead of degrades the performances of the basic models relative to the optimized models but does not affect the latter, since their lags are chosen to minimize the prediction risk. It would be a good idea to also let the model selection procedure choose the spike count bin size to lower the risk.

## Acknowledgments

We thank Professor Andrew Schwartz, Professor Steven Chase, and the MotorLab for providing experimental data for this analysis. Support was provided by NIH grant 2R01MH064537 (V.V.).

## References

## Author notes

^{*}Deceased.