## Abstract

Although many real-time neural decoding algorithms have been proposed for
brain-machine interface (BMI) applications over the years, an optimal,
consensual approach remains elusive. Recent advances in deep learning algorithms
provide new opportunities for improving the design of BMI decoders, including
the use of recurrent artificial neural networks to decode neuronal ensemble
activity in real time. Here, we developed a long-short term memory (LSTM)
decoder for extracting movement kinematics from the activity of large
(*N* = 134–402) populations of neurons, sampled
simultaneously from multiple cortical areas, in rhesus monkeys performing motor
tasks. Recorded regions included primary motor, dorsal premotor, supplementary
motor, and primary somatosensory cortical areas. The LSTM's capacity to retain
information for extended periods of time enabled accurate decoding for tasks
that required both movements and periods of immobility. Our LSTM algorithm
significantly outperformed the state-of-the-art unscented Kalman filter when
applied to three tasks: center-out arm reaching, bimanual reaching, and bipedal
walking on a treadmill. Notably, LSTM units exhibited a variety of well-known
physiological features of cortical neuronal activity, such as directional tuning
and neuronal dynamics across task epochs. LSTM modeled several key physiological
attributes of cortical circuits involved in motor tasks. These findings suggest
that LSTM-based approaches could yield a better algorithm strategy for
neuroprostheses that employ BMIs to restore movement in severely disabled
patients.

## 1 Introduction

In the brain-machine interface (BMI) field, neural decoders are computational algorithms that aim at producing the best possible mapping of large-scale neural activity to complex motor behaviors such as limb movements (Carmena et al., 2003; Fitzsimmons, Lebedev, Peikon, & Nicolelis, 2009; Ifft, Shokur, Li, Lebedev, & Nicolelis, 2013; Taylor, Tillery, & Schwartz, 2002; Wu et al., 2003) and whole-body navigation (Rajangam et al., 2016). Here we report a decoder that exceeds the high performance of previously reported decoding methods based on the traditional Kalman filter and newer variations of this classic method. To this date, different versions of Kalman filters (Kalman, 1960; Kalman & Bucy, 1961) have been successfully utilized in various BMI applications, including experiments in rhesus monkeys (Aggarwal, Mollazadeh, Davidson, Schieber, & Thakor, 2013; Chen & Takahashi, 2013; Dangi, Gowda, & Carmena, 2013; Ifft et al., 2013; Li, Li, & Li, 2016; Li et al., 2009; Makin, O'Doherty, Cardoso, & Sabes, 2017; Wu et al., 2003) and clinical BMIs tested in human patients (Hochberg et al., 2012; Homer et al., 2013; Jarosiewicz et al., 2013). Kalman filters explicitly model neuronal tuning properties (i.e., the relationship between neuronal activity and behavioral parameters of interest) using a series of assumptions, such as the statistical distribution of noise (e.g., gaussian or Poisson), characteristics of the tuning curve (e.g., linear or quadratic), and a predetermined number of “taps” (time bins of spike counts) used to capture neuronal response dynamics. However, the performance of a Kalman filter can be considered suboptimal in many cases because it does not account for the temporal characteristics of spiking activity and limb kinematics over sufficiently long periods of time. This is because the number of taps should not be too high to avoid overfitting (Li et al., 2016).

Conversely, recurrent artificial neural networks offer a solution to this problem, since they are capable of keeping information for a long time without increasing the number of free parameters. Recurrent artificial neural networks used to be hard to train if the time lag between the input and the output exceeded 5 to 10 steps. However, the development of long-short term memory (LSTM) architecture (Gers, Schmidhuber, & Cummins, 2000; Hochreiter & Schmidhuber, 1997) allowed this major obstacle to be overcome by letting learning of input-output relationships happen with an arbitrary and large number of time lags. This property is very useful for neural decoding of motor parameters because a typical motor task lasts several seconds—a time interval difficult to model with a Kalman filter. Therefore, using LSTM to retain information for long intervals could significantly improve the decoding performance of neuroprosthetic devices that are based on brain-machine interfaces.

In this study, we tested two standard versions of LSTM against three alternative decoders: unscented Kalman filter (Li et al., 2016), Kalman filter (Wu et al., 2003), and Wiener filter. We used each algorithm to reconstruct the kinematic parameters (position, velocity, and acceleration) of both upper and lower limb movements, based on the combined electrical activity of cortical neuronal ensembles recorded in four rhesus monkeys chronically implanted with multielectrode arrays. These chronic implants were placed in the primary motor (M1), dorsal premotor (PMd), supplementary motor (SMA), and primary somatosensory (S1) cortical areas. Data from three experimental tasks were used: (1) center-out arm reaching (Carmena et al., 2003; Li et al., 2016; Li et al., 2009), (2) bimanual reaching (Ifft et al., 2013), and (3) bipedal walking on a treadmill (Fitzsimmons et al., 2009; Peikon, Fitzsimmons, Lebedev, & Nicolelis, 2009). Overall, LSTM-based algorithms outperformed the other decoders for all three tasks. Moreover, LSTM applications needed fewer training data to achieve the same decoding performance as the other decoders. LSTM decoding performance also remained robust to perturbations of the network hyperparameters, such as structure, length of training snippet, and the ratio of neurons randomly lost during both training and testing.

## 2 Materials and Methods

All animal procedures were performed in accordance with the National Research Council's Guide for the Care and Use of Laboratory Animals and were approved by the Duke University Institutional Animal Care and Use Committee.

### 2.1 Experimental Tasks

Data were taken from three experimental tasks: (1) center-out arm reaching (Carmena et al., 2003; Li et al., 2016; Li et al., 2009), (2) bimanual reaching (Ifft et al., 2013), and (3) bipedal walking on a treadmill (Fitzsimmons et al., 2009; Peikon et al., 2009).

In the center-out task, monkeys continuously controlled a hand-held joystick (CH-400R-P3, Hangzhou ChuangHong Electric Co.) to place a screen cursor over screen targets (see Figure 1B). The computer monitor was placed 55 cm in front of the monkey. The targets were circular, with a diameter of 7 cm. Each trial started with a target appearing at the center of the screen. Then the monkey had to hold the cursor at that central location for 500 ms. After this period, a peripheral target appeared on the screen, 9 cm from the center and at a random angle. The monkey had 5 s to move the cursor to the target. Next, the monkey held the cursor at the target location for 200 to 1000 ms, and finally received a juice reward (Hawaiian Punch). We decoded six kinematics variables: $x$ and $y$ components of joystick position, velocity, and acceleration.

In the bimanual task, a monkey held two joysticks, with the right and left hands, to control the movements of two avatar hands shown on the screen. The task required placing the avatar hands over their initial targets close to the center of the screen, holding these positions for 750 to 1500 ms (drawn from a uniform distribution), and then reaching toward the peripheral targets designated for each hand (see Figure 1D). The initial targets were 8 cm by 8 cm squares located 12 cm to the left and right from the screen center. The peripheral targets were circles, 8 cm in diameter, shifted 8 cm leftward, rightward, up, or down with respect to the initial targets. The four potential target locations for each hand resulted in a configuration with 16 potential target locations, each appearing with equal probability. After the monkeys acquired the peripheral targets, within a 5 s timeout limit, they had to hold the avatar hands over these targets for 50 ms to receive a juice reward. The behavioral task was identical to the one described in our previous study (Ifft et al., 2013), but here we used data from a different monkey. We decoded 12 kinematics variables: $x$ and $y$ coordinates of both joysticks and the corresponding velocity and acceleration.

In the bipedal walking task, a monkey bipedally walked on a treadmill while holding a horizontal bar with its hand. An adjustable neckplate restrained the monkey. Three different treadmill speeds were tested: 12.5, 25, and 50 cm/s (see Figure 1E). The monkeys walked in both forward and backward directions. Each walking session lasted 10 min, after which the monkey rested for 5 min; then the treadmill parameters were changed and the next walking session conducted. The monkey was rewarded with raisins every 5 to 10 s. Fluorescent markers were attached to the hip, knee, and ankle of both legs, and two video cameras captured the marker locations at 30 Hz as described in Peikon et al. (2009). We decoded 36 kinematics variables: the position, velocity, and acceleration of the ankle, the knee, and the hip of both legs in the sagittal plane.

### 2.2 Neuronal Ensemble Recordings and Data Sets

Monkeys were chronically implanted with microwire arrays in multiple cortical areas in both cerebral hemispheres. The implant locations were different for different monkeys; in all of them, we recorded from M1 and in some from S1, PMd, and SMA (see Figure 1A). The extracellular activity of neuronal populations was amplified, digitized, and processed using a multichannel recording system (Plexon, TX, USA).

Neuronal spikes were sorted using a template matching algorithm. While our
neuronal recordings mostly represent single neurons (Schwarz et al., 2014), we chose to stay conservative and
use the term *unit* to allow the eventual inclusion of residual
multiunit activity. As repeatedly reported in the literature, multi-units can be
used for decoding, with decoding performance compared to that of single units
(Carmena et al., 2003; Chestek et al., 2011; Stark & Abeles, 2007) (see Supplementary Figure S1 for the
decoding performance for emulated multi-units). The data set for the center-out
task included six recording sessions (each 14–30 min long) from monkey N
and three sessions (40–120 min) from monkey C. Both monkeys performed the
task with their right arms. In monkey C, trials during which the monkey did not
perform were removed from the analysis, which resulted in 30 to 45 min per
session being analyzed. In monkey N, 97 to 107 units were sampled in M1 and 59
to 61 in S1, all in the left hemisphere. In one session from monkey C, 92 M1 and
103 SMA units were recorded bilaterally. In the other two sessions, we recorded
97 to 103 M1 units bilaterally and 52 to 55 SMA units and 60 to 72 PMd units in
the left hemisphere (see Figure 1A).

The bimanual data set included three sessions from monkey R (50–75 min, with 40–65 min analyzed after the no-performance trials were removed). In this monkey, 164 to 171 units were recorded from the left M1 while the monkey manipulated two joysticks with its hands.

The bipedal data set consisted of five days of recordings in monkey P. A range of 225 to 250 units was sampled in M1 bilaterally and 150 to 159 in the right and left S1. For two recording sessions, a pair of treadmill speeds was tested (12.5 and 25 cm/s, forward and backward). The duration of each of these sessions was approximately 40 min, but up to 4 min of data per session were excluded because of the occasional inability of the video system to track the markers correctly. For the other three sessions, three speeds were tested (12.5, 25, and 50 cm/s, forward and backward walking). Each session lasted approximately 60 minutes, with 3 to 16 minutes per session excluded from data analysis.

### 2.3 LSTM Architecture

We used the basic form of LSTM (Gers et al., 2000; Hochreiter & Schmidhuber, 1997) with full gradient training. The LSTM recurrent network structure was specially designed to store information for an extended period of time. The information was read, written, and forgotten by three different gates: input, output, and forget. The LSTM schematics is shown in Figure 2.

For the forward pass, at any given time $t$, an LSTM block took three input vectors: the instantaneous firing rates $xt$ of the neuronal ensemble, the LSTM output $mt-1$ (i.e., decoded kinematics), and the state of the memory cells $ct-1$ from the previous time point. The forget gate $ft$ first determined the information to be forgotten. Then the input gate $it$ determined the information to be added to the memory cells, and finally the output gate $ot$ determined the information from the memory cells to project onto the kinematics of interests $mt$. In the equations below, $W$s are learned weight matrices, and $\sigma $ and $g$ are point-wise sigmoid ($\sigma (\alpha )=11+e-\alpha )$ and hyperbolic tangent ($tanh(\alpha )=sinh(\alpha )cosh(\alpha ))$ functions, respectively:

$Concatenatedinputzt=[xtT\u2225mt-1T\u22251]TForgetgateft=\sigma (Wfzt)Inputgateit=\sigma (Wizt)Blockinputjt=g(Wjzt)Outputgateot=\sigma (Wozt)Memorycellstatect=ct-1\xb7ft+it\xb7jtBlockoutputmt=Wpg(ct)\xb7ot,lastLSTMblockg(ct)\xb7ot,else$

Because these equations are differentiable, gradient-based learning algorithms can be deployed to learn the weight matrices, using an error signal derived from the differences between $mt$ of the last LSTM block and the recorded kinematics. We propagated this error signal back in time through the memory cell whose activation function was linear with the derivative of one. When the value for the forget gate remained equal to one, the error signal stayed constant in time, and LSTM learned the relationship between the target values and the input values at arbitrary time lags.

We implemented two LSTMs in TensorFlow 1.3.0: one was the basic LSTM with 128 hidden units, as shown Figure 2A (LSTM1), and the other was a two-layer LSTM with 128 hidden units in each layer, as shown Figure 2B (LSTM2). The settings that produced the main results were as follows: 1000 iterations of training with a batch size of 64 and unrolling of 30 steps (equivalent to 1.5 s as each time step was 50 ms). We used an Adam optimizer (Kingma & Ba, 2014) with the learning rate $\alpha =0.001,\beta 1=0.9,\beta 2=0.999$, and $\epsilon =1e-08$ (default values in Kingma & Ba, 2014) and optimized a Huber loss function (delta = 16) with L2-regularization (alpha = 0.2). To further regularize the LSTM, a dropout technique was applied with the probability of 0.2. (20% of the input and output units were randomly set to 0 during training.)

### 2.4 Comparison of Different Decoders

LSTM decoding performance was compared to three alternative algorithms: the Wiener filter, Kalman filter, and unscented Kalman filter. The Wiener filter, WIENER10 (10 past taps were used), performed a regression of the $z$-scored spike counts for 10 50 ms taps, representing time in the past with L2-regularization (ridge regression; best alpha was searched using cross-validation).

The linear Kalman filter, KF10 (i.e., 10 taps), decoded the kinematics (position, velocity, and acceleration) with the iterations that consisted of two steps. During the first step, the filter's movement model predicted the current kinematics from the same parameter calculated during the previous iteration. During the second step, this kinematic prediction was corrected in a Bayesian way based on the difference between the measured neural firing rates and the estimated firing rates derived using the observation model. The observation model (also called neuronal tuning model) represented firing rate as a linear function of the kinematic parameters for the 10 50 ms taps. The observation model and the movement model were fitted with ridge regression, where the best alpha was found using cross-validation.

The unscented Kalman filter, UKF5 (i.e., 5 taps), utilized the observation model that represented firing rate as a quadratic function of kinematics ($x$ and $y$ position and the corresponding values of velocity and acceleration; Li et al., 2016; Li et al., 2009). Five taps, instead of 10, were used for better filter stability as suggested in S. Li et al. (2016). Both the observation model and the movement models were fitted with ridge regression.

### 2.5 Evaluation of Decoding Performance

## 3 Results

The performance of the decoders is presented below for different tasks.

### 3.1 Center-Out Task

For the center-out task, the performance of each decoder was first assessed using the correlation coefficient ($r)$ as a metric (see Figure 3A). An ANOVA revealed main effects for the following categories: decoders $(F(4,192)=374.8,p<0.05)$, kinematics $(F(2,48)=211.0,p<0.05)$, and subjects $(F(1,48)=135.5,p<0.05)$. Post hoc multiple comparison of different decoders showed that the performance of LSTM2 (mean $r=0.66$) was better (paired permutation test, FDR corrected, $p<0.005$) compared to the other four decoders (LSTM1, mean $r=0.58$; UKF5, $r=0.48$; KF10, $r=0.51$; WIENER10, $r=0.57$). Additionally, LSTM1 outperformed UKF5, KF10, and WIENER10 (paired permutation test, FDR corrected, $p<0.023$). A post hoc multiple comparison conducted for kinematics showed that all decoders performed best for predicting position (mean $r=0.77$), followed by velocity ($r=0.59$) and acceleration ($r=0.30$). (These values of $r$ are different from each other; permutation test, FDR corrected, $p<0.005$.) Finally, all decoders performed better (permutation test, $p<0.005$) for monkey N (mean $r=0.63$) than for monkey C ($r=0.41$).

Next, the decoding performance was assessed using the coefficient of determination ($R2$; see Figure 3B). This metric is stricter than the correlation coefficient because it is more affected by the drifts and changes in amplitude of the decoded variable compared to the correlation coefficient. The ANOVA showed main effects for the decoders $(F(4,192)=227.2,p<0.05$), kinematics $(F(2,48)=93.0,p<0.05$), and subjects $(F(1,48)=81.5,p<0.05$). The post hoc multiple comparison showed that LSTM2 (mean $R2=0.45$) outperformed (paired permutation test, FDR corrected, $p<0.005$) the other four decoders (LSTM1, $R2=0.37$; UKF5, $R2=0.23$; KF10, $R2=0.23$; WIENER10, $R2=0.34$), and LSTM1 outperformed UKF5, KF10, and WIENER10 (paired permutation test, FDR corrected, $p<0.018$). UKF5 and KF10 had similar $R2$ (paired permutation test, FDR corrected, $p=0.48$). The decoding performance was best (permutation test, FDR corrected, $p<0.005$) for position (mean $R2=0.52$), followed by velocity ($R2=0.36$) and acceleration ($R2=0.09$. All decoders performed better (permutation test, FDR corrected, $p<0.005$) for monkey N (mean $R2=0.40$) than for monkey C ($R2=0.18$). The data for the $SNRdb$ metric are presented in the supplementary materials (see supplementary Figure S2).

After showing that both types of LSTM outperformed UKF5, KF10, and WIENER10, we examined the decoded traces to assess the contribution of different task epochs to the performance improvement. Figure 4A shows a 10 s snippet showing the $x$-coordinates for the true and decoded joystick position. Two features of LSTM performance can be seen. First, when the monkeys did not move the joystick (i.e., initial central hold), both LSTMs generated more stable output compared to the other decoders. Second, the joystick trajectories obtained with LSTM2 were visibly smoother (i.e., contained less high-frequency noise) compared to those generated by LSTM1. These observations were confirmed by the analysis of mean squared error (MSE) for the difference between the decoded and true joystick positions during the initial central hold period of the behavioral task (see Figure 4B). An ANOVA showed the main effect of the decoder (repeated measure, $F(4,68)=97.0,p<0.05$), and post hoc multiple comparisons showed that the LSTM2 output was the most stable for the analyzed epoch (mean MSE = 1.31), followed by LSTM1 (MSE = 1.54), WIENER10 (MSE = 2.56), UKF5 (MSE = 4.75), and KF10 (MSE = 6.27). These values were statistically different, as confirmed by the paired permutation test (FDR corrected, $p<0.005$ for every pair). Trajectory smoothness was quantified using normalized mean absolute jerk (Rohrer et al., 2002) ($NMAJ=\Delta -1vpeak(t2-t1)\u222bt1t2|d2vdt2|dt$, where $v$ is the cursor speed and $vpeak=\Delta maxt\u2208[t1,t2]v(t)$); (see Figure 4C), where an ANOVA showed the main effect of the decoder (repeated measure, $F(4,32)=19.4,p<0.05$). The post hoc analysis showed that LSTM2 produced the smoothest trajectories (mean NMAJ = $-$0.27, paired permutation test FDR corrected, $p<0.01$), followed by UKF5 (NMAJ = $-$0.42), KF10 (NMAJ = $-$0.45), LSTM1 (NMAJ = $-$0.54), and WIENER10 (NMAJ = $-$0.56) (see Figure 4C).

#### 3.1.1 LSTM Hyperparameters

Next, we examined the effect of LSTM hyperparameters on the decoding performance. We first explored how the LSTM performance was affected by the network width (i.e., the number of cells: 16, 32, 64, 128, 256, or 512) and height (i.e., number of layers: 1 to 3; see Figure 5). The top section of Figure 5 shows the effect of the number of layers on the decoding performance. In the center-out data set, six kinematic parameters were decoded from 134 to 223 units. For these data, we found that the performance of the two-layer LSTMs (mean $R2=0.41$) was significantly better (paired permutation test, $p<0.05$ with all kinematics combined) than the performance of the one-layer LSTMs (mean $R2=0.34$). However, adding the third LSTM layer did not improve the performance (mean $R2=0.41$; paired permutation test, $p=0.36$). The middle section of Figure 5 compares the decoding performance of LSTMs with the widening and narrowing architectures; the number of cells is the same in different LSTMs. In the widening LSTM, the second layer has more cells than the first layer, and in a narrowing LSTM, the second layer has fewer cells. We observed that the widening LSTMs performed better than the narrowing LSTMs (widening LSTM, mean $R2=0.42$; narrowing LSTM, mean $R2=0.39$; paired permutation test, $p<0.005$ with all kinematic variables combined).

Next, we examined decoding performance as a function of the training snippet length. We tested the following length values: 0.05 s, 0.25 s, 0.5 s, 0.75 s, 1 s, 1.25 s, 1.5 s, 2 s, 2.5 s, 3 s, and 10 s (see Figure 6A). To estimate the point where the decoding performance stopped improving, we pooled the data from the two monkeys together; the difference in $R2$ between the neighboring points (e.g., 0.05 s versus 0.25 s, but not 0.05 s versus 0.5 s) was used as a measure of improvement. For decoding of position, LSTM2 performance stopped improving when the snippet length reached 2 s, and LSTM1 stopped improving at 2.5 s (paired permutation test, FDR corrected, $p>0.05$). For decoding of velocity, LSTM2 stopped improving at 0.8 s for LSTM2 and LSTM1 at 2.5 s. For acceleration, LSTM2 stopped improving at 1.2 s, while LSTM1 continued to improve for a snippet length of up to 3 s (paired permutation test, FDR corrected, $p<0.05$ for 2.5 s versus 3 s). Thus, for all three kinematic parameters, a shorter snippet was needed for LSTM2 to reach the maximum performance as compared to LSTM1.

In addition to these tests, we assessed how a very long history (i.e., 10 s) affected the decoding performance. The difference between the $R2$ values for a snippet length of 3 s or 10 s was small but statistically significant. For LSTM1, the decoding for 10 s (position, $R2=0.66$; velocity, $R2=0.41$; acceleration, $R2=0.11$) was better than for 3 s (position, $R2=0.65$; velocity, $R2=0.41$; acceleration, $R2=0.10$) (paired permutation test, FDR corrected, $p<0.021$), while for LSTM2, it was the opposite (10 s versus 3 s: position, $R2=0.71$ versus 0.71, $p=0.13$; velocity, $R2=0.46$ versus 0.47, $p<0.01$; acceleration, $R2=0.18$ versus 0.20, $p<0.01$; paired permutation test, FDR corrected).

We also assessed the decoding performance as a function of the total number of training samples. In this analysis, we increased the length of the training segment from 30 s to 5 min at 30 s steps (see Figure 6B). We used the same cross-validation paradigm as in the other analyses; the training fold was randomly sampled to reduce the number training data. Because sampling different parts of the training fold created variance in the testing results, we repeated the procedure 10 times and took the average of the 10 testing results as the final result. Three observations emerged. First, for all three types of kinematic parameters, the decoding performance continued to improve as the training data set increased from 0.5 to 3.5 min (paired permutation test, FDR corrected, $p<0.05$). Second, LSTMs trained with the full training fold outperformed those trained on 5 minutes of data for decoding of the joystick position (LSTM2, $R2=0.69$ versus 0.56; LSTM1, $R2=0.63$ versus 0.54; paired permutation test, FDR corrected, $p<0.005$) and velocities (LSTM2, $R2=0.47$ versus 0.38; LSTM1, $R2=0.39$ versus 0.36; $p<0.005$), but not acceleration (LSTM2, $R2=0.19$ versus 0.12; LSTM1, $R2=0.08$ versus 0.09; $p<0.005$). Third, LSTMs trained with a small number of data outperformed UKF5, KF10, and WIENER10 trained on full data. For the decoding of position, LSTM2 needed at most 2.5 min of training data to perform comparably to its three competitors (paired permutation test, FDR corrected, $p>0.069$) and at most 4 min of data to outperform them ($p<0.05$). For the decoding of velocity, LSTM2 needed at most 3.5 min of training data to perform comparably to its competitors ($p>0.069$). Although LSTM2 trained on 5 min did not outperform WIENER10 ($R2=0.38$ versus 0.37, $p=0.25$), the LSTM2 trained with 2.5 min of data outperformed UKF5 ($R2=0.25,p<0.005$) and KF10 ($R2=0.30,p<0.005$). For acceleration, LSTM2 trained with 2 min of data ($R2=0.06$) surpassed the performance of UKF5 ($R2=0.01,p<0.005$) and KF10 ($R2=0.04,p<0.008$), but not WIENER10 ($R2=0.14$), even when the LSTM2 was trained on 5 min of data ($R2=0.12,p=0.01$). LSTM1 needed additional 30 s to 1 min of data to reach the performance level of LSTM2; otherwise, it performed about the same to LSTM2 in comparison to the other decoders.

Next, we looked at the decoding performance as a function of the percentage of hidden units randomly dropped out during training (see Figure 6C). This dropout method has been proposed as a regularization technique to reduce overfitting (Hinton, Srivastava, Krizhevsky, Sutskever, & Salakhutdinov, 2012). We varied the percentage of dropped units, from 0% to 90%, and observed that LSTM2 had the highest $R2$ for all three kinematic parameters when 20% of units were dropped (paired permutation test, FDR corrected, $p<0.031$). For LSTM1, the best decoding performance was achieved for all three kinematics when no units were dropped out (paired permutation test, FDR corrected, $p<0.031$).

Finally, we tested the decoding performance as a function of the percentage of neurons randomly dropped out during testing (see Figure 6D). This test assessed the decoder robustness to unstable recordings (e.g., due to a failure of some electrodes). We randomly selected the neurons to drop in 10 different tests and calculated the average $R2$ for these tests as a measure of decoder performance. We found that both LSTMs performed the best for all three kinematic parameters when no neurons were dropped (paired permutation test, $p<0.05$ after FDR correction). The deterioration of decoding was evaluated using a dropping curve. The $x$-axis of the curve was the percentage of neurons dropped, and the $y$-axis represented normalized $R2$ ($R2$/max($R2$)). To quantify the rate of performance decrease with the neurons being dropped, we computed the area under the dropping curve (AUDC). AUDC ranged between 0 and 1; the higher the AUDC, the less pronounced was the decrease in performance. We found that the AUDC of LSTM1 (mean = 0.65) was slightly but significantly better than that of LSTM2 (mean = 0.64; paired permutation test, $p=0.017$).

#### 3.1.2 LSTM Internal Dynamics

Figures 7A to 7H illustrate the activities of different LSTM2 components, for a short 10 s snippet, which depicts offline reconstruction of kinematics from the normalized firing rate of input units. We observed that the same type of cells had a higher intragroup correlation in the second LSTM layer than in the first layer (e.g., input gate cells in the second layer versus in the first layer; diagonal of Figure 7I). To compute the intragroup correlation, we first randomly split the cells of the group into two halves and randomly sampled 2500 data points. Then we quantified the correlation between the two halves using a bias-corrected distance correlation (Székely & Rizzo, 2013), which measured the dependency (possibly nonlinear) between two random vectors. This procedure was repeated 10 times, and the mean was taken to represent the intragroup correlation. High correlation values were observed in the second layer for the input gates ($0.96\xb10.01$ versus $0.43\xb10.03,p<0.01$, paired permutation test), the output gate ($0.99\xb10.00$ versus $0.68\xb10.02,p<0.01$), the forget gate ($0.92\xb10.01$ versus $0.27\xb10.02,p<0.01$), the block input ($0.78\xb10.01$ versus $0.37\xb10.01,p<0.01$), and the memory c state ($0.84\xb10.01$ versus $0.60\xb10.01,p<0.01$). Intergroup correlation between the three gates was higher in the second layer than in the first layer ($0.77\xb10.01$ versus $0.49\xb10.01,p<0.01$, permutation test).

We also observed that the filtering performed by the gates in the second layer had a higher contrast compared to the gates in the first layer. Namely, the second-layer gates alternated between being fully open and fully closed, whereas the first-layer gates exhibited such operations less frequently. To describe this effect quantitatively, we defined gate open state as the one with gate value greater than 0.95 (while 1.0 was the maximum) and a closed state as the one with gate value below 0.05 (0.0 was the minimum). An analysis of these states showed that the gates in the second layer were more likely to be closed than the gates in the first layer (see Figure 7J; paired permutation test, FDR corrected, $p<0.01$). This effect was especially clear for the input gates ($29.4%\xb10.7%$ versus $3.8%\xb10.7%$) and the output gates ($35.8%\xb18.0%$ versus $7.7%\xb11.0%$). For the open state, there was a slight but significant trend for the gates in the first layer to enter this state more often compared to the second layer (paired permutation test, FDR corrected, $p<0.01$).

Another interesting finding was the observation that the input units and the memory cells from both layers exhibited directional tuning; their activity was different depending on the decoded movement direction (see Figure 7K). The tuning properties were evaluated by regressing the joystick position to the unit/cell firing rates for different lags: $firingrate(t-lag)=b+bxx(t)+byy(t)$. The lag that resulted in the best fit (according to the correlation coefficient value) was taken to represent the delay between unit/cell activity and $x,y$ coordinates. Modulation depth and preferred direction were defined as the correlation coefficient and $arctan(b2,b1)$, respectively, for the same lag. We found that memory cells in the second LSTM layer exhibited higher modulation depths than those in the first layer ($0.28\xb10.01$ versus $0.18\xb10.01$, paired permutation test, $p<0.01$); memory cells in the first layer were more strongly modulated than the input units ($0.18\xb10.01$ versus $0.05\xb10.00,p<0.01$).

### 3.2 Bimanual Target Acquisition Task

Like the analysis of the center-out data, the performance in the bimanual task was first assessed using the correlation coefficient $r$ (see Figure 8A). The ANOVA showed main effects for both the decoder $(F(4,132)=115.3,p<0.005$) and kinematics $(F(2,33)=47.2,p<0.005$). Post hoc multiple comparison showed that LSTM2 outperformed the other four decoders (mean $r=0.58$; paired permutation test, FDR corrected, $p<0.005$), while LSTM1 ($r=0.42$) did not outperform WIENER10 ($r=0.42$), KF10 ($r=0.38$) and UKF5 ($r=0.36$) (paired permutation test, FDR corrected, $p>0.16$) in terms of $r$. The position was decoded more accurately (mean $r=0.64$) compared the other kinematic parameters (paired permutation test, FDR corrected, $p<0.005$), followed by the velocities ($r=0.44$) and accelerations ($r=0.21$).

For the assessment of performance with the coefficient of determination, $R2$ (see Figure 8B), the ANOVA showed main effects for both decoders $(F(4,132)=59.4,p<0.005$) and kinematics $(F(2,33)=9.6,p<0.005$). Post hoc multiple comparisons showed that LSTM2 (mean $R2=0.34$) surpassed the performance of LSTM1 ($R2=0.19$), UKF5 ($R2=0.10$), KF10 ($R2=0.03$), and WIENER10 ($R2=0.20$) (paired permutation test, FDR corrected, $p<0.005$). LSTM1 outperformed UKF5 and KF10 ($p<0.05$) but not WIENER10 ($p=0.40$). Multiple comparisons showed that decoders most accurately predicted joystick positions ($R2=0.30$), followed by velocity ($R2=0.18$) and acceleration ($R2=0.04$) (paired permutation test, FDR corrected, $p<0.05$ for each pair of kinematics). The data for $SNRdb$ metric are presented in the supplementary materials (supplementary Figure S3).

### 3.3 Bipedal Walking Task

To accommodate the larger number of kinematics variables ($N=36$), we increased the width of the second layer of LSTM2 from 128 to 1024 cells and named it LSTM2x. This was done because our analysis of the center-out data showed that the widening LSTM performed better than the narrowing LSTM.

We first assessed the decoding performance of the six decoders using the correlation coefficient as the metric (see Figure 9A). An ANOVA showed main effects for both decoders $(F(5,4665)=466.5,p<0.005$) and kinematics $(F(2,933)=445.7,p<0.005$). Post hoc multiple comparisons revealed that LSTM2x (mean $r=0.70$) outperformed the other decoders (LSTM2, $r=0.65$; LSTM1, $r=0.60$; UKF5, $r=0.62$; KF10, $r=0.64$; WIENER10, $r=0.65$), and the performance differed significantly between every pair of the decoders (paired permutation test, FDR corrected, $p<0.005$). For kinematics, the ankle, knee, and hip positions were the best decoded parameters (mean $r=0.84$), followed by the corresponding velocities ($r=0.66$) and accelerations ($r=0.42$) (paired permutation test, FDR corrected, $p<0.005$).

We assessed the performance of different decoders using the coefficient of determination, $R2$ (see Figure 9B). The ANOVA showed main effects for decoders $(F(5,4665)=18.5,p<0.005$) and kinematics $(F(2,933)=28.8,p<0.005$). Post hoc multiple comparisons showed that LSTM2x was the best-performing decoder (mean $R2=0.51$), followed by LSTM2 ($R2=0.39$), WIENER10 ($R2=0.35$), LSTM1 ($R2=0.32$), UKF5 ($R2=0.14$), and KF10 ($R2=-0.21$) (paired permutation test, FDR corrected, $p<0.005$). Pairwise comparison of different decoders revealed significant differences in all cases ($p<0.05$), except for the comparison of LSTM1 with WIENER10 ($p=0.23$). For decoding of kinematics, the position was decoded the best ($R2=0.61$), followed by velocity ($R2=0.25$) and acceleration ($R2=-0.11$) (paired permutation test, FDR corrected, $p<0.005$). (See supplementary Figures S4– S6 for the decoding performance of each kinematic variable.)

To assess how the decoding performance was affected by leaving out units from either M1 or S1, we analyzed data from the bipedal walking task, where substantial numbers of units were recorded from both areas for all 26 sessions (on average 388.3 units; M1, 60.5%, S1, 39.5%). For a fair comparison between conditions, we ensured that the number of units remained the same after leaving out data from different cortical areas. When more M1 units were recorded than S1 units in a session, we randomly selected and discarded M1 units so that the number of remaining units was the same for both, leaving out M1 and S1 data. This analysis showed that the decoding performance of LSTM2, evaluated as $R2$, decreased slightly (but significantly) when leaving out both M1 and S1 data for all the decoding kinematics (paired permutation test, $p<0.01$ after FDR correction; see Figure 10). Additionally, $R2$ was slightly lower when leaving out M1 compared to leaving out S1 (paired permutation test, $p<0.01$ after FDR correction).

## 4 Discussion

We developed an LSTM-based decoder for translating extracellular discharges of an ensemble of cortical neurons into predictions of up to 36 kinematic parameters. The decoder was applied to two arm movement tasks (unimanual and bimanual reaching) and the bipedal locomotion task. LSTM proved to be suitable for decoding motor parameters because it captured well several features of the behavioral tasks: the time course with a characteristic interval of 1 to 5 s (trial duration in the arm movement tasks or step cycle duration in the bipedal walking task) and the state transitions (from resting to moving and from leg stance to leg swing). These features were captured by gating the neuronal activity in and out of the LSTM memory cells. Motor information could be kept in the memory cells for long periods of time, which allowed the neuronal decoder to capture the temporal characteristics of both upper and lower limb movements.

We demonstrated that given the same number of training data, LSTM outperformed the conventional neural decoders: the Wiener filter, and Kalman filters, including the unscented Kalman filter. Moreover, with many fewer training samples, LSTM could reach the same level of decoding performance as the other decoders. We suggest that LSTM outperformed the Wiener and Kalman filters because the latter two decoders did not capture the nonlinearities between the spiking activities and the limb kinematics. By contrast, LSTM captured nonlinearities by its nonlinear gates and activation functions. Because of such representation of nonlinearities, LSTM outperformed even the unscented Kalman filter that was also designed to incorporate nonlinearities. We suggest that LSTM had higher accuracy than the Kalman filter because its memory performed better compared to the unscented Kalman filter tap structure.

While the formation of the weight structure in LSTM networks is not well understood, some believe that it may mimic physiological properties of biological neuronal circuits (Banino et al., 2018). We conducted a systematic analysis of the effect of hyperparameters on LSTM performance. As a result, we found that when LSTM2 did not perform well, the decoder's performance could be improved by using a widening LSTM (i.e., LSTM2x). We also found that LSTM was robust to various perturbations in hyperparameter settings. For example, the decoding performance was comparable when using training snippets longer than 1 s, dropout ratio in training from 0 to 0.3, and even with a different number of memory cells and LSTM layer structures.

While LSTM was robust to our hyperparameter perturbations, some LSTM structures appeared to fit certain tasks particularly well. For example, the two-layer LSTM outperformed the single-layer LSTM, but the three-layer LSTM did not improve the decoding performance for the center-out task. Also, the widening LSTM outperformed the narrowing LSTM for the same number of neurons. The superior performance of the widening LSTM should be examined in future studies. At present, we suggest that the first LSTM layer could be lower in size than the neuronal sample because kinematic information is represented redundantly in the activity of different neurons. Additionally, the second LSTM layer should be larger than the number of variables being decoded because multiple memory cells are needed to fully represent the properties of each variable.

The most striking advantage of our LSTM-based decoder was its ability to stabilize the cursor during the delay periods of the arm-reaching tasks, that is, the periods corresponding to monkeys holding the cursor in the center of the screen, a task none of the other algorithms was able to adequately match. For example, the unscented Kalman filter generated noisy kinematic patterns during the delay intervals even though it incorporated nonlinear neuronal tuning to behavioral variables. Several studies previously attempted to solve this problem with decoder add-ons. Homer and his colleagues (2013) proposed that the cursor position should not be updated if the decoded velocity was close to zero. Golub, Yu, Schwartz, and Chase (2014) developed a speed-dampening Kalman filter that assumed that the cursor speed was inversely correlated with its angular velocity, that is, the cursor moved slower when movement direction changed. Gilja and his colleagues (2012) used the cursor position as feedback in their ReFIT Kalman filter to reduce the jitter during online control. Here, we showed that LSTM does not require such add-ons to stabilize the cursor. Instead, the hold state was detected by the gating mechanism that regulated the transformation of neuronal spikes to limb kinematics. According to this mechanism, the decoded position, velocity, and acceleration from the previous time point provided recurrent input to the LSTM for the next time point.

In the case of previously proposed BMI decoders, the length of spiking history was usually set arbitrarily or adjusted empirically. In nonrecurrent decoders such as the Wiener filter and Kalman filters, the trial history is defined as the number of discrete steps, called taps. Although many taps could reproduce the neuronal dynamics better, excessive taps cause an exponential increase in computational cost, overfitting, and numerical instability. In previous studies, the history was typically set to 0.5 to 1.0 s and the number of taps to 5 to 10. Recurrent LSTM is immune to these problems because no taps are utilized. Instead, LSTM keeps the entire trial history rather than limiting its memory to a short time span. In our case, we obtained very good results with a spiking history as long as 4 s, which corresponded to 80 taps of nonrecurrent decoders. We found that when the LSTMs were trained with a longer history than the typical trial duration, these decoders' performance neither deteriorated nor improved, probably because the memory span was exceeded by the timing required for the neurophysiological mechanism involved in executing the motor task to unfold properly.

We also found that LSTM2 (i.e., two-layer LSTM) reached a performance plateau with a shorter spiking history compared to LSTM1 (i.e., single-layer LSTM). One possible explanation of this observation is that each of these LSTMs captured a different input-output relationship in different ways. LSTM2 has twice as many free parameters to fit as LSTM1, which probably explains why a short temporal lag was sufficient for training.

Previously, an RNN-based neural decoder has been used by for decoding arm-reaching kinematics from cortical activities for the center-out task (Sussillo et al., 2012; Sussillo, Stavisky, Kao, Ryu, & Shenoy, 2016). The decoder was based on echo state networks (ESN; Jaeger & Haas, 2004) that consisted of two layers: a recurrent layer and an output layer. The recurrent layer had a large number of nonlinear units that were sparsely and randomly connected with random initial weights. The output layer was a linear readout of the recurrent layer. Because the learning took place only in the output layer, not the recurrent layer, the ESN did not have to propagate error signals back through time, which made training more efficient. However, since the recurrent layer was not trained, the decoder performance was highly sensitive to several hyperparameters of the recurrent layer (Bianchi, Maiorino, Kampffmeyer, Rizzi, & Jenssen, 2017), such as the number of neurons, connectivity between neurons, reservoir spectral radius, and noise terms. While good results have been reported for ESN in the literature (see Bianchi et al., 2017, for examples), this algorithm was compared with only a few other methods. For instance, Sussillo et al. (2012) compared ESN only to a velocity-KF instead of ReFIT-KF, presumably a better algorithm developed by the same group (Gilja et al., 2012).

As is well known, neural recording conditions can change in a matter of hours or days. When they change, a decoder that has established a fixed mapping between neural activities and limb movement would perform poorly. To this end, Sussillo et al. (2016) applied multiplicative recurrent neural networks (MRNNs; Martens & Sutskever, 2011) to improve the robustness of neural decoders across days. MRNN addresses this issue by learning a library of the mapping and selecting the right mapping based on the input. The authors showed that MRNN not only outperformed FIT-KF (similar to ReFIT-KF; Fan et al., 2014) when tested on many daily recording sessions, but it also outperformed ReFIT-KF. While MRNN is a powerful framework and can be integrated with LSTM (Krause, Lu, Murray, & Renals, 2016), it is data thirsty and time-consuming to train (Sussillo et al., 2016).

Despite outperforming other classic decoders, some limitations and future directions should be noted for our LSTM approach. While LSTM accurately predicted behaviors from neural data, unlike Kalman filters, it lacked explicit neural tuning models. Nevertheless, because the LSTMs we used in this study had a relatively simple architecture, many limitations could be addressed in the future. For example, LSTM could learn the motion and measurement models of the Kalman filter (Coskun, Achilles, Dipietro, Navab, & Tombari, 2017). It could use target position as a training signal to improve decoding performance (Li et al., 2016). For the sake of a fair comparison of different decoders, the use of target position was not assessed in this study because the Wiener filter, unlike the other decoders, could not use target information. In another direction for future research, LSTM could incorporate adaptive features based on reinforcement learning (Bakker, 2002). This capacity could be used to improve decoding from nonstationary neural signals. For practical BMI applications, it is essential that decoders detect the instances when a user wants to halt BMI control and switch attention to a different task. LSTM can incorporate such a detection (Bahdanau, Cho, & Bengio, 2014; Chorowski, Bahdanau, Serdyuk, Cho, & Bengio, 2015).

Overall, our study showed that LSTM architectures provide a robust neural decoding framework with many features that can be adjusted to improve BMI performance. Since LSTM is a neural network, it can be viewed as an artificial circuit attached to the brain in order to recover and augment physiological functions. In this respect, it is interesting that we observed the emergence of directional tuning and neuronal dynamics in our LSTM units—the patterns resembling activity of real neurons. Given the accurate performance of LSTM as a neural decoder, we propose that this method should be considered for the design of future neuroprosthetic devices for restoring mobility in severely disabled patients.