## Abstract

The Kalman filter provides a simple and efficient algorithm to compute the posterior distribution for state-space models where both the latent state and measurement models are linear and gaussian. Extensions to the Kalman filter, including the extended and unscented Kalman filters, incorporate linearizations for models where the observation model $p(observation|state)$ is nonlinear. We argue that in many cases, a model for $p(state|observation)$ proves both easier to learn and more accurate for latent state estimation.

Approximating $p(state|observation)$ as gaussian leads to a new filtering algorithm, the discriminative Kalman filter (DKF), which can perform well even when $p(observation|state)$ is highly nonlinear and/or nongaussian. The approximation, motivated by the Bernstein–von Mises theorem, improves as the dimensionality of the observations increases. The DKF has computational complexity similar to the Kalman filter, allowing it in some cases to perform much faster than particle filters with similar precision, while better accounting for nonlinear and nongaussian observation models than Kalman-based extensions.

When the observation model must be learned from training data prior to filtering, off-the-shelf nonlinear and nonparametric regression techniques can provide a gaussian model for $p(observation|state)$ that cleanly integrates with the DKF. As part of the BrainGate2 clinical trial, we successfully implemented gaussian process regression with the DKF framework in a brain-computer interface to provide real-time, closed-loop cursor control to a person with a complete spinal cord injury. In this letter, we explore the theory underlying the DKF, exhibit some illustrative examples, and outline potential extensions.

## 1  Introduction

Consider a state-space model for $Z1:T:=Z1,…,ZT$ (latent states) and $X1:T:=X1,…,XT$ (observations) represented as a Bayesian network:
$Z1⟶⋯⟶Zt-1⟶Zt⟶⋯⟶ZT↓↓↓↓X1Xt-1XtXT.$
(1.1)
The conditional density of $Zt$ given $X1:t$ can be expressed recursively using the Chapman-Kolmogorov equation and Bayes' rule (see Chen, 2003, for further details):
$p(zt|x1:t-1)=∫p(zt|zt-1)p(zt-1|x1:t-1)dzt-1,$
(1.2a)
$p(zt|x1:t)=p(xt|zt)p(zt|x1:t-1)∫p(xt|zt)p(zt|x1:t-1)dzt=p(xt|zt)p(zt|x1:t-1)p(xt|x1:t-1),$
(1.2b)
where $p(z0|x1:0)=p(z0)$ and the conditional densities $p(zt|zt-1)$ and $p(xt|zt)$ are either specified a priori or learned from training data prior to filtering. Computing or approximating equation 1.2 is often called Bayesian filtering. Bayesian filtering arises in a large number of applications, including global positioning systems, target tracking, aircraft and spacecraft guidance, weather forecasting, computer vision, digital communications, and brain-computer interfaces (Chen, 2003; Hall, 1966; Battin & Levine, 1970; Grewal & Andrews, 2010; Buehner, McTaggart-Cowan, & Heilliette, 2017; Brown & Hwang, 2012; Schmidt, Weinberg, & Lukesh, 1970; Brandman, Cash, & Hochberg, 2017).

Exact solutions to equation 1.2 are available only in special cases, such as the Kalman filter (Kalman, 1960; Kalman & Bucy, 1961). The Kalman filter models the conditional densities $p(zt|zt-1)$ and $p(xt|zt)$ as linear and gaussian so that the posterior distribution $p(zt|x1:t)$ is also gaussian and quickly computable. Beneš (1981) and Daum (1984, 1986) broadened the class of models for which the integrals in equation 1.2 are analytically tractable, but many model specifications still fall outside this class. When the latent state space is finite, the integrals in equation 1.2 become sums that can be calculated exactly using a grid-based filter (Elliott, 1994; Arulampalam, Maskell, Gordon, & Clapp, 2002). For more general models, there are many techniques for approximate Bayesian filtering see (Chen, 2003 for a review).

In some applications, parts of the underlying model are first learned from supervised training data consisting of $(Zt,Xt)$ pairs, and then the learned model is used for filtering on new $(Xt)$ data. For instance, $(Zt,Xt)$ pairs might be used to learn $p(xt|zt)$ with nonparametric conditional density estimation, and then the learned $p(xt|zt)$, say, $p^(xt|zt)$, is substituted into whatever algorithm is used to approximate Bayes' rule in equation 1.2b. This motivates the search for combinations of approximation algorithms and learning methods that work well together. It also opens the door to novel approximation algorithms that would not traditionally be considered for a known model but become practical when the model can be learned. For instance, from $(Zt,Xt)$ pairs, we can choose to learn $p(xt|zt)$ or $p(zt|xt)$ and incorporate either into the approximation algorithm, whereas traditional approximation algorithms assume that only $p(xt|zt)$ is available.

In this letter, we explore the idea of using a novel approximation algorithm that pairs well with learning and demonstrate its use in an intracortical brain-computer interface (iBCI) for a human volunteer with tetraplegia as part of the ongoing BrainGate2 clinical trial. Our approach focuses on the approximation of Bayes' rule in equation 1.2b, making use of the fact that $p(xt|zt)$ can be replaced with $p(zt|xt)/p(zt)$ throughout. (The $p(xt)$ term cancels.) This strategy combines well with various gaussian assumptions that are often employed in approximate Bayesian filtering, resulting in what we call the discriminative Kalman filter (DKF). The DKF retains much of the computational simplicity of the classical Kalman filter but allows for arbitrary observation models. Some of our clinical research using the DKF has already been published (Brandman, Burkhart et al., 2018; Brandman, Hosman et al., 2018), and theoretical aspects of the DKF are further explored in the first author's dissertation (Burkhart, 2019).

## 2  The Discriminative Kalman Filter

In section 2.1, we derive the DKF approximation for a class of models that generalizes the Kalman filter by allowing for arbitrary observation models. We discuss approximation accuracy in section 2.2 and introduce a modified algorithm that can be more robust to model misspecification in section 2.3. In section 2.4, we compare the DKF formalism to a variety of existing approaches that generalize the Kalman filter, and in section 2.5, we discuss using the DKF approximation in models with nonlinear or nongaussian state dynamics.

We now introduce some notation and conventions. We let the latent states $Zt$ take values in $Rd×1$ and the observations $Xt$ take values in an abstract space $X$. In all of our examples, $X⊆Rn×1$, but this is not necessary. We use $ηd(z;μ,Σ)$ to denote the $d$-dimensional multivariate gaussian distribution with mean vector $μ∈Rd×1$ and covariance matrix $Σ∈Sd$ evaluated at $z∈Rd×1$, where $Sd$ denotes the set of $d×d$ positive-definite (symmetric) matrices. We let $A⊺$ refer to the transpose of a matrix $A$ and use $E$ and $V$ for expected value and variance/covariance, respectively.

### 2.1  Filter Derivation

For the basic derivation, we assume that the latent states form a stationary, mean zero, gaussian, vector autoregressive model of order 1. Namely, for $A∈Rd×d$ and $S,Γ∈Sd$,
$p(z0)=ηd(z0;0,S),$
(2.1a)
$p(zt|zt-1)=ηd(zt;Azt-1,Γ),$
(2.1b)
for $t=1,2,…$, where $S=ASA⊺+Γ$ so that the process is stationary. Note that equation 2.1 matches the latent state model for the stationary Kalman filter. (The assumption of zero mean is easily generalized, but it is usually more convenient to center the $Zt$ process by subtracting the common mean.)

The observation model $p(xt|zt)$ is assumed to not vary with $t$, so that the joint $(Zt,Xt)$ process is stationary but otherwise arbitrary. The observation model can be nongaussian, multimodal, or discrete, for example. For instance, in neural decoding for BCI applications, the observations are often vectors of counts of neural spiking events (binned action potentials), which might be restricted to small integers or even be binary-valued.

The DKF is based on a gaussian approximation for $p(zt|xt)$, namely,
$p(zt|xt)≈ηd(zt;f(xt),Q(xt)),$
(2.2)
where $f:X→Rd$ and $Q:X→Sd$. Note that equation 2.2 is not an approximation of the observation model, but rather of the conditional density of the latent state given the observation at a single time step. In section 2.4, we compare this to other approaches that use gaussian approximations for Bayesian filtering. When the dimensionality of the observation space ($X$) is large relative to the dimensionality of the state space ($Rd$), the Bernstein–von Mises theorem states that $f$ and $Q$ exist such that this approximation will be accurate, requiring only mild regularity conditions on the observation model $p(xt|zt)$ (see section 2.2 in van der Vaart, 1998). Furthermore, we can take $f$ and $Q$ to be the conditional mean and covariance of $Zt$ given $Xt$, namely,
$f(x)=E(Zt|Xt=x),Q(x)=V(Zt|Xt=x),$
(2.3)
which is the approach taken in this letter, although other choices are certainly possible, such as $f(xt)=argmaxztp(zt|xt)$ or $f(xt)=argmaxztp(xt|zt)$, the latter of which is most commonly used in statements of the Bernstein–von mises Theorem.
To make use of equation 2.2 for approximating equation 1.2, we first rewrite equation 1.2b in terms of $p(zt|xt)$ as
$p(zt|x1:t)=p(xt)p(xt|x1:t-1)p(zt|xt)p(zt)p(zt|x1:t-1),=p(xt)p(xt|x1:t-1)p(zt|xt)p(zt)∫p(zt|zt-1)p(zt-1|x1:t-1)dzt-1,$
(2.4)
where the second line follows from the Chapman–Kolmogorov equation (see equation 1.2a). We then substitute the latent state model, equation 2.1, and the DKF approximation, equation 2.2, into equation 2.4. We absorb terms not depending on $zt$ into a normalizing constant $κ$ to obtain
$p(zt|x1:t)≈κ(x1:t)ηd(zt;f(xt),Q(xt))ηd(zt;0,S)∫ηd(zt;Azt-1,Γ)p(zt-1|x1:t-1)dzt-1.$
(2.5)
If $p(zt-1|x1:t-1)$ is approximately gaussian, which it is for the base case of $t=1$ from equation 2.1a (defining $p(z0|x1:0)=p(z0)$), then all of the terms on the right side of equation 2.5 are approximately gaussian. If these approximations are exact and the analytic expression for covariance is valid (specifically if $Σt$ in equation 2.7 is positive definite), we find that the right side of equation 2.5 is again gaussian, giving a gaussian approximation for $p(zt|x1:t)$. We rely on the fact that dividing two gaussian pdfs yields an exponentiated quadratic form that will itself be gaussian if the associated covariance matrix is positive definite (and that the product of two gaussian pdfs is gaussian, without any additional assumptions). See the proof of 7 in appendix B for a full derivation and further details.
Let
$p(zt|x1:t)≈ηd(zt;μt(x1:t),Σt(x1:t))$
(2.6)
be the gaussian approximation of $p(zt|x1:t)$ obtained from successively applying the approximation in equation 2.5. Defining $μ0=0$ and $Σ0=S$, we can sequentially compute $μt=μt(x1:t)∈Rd×1$ and $Σt=Σt(x1:t)∈Sd$ via
$νt=Aμt-1,Mt=AΣt-1A⊺+Γ,Σt=Mt-1+Q(xt)-1-S-1-1,μt=ΣtMt-1νt+Q(xt)-1f(xt).$
(2.7)
The first two steps incorporate the exact state dynamics in equation 2.1b and the final two steps incorporate the observation information using the DKF approximation in equation 2.2. The function $Q$ needs to be defined so that $Σt$ exists and is a proper covariance matrix. A sufficient condition that is easy to enforce in practice is $Q(·)-1-S-1∈Sd$ (see section A.3 in appendix A).

Equation 2.7 encapsulates the DKF. (For pseudocode, see algorithm 1.) Once $f(xt)$ and $Q(xt)$ have been evaluated, there is no remaining dependence on $n$ and a single iteration of the algorithm takes $O(d3)$ operations, which is at least as fast as the Kalman filter (when $d). The power of the DKF, along with potential computational difficulties, comes from evaluating $f$ and $Q$. If $f$ is linear and $Q$ is constant, the DKF and the Kalman filter are equivalent (see section 4.1). More general $f$ and $Q$ allow the filter to depend nonlinearly on the observations, improving performance in many cases. If $f$ and $Q$ can be quickly evaluated and the dimension $d$ of $Zt$ is not too large, then the DKF is fast enough for use in real-time applications, such as the BCI decoding example below.

### 2.2  Approximation Accuracy

Let the observation space be $X=Bn$ for some set $B$. As $n$ grows, the Bernstein–von Mises (BvM) theorem guarantees under mild assumptions that the conditional distribution of $Zt|Xt$ is asymptotically normal in total variation distance and concentrates at $Zt$ (van der Vaart, 1998). This asymptotic normality result provides the main rationale for our key approximation expressed in equation 2.2. The BvM theorem is usually stated in the context of Bayesian estimation. To apply it in our context, we equate $Zt$ with the parameter and $Xt$ with the data, so that $p(zt|xt)$ becomes the posterior distribution of the parameter at a fixed time $t$. We then let the dimension $n$ of $xt$ grow, meaning that we are observing growing amounts of data at a fixed time $t$ associated with the parameter $Zt$. Very loosely speaking, the BvM theorem tends to be applicable in situations where $Xt$ uniquely determines $Zt$ in the limit as $n→∞$ but does not uniquely determine $Zt$ for any finite $n$.

One concern is that equation 2.4 will amplify approximation errors. Along these lines, we prove the following result that holds whenever the BvM theorem is applicable for equation 2.2:

Theorem 1.

Under mild assumptions, the total variation distance between our approximation $ηd(zt;μt(x1:t),Σt(x1:t))$ and the exact filtering distribution $p(zt|x1:t)$ converges in probability to zero for each $t$ as $n→∞$.

This result is stated formally and proven in appendix B. We interpret the theorem to mean that under most conditions, as the dimensionality of the observations increases, the approximation error of the DKF tends to zero.

The proof is elementary but involves several subtleties that arise because of the $p(zt)$ term in the denominator of equation 2.4 corresponding to $ηd(zt;0,S)$. This term can amplify approximation errors in the tails of $p(zt|xt)$, which are not uniformly controlled by the asymptotic normality results in the BvM theorem. To remedy this, our proof also uses the concentration results in the BvM theorem to control pathological behaviors in the tails. As an intermediate step, we prove that 1 still holds when the $p(zt)$ term is omitted from the denominator of equation 2.4 (see remark 5 in appendix B).

### 2.3  Robust DKF

Omitting the $p(zt)$ from the denominator of equation 2.4 is also helpful for making the DKF robust to violations of the modeling assumptions and errors introduced when $f$ and $Q$ are learned from training data. Repeating the original derivation, but without $ηd(zt;0,S)$ in the denominator, gives the following filtering algorithm that we call the robust DKF. One can think of the robust DKF as a special case of the standard DKF where all eigenvalues of $S-1$ are so small that the effect of subtracting $S-1$ is negligible. This has the effect of placing an improper prior on $Z0$. Defining $μ1(x1)=f(x1)$ and $Σ1(x1)=Q(x1)$, we sequentially compute $μt$ and $Σt$ for $t≥2$ via
$νt=Aμt-1,Mt=AΣt-1A⊺+Γ,Σt=Mt-1+Q(xt)-1-1,μt=ΣtMt-1νt+Q(xt)-1f(xt).$
(2.8)
(Note that we initialize at $t=1$ and not $t=0$ in the robust DKF.) Justification for the robust DKF comes from remark 5 in appendix B showing that the robust DKF accurately approximates the true $p(zt|x1:t)$ in total variation distance for each $t$ as $n$ increases. We sometimes find that the robust DKF outperforms the DKF on real-data examples, but not on simulated examples that closely match the DKF assumptions. (For pseudocode, see algorithm 2.)

### 2.4  Other Gaussian Approximations

The DKF enforces a gaussian form for the filtering distribution $p(zt|x1:t)$, a common strategy for approximate Bayesian filtering owing to the analytic and representational tractability of gaussians. In this section, we describe several other methods that use gaussian approximations, focusing on the case of linear, gaussian state dynamics. For this type of state dynamics, the transition from time $t-1$ to time $t$ is usually separated into two distinct steps when using gaussian approximations. Beginning with
$p(zt-1|x1:t-1)≈ηd(zt-1;μt-1,Σt-1),$
the first step uses the exact state dynamics, equation 2.1b, to create a gaussian approximation for $p(zt|x1:t-1)$, namely,
$p(zt|x1:t-1)≈ηd(zt;νt,Mt),$
(2.9)
where $νt=Aμt-1$ and $Mt=AΣt-1A⊺+Γ$, as in equations 2.7 and 2.8. Most gaussian methods would proceed similarly for the first step under these state dynamics. Differences between methods appear for nonlinear or nongaussian state dynamics (see section 2.5).
The second step attempts to incorporate the observation information $xt$ via Bayes' rule:
$p(zt|x1:t)=p(xt|zt)p(zt|x1:t-1)∫p(xt|zt)p(zt|x1:t-1)dzt.$
Beginning with the gaussian approximation from step 1 (equation 2.9) and enforcing the final approximation,
$p(zt|x1:t)≈ηd(zt;μt,Σt),$
the problem reduces to finding $μt$ and $Σt$ so that
$ηd(zt;μt,Σt)≈p(xt|zt)ηd(zt;νt,Mt)∫p(xt|zt)ηd(zt;νt,Mt)dzt=qt(zt),$
(2.10)
where $qt$ is defined by this equation.

There are many strategies in the literature for choosing $μt$ and $Σt$ in equation 2.10. The terminology is not standardized, but we will attempt to describe some prominent classes of strategies.

#### 2.4.1  Gaussian Assumed Density Filter

The gaussian assumed density filter (G-ADF) usually refers to choosing $μt$ and $Σt$ to be the mean vector and covariance matrix of the density $qt$ in equation 2.10 (Kushner, 1967; Ito, 2000; Ito & Xiong, 2000; Minka, 2001a). Moment matching, in this case, minimizes the relative entropy $D(qt∥ηd(·;μt,Σt))$. The G-ADF directly seeks a gaussian approximation to the full posterior $p(zt|x1:t)$, whereas the DKF derives a gaussian approximation to the full posterior from a gaussian approximation of $p(zt|xt)$. While the G-ADF approach tends to prove quite accurate, it is practical only if the mean and covariance of $qt$ are available. In particular, we must be able to efficiently compute or easily approximate the integrals,
$a=∫p(xt|zt)ηd(zt;νt,Mt)dzt,b=∫ztp(xt|zt)ηd(zt;νt,Mt)dzt,c=∫ztzt⊺p(xt|zt)ηd(zt;νt,Mt)dzt,$
(2.11)
to obtain $μt=b/a$ and $Σt=c/a-μtμt⊺$. There also exist extensions of the G-ADF. For instance, expectation propagation uses iterative refinement of estimates to improve on assumed density filtering (Minka, 2001a, 2001b). It may be possible to similarly improve the DKF, but iterating over the history of observations is typically not practical in an online setting, and we do not explore that approach here.

In cases where the DKF is derived from a known model, as opposed to being learned from training data, computing $f(xt)$ and $Q(xt)$ requires the computation of very similar integrals to those needed for the G-ADF, the difference being that $vt$ and $Mt$ are replaced by 0 and $S$, respectively, throughout equation 2.11 (and then $f(xt)=b/a$ and $Q(xt)=c/a-f(xt)f(xt)⊺$). For this reason, in models where the G-ADF can be easily used, there would seem to be no reason to use the DKF. The main difference is that the DKF can be easily learned from training data, whereas the G-ADF cannot, since the latter is based on the conditional mean and variance of $Zt|Xt$ derived under a different marginal distribution for $Zt$ at each time step, namely, $ηd(zt;νt,Mt)$. The example in section 4.2 illustrates a model where both the DKF and G-ADF can be analytically computed; there is little difference in performance. The example in section 4.3 illustrates a somewhat contrived model where the DKF can be easily computed, but it seems the G-ADF cannot.

#### 2.4.2  Laplace Approximation

The Laplace approximation uses a Taylor approximation at the maximum to coerce the numerator in equation 2.10 into a gaussian form as a function of $zt$ (Butler, 2007; Koyama, Pérez-Bolde, Shalizi, & Kass, 2010; Quang, Musso, & Le Gland, 2015). Defining
$gt(zt)=logp(xt|zt)ηd(zt;νt,Mt)andzt*=argmaxztgt(zt),$
a second-order Taylor approximation of $gt$ at $zt*$ is
$gt(zt)≈gt(zt*)+g˙t(zt*)(zt-zt*)+(zt-zt*)⊺g¨t(zt*)(zt-zt*)/2,$
where $g˙t(z)$ and $g¨t(z)$ denote, respectively, the $d×1$ gradient vector and the $d×d$ Hessian matrix of $gt$ evaluated at $z$. The second term vanishes since $g˙t$ is zero at the maximum, giving
$qt(zt)∝exp(gt(zt))≈expgt(zt*)+(zt-zt*)⊺g¨t(zt*)(zt-zt*)/2)∝ηd(zt;zt*,-g¨t(zt*)-1).$
This motivates the choice of $μt=zt*$ and $Σt=-g¨t(zt*)$. Similar to the DKF, the Laplace approximation can be justified in the limit of increasing observation dimensionality using the BvM theorem. If $zt*$ or the derivatives of $gt$ are not available in closed form, then the Laplace approximation can be slow owing to the need to solve an optimization problem at each time step. Laplace approximations are also criticized for being too local, in that the local curvature in the density at $zt*$ dictates the variance chosen for a global approximation to the density.

#### 2.4.3  Linearization Methods

Several methods, often called linearization methods, can be motivated by attempting to approximate the numerator of equation 2.10 as jointly gaussian in $(zt,xt)$, namely,
$p(xt|zt)ηd(zt;νt,Mt)≈ηd+nztxt|νtht,MtCtCt⊺Nt,$
(2.12)
where the history of observations $x1:t$ is allowed to influence the choice of $ht∈Rn×1$, $Nt∈Sn$, and $Ct∈Rd×n$. Using this approximation allows equation 2.10 to be exactly integrated to obtain
$μt=νt+CtNt-1(xt-ht)andΣt=Mt-CtNt-1Ct⊺.$
(2.13)
Methods differ in how they choose $ht$, $Nt$, and $Ct$.
Using $ηd(zt;νt,Mt)$ as the marginal density for $Zt$, equation 2.12 can be rewritten as
$p(xt|zt)≈ηn(xt;bt+Htzt,Λt).$
(2.14)
The implicit linearization in equation 2.12 is now explicit: $E(Xt|Zt=zt)$ is approximated as the linear function $bt+Htzt$. The relationship between the different parameters in equations 2.12 and 2.14 is $bt=ht-Ct⊺Mt-1vt$, $Ht=Ct⊺Mt-1$, and $Λt=Nt-Ct⊺M-1Ct$. Upon reparameterization,1 equation 2.13 can be used for filtering with
$Σt=(Mt-1+Ht⊺Λt-1Ht)-1,μt=ΣtMt-1νt+Ht⊺Λt-1(xt-bt),$
which has a similar appearance to the corresponding DKF updates in equation 2.7.
Equation 2.14 underlies several gaussian approximations to Bayes' rule, including the approximations used in the extended Kalman filter (EKF), the unscented Kalman filter (UKF: Julier & Uhlmann, 1997; Wan & van der Merwe, 2000; van der Merwe, 2004), and the statistically linearized filter (SLF: Gelb, 1974; Särkkä, 2013). The EKF, for instance, begins with the functions
$h(z)=E(Xt|Zt=z)andΛ(z)=V(Xt|Zt=z),$
which are assumed known, and takes $Ht=h˙(νt)$, $bt=h(νt)-Htνt$, and $Λt=Λ(νt)$, where $h˙(z)$ is the $n×d$ matrix of partial derivatives of $h$ evaluated at $z$. These choices of $bt$ and $Ht$ correspond to a first-order Taylor approximation of $h$ at the point $νt$. Like the Laplace approximation, the EKF is often criticized for being too local because the gradient of $h$ at a single point drives the approximation.

The unscented Kalman filter (UKF) employs the eponymous transform to propagate weighted, deterministically chosen points through a nonlinear transformation and recover estimates for $ht$, $Nt$, and $Ct$ from equation 2.13. The estimates for all three parameters prove exact for linear transformations of gaussians but inexact for general higher-order polynomials (Särkkä, 2013), so we consider this a linearization method. Variations on this approach, collectively called sigma-point filters (van der Merwe, 2004), include the central difference Kalman filter (CDKF: Ito & Xiong, 2000; Nørgaard, Poulsen, & Ravn, 2000), the Gauss-Hermite Kalman filter, the quadrature Kalman filter (Ito, 2000; Ito & Xiong, 2000), and the cubature Kalman filter (Arasaratnam, Haykin, & Elliott, 2007; Arasaratnam & Haykin, 2009).

The SLF is a related but more global approximation for the same observation model. It selects $bt$ and $Ht$ to minimize the difference between the true observation model $Xt=h(Zt)+εt$ and the linear approximation $Xt≈at+BtZt+εt$, where $Zt$ is chosen from the current, approximate, predicted distribution. For instance, $at$ and $Bt$ can be chosen to minimize
$∫∥h(zt)-(at+Btzt)∥2ηd(zt;νt,Mt)dzt,$
where $∥·∥$ is the usual Euclidean norm in $Rn$. Defining $h¯t=∫h(zt)ηd(zt;νt,Mt)dzt$ and $H¯t=∫(h(zt)-h¯t)(zt-νt)⊺ηd(zt;νt,Mt)dzt$, the solution is $Bt=H¯tMt-1$ and $at=h¯t-Btνt$, again with $Λt=Λ$. Like the EKF, this version of the SLF is best suited for additive, gaussian noise models, but it further requires that the integrals defining $h¯t$ and $H¯t$ can be efficiently computed or easily approximated.

The UKF, the SLF, and many related techniques improve on some of the deficiencies of the EKF. Nevertheless, these methods tend to perform poorly when the conditional distribution of $Xt$ given $Zt$ cannot be well approximated as gaussian. The examples in sections 4.2 and 4.3 illustrate models where linearization proves completely ineffectual, as $h(z)=E(Xt|Zt=z)=0$ for all $z$ in these examples, even though the G-ADF and the DKF work well.

### 2.5  Nonlinear State Dynamics

Filtering can be conceptually separated into two steps. The first step uses the state dynamics to transition from $Zt-1|X1:t-1$ to $Zt|X1:t-1$ via equation 1.2a, and the second step uses Bayes' rule to update $Zt|X1:t-1$ into $Zt|X1:t$ via equation 1.2b. In this letter, difficulties with the first step are removed by assuming linear, gaussian, state dynamics (see equation 2.1). There are, however, a variety of approximation methods for more complicated state dynamics, including methods that approximate $p(zt|x1:t-1)$ as a gaussian. Any such gaussian method could be easily combined with the DKF approximation, which relates to Bayes' rule in the second step. In particular, given the approximation
$p(zt|x1:t-1)≈ηd(zt;νt,Mt),$
we simply use these values of $νt$ and $Mt$ in the DKF algorithm (see equation 2.7) or the robust DKF algorithm see equation 2.8, instead of computing them in the first two lines of these algorithms. In this letter, we do not explore in depth this generalization to nonlinear state dynamics, although we do provide a proof of concept example in section 4.4.

There is a vast literature on more general approximation algorithms for Bayesian filtering (Särkkä, 2013; Chen, 2003). Monte Carlo integration (Metropolis & Ulam, 1949) can almost always be used. Such approaches are called sequential Monte Carlo or particle filtering and include sequential importance sampling and sequential importance resampling (Handschin & Mayne, 1969; Handschin, 1970; Gordon, Salmond, & Smith, 1993; Kitagawa, 1996; del Moral, 1996; Doucet, Godsill, & Andrieu, 2000; Cappé, Moulines, & Ryden, 2005; Cappé, Godsill, & Moulines, 2007). These methods apply to all classes of models but tend to be the most expensive to compute online and suffer from the curse of dimensionality (Daum & Huang, 2003). Alternate sampling strategies (see, e.g., Chen, 2003; Liu, 2008) can be used to improve filter performance, including acceptance-rejection sampling (Handschin & Mayne, 1969), stratified sampling (Douc & Cappé, 2005), hybrid MC (Choo & Fleet, 2001), and quasi-MC (Gerber & Chopin, 2015). There are also ensemble versions of the Kalman filter that are used to propagate the covariance matrix in high dimensions, including the ensemble Kalman filter (enKF: Evensen, 1994) and the ensemble transform Kalman filter (ETKF: Bishop, Etherton, & Majumdar, 2001; Majumdar, Bishop, Etherton, & Toth, 2002), along with versions that produce local, parallelizable approximations for covariance (Ott et al., 2004; Hunt, Kostelich, & Szunyogh, 2007).

It may be possible to usefully combine the DKF approximation with some of these more advanced filtering techniques. The key approximation in the DKF is
$p(xt|zt)=p(xt)p(zt|xt)p(zt)≈κ(xt)ηd(zt;f(xt),Q(xt))ηd(zt;0,S).$
(2.15)
This approximation could, in principle, be substituted for the likelihood $p(xt|zt)$ in any filtering algorithm, including particle filters, which incorporate the likelihood into the particle weights. The normalizing term $κ(xt)$ from equation 2.15 will generally cancel, since the final posterior distribution $p(zt|x1:t)$ is invariant to terms depending only on $x1:t$. The advantage of equation 2.15 is that $f(·)$, $Q(·)$, and $S$ might be easier to learn from data than the full conditional density $p(xt|zt)$. For complex state dynamics, it is worth noting that the denominator $ηd(zt;0,S)$ will no longer precisely correspond to $p(zt)$ but will also be an approximation. If the gaussian approximations for $p(zt|xt)$ and $p(zt)$ are learned separately, some care may need to be taken to ensure the resulting approximation to $p(xt|zt)$ remains a good one. One strategy might be to learn a gaussian-shaped approximation to the density ratio $p(zt|xt)/p(zt)$ as a function of $zt$ (Sugiyama, Suzuki, & Kanamori, 2012). Another strategy might be to use the robust DKF approximation as in section 2.3, which simply drops the denominator in equation 2.15. In future work, we plan to explore these and other approaches that might allow a DKF-style approximation to be incorporated into more general filtering models.

## 3  Learning the DKF

The parameters in the DKF are $A$, $Γ$, $f(·)$, and $Q(·)$. ($S$ is specified from $A$ and $Γ$ using the stationarity assumption.) In many problems, some or all of these parameters might be unknown or not easily computable. In this section, we discuss some strategies for learning or approximating the parameters in the situation where fully supervised training data are available, meaning that we have a sequence of $(Zt,Xt)$ pairs assumed to be sampled from the underlying Bayesian network in equation 1.1 and denoted $(z1',x1'),…,(zm',xm')$. This training data might be real data or might be simulated from a known generative model for which the parameters, particularly $f$ and $Q$, are not easily computable.

We use $A^$, $Γ^$, $f^$, and $Q^$ to denote the respective learned parameters. We consider only the situation where the parameters are learned from training data and then fixed for subsequent filtering on a different sequence of observations. In particular, for filtering, we simply replace each parameter with its corresponding estimate in the DKF algorithm in equation 2.7. We do not consider a more fully Bayesian approach, where parameter uncertainty is propagated through the filtering equations.

$A$ and $Γ$ are the parameters of a well-specified statistical model given by equations 2.1a and 2.1b. In the learning experiments below, we learn them from $(zt-1',zt')$ pairs using only equation 2.1b, which reduces to multiple linear regression and is a common approach when learning the parameters of a Kalman filter from fully observed training data (see, e.g., Wu et al., 2002).

The parameters $f$ and $Q$ are more unusual, since they are not uniquely defined by the model, but are introduced via a gaussian approximation in equation 2.2. One possibility, and the one we focus on here, is to define $f$ and $Q$ via equation 2.3 and then learn them directly from training data as
$f^(x)≈f(x)=E(Zt|Xt=x)andQ^(x)≈Q(x)=V(Zt|Xt=x).$
(3.1)
Using equation 3.1, we learn $f$ and $Q$ from $(zt',xt')$ pairs ignoring the overall temporal structure of the data, which reduces to a standard nonlinear regression problem with heteroskedastic variance. The conditional mean $f$ can be learned using any number of off-the-shelf regression tools, and then $Q$ can be learned from the residuals, ideally using a held-out portion of the training data. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this approach.

In the experiments that follow, we compare three standard nonlinear regression methods for learning $f$: Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and gaussian process (GP) regression. (Details are in sections A.4 to A.6.) While we have found that these methods work well with the DKF framework, one could readily use any arbitrary regression model.

For learning $Q$, we first define $Rt=Zt-f(Xt)$ and $R^t=Zt-f^(Xt)$, so that
$Q(x)=V(Zt|Xt=x)=E(RtRt⊺|Xt=x)≈E(R^tR^t⊺|Xt=x).$
(3.2)
The final expression in equation 3.2 is a conditional expectation and can in principle be learned with regression on $(R^tR^t⊺,Xt)$ pairs. Learning $Q$ in this way using off-the-shelf regression tools is more challenging because of the additional requirement that $Q(x)$ be a valid covariance matrix. Since $R^tR^t⊺$ is positive semidefinite, any regression estimator that is a weighted average of the training data with only nonnegative weights will also be positive semidefinite and, in most cases, positive definite. NW kernel regression constitutes one such method, and we use it for learning $Q$ in all of our examples. Given a subset of the training set ${(zi'',xi'')}i=1k$, distinct from the subset used to learn the function $f$, we define the residuals $r^i=zi''-f^(xi'')$, and then learn $Q$ using NW kernel regression via
$Q^(x)=∑i=1kr^ir^i⊺κ(x,xi'')∑i=1kκ(x,xi''),$
(3.3)
for a kernel $κ:X×X→[0,∞)$. (Complete details are in section A.4.)

## 4  Examples

In this section, we compare filter performance on both artificial models and real neural data. Corresponding Matlab code, and Python code for the long short-term memory (LSTM) comparison, is freely available online at https://github.com/burkh4rt/Discriminative-Kalman-Filter under the GNU General Public License v3.0 to encourage code use and adaptation. For timing comparisons, the code was run on a Mid-2018 MacBook Pro laptop with a 2.6 GHz Intel Core i7 processor using Matlab v. 2019a and Python v. 3.6.8.

### 4.1  Kalman Observation Model

The stationary Kalman filter observation model is
$p(xt|zt)=ηn(xt;b+Hzt,Λ)$
for observations in $X=Rn×1$ and for fixed $b∈Rn×1$, $H∈Rn×d$, and $Λ∈Sn$. Defining $f$ and $Q$ via equation 2.3 gives
$Q(x)≡Q=(S-1+H⊤Λ-1H)-1andf(x)=QH⊤Λ-1(x-b).$
It is straightforward to verify that the DKF in equation 2.7 is exactly the well-known Kalman filter recursion. Hence, the DKF computes the exact posterior $p(zt|x1:t)$ in this special case.

### 4.2  Kalman Observation Mixtures

This example and the next are designed to illustrate how the gaussian approximation underlying the DKF is more similar in spirit to the G-ADF than to linearization approximations such as the Kalman filter, the EKF, and the UKF (see section 2.4). In particular, the specific observation model used in the simulation is engineered so that the state $Zt$ and the observation $Xt$ are uncorrelated (but not independent). Linearization methods are useless in this case, whereas the DKF is able to take advantage of the higher-order dependence, much like the G-ADF.

The observation model is a probabilistic mixture of Kalman observation models (see section 4.1), namely,
$p(xt|zt)=∑ℓ=1Lπℓηn(xt;bℓ+Hℓzt,Λℓ),$
for a probability vector $π=π1:L$, where each $bℓ∈Rn×1$, $Hℓ∈Rn×d$, and $Λℓ∈Sn$. At each time step, one of $L$ possible Kalman observation models is randomly and independently selected according to $π$ and then used to generate the observation. This model can be viewed as a special case of a switching state-space model with independent switching (see Shumway & Stoffer, 1991; Ghahramani & Hinton, 2000). The integrals in equation 2.11 can be efficiently computed for any choice of $νt$ and $Mt$, including $νt=0$ and $Mt=S$, so the G-ADF and the DKF can be computed exactly for this model (see section A.1 for details), although the DKF is much faster for large $n$, because it allows for more precomputation. Figure 1 illustrates that the DKF is comparable to the G-ADF in terms of root mean squared error (RMSE) for a particular instance of this model, and it also shows that the computational savings of the DKF over a particle filter with similar accuracy can be dramatic, especially as $n$ gets large.
Figure 1:

Kalman observation mixtures. This figure shows filtering performance on an instance of the model in section 4.2 for various approximation algorithms as the observation dimension $n$ increases. The hidden state dimension is $d=10$, and the state model parameters are $S=Id$, $A=0.95Id-0.05$, and $Γ=S-ASA⊺$. The number of categories is $L=2$, the category probabilities are $π=(0.5,0.5)$, and the Kalman parameters are $b1=b2=b¯=0$, $Λ1=In$, $Λ2=5In$, and $H2=-H1$, so that $H¯=0$ (see equation 4.1). The entries of $H1$ were generated as independent $N(0,d-1)$ using the Matlab 9.6 code rng(42,twister'); H = randn(1000,10)/sqrt(10). The data were generated for an observation dimension of 1000, and the plot shows filter performance using only the first $n$ dimensions of $Xt$ for selected $n$ between 1 and 1000. Filter performance was measured using RMSE (left panel) and computation time (s, right panel) on a single test sequence of length $T=104$. Because $Xt$ and $Zt$ are uncorrelated, linearization methods (e.g., KF, EKF, and UKF) ignore $Xt$ and always predict $Zt≈E(Zt)=0$, giving an RMSE of approximately 1 (black line) in this case. The accuracy of particle filtering increases with the number of particles at the expense of increased computation, and we show performance for different numbers of particles: $101,102,103,104,105$ (blue lines, ordered as expected). We also show RMSE for the optimal prediction using only $Xt$ (as opposed to the entire history $X1:t$), namely, $Zt≈E(Zt|Xt)=f(Xt)$ (dotted red line). (This serves to demonstrate the performance gain that filtering provides.) Finally, we caution that the model parameters have much more influence on the relative performance of the different gaussian approximation methods when $n$ is small than when $n$ is large. The parameters in this model were chosen so that the DKF also performs well for small $n$, even though we only have guarantees about its performance in the large $n$ setting.

Figure 1:

Kalman observation mixtures. This figure shows filtering performance on an instance of the model in section 4.2 for various approximation algorithms as the observation dimension $n$ increases. The hidden state dimension is $d=10$, and the state model parameters are $S=Id$, $A=0.95Id-0.05$, and $Γ=S-ASA⊺$. The number of categories is $L=2$, the category probabilities are $π=(0.5,0.5)$, and the Kalman parameters are $b1=b2=b¯=0$, $Λ1=In$, $Λ2=5In$, and $H2=-H1$, so that $H¯=0$ (see equation 4.1). The entries of $H1$ were generated as independent $N(0,d-1)$ using the Matlab 9.6 code rng(42,twister'); H = randn(1000,10)/sqrt(10). The data were generated for an observation dimension of 1000, and the plot shows filter performance using only the first $n$ dimensions of $Xt$ for selected $n$ between 1 and 1000. Filter performance was measured using RMSE (left panel) and computation time (s, right panel) on a single test sequence of length $T=104$. Because $Xt$ and $Zt$ are uncorrelated, linearization methods (e.g., KF, EKF, and UKF) ignore $Xt$ and always predict $Zt≈E(Zt)=0$, giving an RMSE of approximately 1 (black line) in this case. The accuracy of particle filtering increases with the number of particles at the expense of increased computation, and we show performance for different numbers of particles: $101,102,103,104,105$ (blue lines, ordered as expected). We also show RMSE for the optimal prediction using only $Xt$ (as opposed to the entire history $X1:t$), namely, $Zt≈E(Zt|Xt)=f(Xt)$ (dotted red line). (This serves to demonstrate the performance gain that filtering provides.) Finally, we caution that the model parameters have much more influence on the relative performance of the different gaussian approximation methods when $n$ is small than when $n$ is large. The parameters in this model were chosen so that the DKF also performs well for small $n$, even though we only have guarantees about its performance in the large $n$ setting.

Define $b¯=∑ℓπℓbℓ$ and $H¯=∑ℓπℓHℓ$ so that
$E(Xt|Zt)=b¯+H¯Zt.$
(4.1)
An interesting special case of this model is when $H¯=0$, so that the mean of $Xt$ given $Zt$ does not depend on $Zt$, and, consequently, $Xt$ and $Zt$ are uncorrelated. Information about the states is found only in higher-order moments of the observations. Algorithms that are designed around $E(Xt|Zt)$, such as the Kalman filter, EKF, and UKF, are not useful when $H¯=0$, illustrating the important difference between a gaussian approximation for the observation model and the DKF approximation in equation 2.2. The simulation in Figure 1 used $H¯=0$, and the ineffectiveness of linearization techniques is easily seen.

### 4.3  Independent Bernoulli Mixtures

Here we describe a model where observations take values in ${0,1}n$ to further emphasize that our gaussian approximation is in the state space, not in the observation space. Like the example in section 4.2, this example is also engineered so that the states and observations are uncorrelated, rendering linearization-based methods ineffective (see section 2.4). Finally, the specific parameters of this example are chosen to have the peculiar property that the DKF is efficiently computable, whereas the G-ADF is not (insofar as we can tell).

The observation model is a probabilistic mixture of conditionally independent Bernoulli random variables, namely,
$p(xt|zt)=∑ℓ=1Lπℓ∏i=1ngℓi(zt)xti(1-gℓi(zt))1-xti,$
for a probability vector $π=π1:L$. For each $ℓ=1,…,L$ and $i=1,…,n$, the functions $gℓi:Rd×1→(0,1)$ are defined by
$gℓi(zt)=αℓi1{ztdi<γi}+βℓi1{ztdi≥γi},$
where each $γi∈R$, $αℓi,βℓi∈(0,1)$, $di∈{1,…,d}$, and where $ztk$ indicates the $k$th coordinate of $zt$. The $i$th coordinate of $Xt$ depends on $Zt$ only through the $di$th coordinate of $Zt$, and the probability distribution of $Xti$ is different depending on whether $Ztdi<γi$. Each of the $L$ components of the mixture changes the probability distribution of $Xti$, via $αℓi$ and $βℓi$, but it does not change the corresponding coordinate $di$ or the change point $γi$.

For the state dynamics, we use $S=Id$, which makes it possible to compute $f(zt)$ and $Q(zt)$ exactly (see section A.2). In general, however, the integrals in equation 2.11 are not easily evaluated, so the G-ADF is not a practical approximation technique in this example. Figure 2 suggests that the DKF approximation performs well for a particular instance of this model, in the sense that the DKF's RMSE is near or better than that of a particle filter with a large number of particles. The figure also shows that the computational savings over a particle filter with similar accuracy can be dramatic, especially as $n$ gets large.

Figure 2:

Independent Bernoulli mixtures. This figure shows filtering performance on an instance of the model in section 4.3 for various approximation algorithms as the observation dimension $n$ increases. The state model $(Zt)$ and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. (Using this many particles with higher $n$ was too time-consuming.) The number of categories is $L=2$, the category probabilities are $π=(0.5,0.5)$, and for each $i$, $α1i=β2i=0.01$ and $α2i=β1i=0.99$, so that each $g¯i≡0.5$ (see equation 4.2). The $d1:n$ were chosen as independent uniform${1,…,d}$, and the $γ1:n$ were chosen as independent $N(0,1)$.

Figure 2:

Independent Bernoulli mixtures. This figure shows filtering performance on an instance of the model in section 4.3 for various approximation algorithms as the observation dimension $n$ increases. The state model $(Zt)$ and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. (Using this many particles with higher $n$ was too time-consuming.) The number of categories is $L=2$, the category probabilities are $π=(0.5,0.5)$, and for each $i$, $α1i=β2i=0.01$ and $α2i=β1i=0.99$, so that each $g¯i≡0.5$ (see equation 4.2). The $d1:n$ were chosen as independent uniform${1,…,d}$, and the $γ1:n$ were chosen as independent $N(0,1)$.

Define $g¯i=∑ℓπℓgℓi$, so that
$E(Xti|Zt)=P(Xti=1|Zt)=g¯i(Zt).$
(4.2)
An interesting special case of this model is when $g¯i$ is constant for each $i$, so that the mean of $Xt$ given $Zt$ does not depend on $Zt$, and, consequently, $Xt$ and $Zt$ are uncorrelated. As in the previous section, linearization approximations like the Kalman filter, EKF, and UKF are not useful when $g¯i$ is constant. Furthermore, when $g¯i$ is constant, $Xti$ and $Zt$ are independent, that is, individual coordinates of the observations carry no information about the states. Only the vector of observations $Xt$ can be used for meaningful predictions of $Zt$. The simulation in Figure 2 used $g¯i≡0.5$ for all $i$, so that each coordinate of the observations is independent of the state.

### 4.4  Kalman Observation Mixtures with Nonlinear State Dynamics

This example illustrates how the DKF approximation can be combined with other filtering approximations for use with nonlinear state dynamics (see section 2.5). We include it here as a proof of concept and leave for future work a more thorough exploration of when the DKF approximation is useful for filtering with nonlinear state dynamics. We use the same mixture of Kalman observation models from section 4.2 but modify the state dynamics in equation 2.1 as follows. Define the $2×2$ rotation matrix $R(θ)=sinθ-cosθcosθsinθ$, and for even $d$, define the $d×d$ rotation matrix $Rd(θ)$ to be the block-diagonal matrix with $R(θ)$ repeated along the diagonal:
$Rd(θ)=R(θ)0⋯00R(θ)⋯0⋮⋮⋱⋮00⋯R(θ).$
Define the function $a:Rd×1→Rd×1$ via $a(z)=ARd(|z|)z$, where $|·|$ denotes the Euclidean norm. The new state dynamics are
$p(z0)=ηd(z0;0,S),p(zt|zt-1)=ηd(zt;a(zt-1),Γ),$
for $t=1,2,…$, where $S=ASA⊺+Γ$. These are the same dynamics as before except that the conditional mean of $Zt$ given $Zt-1$ has changed from the linear function $AZt-1$ to the nonlinear function $a(Zt-1)$. In particular, before being multiplied by $A$, the state vector is rotated by an amount that depends on its length. This type of nonlinearity was chosen because when $S=Id$ (as in our examples), $Zt$ remains marginally gaussian, an important part of the DKF approximation.

We use an unscented Kalman filter (UKF) approximation for the state dynamics; that is, we replaced $νt$ and $Mt$ in equations 2.7 and 2.8 with the mean and covariance obtained from performing the unscented transform (Julier & Uhlmann, 1997). We used Matlab's unscentedKalmanFilter implementation with alpha=1,beta=kappa=0. The UKF approximations of $νt$ and $Mt$ can also be substituted directly into the G-ADF used in section 4.2.

Figure 3 shows filtering performance for a specific instance of this model and illustrates that at least in this case, a DKF approximation for nonlinear, nongaussian observation models can be usefully combined with other approximations for nonlinear state dynamics and that there is little loss of performance compared to the G-ADF.

Figure 3:

Nonlinear state dynamics. This figure shows filtering performance on an instance of the model in section 4.4 for various approximation algorithms as the observation dimension $n$ increases. The observation model $(Xt|Zt)$ and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. The state model is now nonlinear, and $μt$ and $Mt$ in the DKF, robust DKF, and G-ADF are approximated using a UKF.

Figure 3:

Nonlinear state dynamics. This figure shows filtering performance on an instance of the model in section 4.4 for various approximation algorithms as the observation dimension $n$ increases. The observation model $(Xt|Zt)$ and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. The state model is now nonlinear, and $μt$ and $Mt$ in the DKF, robust DKF, and G-ADF are approximated using a UKF.

### 4.5  Unknown Observation Model: Macaque Reaching-Task Data

This example illustrates Bayesian filtering in a case where the observation model is unknown and must be learned from data. Flint, Lindberg, Jordan, Miller, and Slutzky (2012) implanted a rhesus monkey with a 96-channel microelectrode array (Blackrock Microsystems LLC) over the arm area of its primary motor cortex (M1). The monkey was trained to move a manipulandum to acquire illuminated targets for a juice reward. While performing this task, the monkey's neural spikes were recorded with a 128-channel acquisition system (Cerebus, Blackrock Microsystems LLC). The signal was sampled at 30 kHz, high-pass-filtered at 300 Hz, and then thresholded and manually sorted into spikes offline. Walker and Kording (2013) continue to make these data publicly available as part of the Database for Reaching Experiments and Models (DREAM). We used data from Flint et al. (2012) and aggregated spike counts over 100 ms bins. The first $n=10$ principal component analysis (PCA) components of neural data became the observed variable $Xt$, and we used the $d=2$-dimensional (horizontal and vertical) cursor velocity (lagged 50 ms after the end of the spike count bin) as the latent variable $Zt$.

Tables 1 and 2 compare filtering performance using various learning algorithms and filtering methods. For learning the function $f:R10→R2$ for the DKF, we experimented with Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and gaussian process (GP) regression. In each case, we learned the function $Q:R10→S2$ using NW kernel regression from the approximate residuals as in equation 3.3. For the Kalman filter, parameters are learned in the usual manner via multivariate (linear) regression. For the EKF and UKF (see section 2.4), we learned the conditional mean $h:R2→R10$ defined by
$h(z)=E(Xt|Zt=z)$
(4.3)
via neural network regression and took the conditional covariance to be constant, namely, $Λ(z)=V(Xt|Zt=z)≡Λ∈S10$, which we learned from the approximate residuals. Finally, we also experimented with an LSTM recurrent neural network for predicting $Zt$ given $X1:t$. In all cases, we used 5000 training points and a different 1000 testing points. (More details about all of these methods are in sections A.4 to A.7.)
Table 1:
Normalized RMSE (nRMSE) for Various Filtering Methods on the Flint Data Set.
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.7650.9420.7880.7930.7800.7650.805
DKF-NW −21% −18% −17% −23% −20% −23% −20%
DKF-GP −21% −19% −15% −20% −18% −20% −19%
DKF-NN −19% −15% −13% −13% −13% −17% −15%
LSTM −15% −19% −16% −13% −16% −11% −15%
EKF 2% 24% 12% 18% 12% 3% 12%
UKF 2% 31% 18% 18% 15% 6% 15%
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.7650.9420.7880.7930.7800.7650.805
DKF-NW −21% −18% −17% −23% −20% −23% −20%
DKF-GP −21% −19% −15% −20% −18% −20% −19%
DKF-NN −19% −15% −13% −13% −13% −17% −15%
LSTM −15% −19% −16% −13% −16% −11% −15%
EKF 2% 24% 12% 18% 12% 3% 12%
UKF 2% 31% 18% 18% 15% 6% 15%

Notes: The nRMSE is computed by dividing the RMSE by the root mean square of the observation vector, so that predicting identically zero would yield an nRMSE of 1. The top row shows the nRMSE of the Kalman filter. Each remaining row shows the percentage change in nRMSE relative to the Kalman filter, with methods ordered from best (top) to worst (bottom) average performance. Columns 1 to 6 refer to completely separate trials using new training and testing data. The final column gives the average performance across the six trials.

Table 2:
Mean Absolute Angular Error (Radians) for Various Filtering Methods on the Flint Data Set.
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.8890.9551.0250.9330.9640.9260.949
DKF-NW −15% −1% −20% −17% −25% −28% −18%
DKF-GP −11% 7% −22% −16% −24% −25% −15%
DKF-NN −7% −2% −17% −16% −21% −23% −14%
LSTM −2% −2% −12% −6% −10% −8% −7%
UKF 0% 3% −3% −3% −8% −6% −3%
EKF 4% 3% −2% −4% −8% −7% −2%
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.8890.9551.0250.9330.9640.9260.949
DKF-NW −15% −1% −20% −17% −25% −28% −18%
DKF-GP −11% 7% −22% −16% −24% −25% −15%
DKF-NN −7% −2% −17% −16% −21% −23% −14%
LSTM −2% −2% −12% −6% −10% −8% −7%
UKF 0% 3% −3% −3% −8% −6% −3%
EKF 4% 3% −2% −4% −8% −7% −2%

Notes: Because cursor speed is often adjustable in BCIs (Willett et al., 2019), this may provide a more informative measure of performance. See the caption for Table 1 for more details about the table arrangement. Note that $45∘=π/4≈0.79$ radians, so all of these methods have fairly substantial angular error over 100 ms prediction intervals. Chance performance would be $π/2≈1.57$ radians.

The DKF using NW kernel regression was the best method among the ones that we tried, and all versions of the DKF were near the top in performance. Under the mean absolute angular error (MAAE) metric (Simeral, Kim, Black, Donoghue, & Hochberg, 2011), each version of the DKF outperformed prediction using the corresponding $f^$, illustrating the benefit of filtering to combine information from both past and present observations. The EKF and UKF performed poorly. We do not know the degree to which poor performance is a result of errors introduced by the EKF and UKF approximations or a result of errors introduced from learning the function $h$ in equation 4.3. All versions of the DKF outperformed the LSTM that we used. The LSTM and its variants require manually selecting a neural network architecture and several tuning parameters. This is often done by experts through trial and error. While we suspect that there is some combination of architecture and tuning parameters that would allow the LSTM to meet or exceed the DKF performance, automating this process of searching through network architecture remains an area of active research requiring extensive computational resources (Zoph & Le, 2017; Real et al., 2017).

### 4.6  Closed-Loop Decoding in a Person with Paralysis

Neural decoding for closed-loop brain-computer interfaces (BCIs) provided the motivating application for the development of the DKF. BCIs use neural measurements from the brain to enable voluntary control of external devices (Wolpaw, Birbaumer, McFarland, Pfurtscheller, & Vaughan, 2002; Hochberg & Donoghue, 2006; Brandman, Cash, & Hochberg, 2017). Intracortical BCI systems (iBCIs) have been shown to provide users with paralysis the ability to control computer cursors (Pandarinath et al., 2015; Jarosiewicz et al., 2015; Nuyujukian et al., 2018), robotic arms (Hochberg et al., 2012; Collinger et al., 2013), and functional electrical stimulation systems (Bouton et al., 2016; Ajiboye et al., 2017) with the real-time decoded neural activity generated during attempted movement. State-of-the-art decoding approaches have been based on the Kalman filter (Pandarinath et al., 2017; Jarosiewicz et al., 2015; Gilja et al., 2015), with observed neural features and latent motor intention used to move external devices. To construct a supervised training set, motor intentions are inferred as vectors from the instantaneous cursor position to the target position $Zt$ (Brandman, Hosman et al., 2018).

The DKF is a natural choice for closed-loop neural decoding using iBCIs for a few reasons. First, evidence suggests that neurons have very complex behavior. Neurons in the motor cortex have been shown to encode direction of movement (Georgopoulos, Kettner, & Schwartz, 1988), velocity (Schwartz, 1994), acceleration (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004), muscle activation (Lemon, 2008; Pohlmeyer, Solla, Perreault, & Miller, 2007), proprioception (Bensmaia & Miller, 2014), visual information related to the task (Rao & Donoghue, 2014), and preparatory activity (Churchland et al., 2012). Hence, iBCI-related recordings are highly complex and nonlinear (Vargas-Irwin, Brandman, Zimmermann, Donoghue, & Black, 2015). Moving away from the linear constraints of the Kalman filter could potentially capture more of the inherent complexity of the signals, resulting in higher end-effector control for the user.

Second, evidence suggests that the quality of control directly relates to the rate at which the decoding systems perform real-time decoding. Modern iBCI sytems update velocity estimates on the order of 20 ms (Jarosiewicz et al., 2015) or even 1 ms (Pandarinath et al., 2015). Thus, any potential filtering technique must be computationally feasible to implement for real-time use.

Third, new technologies have allowed neuroscientists to simultaneously record from increasingly large numbers of neurons. In fact, the number of observed brain signals has been growing exponentially (Stevenson & Kording, 2011). By contrast, the dimensionality of the underlying device being controlled remains small, generally not exceeding 10 dimensions (Wodlinger et al., 2015; Vargas-Irwin et al., 2010).

We previously reported how three people with spinal cord injuries used the DKF with GP regression to rapidly gain closed-loop neural control (Brandman, Burkhart et al., 2018; Brandman, Hosman et al., 2018). Here, as an additional proof of concept, we present data from a person with amyotrophic lateral sclerosis (participant T9) using the DKF. In these research sessions, the observations constitute neural data collected from an electrode array surgically implanted in the participant's brain, and the hidden states represent the intended cursor velocity. The DKF prediction of intended cursor velocity is used at each time step to move the cursor. For learning the DKF parameters, training data are collected during an initial calibration phase in which the participant is instructed to attempt to move the cursor to various target locations, and the intended velocity at each time step is assumed to be pointing from the current cursor position to the instructed target. GP regression was used to learn $f$, and for computational efficiency, $Q$ was assumed to be constant and set as the covariance of the residuals. The participant's performance using an out-of-the-box DKF was comparable to state-of-the-art decoders based on modifications of the Kalman filter designed specifically for the BrainGate2 clinical trials.

#### 4.6.1  Participant

The participant in this study was T9, a 52-year-old right-handed male with paralysis from late-stage amyotrophic lateral sclerosis (ALSFRS-R score $=$ 7; see Cedarbaum et al., 1999, for a detailed explanation of this metric). T9 underwent surgical placement of two 96-channel intracortical silicon microelectrode arrays (Maynard et al., 1997) (1.5 mm electrode length, Blackrock Microsystems, Salt Lake City, UT) in the motor cortex as previously described (Kim, Simeral, Hochberg, Donoghue, & Black, 2008; Simeral et al., 2011). Data were used from trial (postimplant) days 292 and 293.

#### 4.6.2  Signal Acquisition

Raw neural signals for each channel (electrode) were sampled at 30 kHz using the NeuroPort System (Blackrock Microsystems, Salt Lake City, UT). Further signal processing and neural decoding were performed using the xPC target real-time operating system (Mathworks, Natick, MA). Raw signals were downsampled to 15 kHz for decoding and denoised by subtracting an instantaneous common average reference (Gilja et al., 2015; Jarosiewicz et al., 2015) using 40 of the 96 channels on each array with the lowest root mean square value (selected based on their baseline activity during a 1 minute reference block run at the start of each session). The denoised signal was bandpass-filtered between 250 Hz and 5000 Hz using an 8th-order noncausal Butterworth filter (Masse et al., 2015). Spike events were triggered by crossing a threshold set at 3.5 times the root mean square amplitude of each channel, as determined by data from the reference block. The neural feature used was total power in the bandpass-filtered signal (Jarosiewicz et al., 2015; Brandman, Hosman et al., 2018). Neural features were binned in 20 ms nonoverlapping increments for decoding. We used the top 40 features ranked by signal-to-noise-ratio (Malik et al., 2015).

#### 4.6.3  Decoder Calibration

Decoder calibration was performed using the standard radial-8 task (Simeral et al., 2011; Gilja et al., 2015) using custom-built software running Matlab. An LCD monitor was placed 55 to 60 cm at a comfortable angle and orientation to T9. Targets (size $=$ 2.4 cm, visual angle $=$$2.5∘$) were presented sequentially in a pseudorandom order, alternating between one of eight radially distributed targets and a center target (radial target distance from center $=$ 12.1 cm, visual angle $=$$12.6∘$). Successful target acquisition required the user to place the cursor (size $=$ 1.5 cm, visual angle $=$ 1.6$∘$) within the target's diameter for 300 ms, before a predetermined timeout of 15 seconds. Target timeouts resulted in the cursor moving directly to the intended target, with immediate presentation of the next target.

Calibration began with 2 minutes of open-loop presentation of a cursor; that is, the cursor moved automatically to pseudorandomly presented targets in a straight path. During this time, T9 was instructed to “imagine” or “attempt” to move the computer cursor as if he had control of it. After 2 minutes, initial hyperparameters for the GP were learned. Next, T9 acquired targets for 3 minutes with 80% of the component of the decoded vector perpendicular to the vector between the cursor and the target (Jarosiewicz et al., 2013; Velliste, Perel, Spalding, Whitford, & Schwartz, 2008), in order to assist with target acquisition. GP hyperparameters were then recomputed with all of the available data. The radial-8 task was repeated two more times with the attenuated components at 50% and 20%, for a total of 11 minutes of calibration data collected. We collected 3000 data points randomly subsampled from the 11 minutes of collected data, using all 192 neural features (96 features per array, two arrays).

#### 4.6.4  Performance Measurement

We quantified the performance of the DKF decoder with the mFitts1 task (Gilja et al., 2015; Simeral et al., 2011). Under the Fitts model (Fitts, 1954), movement time (MT) varies linearly with the index of difficulty (ID) as
$MT=a·ID+b,$
(4.4)
where the parameters $a$ and $b$ depend on the input device. Parameters are estimated using linear regression on observed (ID, MT) pairs for each input method. These estimates are then used to evaluate filter performance.
A single target was presented on the screen in a pseudorandom location, with one of three pseudorandomly fixed diameters (size $=$ 1.6 cm, 3.5 cm, and 5.6 cm; visual angles 1.7$∘$, 3.7$∘$, and 5.8$∘$). Targets were acquired by having the cursor contact the target for 500 ms, within a timeout of 10 seconds. For the mFitts1 task, the index of difficulty for each trial was calculated as follows,
$ID=log2DW+1,$
where $D$ is the distance from the cursor's start position to the goal and $W$ is the sum of the target's diameter and cursor's radius. Hence, $DW$ reflects a measure of difficulty for acquiring targets.

#### 4.6.5  Results

T9 acquired 98% of targets presented over two research sessions ($N=299$) with the mFitts1 task. The Fitts regression parameters were comparable to the previously described performance by different participants (T6 and T7) using the ReFIT decoder (Gilja et al., 2015, Fig. 4.6.5, slope = $1.08±0.06,p<1.2×10-30$, intercept $=$$1.6±1.3,p<2.2×10-41$). (See Figure 4 for details.)

Figure 4:

On the left, we plot movement time versus index of difficulty for T9 during the radial-8 task. On the right, we compare Fitts metrics for the DKF to those for Kalman ReFit. In particular, the slope and intercept from the line of best fit on the left correspond to the yellow bars for slope and intercept on the right. Error bars correspond to a 95% confidence interval for each estimated parameter. Following the discussion in section 4.6.4, lower values for the slope parameter ($a$ in equation 4.4) correspond to less of an increase in movement time for more difficult targets. Estimates for the intercept parameter correspond to $b$ in equation 4.4.

Figure 4:

On the left, we plot movement time versus index of difficulty for T9 during the radial-8 task. On the right, we compare Fitts metrics for the DKF to those for Kalman ReFit. In particular, the slope and intercept from the line of best fit on the left correspond to the yellow bars for slope and intercept on the right. Error bars correspond to a 95% confidence interval for each estimated parameter. Following the discussion in section 4.6.4, lower values for the slope parameter ($a$ in equation 4.4) correspond to less of an increase in movement time for more difficult targets. Estimates for the intercept parameter correspond to $b$ in equation 4.4.

## 5  Discussion

The DKF is a novel filtering method that should prove to be a helpful addition to the filtering toolbox. It provides a fast, analytic approximation for models with linear, gaussian dynamics but nonlinear, nongaussian observations. The approximations underlying the DKF tend to improve as the dimensionality of the observation space increases relative to the dimensionality of the state space. For known models, the DKF is quite similar in nature to the G-ADF; however, when models must be learned from training data, as is the case for many practical applications, the G-ADF entails integrals that require approximation and does not provide a closed-form update. In comparison to Laplace or saddle-point approximations, the DKF provides a more global approximation to the true filtering distribution. As we demonstrate in our examples, there are many families of state-space models that render the EKF and UKF ineffective but for which the DKF performs well.

In applications where the model must be learned from supervised training data prior to filtering, off-the-shelf nonlinear or nonparametric regression tools can be used to learn the conditional mean and variance for the DKF directly, avoiding the more complicated task of learning the complete observation model $p(xt|zt)$. Using the DKF in this way appears to be novel within the large literature on state-space models. Most approaches either learn a fully generative model and invert it for filtering (this includes the of use discriminative methods for training filters derived from generative models—Abbeel, Coates, Montemerlo, Ng, & Thrun, 2005; Hess & Fern, 2009) or learn a fully discriminative model that directly predicts states from the sequence of observations. The DKF allows a generative model for the state dynamics to be combined in a principled way with a discriminative model for predicting the states from the observations at individual time steps. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this methodology.

Many promising opportunities exist to apply and extend the DKF. For example, using a gaussian approximation for $p(zt|xt)$ can permit a more principled approach to mitigating nonstationarities that occur in the measurement model. In neural decoding, a large change in the behavior of a single neuron that occurs between model training and filter use can result in significant performance degradation for the decoder. In the DKF framework with a GP regression model for $p(zt|xt)$, one can select a kernel function that ignores large differences along any single dimension. Clinical results demonstrate that this modification allows the filter to be more robust to erratic firing patterns in an arbitrary single neuron. (See Brandman, Burkhart et al., 2018, for further details.) It seems that this approach could be readily applied more generally to increase filter resilience to nonstationarities.

While the DKF assumes an approximately gaussian posterior, for general filtering models there may also be ways to incorporate the underlying gaussian approximation for $p(zt|xt)$ to improve performance. Methods that preserve the full form of the filtering distribution, such as particle filters, could be combined with alternatively specified measurement models, as in equation 2.15, to create general-purpose filters that are both more convenient to learn from data and use in filtering applications. The DKF marks a first step in this direction.

## Appendix A:  Technical Details

This appendix provides the derivations used in sections 4.2 and 4.3, along with some information on numerical stability and details for the discriminative learning methods employed in section 4.5.

### A.1  Kalman Observation Mixtures

For the model in section 4.2 we provide analytic expressions for the integrals in equation 2.11, which are needed for the G-ADF and the DKF (using $νt=0$ and $Mt=S$ for the DKF). Define
$Utℓ=(Mt-1+Hℓ⊺Λℓ-1Hℓ)-1,ytℓ=Utℓ(Mt-1νt+Hℓ⊺Λℓ-1(xt-bℓ)),κtℓ=ηd(νt;0,Mt)ηn(xt;bℓ,Λℓ)/ηd(ytℓ;0,Utℓ).$
Then
$a=∫p(xt|zt)ηd(zt;νt,Mt)dzt=∑ℓ=1Lπℓ∫ηn(xt;bℓ+Hℓzt,Λℓ)ηd(zt;νt,Mt)dzt=∑ℓ=1Lπℓηn(xt;bℓ+Hℓνt,Λℓ+HℓMtHℓ⊺),b=∫ztp(xt|zt)ηd(zt;νt,Mt)dzt=∑ℓ=1Lπℓ∫ztηn(xt;bℓ+Hℓzt,Λℓ)ηd(zt;νt,Mt)dzt=∑ℓ=1Lπℓκtℓ∫ztηd(zt;ytℓ,Utℓ)dzt=∑ℓ=1Lπℓκtℓytℓ,c=∫ztzt⊺p(xt|zt)ηd(zt;νt,Mt)dzt=∑ℓ=1Lπℓκtℓ∫ztzt⊺ηd(zt;ytℓ,Utℓ)dzt=∑ℓ=1Lπℓκtℓ(Utℓ+ytℓytℓ⊺).$

### A.2  Independent Bernoulli Mixtures

For the model in section 4.3, we provide analytic expressions for the integrals in equation 2.11 for the special case of $νt=0$ and $Mt=S=Id$, which are needed for the DKF. For each $k=1,…,d$, define $Nk={i:di=k}$, $Γk={γi:i∈Nk}$, $nk=|Γk|$, and let $γk,1<⋯<γk,nk$ denote the sorted (distinct) values in $Γk$, using $γk,0=-∞$ and $γk,nk+1=+∞$. Using $η(u)=η1(u;0,1)$ to denote the standard normal pdf and $φ(v)=∫-∞vη(u)du$ to denote the corresponding distribution function, define
$Φkj=∫γk,j-1γk,jη(u)du=φ(γk,j)-φ(γk,j-1),Φkj'=∫γk,j-1γk,juη(u)du-Φkj=η(γk,j-1)-η(γk,j)-Φkj,Φkj''=∫γk,j-1γk,ju2η(u)du-Φkj-2Φkj'=γk,j-1η(γk,j-1)-γk,jη(γk,j)-2Φkj',ρℓij=αℓi1{γk,j≤γi}+βℓi1{γi<γk,j},(i∈Nk),$
for $k=1,…,d$ and $j=1,…,nk+1$ and $ℓ=1,…,L$.
Let $xtNk=(xti:i∈Nk)$ and define
$Dℓkj(xtNk)=∏i∈Nkρℓijxti(1-ρℓij)1-xti,pℓ(xtNk|ztk)=∏i∈Nkαℓixti(1-αℓi)1-xti1{ztk<γi}+βℓixti(1-βℓi)1-xti1{ztk≥γi}=∑j=1nk+11{γk,j-1≤ztk<γk,j}Dℓkj(xtNk),$
so that $p(xt|zt)=∑ℓ=1Lπℓ∏k=1dpℓ(xtNk|ztk)$ and (with $S=Id$),
$p(xt|zt)ηd(zt;0,S)=p(xt|zt)∏k=1dη(ztk)=∑ℓ=1Lπℓ∏k=1dpℓ(xtNk|ztk)η(ztk).$
Hence, using $δkr=1{k=r}$,
$a=∫p(xt|zt)ηd(zt;0,S)dzt=∑ℓ=1Lπℓ∏k=1d∫pℓ(xtNk|ztk)η(ztk)dztk=∑ℓ=1Lπℓ∏k=1d∑j=1nk+1Dℓkj(xtNk)∫γk,j-1γk,jη(ztk)dztk=∑ℓ=1Lπℓ∏k=1d∑j=1nk+1Dℓkj(xtNk)Φkj,br=∫ztrp(xt|zt)ηd(zt;0,S)dzt=∑ℓ=1Lπℓ∏k=1d∑j=1nk+1Dℓkj(xtNk)(Φkj+Φkj'δkr),crs=∫ztrztsp(xt|zt)ηd(zt;0,S)dzt=∑ℓ=1Lπℓ∏k=1d∑j=1nk+1Dℓkj(xtNk)(Φkj+Φkj'δkr+Φkj'δks+Φkj''δkrδks),$
where in equation 2.11, the vector $b=(br:r=1,…,d)$ and the matrix $c=(crs:r,s=1,…,d)$. We have $f(x)=b/a$ and $Q(x)=c/a-f(x)f(x)⊺$.

### A.3  Measures to Prevent Numerical Instabilities

The covariance matrix $Σt$ must be positive definite for the DKF algorithm to make sense. As $n$ gets large, using $Q(xt)=V(Zt|Xt=xt)$, the probability that $Σt$ is positive definite goes to 1 (see 7 below). However, when $n$ is small or when $Q$ is learned, $Σt$ will often not be positive definite. An easy remedy is to force $Q-1(x)-S-1$ to be positive semidefinite for every $x$ by shrinking the (generalized) eigenvalues of $Q(x)$ for any $x$ where this constraint is not satisfied. In particular, beginning with a target $Q=Q(x)$ for a given fixed $x$, consider the generalized eigenvalue decomposition $QV=SVD$, where $V∈Rd×d$ is invertible and $D∈Rd×d$ is diagonal. (This decomposition can be computed in Matlab using [V,D]=eig(Q,S).) Let $D∧1$ denote the element-wise minimum of $D$ and 1, and define $Q'=SV(D∧1)V-1$. By redefining $Q(x)$ as $Q'$, we will ensure that $Q-1(x)-S-1$ is positive semidefinite, as required. Moreover, $Q'$ will be the same as the original $Q$ if this condition was already satisfied by the original $Q$, showing that this modification to the DKF algorithm does not affect our asymptotic analysis. We used this modification for all of the experiments with the DKF. The robust DKF does not require this modification. Here is a proof of the claims about this method: $Q-1-S-1$ is positive semidefinite if and only if $S-Q$ is positive semidefinite if and only if $S-1/2(S-Q)S-1/2$ is positive semidefinite. We have $S-1/2(S-Q)S-1/2=S-1/2(S-SVDV-1)S-1/2=I-S1/2VD(S1/2V)-1=(S1/2V)(I-D)(S1/2V)-1$, which is positive semidefinite if and only if all entries of $D$ (which is diagonal) are $≤1$. Replacing $D$ with $D∧1$ exactly enforces this constraint.

For our DKF experiments with nonlinear state dynamics using an extended Kalman filter (EKF) approximation (not described here), we found that the DKF-EKF became unstable for small $n$ because the EKF approximation to the nonlinearity was quite poor. To remedy this, we modified the DKF algorithm to prevent $μt$ from diverging too far from $νt$ and $f(xt)$ (the posterior means of $Zt$ given $X1:t-1$ and given $Xt$, respectively). In particular, we forced $|μt|2≤|νt|2+|f(xt)|2$ (by scaling $μt$ whenever its norm exceeded our bound). For larger $n$, once the DKF approximation becomes more accurate, this constraint was always satisfied in our experiments without intervention, but for smaller $n$, enforcing it was important for preventing numerical instabilities. The robust DKF did not require this modification. Although not used in this letter, we report this modification in case others find it useful in their application.

We can learn $f:Rn→Rd$ with a variety of regression methods. The well-known Nadaraya-Watson kernel regression estimator (Nadaraya, 1964; Watson, 1964) is
$f^(x)=∑i=1mzi'κX(x,xi')∑i=1mκX(x,xi'),$
where the $κX(x,x')$ is a nonnegative kernel and $m$ is the size of the training set. Bandwidth can be chosen using rule-of-thumb or with leave-one-out cross-validation, the latter scaling as $O(m2)$. Evaluation of $f^$ scales like $O(m)$. In the examples, we use a gaussian kernel with a bandwidth chosen by minimizing leave-one-out mean squared error (MSE) on the training set.

### A.5  Neural Network Regression

We can also learn $f$ as a neural network (NN). NNs are attractive for online filtering, because evaluation of $f^$ scales $O(1)$ with the size of the training set. With MSE as an objective function, we optimize parameters over the training set. Typically optimization continues until performance stops improving on a validation subset (to prevent overfitting), but instead we use Bayesian regularization to ensure network generalizability (MacKay, 1992; Foresee & Hagan, 1997). Training costs depend on the training algorithm chosen. Traditional optimizers include stochastic gradient descent, scaling with $O(m)$; scaled conjugate gradient, with $O(m2)$; and Levenberg-Marquardt, with $O(m3)$ (Castillo, Guijarro-Berdiñas, Fontenla-Romero, & Alonso-Betanzos, 2010), where $m$ is the size of the training set. More recently, Hessian-free approaches have been developed to train NNs on larger data sets (Schmidhuber, 2015). Training costs also grow with $d$, depending on the choice of architecture.

We implemented all feedforward neural networks with Matlab's Neural Network Toolbox R2019a. Our implementation consisted of a single hidden layer of tansig neurons trained via Levenberg-Marquardt optimization (Levenberg, 1944; Marquardt, 1963; Hagan & Menhaj, 1994) with Bayesian regularization.

### A.6  Gaussian Process Regression

Gaussian process (GP) regression is another popular method for nonlinear regression (Rasmussen & Williams, 2006). The idea is to put a prior distribution on the function $f$ and approximate $f$ with its posterior mean given training data. We first briefly describe the case $d=1$. We form an $m×n$-dimensional matrix $X'$ by concatenating the $1×n$-dimensional vectors $Xi'$ and an $m×d$-dimensional matrix $Z'$ by concatenating the vectors $Zi'$. We assume that $p(zi'|xi',f)=η(zi';f(xi'),σ2)$, where $f$ is sampled from a mean-zero GP with covariance kernel $K(·,·)$. Under this model,
$f^(x)=E(f(x)|Z',X')=K(x,X')(K(X',X')+σ2Im)-1Z',$
where $K(x,X')$ denotes the $1×m$ vector with $i$th entry $K(x,Xi')$, $K(X',X')$ denotes the $m×m$ matrix with $ij$th entry $K(Xi',Xj')$, $Z'$ is a column vector, and $Im$ is the $m×m$ identity matrix. The noise variance $σ2$ and any parameters controlling the kernel shape are hyperparameters. For our examples, we used the radial basis function kernel with two parameters: length scale and maximum covariance. These hyperparameters were selected via maximum likelihood. For $d>1$, we repeated this process for each dimension to separately learn the coordinates of $f$. Training costs for a single dimension scale as $O(m3)$. Sparse approximations to GPs can reduce training requirements to $O(m·NS2)$ where $NS$ is the size of the sparse GP (Quiñonero Candela & Rasmussen, 2005). Evaluation of $f^$ scales $O(m)$ for each dimension, or $O(NS)$ for sparse approximations.

All GP training was performed using the publicly available GPML package (Rasmussen & Nickisch, 2010).

### A.7  Comparison with a Long Short-Term Memory Neural Network

An LSTM is a stateful recurrent neural network designed to overcome error backflow problems (Hochreiter & Schmidhuber, 1997). Such recurrent neural networks have previously been shown to outperform state-of-the-art Kalman-based filters on this primate neural decoding task and so provide a good point of comparison (Sussillo, Stavisky, Kao, Ryu, & Shenoy, 2016; Sussillo et al., 2012; Pandarinath et al., 2018; Hosman et al., 2019). While there are many variants on the LSTM architecture, none seem to universally improve on the basic design (Jozefowicz, Zaremba, & Sutskever, 2015; Greff, Srivastava, Koutník, Steunebrink, & Schmidhuber, 2016). LSTM optimization uses many of the same methods that work for feedforward NN's (Schmidhuber, 2015). Training and evaluation requirements are similar.

All LSTM trials were conducted with TensorFlow r1.4.0 in a Python 3.6.8 environment. The LSTM cell used in these trials was built from scratch in TensorFlow following Gers, Schmidhuber, and Cummins (2000). Dropout was used to prevent overfitting (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), but it was only applied to feedforward connections, not recurrent connections (Pham, Bluche, Kermorvant, & Louradour, 2014; Zaremba, Sutskever, & Vinyals, 2014). The recurrent states and outputs at each intermediate time step were batch-normalized to accommodate internal covariate shift (Ioffe & Szegedy, 2015). Model parameters were initialized via a Xavier-type method (Glorot & Bengio, 2010) designed to stabilize variance from layer to layer. Optimization was then performed with Adadelta (Zeiler, 2012), an algorithm designed to improve upon Adagrad (Duchi, Hazan, & Singer, 2011) with the explicit goals of decreasing sensitivity to hyperparameters and permitting the learning rate to sometimes increase.

## Appendix B:  Mathematical Results

Our main technical result is 2. After stating the theorem, we translate it into the setting of the letter. Probability density functions (pdfs) are with respect to Lebesgue measure over $Rd$. $∥·∥1$ and $∥·∥∞$ denote the $L1$ and $L∞$ norms, respectively, $→w$ denotes weak convergence of probability measures (equivalent, for instance, to convergence of the expected values of bounded continuous functions), and $δc$ denotes the unit point mass at $c∈Rd$. Define the Markov transition density $τ(y,z)=ηd(z;Ay,Γ)$, and let $τh$ denote the function
$(τh)(z)=∫τ(y,z)h(y)dy$
for an arbitrary, integrable $h$. Define $p(z)=ηd(z;0,S)$, where $S$ satisfies $S=ASA⊺+Γ$.
Theorem 2.
Fix pdfs $sn$ and $un$$(n≥1)$ so that the pdfs
$pn=unτsn/p∥unτsn/p∥1$
(B.1)
are well defined for each $n$. Suppose that for some $b∈Rd$ and some probability measure $P$ over $Rd$:
1. $sn→wP$ as $n→∞$;

2. There exists a sequence of gaussian pdfs $(sn')$ such that $∥sn-sn'∥1→0$ as $n→∞$;

3. $un→wδb$ as $n→∞$;

4. There exists a sequence of gaussian pdfs $(un')$ such that $∥un-un'∥1→0$ as $n→∞$;

5. $pn→wδb$ as $n→∞$;

Then:

1. $sn'→wP$ as $n→∞$;

2. $un'→wδb$ as $n→∞$;

3. The pdf
$pn'=un'τsn'/p∥un'τsn'/p∥1$
is well defined and gaussian for $n$ sufficiently large;
4. $pn'→wδb$ as $n→∞$;

5. $∥pn-pn'∥1→0$ as $n→∞$.

Remark 1.

The $L1$ distance between pdfs is equivalent to the total variation distance between the respective probability measures.

Remark 2.

We are not content to show the existence of a sequence of gaussian pdfs $(pn')$ that satisfy C4 and C5. Rather, we are trying to show that the specific $pn'$ defined in C3 satisfies C4 and C5 regardless of the choice of $un'$ and $sn'$.

Remark 3.
An inspection of the proof shows that the pdf
$rn'=pn'p/∥pn'p∥1=un'τsn'/∥un'τsn'∥1$
is well defined and gaussian, with $rn'→wδb$ and
$∥pn-rn'∥1≤An+Bn+Cn,$
where the terms $An,Bn,Cn$ are those defined in equation B.2, each of which tends to zero in the limit. Thus $∥pn-rn'∥1→0$. These $rn'$ are precisely the estimates formed using the robust DKF.
Remark 4.

Suppose the pdfs $sn,sn',un,un'$$(n≥1)$, the constant $b$, and the probability measure $P$ are themselves random, defined on a common probability space, so that $pn$ is well defined with probability one, and suppose that the limits in A1 to A5 hold in probability. Then the probability that $pn'$ is a well-defined, gaussian pdf converges to one, and the limits in C1 to C5 hold in probability.

For the setting of this letter, first fix $t≥1$, and note that $p$ is the common pdf of each $Zt$ and $τ$ is the common conditional pdf of $Zt$ given $Zt-1$. The limit of interest is for increasing dimension ($n$) of a single observation. To formalize this, we let each $Xt$ be infinite dimensional and consider observing only the first $n$ dimensions, denoted $Xt1:n∈Rn$. Similarly, $X1:t1:n=(X11:n,…,Xt1:n)$. We will abuse notation and use $P(Zt=·|W)$ to denote the conditional pdf of $Zt$ given another random variable $W$. These conditional pdfs (formally defined via disintegrations) exist under very mild regularity assumptions (Chang & Pollard, 1997). Note that we are in the setting of 6, where the randomness comes from $X1:t,Z1:t$. With this in mind, define
$un(·)=un(·;Xt1:n)=P(Zt=·|Xt1:n)un'(·)=un'(·;Xt1:n)=ηd(·;fn(Xt1:n),Qn(Xt1:n))sn(·)=sn(·;X1:t-11:n)=P(Zt-1=·|X1:t-11:n)(t>1)sn'(·)=sn'(·;X1:t-11:n)=ηd(·;μt-1,n(X1:t-11:n),Σt-1,n(X1:t-11:n))(t>1)pn(·)=pn(·;X1:t1:n)=P(Zt=·|X1:t1:n)pn'(·)=pn'(·;X1:t1:n)=ηd(·;μt,n(X1:t1:n),Σt,n(X1:t1:n))b=ZtP(·)=P(·;Zt-1)=δZt-1(t>1),$
and define $sn≡sn'≡P≡p$ when $t=0$. The pdf $un'$ is our gaussian approximation of the conditional pdf of $Zt$ for a given $Xt1:n$. We have added the subscript $n$ to $f$ and $Q$ from the main text to emphasize the dependence on the dimensionality of the observations. The pdfs $sn'$ and $pn'$ are our gaussian approximations of $Zt-1$ and $Zt$ given $X1:t-11:n$ and $X1:t1:n$, respectively. Again, we added the subscript $n$ to $μt$ and $Σt$ from the text. Note that equation B.1 is simply a condensed version of equation 2.4 in the main text, and, for the same reason, the $pn'$ defined in C3 is the same $pn'$ defined above.

The Bernstein–von Mises (BvM) theorem gives conditions for the existence of functions $fn$ and $Qn$ so that A3 to A4 hold in probability. We refer readers to van der Vaart (1998) for details. Very loosely speaking, the BvM theorem requires $Zt$ to be completely determined in the limit of increasing amounts of data, but not completely determined after observing only a finite amount of data. The simplest case is when $Xt1:n$ are conditionally independent and identically distributed given $Zt$, and distinct values of $Zt$ give rise to distinct conditional distributions for $Xt1:n$, but the result holds in much more general settings. A separate application of the BvM theorem gives A5 (in probability). In applying the BvM theorem to obtain A5, we also obtain the existence of a sequence of (random) gaussian pdfs $(pn'')$ such that $∥pn-pn''∥1→0$ (in probability), but we do not make use of this result, and, as explained in 4, we care about the specific sequence $(pn')$ defined in C3.

As long as the BvM theorem is applicable, the only remaining thing to show is A1 and A2 (in probability). For the case $t=1$, we have $sn≡sn'≡P≡p$, so A1 and A2 are trivially true, and the theorem holds. For any case $t>1$, we note that $sn$ and $sn'$ are simply $pn$ and $pn'$, respectively, for the case $t-1$. So the conclusions C4 and C5 in the case $t-1$ become the assumptions A1 and A2 for the subsequent case $t$. The theorem then holds for all $t≥1$ by induction. The key conclusion is C5, which says that our gaussian filter approximation $pn'$ will be close in total variation distance (see 3) to the true Bayesian filter distribution $pn$ with high probability when $n$ is large.

Proof.
C1 follows immediately from A1 and A2. C2 follows immediately from A3 and A4. C3 and C4 are proved in 7 below. To show C5, we first bound
$∥pn-pn'∥1≤∥pn-pnpp(b)∥1︸An+∥pnpp(b)-pnp∥pnp∥1∥1︸Bn+∥pnp∥pnp∥1-pn'p∥pn'p∥1∥1︸Cn+∥pn'p∥pn'p∥1-pn'pp(b)∥1︸Bn'+∥pn'pp(b)-pn'∥1︸An'.$
(B.2)
Since $pn→wδb$ and $p(z)$ is bounded and continuous,
$An=∫pn|1-pp(b)|=EZn∼pn|1-p(Zn)p(b)|→|1-p(b)p(b)|=0$
and
$Bn=∫pnp∥pnp∥1|∥pnp∥1p(b)-1|=|∥pnp∥1p(b)-1|=|EZn∼pn|p(Zn)|p(b)-1|→|p(b)p(b)-1|=0.$
Similarly, since $pn'→wδb$,
$An'=∫pn'|1-pp(b)|=EZn∼pn'|1-p(Zn)p(b)|→|1-p(b)p(b)|=0$
and
$Bn'=∫pn'p∥pn'p∥1|∥pn'p∥1p(b)-1|=|∥pn'p∥1p(b)-1|=|EZn∼pn'|p(Zn)|p(b)-1|→|p(b)p(b)-1|=0.$
All that remains is to show that $Cn→0$.
We first observe that
$pnp∥pnp∥1=unτsn∥unτsn∥1andpn'p∥pn'p∥1=un'τsn'∥un'τsn'∥1.$
Define
$α=E(Y,Z)∼P×δbηd(Z;AY,Γ)=EY∼Pηd(b;AY,Γ)∈(0,∞).$
Since $sn→wP$, $un→wδb$, and $(z,y)↦τ(y,z)=ηd(z;Ay,Γ)$ is bounded and continuous, we have
$∥unτsn∥1=∫∫ηd(z;Ay,Γ)sn(y)un(z)dydz=E(Yn,Zn)∼sn×unηd(Zn;AYn,Γ)→α.$
Similarly, since $sn'→wP$ and $un'→wδb$,
$∥un'τsn'∥1=∫∫ηd(z;Ay,Γ)sn'(y)un'(z)dydz=E(Yn,Zn)∼sn'×un'ηd(Zn;AYn,Γ)→α.$
Defining $β=ηd(0;0,Γ)∈(0,∞)$, gives
$∥τh∥∞≤supz|(τh)(z)|≤supz,yηd(z;Ay,Γ)∫|h(t)|dt≤ηd(0;0,Γ)∥h∥1=β∥h∥1$
for any integrable $h$. With these facts in mind, we obtain
$Cn=∥unτsn∥unτsn∥1-un'τsn'∥un'τsn'∥1∥1≤∥unτsn∥unτsn∥1-un'τsn∥unτsn∥1∥1+∥un'τsn∥unτsn∥1-un'τsn'∥un'τsn'∥1∥1≤∥τsn∥∞∥unτsn∥1∥un-un'∥1+∥τsn∥unτsn∥1-τsn'∥un'τsn'∥1∥∞∥un'∥1≤β∥unτsn∥1∥un-un'∥1+∥τsn∥unτsn∥1-τsn∥un'τsn'∥1∥∞+∥τsn∥un'τsn'∥1-τsn'∥un'τsn'∥1∥∞≤β∥unτsn∥1∥un-un'∥1+∥τsn∥∞∥unτsn∥1|1-∥unτsn∥1∥un'τsn'∥1|+∥τsn-τsn'∥∞∥un'τsn'∥1≤β∥unτsn∥1︸→β/α∥un-un'∥1︸→0+β∥unτsn∥1︸→β/α|1-∥unτsn∥1∥un'τsn'∥1|︸→|1-α/α|=0+β∥un'τsn'∥1︸→β/α∥sn-sn'∥1︸→0.$
Since $α>0$, we see that $Cn→0$, and the proof of the theorem is complete.

6 follows from standard arguments by making use of the equivalence between convergence in probability and the existence of a strongly convergent subsequence within each subsequence. The theorem can be applied to each strongly convergent subsequence.

Lemma 1
(DKF equation). If $sn'(z)=ηd(z;an,Vn)$ and $un'(z)=ηd(z;bn,Un)$, then defining
$pn'=un'τsn'/p∥un'τsn'/p∥1$
gives
$pn'(z)=ηd(z;cn,Tn),$
where $Gn=AVnA⊤+Γ$, $Tn=(Un-1+Gn-1-S-1)-1$, and $cn=Tn(Un-1bn+Gn-1Aan)$, as long as $Tn$ is well defined and positive definite. Furthermore, if $sn'→wP$ and $un'→wδb$, then $pn'$ is eventually well defined, and $pn'→wδb$.
Proof.
See above for the definition of $τ$, $p$, $A$, $Γ$, $S$. Assuming $un'τsn'/p$ is integrable, we have
$pn'(z)∝ηd(z;bn,Un)ηd(z;0,S)∫ηd(z;Ay,Γ)ηd(y;an,Vn)dy.$
Since
$∫ηd(z;Ay,Γ)ηd(y;an,Vn)dy=ηd(z;Aan,AVnA⊤+Γ)=ηd(z;Aan,Gn)$
and
$ηd(z;bn,Un)ηd(z;0,S)∝exp(-12(z-bn)⊤Un-1(z-bn))exp(-12z⊤S-1z)∝exp-12(z⊤(Un-1-S-1)z-2z⊤Un-1bn)∝exp-12(z-bn')⊤(Un')-1(z-bn')∝ηd(z,bn',Un')$
for $Un'=(Un-1-S-1)-1$ and $bn'=Un'Un-1bn$, we have
$pn'(z)∝ηd(z;bn',Un')ηd(z;Aan,Gn)∝ηd(z;Tn((Un')-1bn'+Gn-1Aan),Tn)=ηd(z;cn,Tn).$
As the normal density integrates to 1, the proportionality constant drops out.
Now suppose in addition that $sn'→wP$ and $un'→wδb$. Consider the characteristic functions
$φsn'(t)=EX∼sn'[eitX]=eit⊤an-12t⊤Vntandφun'(t)=eit⊤bn-12t⊤Unt$
for these random variables. Lévy's continuity theorem (theorem 2.13 in van der Vaart, 1998) implies that $φsn'(t)→φP(t)$ and $φun'(t)→φδb(t)$ for all $t∈Rd$ where
$φP(t)=eit⊤a-12t⊤Vtandφδb(t)=eit⊤b$
denote the characteristic functions for $P$ and $δb$, respectively. Here, $a$ and $V$ are the mean vector and covariance matrix, respectively, of the distribution $P$, which must itself be gaussian, although possibly degenerate. It follows that
$(it⊤an-12t⊤Vnt)→(it⊤a-12t⊤Vt)$
and as $φsn'(-t)→φP(-t)$,
$(-it⊤an-12t⊤Vnt)→(-it⊤a-12t⊤Vt),$
so $t⊤an→t⊤a$ and $t⊤Vnt→t⊤Vt$ for all $t∈Rd$. Choosing $t$ to be coordinate vectors, we see that this implies $an→a$ and $Vn→V$ coordinate-wise. An analogous argument allows us to conclude that $bn→b$ and $Un→0d×d$. Thus, $Gn→G=AVA⊤+Γ$, which is invertible, since $Γ$ is positive definite, and so $Gn-1→G-1$.
The Woodbury matrix identity gives
$Tn=(Un-1+Gn-1-S-1)-1=Un-Un((Gn-1-S-1)-1+Un)-1Un.$
(B.3)
Since $Un→0d×d$ and $((Gn-1-S-1)-1+Un)-1→G-1-S-1$, we see that $Tn→0d×d$.
To show $Tn$ is eventually well defined and strictly positive definite, it suffices to show the same for
$Tn-1=Un-1+Dn,$
where we set $Dn=Gn-1-S-1$. For a symmetric matrix $M∈Rd×d$, let $λ1(M)≥⋯≥λd(M)$ denote its ordered eigenvalues. As a corollary to Hoffman and Wielandt's result (see corollary 6.3.8 in Horn & Johnson, 2013), it follows that
$maxj|λj(Tn-1)-λj(Un-1)|≤∥Dn∥2,$
where $∥·∥2$ denotes the Frobenius norm. Since $∥Dn∥2→∥G-1-S-1∥2<∞$, the difference between the $j$th ordered eigenvalues for $Tn-1$ and $Un-1$ is upper-bounded independent of $n$ for $1≤j≤d$. Since $Un$ is positive definite and since $Un→0d×d$, it follows that $λj(Un-1)≥λd(Un-1)=1/λ1(Un)→∞$. Hence, all eigenvalues of $Tn-1$ must eventually become positive, so that $Tn-1$ becomes positive definite, hence also $Tn$.
For the means, we have
$cn=TnUn-1bn+TnGn-1Aan.$
Because $Tn→0d×d$, $Gn-1→G-1$, and $an→a$, we have $TnGn-1Aan→0→$. Using equation B.3 for $Tn$ gives
$TnUn-1bn=bn-Un((Gn-1-S-1)-1+Un)-1bn,$
where the eventual boundedness of $(Gn-1-S-1)-1+Un)-1$ implies
$Un((Gn-1-S-1)-1+Un)-1bn→0→.$
As $bn→b$, we conclude $cn→b$. Hence, $pn'→wδb$.

## Notes

1

With $ht=bt+Htνt$, $Ct=MtHt⊺$, and $Nt=Λt+HtMtHt⊺$.

## Acknowledgments

We thank participant T9 and T9's family, the anonymous reviewers, and E. Crites for their thoughtful feedback on the manuscript; B. Travers and D. Rosler for administrative support; and C. Grant for clinical assistance. This work was supported by the National Institutes of Health: National Institute on Deafness and Other Communication Disorders, NIDCD (R01DC009899), Rehabilitation Research and Development Service, Department of Veterans Affairs (B6453R and N9228C); National Science Foundation (DMS1309004); National Institute of Health (IDeA P20GM103645, R01MH102840); Massachusetts General Hospital (MGH)–Deane Institute for Integrated Research on Atrial Fibrillation and Stroke; Joseph Martin Prize for Basic Research; the Executive Committee on Research of Massachusetts General Hospital; Canadian Institute of Health Research (336092); Killam Trust Award Foundation; and the Brown Institute of Brain Science. The content of this letter is solely our responsibility and does not necessarily represent the official views of the National Institutes of Health, the Department of Veterans Affairs, or the U.S. government.

## References

Abbeel
,
P.
,
Coates
,
A.
,
Montemerlo
,
M.
,
Ng
,
A. Y.
, &
Thrun
,
S.
(
2005
).
Discriminative training of Kalman filters
. In
Proceedings of Robotics: Science and Systems
.
Cambridge, MA
:
MIT Press
.
Ajiboye
,
A. B.
,
Willett
,
F. R.
,
Young
,
D. R.
,
Memberg
,
W. D.
,
Murphy
,
B. A.
,
Miller
,
J. P.
, …
Kirsch
,
R. F.
(
2017
).
Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: A proof-of-concept demonstration
.
Lancet
,
389
,
1821
1830
.
Arasaratnam
,
I.
, &
Haykin
,
S.
(
2009
).
Cubature Kalman filters
.
IEEE Trans. Autom. Control
,
54
(
6
),
1254
1269
.
Arasaratnam
,
I.
,
Haykin
,
S.
, &
Elliott
,
R. J.
(
2007
).
Discrete-time nonlinear filtering algorithms using Gauss–Hermite quadrature
.
Proc. IEEE
,
95
(
5
),
953
977
.
Arulampalam
,
M. S.
,
,
S.
,
Gordon
,
N.
, &
Clapp
,
T.
(
2002
).
A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking
.
IEEE Trans. Signal Process.
,
50
(
2
),
174
188
.
Battin
,
R. H.
, &
Levine
,
G. M.
(
1970
).
Application of Kalman filtering techniques to the Apollo program
. In
C. T.
Leondes
(Ed.),
Theory and applications of Kalman filtering
.
Neuilly sur Seine
:
NATO, Advisory Group for Aerospace Research and Development
.
Beneš
,
V. E.
(
1981
).
Exact finite-dimensional filters for certain diffusions with nonlinear drift
.
Stochastics
,
5
(
1–2
),
65
92
.
Bensmaia
,
S. J.
, &
Miller
,
L. E.
(
2014
).
Restoring sensorimotor function through intracortical interfaces: Progress and looming challenges
.
Nat. Rev. Neurosci.
,
15
(
5
),
313
325
.
Bishop
,
C. H.
,
Etherton
,
B. J.
, &
Majumdar
,
S. J.
(
2001
).
Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects
.
Mon. Weather Rev.
,
129
(
3
),
420
436
.
Bouton
,
C. E.
,
Shaikhouni
,
A.
,
Annetta
,
N. V.
,
,
M. A.
,
Friedenberg
,
D. A.
,
Nielson
,
D. M.
, …
Rezai
,
A. R.
(
2016
).
Restoring cortical control of functional movement in a human with quadriplegia
.
Nature
,
533
,
247
250
.
Brandman
,
D. M.
,
Burkhart
,
M. C.
,
Kelemen
,
J.
,
Franco
,
B.
,
Harrison
,
M. T.
, &
Hochberg
,
L. R.
(
2018
).
Robust closed-loop control of a cursor in a person with tetraplegia using gaussian process regression
.
Neural Comput.
,
30
(
11
),
2986
3008
.
Brandman
,
D. M.
,
Cash
,
S. S.
, &
Hochberg
,
L. R.
(
2017
).
Review: Human intracortical recording and neural decoding for brain-computer interfaces
.
IEEE Trans. Neural Syst. Rehabil. Eng.
,
25
,
1687
1696
.
Brandman
,
D. M.
,
Hosman
,
T.
,
Saab
,
J.
,
Burkhart
,
M. C.
,
Shanahan
,
B. E.
,
Ciancibello
,
J. G.
, …
Hochberg
,
L. R.
(
2018
).
Rapid calibration of an intracortical brain–computer interface for people with tetraplegia
.
J. Neural Eng.
,
15
(
2
),
1
14
.
Brown
,
R. G.
, &
Hwang
,
P. Y. C.
(
2012
).
Introduction to random signals and applied Kalman filtering
, 4th ed.
Hoboken, NJ
:
Wiley
.
Buehner
,
M.
,
McTaggart-Cowan
,
R.
, &
Heilliette
,
S.
(
2017
).
An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: VarEnKF
.
Mon. Weather Rev.
,
145
(
2
),
617
635
.
Burkhart
,
M. C.
(
2019
).
A discriminative approach to Bayesian filtering with applications to human neural decoding
.
PhD diss., Brown University
.
Butler
,
R. W.
(
2007
).
.
Cambridge
:
Cambridge University Press
.
Cappé
,
O.
,
Godsill
,
S. J.
, &
Moulines
,
E.
(
2007
).
An overview of existing methods and recent advances in sequential Monte Carlo
.
Proc. IEEE
,
95
(
5
),
899
924
.
Cappé
,
O.
,
Moulines
,
E.
, &
Ryden
,
T.
(
2005
).
Inference in hidden Markov models
.
Berlin
:
Springer-Verlag
.
Castillo
,
E.
,
Guijarro-Berdiñas
,
B.
,
Fontenla-Romero
,
O.
, &
Alonso-Betanzos
,
A.
(
2010
).
A very fast learning method for neural networks based on sensitivity analysis
.
J. Mach. Learn. Res.
,
7
,
1159
1182
.
Cedarbaum
,
J. M.
,
Stambler
,
N.
,
Malta
,
E.
,
Fuller
,
C.
,
Hilt
,
D.
,
Thurmond
,
B.
, &
Nakanishi
,
A.
(
1999
).
The ALSFRS-R: A revised ALS functional rating scale that incorporates assessments of respiratory function
.
J. Neurol. Sci.
,
169
(
1
),
13
21
.
Chang
,
J. T.
, &
Pollard
,
D.
(
1997
).
Conditioning as disintegration
.
Stat. Neerl.
,
51
(
3
),
287
317
.
Chen
,
Z.
(
2003
).
Bayesian filtering: From Kalman filters to particle filters, and beyond
.
Statistics
,
182
(
1
),
1
69
.
Choo
,
K.
, &
Fleet
,
D. J.
(
2001
).
People tracking using hybrid Monte Carlo filtering
. In
Proc. Int. Conf. Comput. Vis.
(
vol. 2
, pp.
321
328
).
Piscataway, NJ
:
IEEE
.
Churchland
,
M. M.
,
Cunningham
,
J. P.
,
Kaufman
,
M. T.
,
Foster
,
J. D.
,
Nuyujukian
,
P.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2012
).
Neural population dynamics during reaching
.
Nature
,
487
(
7405
),
1
20
.
Collinger
,
J. L.
,
Wodlinger
,
B.
,
Downey
,
J. E.
,
Wang
,
W.
,
Tyler-Kabara
,
E. C.
,
Weber
,
D. J.
, …
Schwartz
,
A. B.
(
2013
).
High-performance neuroprosthetic control by an individual with tetraplegia
.
Lancet
,
381
(
9866
),
557
564
.
Daum
,
F. E.
(
1984
).
Exact finite dimensional nonlinear filters for continuous time processes with discrete time measurements
. In
Proceedings of the IEEE Conf. Decis. Control
(pp.
16
22
).
Piscataway, NJ
:
IEEE
.
Daum
,
F. E.
(
1986
).
Exact finite-dimensional nonlinear filters
.
IEEE Trans. Autom. Control
,
31
(
7
),
616
622
.
Daum
,
F. E.
, &
Huang
,
J.
(
2003
).
Curse of dimensionality and particle filters
. In
Proceedings of the 2003 IEEE Aerosp. Conf. Proc.
(
vol. 4
).
Piscataway, NJ
:
IEEE
.
sdel Moral
,
P.
(
1996
).
Nonlinear filtering using random particles
.
Theory Probab. Appl.
,
40
(
4
),
690
701
.
Douc
,
R.
, &
Cappé
,
O.
(
2005
).
Comparison of resampling schemes for particle filtering
. In
Proc. Int. Symp. Image and Signal Process. Anal.
(pp.
64
69
).
Piscataway, NJ
:
IEEE
.
Doucet
,
A.
,
Godsill
,
S.
, &
Andrieu
,
C.
(
2000
).
On sequential Monte Carlo sampling methods for Bayesian filtering
.
Stat. Comput.
,
10
(
3
),
197
208
.
Duchi
,
J.
,
Hazan
,
E.
, &
Singer
,
Y.
(
2011
).
.
J. Mach. Learn. Res.
,
12
,
2121
2159
.
Elliott
,
R.
(
1994
).
Exact adaptive filters for Markov chains observed in gaussian noise
.
Automatica
,
30
(
9
),
1399
1408
.
Evensen
,
G.
(
1994
).
Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics
.
J. Geophys. Res: Oceans
,
99
,
10143
10162
.
Fitts
,
P. M.
(
1954
).
The information capacity of the human motor system in controlling the amplitude of movement
.
J. Exp. Pyschol.
,
47
(
6
),
381
391
.
Flint
,
R. D.
,
Lindberg
,
E. W.
,
Jordan
,
L. R.
,
Miller
,
L. E.
, &
Slutzky
,
M. W.
(
2012
).
Accurate decoding of reaching movements from field potentials in the absence of spikes
.
J. Neural Eng.
,
9
(
4
),
1
13
.
Foresee
,
F. D.
, &
Hagan
,
M. T.
(
1997
).
Gauss-Newton approximation to Bayesian learning
. In
Proceedings of the Int. Conf. Neural Netw.
(
3:1930
1935
).
Piscataway, NJ
:
IEEE
.
Gelb
,
A.
(
1974
).
Applied optimal estimation
.
Cambridge, MA
:
MIT Press
.
Georgopoulos
,
A. P.
,
Kettner
,
R. E.
, &
Schwartz
,
A. B.
(
1988
).
Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population
.
J. Neurosci.
,
8
(
8
),
2928
2937
.
Gerber
,
M.
, &
Chopin
,
N.
(
2015
).
Sequential quasi Monte Carlo
.
J. Roy. Stat. Soc. Ser. B (Stat. Methodol.)
,
77
(
3
),
509
579
.
Gers
,
F. A.
,
Schmidhuber
,
J.
, &
Cummins
,
F.
(
2000
).
Learning to forget: Continual prediction with LSTM
.
Neural Comput.
,
12
(
10
),
2451
2471
.
Ghahramani
,
Z.
, &
Hinton
,
G. E.
(
2000
).
Variational learning for switching state-space models
.
Neural Comput.
,
12
(
4
),
831
864
.
Gilja
,
V.
,
Pandarinath
,
C.
,
Blabe
,
C. H.
,
Nuyujukian
,
P.
,
Simeral
,
J. D.
,
Sarma
,
A. A.
, …
Henderson
,
J. M.
(
2015
).
Clinical translation of a high-performance neural prosthesis
.
Nat. Med.
,
21
(
10
),
1142
1145
.
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
).
Understanding the difficulty of training deep feedforward neural networks
. In
Proceedings of the Int. Conf. Artif. Intell. Stats.
(
9:249
256
).
PMLR
.
Gordon
,
N. J.
,
Salmond
,
D. J.
, &
Smith
,
A. F. M.
(
1993
).
Novel approach to nonlinear/non-gaussian Bayesian state estimation
.
IEE Proc. F—Radar and Signal Process.
,
140
(
2
),
107
113
.
Greff
,
K.
,
Srivastava
,
R. K.
,
Koutník
,
J.
,
Steunebrink
,
B. R.
, &
Schmidhuber
,
J.
(
2016
).
LSTM: A search space odyssey
.
IEEE Trans. Neural Netw. Learn. Syst.
,
28
(
10
),
1
11
.
Grewal
,
M. S.
, &
Andrews
,
A. P.
(
2010
).
Applications of Kalman filtering in aerospace 1960 to the present
.
IEEE Control Syst. Mag.
,
30
(
3
),
69
78
.
Hagan
,
M. T.
, &
Menhaj
,
M. B.
(
1994
).
Training feedforward networks with the Marquardt algorithm
.
IEEE Trans. Neural Netw.
,
5
(
6
),
989
993
.
Hall
,
E. C.
(
1966
).
Case history of the Apollo guidance computer
.
Cambridge, MA
:
MIT Press
.
Handschin
,
J.
(
1970
).
Monte Carlo techniques for prediction and filtering of non-linear stochastic processes
.
Automatica
,
6
(
4
),
555
563
.
Handschin
,
J. E.
, &
Mayne
,
D. Q.
(
1969
).
Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering
.
Int. J. Control
,
9
(
5
),
547
559
.
Hess
,
R.
, &
Fern
,
A.
(
2009
).
Discriminatively trained particle filters for complex multi-object tracking
. In
Proceedings of Comput. Vis. Pattern Recognit.
(pp.
240
247
).
Piscataways, NJ
:
IEEE
.
Hochberg
,
L. R.
,
Bacher
,
D.
,
Jarosiewicz
,
B.
,
Masse
,
N. Y.
,
Simeral
,
J. D.
,
Vogel
,
J.
, …
Donoghue
,
J. P.
(
2012
).
Reach and grasp by people with tetraplegia using a neurally controlled robotic arm
.
Nature
,
485
(
7398
),
372
375
.
Hochberg
,
L. R.
, &
Donoghue
,
J. P.
(
2006
).
Sensors for brain-computer interfaces
.
IEEE Eng. Med. Biol. Mag.
,
25
(
5
),
32
38
.
Hochreiter
,
S.
, &
Schmidhuber
,
J.
(
1997
).
Long short-term memory
.
Neural Comput.
,
9
(
8
),
1735
1780
.
Horn
,
R. A.
, &
Johnson
,
C. R.
(
2013
).
Matrix analysis
, 2nd ed.
Cambridge
:
Cambridge University Press
.
Hosman
,
T.
,
Vilela
,
M.
,
Milstein
,
D.
,
Kelemen
,
J. N.
,
Brandman
,
D. M.
,
Hochberg
,
L. R.
, &
Simeral
,
J. D.
(
2019
).
BCI decoder performance comparison of an LSTM recurrent neural network and a Kalman filter in retrospective simulation
. In
Proceedings of the Int. IEEE EMBS Conf. Neural Eng.
Piscataway, NJ
:
IEEE
.
Hunt
,
B. R.
,
Kostelich
,
E. J.
, &
Szunyogh
,
I.
(
2007
).
Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter
.
Physica D: Nonlinear Phenom.
,
230
(
1
),
112
126
.
Ioffe
,
S.
, &
Szegedy
,
C.
(
2015
).
Batch normalization: Accelerating deep network training by reducing internal covariate shift
. In
F.
Bach
, &
D.
Blei
(Eds.),
Proceedings of the Int. Conf. Mach. Learn.
,
vol. 37
(pp.
448
456
).
PMLR
.
Ito
,
K.
(
2000
).
Gaussian filter for nonlinear filtering problems
. In
Proceedings of the IEEE Conf. Decis. Control
,
vol. 2.
Piscataway, NJ
:
IEEE
.
Ito
,
K.
, &
Xiong
,
K.
(
2000
).
Gaussian filters for nonlinear filtering problems
.
IEEE Trans. Autom. Control
,
45
,
910
927
.
Jarosiewicz
,
B.
,
Masse
,
N. Y.
,
Bacher
,
D.
,
Cash
,
S. S.
,
Eskandar
,
E.
,
Friehs
,
G.
, …
Hochberg
,
L. R.
(
2013
).
Advantages of closed-loop calibration in intracortical brain-computer interfaces for people with tetraplegia
.
J. Neural Eng.
,
10
(
4
),
1
17
.
Jarosiewicz
,
B.
,
Sarma
,
A. A.
,
Bacher
,
D.
,
Masse
,
N. Y.
,
Simeral
,
J. D.
,
Sorice
,
B.
, …
Hochberg
,
L. R.
(
2015
).
Virtual typing by people with tetraplegia using a self-calibrating intracortical brain-computer interface
.
Sci. Transl. Med.
,
7
(
313
),
1
11
.
Jozefowicz
,
R.
,
Zaremba
,
W.
, &
Sutskever
,
I.
(
2015
).
An empirical exploration of recurrent network architectures
. In
F.
Bach
&
D.
Blei
(Eds.),
Proceedings of the Int. Conf. Mach. Learn.
,
vol. 37
(pp.
2342
2350
).
Julier
,
S. J.
, &
Uhlmann
,
J. K.
(
1997
).
New extension of the Kalman filter to nonlinear systems
.
Proc. SPIE
,
3068
,
182
193
.
Kalman
,
R. E.
(
1960
).
A new approach to linear filtering and prediction problems
.
J. Basic Eng.
,
82
(
1
),
35
45
.
Kalman
,
R. E.
, &
Bucy
,
R. S.
(
1961
).
New results in linear filtering and prediction theory
.
J. Basic Eng.
,
83
(
1
),
95
108
.
Kim
,
S.-P.
,
Simeral
,
J. D.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
, &
Black
,
M. J.
(
2008
).
Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia
.
J. Neural Eng.
,
5
(
4
),
455
476
.
Kitagawa
,
G.
(
1996
).
Monte Carlo filter and smoother for non-Gaussian nonlinear state space models
.
J. Comput. Graph. Stat.
,
5
(
1
).
Koyama
,
S.
,
Pérez-Bolde
,
L. C.
,
Shalizi
,
C. R.
, &
Kass
,
R. E.
(
2010
).
Approximate methods for state-space models
.
J. Am. Stat. Assoc.
,
105
(
489
),
170
180
.
Kushner
,
H.
(
1967
).
Approximations to optimal nonlinear filters
.
IEEE Trans. Autom. Control
,
12
(
5
),
546
556
.
Lemon
,
R. N.
(
2008
).
Descending pathways in motor control
.
Annu. Rev. Neurosci.
,
31
,
195
218
.
Levenberg
,
K.
(
1944
).
A method for the solution of certain non-linear problems in least squares
.
Quart. Appl. Math.
,
2
,
164
168
.
Liu
,
J. S.
(
2008
).
Monte Carlo strategies in scientific computing
.
Berlin
:
Springer
.
MacKay
,
D. J. C.
(
1992
).
Bayesian interpolation
.
Neural Comput.
,
4
(
3
),
415
447
.
Majumdar
,
S. J.
,
Bishop
,
C. H.
,
Etherton
,
B. J.
, &
Toth
,
Z.
(
2002
).
Adaptive sampling with the ensemble transform Kalman filter. Part II: Field program implementation
.
Mon. Weather Rev.
,
130
(
5
),
1356
1369
.
Malik
,
W. Q.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
, &
Brown
,
E. N.
(
2015
).
Modulation depth estimation and variable selection in state-space models for neural interfaces
.
IEEE Trans. Biomed. Eng.
,
62
(
2
),
570
581
.
Marquardt
,
D. W.
(
1963
).
An algorithm for least-squares estimation of nonlinear parameters
.
J. Soc. Indust. Appl. Math.
,
11
,
431
441
.
Masse
,
N. Y.
,
Jarosiewicz
,
B.
,
Simeral
,
J. D.
,
Bacher
,
D.
,
Stavisky
,
S. D.
,
Cash
,
S. S.
, …
Donoghue
,
J. P.
(
2015
).
Non-causal spike filtering improves decoding of movement intention for intracortical BCIs
.
J. Neurosci. Methods
,
244
,
94
103
.
Maynard
,
E. M.
,
Nordhausen
,
C. T.
, &
Normann
,
R. A.
(
1997
).
The Utah intracortical electrode array: A recording structure for potential brain-computer interfaces
.
Electroencephalogr. Clin. Neurophysiol.
,
102
(
3
),
228
239
.
Metropolis
,
N.
, &
Ulam
,
S.
(
1949
).
The Monte Carlo method
.
J. Am. Stat. Assoc.
,
44
(
247
),
335
341
.
Minka
,
T. P.
(
2001a
).
Expectation propagation for approximate Bayesian inference
.
Proceedings of the Conf. Uncertain. Artif. Intell.
San Mateo, CA
:
Morgan Kaufmann
.
Minka
,
T. P.
(
2001b
).
A family of algorithms for approximate Bayesian inference
.
PhD diss., MIT
.
,
E. A.
(
1964
).
On a regression estimate
.
Teor. Verojatnost. i Primenen.
,
9
,
157
159
.
Nørgaard
,
M.
,
Poulsen
,
N. K.
, &
Ravn
,
O.
(
2000
).
New developments in state estimation for nonlinear systems
.
Automatica
,
36
(
11
),
1627
1638
.
Nuyujukian
,
P.
,
Albites Sanabria
,
J.
,
Saab
,
J.
,
Pandarinath
,
C.
,
Jarosiewicz
,
B.
,
Blabe
,
C. H.
, …
Henderson
,
J. M.
(
2018
).
Cortical control of a tablet computer by people with paralysis
.
PLOS One
,
13
(
11
).
Ott
,
E.
,
Hunt
,
B. R.
,
Szunyogh
,
I.
,
Zimin
,
A. V.
,
Kostelich
,
E. J.
,
Corazza
,
M.
, …
Yorke
,
J. A.
(
2004
).
A local ensemble Kalman filter for atmospheric data assimilation
.
Tellus A
,
56
(
5
),
415
428
.
Pandarinath
,
C.
,
Gilja
,
V.
,
Blabe
,
C. H.
,
Nuyujukian
,
P.
,
Sarma
,
A. A.
,
Sorice
,
B. L.
, …
Shenoy
,
K. V.
(
2015
).
Neural population dynamics in human motor cortex during movements in people with ALS
.
eLife
,
4
.
Pandarinath
,
C.
,
Nuyujukian
,
P.
,
Blabe
,
C. H.
,
Sorice
,
B. L.
,
Saab
,
J.
,
Willett
,
F.
, …
Henderson
,
J. M.
(
2017
).
High performance communication by people with paralysis using an intracortical brain-computer interface
.
eLife
, pp.
1
27
.
Pandarinath
,
C.
,
O'Shea
,
D. J.
,
Collins
,
J.
,
Jozefowicz
,
R.
,
Stavisky
,
S. D.
,
Kao
,
J. C.
, …
Sussillo
,
D.
(
2018
).
Inferring single-trial neural population dynamics using sequential auto-encoders
.
Nat. Methods
,
15
(
10
),
805
815
.
Paninski
,
L.
,
Fellows
,
M. R.
,
Hatsopoulos
,
N. G.
, &
Donoghue
,
J. P.
(
2004
).
Spatiotemporal tuning of motor cortical neurons for hand position and velocity spatiotemporal tuning of motor cortical neurons for hand position and velocity
.
J. Clin. Neurophysiol.
,
91
,
515
532
.
Pham
,
V.
,
Bluche
,
T.
,
Kermorvant
,
C.
, &
,
J.
(
2014
).
Dropout improves recurrent neural networks for handwriting recognition
. In
Proceedings of the Int. Conf. Front. Handwriting Recognit.
(pp.
285
290
).
Piscataway, NJ
:
IEEE
.
Pohlmeyer
,
E.
,
Solla
,
S.
,
Perreault
,
E. J.
, &
Miller
,
L. E.
(
2007
).
Prediction of upper limb muscle activity from motor cortical discharge during reaching
.
J. Neural Eng.
,
4
,
369
379
.
Quang
,
P. B.
,
Musso
,
C.
, &
Le Gland
,
F.
(
2015
).
The Kalman Laplace filter: A new deterministic algorithm for nonlinear Bayesian filtering
. In
Proceedings of the Intern. Conf. Inf. Fusion
(pp.
1566
1573
).
Piscataway, NJ
:
IEEE
.
Quiñonero Candela
,
J.
, &
Rasmussen
,
C. E.
(
2005
).
A unifying view of sparse approximate gaussian process regression
.
J. Mach. Learn. Res.
,
6
,
1939
1959
.
Rao
,
N. G.
, &
Donoghue
,
J. P.
(
2014
).
Cue to action processing in motor cortex populations
.
J. Neurophysiol.
,
111
(
2
),
441
453
.
Rasmussen
,
C. E.
, &
Nickisch
,
H.
(
2010
).
Gaussian processes for machine learning (GPML) toolbox
.
J. Mach. Learn. Res.
,
11
,
3011
3015
.
Rasmussen
,
C. E.
, &
Williams
,
C. K. I.
(
2006
).
Gaussian processes for machine learning
.
Cambridge, MA
:
MIT Press
.
Real
,
E.
,
Moore
,
S.
,
Selle
,
A.
,
Saxena
,
S.
,
Suematsu
,
Y. L.
,
Le
,
Q.
, &
Kurakin
,
A.
(
2017
).
Large-scale evolution of image classifiers
. In
Proceedings of the Int. Conf. Mach. Learn.
PMLR
.
Särkkä
,
S.
(
2013
).
Bayesian filtering and smoothing
.
Cambridge
:
Cambridge University Press
.
Schmidhuber
,
J.
(
2015
).
Deep learning in neural networks: An overview
.
Neural Netw.
,
61
,
85
117
.
Schmidt
,
S. F.
,
Weinberg
,
J. D.
, &
Lukesh
,
J. S.
(
1970
).
Application of Kalman filtering to the C-5 guidance and control system
. In
C. T.
Leondes
(Ed.),
Theory and applications of Kalman filtering
.
Neuilly sur Seine, NATO, Advisory Group for Aerospace Research and Development
.
Schwartz
,
A. B.
(
1994
).
Direct cortical representation of drawing
.
Science
,
265
(
5171
),
540
542
.
Shumway
,
R. H.
, &
Stoffer
,
D. S.
(
1991
).
Dynamic linear models with switching
.
J. Am. Stat. Assoc.
,
86
(
415
),
763
769
.
Simeral
,
J. D.
,
Kim
,
S.-P.
,
Black
,
M. J.
,
Donoghue
,
J. P.
, &
Hochberg
,
L. R.
(
2011
).
Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracortical microelectrode array
.
J. Neural Eng.
,
8
(
2
),
1
21
.
Srivastava
,
N.
,
Hinton
,
G.
,
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Salakhutdinov
,
R.
(
2014
).
Dropout: A simple way to prevent neural networks from overfitting
.
J. Mach. Learn. Res.
,
15
,
1929
1958
.
Stevenson
,
I. H.
, &
Kording
,
K. P.
(
2011
).
How advances in neural recording affect data analysis
.
Nat. Neurosci.
,
14
(
2
),
139
142
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012
).
Density ratio estimation in machine learning
.
Cambridge
:
Cambridge University Press
.
Sussillo
,
D.
,
Nuyujukian
,
P.
,
Fan
,
J. M.
,
Kao
,
J. C.
,
Stavisky
,
S. D.
,
Ryu
,
S.
, &
Shenoy
,
K.
(
2012
).
A recurrent neural network for closed-loop intracortical brain–machine interface decoders
.
J. Neural Eng.
,
9
(
2
),
1
21
.
Sussillo
,
D.
,
Stavisky
,
S. D.
,
Kao
,
J. C.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2016
).
Making brain–machine interfaces robust to future neural variability
.
Nat. Commun.
,
7
,
1
12
.
van der Merwe
,
R.
(
2004
).
Sigma-point Kalman filters for probabilistic inference in dynamic state-space models
.
PhD diss., Oregon Health and Science University
.
van der Vaart
,
A. W.
(
1998
).
Asymptotic statistics
.
Cambridge
:
Cambridge University Press
.
Vargas-Irwin
,
C. E.
,
Brandman
,
D. M.
,
Zimmermann
,
J. B.
,
Donoghue
,
J. P.
, &
Black
,
M. J.
(
2015
).
Spike train SIMilarity space (SSIMS): A framework for single neuron and ensemble data analysis
.
Neural Comput.
,
27
(
1
),
1
31
.
Vargas-Irwin
,
C. E.
,
Shakhnarovich
,
G.
,
,
P.
,
Mislow
,
J. M. K.
,
Black
,
M. J.
, &
Donoghue
,
J. P.
(
2010
).
Decoding complete reach and grasp actions from local primary motor cortex populations
.
J. Neurosci.
,
30
(
29
),
9659
9669
.
Velliste
,
M.
,
Perel
,
S.
,
Spalding
,
M. C.
,
Whitford
,
A. S.
, &
Schwartz
,
A. B.
(
2008
).
Cortical control of a prosthetic arm for self-feeding
.
Nature
,
453
(
7198
),
1098
101
.
Walker
,
B.
, &
Kording
,
K.
(
2013
).
The database for reaching experiments and models
.
PLOS One
,
8
(
11
).
Wan
,
E. A.
, &
van der Merwe
,
R.
(
2000
).
The unscented Kalman filter for nonlinear estimation
. In
Proceedings of the Adaptive Syst. for Signal Process., Commun., and Control Symp.
(pp.
153
158
).
Washington, DC
:
Society for Neuroscience
.
Watson
,
G. S.
(
1964
).
Smooth regression analysis
.
Sankhyā Ser. A
,
26
,
359
372
.
Willett
,
F. R.
,
Young
,
D. R.
,
Murphy
,
B. A.
,
Memberg
,
W. D.
,
Blabe
,
C. H.
,
Pandarinath
,
C.
, …
Bolu Ajiboye
,
A.
(
2019
).
Principled BCI decoder design and parameter selection using a feedback control model
.
Sci. Rep.
,
9
(
8881
).
Wodlinger
,
B.
,
Downey
,
J. E.
,
Tyler-Kabara
,
E. C.
,
Schwartz
,
A. B.
,
Boninger
,
M. L.
, &
Collinger
,
J. L.
(
2015
).
Ten-dimensional anthropomorphic arm control in a human brain machine interface: Difficulties, solutions, and limitations
.
J. Neural Eng.
,
12
(
1
),
1
17
.
Wolpaw
,
J. R.
,
Birbaumer
,
N.
,
McFarland
,
D. J.
,
Pfurtscheller
,
G.
, &
Vaughan
,
T. M.
(
2002
).
Brain-computer interfaces for communication and control
.
Clin. Neurophysiol.
,
113
(
6
),
767
791
.
Wu
,
W.
,
Black
,
M. J.
,
Gao
,
Y.
,
Bienenstock
,
E.
,
Serruya
,
M.
, &
Donoghue
,
J. P.
(
2002
).
Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter
. In
SAB'02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices
(pp.
66
73
).
Washington, DC
:
Society for Neuroscience
.
Zaremba
,
W.
,
Sutskever
,
I.
, &
Vinyals
,
O.
(
2014
).
Recurrent neural network regularization
.
arXiv:1409.2329
.
Zeiler
,
M. D.
(
2012
).
.
arXiv:1212.5701
.
Zoph
,
B.
, &
Le
,
Q. V.
(
2017
).
Neural architecture search with reinforcement learning
. In
Proceedings of the Int. Conf. Learn. Represent.
ICLR
.

## Author notes

B. Franco is now at NeuroPace, Mountain View, CA, U.S.A.