Abstract

The Kalman filter provides a simple and efficient algorithm to compute the posterior distribution for state-space models where both the latent state and measurement models are linear and gaussian. Extensions to the Kalman filter, including the extended and unscented Kalman filters, incorporate linearizations for models where the observation model p(observation|state) is nonlinear. We argue that in many cases, a model for p(state|observation) proves both easier to learn and more accurate for latent state estimation.

Approximating p(state|observation) as gaussian leads to a new filtering algorithm, the discriminative Kalman filter (DKF), which can perform well even when p(observation|state) is highly nonlinear and/or nongaussian. The approximation, motivated by the Bernstein–von Mises theorem, improves as the dimensionality of the observations increases. The DKF has computational complexity similar to the Kalman filter, allowing it in some cases to perform much faster than particle filters with similar precision, while better accounting for nonlinear and nongaussian observation models than Kalman-based extensions.

When the observation model must be learned from training data prior to filtering, off-the-shelf nonlinear and nonparametric regression techniques can provide a gaussian model for p(observation|state) that cleanly integrates with the DKF. As part of the BrainGate2 clinical trial, we successfully implemented gaussian process regression with the DKF framework in a brain-computer interface to provide real-time, closed-loop cursor control to a person with a complete spinal cord injury. In this letter, we explore the theory underlying the DKF, exhibit some illustrative examples, and outline potential extensions.

1  Introduction

Consider a state-space model for Z1:T:=Z1,,ZT (latent states) and X1:T:=X1,,XT (observations) represented as a Bayesian network:
Z1Zt-1ZtZTX1Xt-1XtXT.
(1.1)
The conditional density of Zt given X1:t can be expressed recursively using the Chapman-Kolmogorov equation and Bayes' rule (see Chen, 2003, for further details):
p(zt|x1:t-1)=p(zt|zt-1)p(zt-1|x1:t-1)dzt-1,
(1.2a)
p(zt|x1:t)=p(xt|zt)p(zt|x1:t-1)p(xt|zt)p(zt|x1:t-1)dzt=p(xt|zt)p(zt|x1:t-1)p(xt|x1:t-1),
(1.2b)
where p(z0|x1:0)=p(z0) and the conditional densities p(zt|zt-1) and p(xt|zt) are either specified a priori or learned from training data prior to filtering. Computing or approximating equation 1.2 is often called Bayesian filtering. Bayesian filtering arises in a large number of applications, including global positioning systems, target tracking, aircraft and spacecraft guidance, weather forecasting, computer vision, digital communications, and brain-computer interfaces (Chen, 2003; Hall, 1966; Battin & Levine, 1970; Grewal & Andrews, 2010; Buehner, McTaggart-Cowan, & Heilliette, 2017; Brown & Hwang, 2012; Schmidt, Weinberg, & Lukesh, 1970; Brandman, Cash, & Hochberg, 2017).

Exact solutions to equation 1.2 are available only in special cases, such as the Kalman filter (Kalman, 1960; Kalman & Bucy, 1961). The Kalman filter models the conditional densities p(zt|zt-1) and p(xt|zt) as linear and gaussian so that the posterior distribution p(zt|x1:t) is also gaussian and quickly computable. Beneš (1981) and Daum (1984, 1986) broadened the class of models for which the integrals in equation 1.2 are analytically tractable, but many model specifications still fall outside this class. When the latent state space is finite, the integrals in equation 1.2 become sums that can be calculated exactly using a grid-based filter (Elliott, 1994; Arulampalam, Maskell, Gordon, & Clapp, 2002). For more general models, there are many techniques for approximate Bayesian filtering see (Chen, 2003 for a review).

In some applications, parts of the underlying model are first learned from supervised training data consisting of (Zt,Xt) pairs, and then the learned model is used for filtering on new (Xt) data. For instance, (Zt,Xt) pairs might be used to learn p(xt|zt) with nonparametric conditional density estimation, and then the learned p(xt|zt), say, p^(xt|zt), is substituted into whatever algorithm is used to approximate Bayes' rule in equation 1.2b. This motivates the search for combinations of approximation algorithms and learning methods that work well together. It also opens the door to novel approximation algorithms that would not traditionally be considered for a known model but become practical when the model can be learned. For instance, from (Zt,Xt) pairs, we can choose to learn p(xt|zt) or p(zt|xt) and incorporate either into the approximation algorithm, whereas traditional approximation algorithms assume that only p(xt|zt) is available.

In this letter, we explore the idea of using a novel approximation algorithm that pairs well with learning and demonstrate its use in an intracortical brain-computer interface (iBCI) for a human volunteer with tetraplegia as part of the ongoing BrainGate2 clinical trial. Our approach focuses on the approximation of Bayes' rule in equation 1.2b, making use of the fact that p(xt|zt) can be replaced with p(zt|xt)/p(zt) throughout. (The p(xt) term cancels.) This strategy combines well with various gaussian assumptions that are often employed in approximate Bayesian filtering, resulting in what we call the discriminative Kalman filter (DKF). The DKF retains much of the computational simplicity of the classical Kalman filter but allows for arbitrary observation models. Some of our clinical research using the DKF has already been published (Brandman, Burkhart et al., 2018; Brandman, Hosman et al., 2018), and theoretical aspects of the DKF are further explored in the first author's dissertation (Burkhart, 2019).

2  The Discriminative Kalman Filter

In section 2.1, we derive the DKF approximation for a class of models that generalizes the Kalman filter by allowing for arbitrary observation models. We discuss approximation accuracy in section 2.2 and introduce a modified algorithm that can be more robust to model misspecification in section 2.3. In section 2.4, we compare the DKF formalism to a variety of existing approaches that generalize the Kalman filter, and in section 2.5, we discuss using the DKF approximation in models with nonlinear or nongaussian state dynamics.

We now introduce some notation and conventions. We let the latent states Zt take values in Rd×1 and the observations Xt take values in an abstract space X. In all of our examples, XRn×1, but this is not necessary. We use ηd(z;μ,Σ) to denote the d-dimensional multivariate gaussian distribution with mean vector μRd×1 and covariance matrix ΣSd evaluated at zRd×1, where Sd denotes the set of d×d positive-definite (symmetric) matrices. We let A refer to the transpose of a matrix A and use E and V for expected value and variance/covariance, respectively.

2.1  Filter Derivation

For the basic derivation, we assume that the latent states form a stationary, mean zero, gaussian, vector autoregressive model of order 1. Namely, for ARd×d and S,ΓSd,
p(z0)=ηd(z0;0,S),
(2.1a)
p(zt|zt-1)=ηd(zt;Azt-1,Γ),
(2.1b)
for t=1,2,, where S=ASA+Γ so that the process is stationary. Note that equation 2.1 matches the latent state model for the stationary Kalman filter. (The assumption of zero mean is easily generalized, but it is usually more convenient to center the Zt process by subtracting the common mean.)

The observation model p(xt|zt) is assumed to not vary with t, so that the joint (Zt,Xt) process is stationary but otherwise arbitrary. The observation model can be nongaussian, multimodal, or discrete, for example. For instance, in neural decoding for BCI applications, the observations are often vectors of counts of neural spiking events (binned action potentials), which might be restricted to small integers or even be binary-valued.

The DKF is based on a gaussian approximation for p(zt|xt), namely,
p(zt|xt)ηd(zt;f(xt),Q(xt)),
(2.2)
where f:XRd and Q:XSd. Note that equation 2.2 is not an approximation of the observation model, but rather of the conditional density of the latent state given the observation at a single time step. In section 2.4, we compare this to other approaches that use gaussian approximations for Bayesian filtering. When the dimensionality of the observation space (X) is large relative to the dimensionality of the state space (Rd), the Bernstein–von Mises theorem states that f and Q exist such that this approximation will be accurate, requiring only mild regularity conditions on the observation model p(xt|zt) (see section 2.2 in van der Vaart, 1998). Furthermore, we can take f and Q to be the conditional mean and covariance of Zt given Xt, namely,
f(x)=E(Zt|Xt=x),Q(x)=V(Zt|Xt=x),
(2.3)
which is the approach taken in this letter, although other choices are certainly possible, such as f(xt)=argmaxztp(zt|xt) or f(xt)=argmaxztp(xt|zt), the latter of which is most commonly used in statements of the Bernstein–von mises Theorem.
To make use of equation 2.2 for approximating equation 1.2, we first rewrite equation 1.2b in terms of p(zt|xt) as
p(zt|x1:t)=p(xt)p(xt|x1:t-1)p(zt|xt)p(zt)p(zt|x1:t-1),=p(xt)p(xt|x1:t-1)p(zt|xt)p(zt)p(zt|zt-1)p(zt-1|x1:t-1)dzt-1,
(2.4)
where the second line follows from the Chapman–Kolmogorov equation (see equation 1.2a). We then substitute the latent state model, equation 2.1, and the DKF approximation, equation 2.2, into equation 2.4. We absorb terms not depending on zt into a normalizing constant κ to obtain
p(zt|x1:t)κ(x1:t)ηd(zt;f(xt),Q(xt))ηd(zt;0,S)ηd(zt;Azt-1,Γ)p(zt-1|x1:t-1)dzt-1.
(2.5)
If p(zt-1|x1:t-1) is approximately gaussian, which it is for the base case of t=1 from equation 2.1a (defining p(z0|x1:0)=p(z0)), then all of the terms on the right side of equation 2.5 are approximately gaussian. If these approximations are exact and the analytic expression for covariance is valid (specifically if Σt in equation 2.7 is positive definite), we find that the right side of equation 2.5 is again gaussian, giving a gaussian approximation for p(zt|x1:t). We rely on the fact that dividing two gaussian pdfs yields an exponentiated quadratic form that will itself be gaussian if the associated covariance matrix is positive definite (and that the product of two gaussian pdfs is gaussian, without any additional assumptions). See the proof of 7 in appendix B for a full derivation and further details.
Let
p(zt|x1:t)ηd(zt;μt(x1:t),Σt(x1:t))
(2.6)
be the gaussian approximation of p(zt|x1:t) obtained from successively applying the approximation in equation 2.5. Defining μ0=0 and Σ0=S, we can sequentially compute μt=μt(x1:t)Rd×1 and Σt=Σt(x1:t)Sd via
νt=Aμt-1,Mt=AΣt-1A+Γ,Σt=Mt-1+Q(xt)-1-S-1-1,μt=ΣtMt-1νt+Q(xt)-1f(xt).
(2.7)
The first two steps incorporate the exact state dynamics in equation 2.1b and the final two steps incorporate the observation information using the DKF approximation in equation 2.2. The function Q needs to be defined so that Σt exists and is a proper covariance matrix. A sufficient condition that is easy to enforce in practice is Q(·)-1-S-1Sd (see section A.3 in appendix A).

Equation 2.7 encapsulates the DKF. (For pseudocode, see algorithm 1.) Once f(xt) and Q(xt) have been evaluated, there is no remaining dependence on n and a single iteration of the algorithm takes O(d3) operations, which is at least as fast as the Kalman filter (when d<n). The power of the DKF, along with potential computational difficulties, comes from evaluating f and Q. If f is linear and Q is constant, the DKF and the Kalman filter are equivalent (see section 4.1). More general f and Q allow the filter to depend nonlinearly on the observations, improving performance in many cases. If f and Q can be quickly evaluated and the dimension d of Zt is not too large, then the DKF is fast enough for use in real-time applications, such as the BCI decoding example below.

2.2  Approximation Accuracy

Let the observation space be X=Bn for some set B. As n grows, the Bernstein–von Mises (BvM) theorem guarantees under mild assumptions that the conditional distribution of Zt|Xt is asymptotically normal in total variation distance and concentrates at Zt (van der Vaart, 1998). This asymptotic normality result provides the main rationale for our key approximation expressed in equation 2.2. The BvM theorem is usually stated in the context of Bayesian estimation. To apply it in our context, we equate Zt with the parameter and Xt with the data, so that p(zt|xt) becomes the posterior distribution of the parameter at a fixed time t. We then let the dimension n of xt grow, meaning that we are observing growing amounts of data at a fixed time t associated with the parameter Zt. Very loosely speaking, the BvM theorem tends to be applicable in situations where Xt uniquely determines Zt in the limit as n but does not uniquely determine Zt for any finite n.

formula

One concern is that equation 2.4 will amplify approximation errors. Along these lines, we prove the following result that holds whenever the BvM theorem is applicable for equation 2.2:

Theorem 1.

Under mild assumptions, the total variation distance between our approximation ηd(zt;μt(x1:t),Σt(x1:t)) and the exact filtering distribution p(zt|x1:t) converges in probability to zero for each t as n.

This result is stated formally and proven in appendix B. We interpret the theorem to mean that under most conditions, as the dimensionality of the observations increases, the approximation error of the DKF tends to zero.

The proof is elementary but involves several subtleties that arise because of the p(zt) term in the denominator of equation 2.4 corresponding to ηd(zt;0,S). This term can amplify approximation errors in the tails of p(zt|xt), which are not uniformly controlled by the asymptotic normality results in the BvM theorem. To remedy this, our proof also uses the concentration results in the BvM theorem to control pathological behaviors in the tails. As an intermediate step, we prove that 1 still holds when the p(zt) term is omitted from the denominator of equation 2.4 (see remark 5 in appendix B).

2.3  Robust DKF

Omitting the p(zt) from the denominator of equation 2.4 is also helpful for making the DKF robust to violations of the modeling assumptions and errors introduced when f and Q are learned from training data. Repeating the original derivation, but without ηd(zt;0,S) in the denominator, gives the following filtering algorithm that we call the robust DKF. One can think of the robust DKF as a special case of the standard DKF where all eigenvalues of S-1 are so small that the effect of subtracting S-1 is negligible. This has the effect of placing an improper prior on Z0. Defining μ1(x1)=f(x1) and Σ1(x1)=Q(x1), we sequentially compute μt and Σt for t2 via
νt=Aμt-1,Mt=AΣt-1A+Γ,Σt=Mt-1+Q(xt)-1-1,μt=ΣtMt-1νt+Q(xt)-1f(xt).
(2.8)
(Note that we initialize at t=1 and not t=0 in the robust DKF.) Justification for the robust DKF comes from remark 5 in appendix B showing that the robust DKF accurately approximates the true p(zt|x1:t) in total variation distance for each t as n increases. We sometimes find that the robust DKF outperforms the DKF on real-data examples, but not on simulated examples that closely match the DKF assumptions. (For pseudocode, see algorithm 2.)

2.4  Other Gaussian Approximations

The DKF enforces a gaussian form for the filtering distribution p(zt|x1:t), a common strategy for approximate Bayesian filtering owing to the analytic and representational tractability of gaussians. In this section, we describe several other methods that use gaussian approximations, focusing on the case of linear, gaussian state dynamics. For this type of state dynamics, the transition from time t-1 to time t is usually separated into two distinct steps when using gaussian approximations. Beginning with
p(zt-1|x1:t-1)ηd(zt-1;μt-1,Σt-1),
the first step uses the exact state dynamics, equation 2.1b, to create a gaussian approximation for p(zt|x1:t-1), namely,
p(zt|x1:t-1)ηd(zt;νt,Mt),
(2.9)
where νt=Aμt-1 and Mt=AΣt-1A+Γ, as in equations 2.7 and 2.8. Most gaussian methods would proceed similarly for the first step under these state dynamics. Differences between methods appear for nonlinear or nongaussian state dynamics (see section 2.5).
formula
The second step attempts to incorporate the observation information xt via Bayes' rule:
p(zt|x1:t)=p(xt|zt)p(zt|x1:t-1)p(xt|zt)p(zt|x1:t-1)dzt.
Beginning with the gaussian approximation from step 1 (equation 2.9) and enforcing the final approximation,
p(zt|x1:t)ηd(zt;μt,Σt),
the problem reduces to finding μt and Σt so that
ηd(zt;μt,Σt)p(xt|zt)ηd(zt;νt,Mt)p(xt|zt)ηd(zt;νt,Mt)dzt=qt(zt),
(2.10)
where qt is defined by this equation.

There are many strategies in the literature for choosing μt and Σt in equation 2.10. The terminology is not standardized, but we will attempt to describe some prominent classes of strategies.

2.4.1  Gaussian Assumed Density Filter

The gaussian assumed density filter (G-ADF) usually refers to choosing μt and Σt to be the mean vector and covariance matrix of the density qt in equation 2.10 (Kushner, 1967; Ito, 2000; Ito & Xiong, 2000; Minka, 2001a). Moment matching, in this case, minimizes the relative entropy D(qtηd(·;μt,Σt)). The G-ADF directly seeks a gaussian approximation to the full posterior p(zt|x1:t), whereas the DKF derives a gaussian approximation to the full posterior from a gaussian approximation of p(zt|xt). While the G-ADF approach tends to prove quite accurate, it is practical only if the mean and covariance of qt are available. In particular, we must be able to efficiently compute or easily approximate the integrals,
a=p(xt|zt)ηd(zt;νt,Mt)dzt,b=ztp(xt|zt)ηd(zt;νt,Mt)dzt,c=ztztp(xt|zt)ηd(zt;νt,Mt)dzt,
(2.11)
to obtain μt=b/a and Σt=c/a-μtμt. There also exist extensions of the G-ADF. For instance, expectation propagation uses iterative refinement of estimates to improve on assumed density filtering (Minka, 2001a, 2001b). It may be possible to similarly improve the DKF, but iterating over the history of observations is typically not practical in an online setting, and we do not explore that approach here.

In cases where the DKF is derived from a known model, as opposed to being learned from training data, computing f(xt) and Q(xt) requires the computation of very similar integrals to those needed for the G-ADF, the difference being that vt and Mt are replaced by 0 and S, respectively, throughout equation 2.11 (and then f(xt)=b/a and Q(xt)=c/a-f(xt)f(xt)). For this reason, in models where the G-ADF can be easily used, there would seem to be no reason to use the DKF. The main difference is that the DKF can be easily learned from training data, whereas the G-ADF cannot, since the latter is based on the conditional mean and variance of Zt|Xt derived under a different marginal distribution for Zt at each time step, namely, ηd(zt;νt,Mt). The example in section 4.2 illustrates a model where both the DKF and G-ADF can be analytically computed; there is little difference in performance. The example in section 4.3 illustrates a somewhat contrived model where the DKF can be easily computed, but it seems the G-ADF cannot.

2.4.2  Laplace Approximation

The Laplace approximation uses a Taylor approximation at the maximum to coerce the numerator in equation 2.10 into a gaussian form as a function of zt (Butler, 2007; Koyama, Pérez-Bolde, Shalizi, & Kass, 2010; Quang, Musso, & Le Gland, 2015). Defining
gt(zt)=logp(xt|zt)ηd(zt;νt,Mt)andzt*=argmaxztgt(zt),
a second-order Taylor approximation of gt at zt* is
gt(zt)gt(zt*)+g˙t(zt*)(zt-zt*)+(zt-zt*)g¨t(zt*)(zt-zt*)/2,
where g˙t(z) and g¨t(z) denote, respectively, the d×1 gradient vector and the d×d Hessian matrix of gt evaluated at z. The second term vanishes since g˙t is zero at the maximum, giving
qt(zt)exp(gt(zt))expgt(zt*)+(zt-zt*)g¨t(zt*)(zt-zt*)/2)ηd(zt;zt*,-g¨t(zt*)-1).
This motivates the choice of μt=zt* and Σt=-g¨t(zt*). Similar to the DKF, the Laplace approximation can be justified in the limit of increasing observation dimensionality using the BvM theorem. If zt* or the derivatives of gt are not available in closed form, then the Laplace approximation can be slow owing to the need to solve an optimization problem at each time step. Laplace approximations are also criticized for being too local, in that the local curvature in the density at zt* dictates the variance chosen for a global approximation to the density.

2.4.3  Linearization Methods

Several methods, often called linearization methods, can be motivated by attempting to approximate the numerator of equation 2.10 as jointly gaussian in (zt,xt), namely,
p(xt|zt)ηd(zt;νt,Mt)ηd+nztxt|νtht,MtCtCtNt,
(2.12)
where the history of observations x1:t is allowed to influence the choice of htRn×1, NtSn, and CtRd×n. Using this approximation allows equation 2.10 to be exactly integrated to obtain
μt=νt+CtNt-1(xt-ht)andΣt=Mt-CtNt-1Ct.
(2.13)
Methods differ in how they choose ht, Nt, and Ct.
Using ηd(zt;νt,Mt) as the marginal density for Zt, equation 2.12 can be rewritten as
p(xt|zt)ηn(xt;bt+Htzt,Λt).
(2.14)
The implicit linearization in equation 2.12 is now explicit: E(Xt|Zt=zt) is approximated as the linear function bt+Htzt. The relationship between the different parameters in equations 2.12 and 2.14 is bt=ht-CtMt-1vt, Ht=CtMt-1, and Λt=Nt-CtM-1Ct. Upon reparameterization,1 equation 2.13 can be used for filtering with
Σt=(Mt-1+HtΛt-1Ht)-1,μt=ΣtMt-1νt+HtΛt-1(xt-bt),
which has a similar appearance to the corresponding DKF updates in equation 2.7.
Equation 2.14 underlies several gaussian approximations to Bayes' rule, including the approximations used in the extended Kalman filter (EKF), the unscented Kalman filter (UKF: Julier & Uhlmann, 1997; Wan & van der Merwe, 2000; van der Merwe, 2004), and the statistically linearized filter (SLF: Gelb, 1974; Särkkä, 2013). The EKF, for instance, begins with the functions
h(z)=E(Xt|Zt=z)andΛ(z)=V(Xt|Zt=z),
which are assumed known, and takes Ht=h˙(νt), bt=h(νt)-Htνt, and Λt=Λ(νt), where h˙(z) is the n×d matrix of partial derivatives of h evaluated at z. These choices of bt and Ht correspond to a first-order Taylor approximation of h at the point νt. Like the Laplace approximation, the EKF is often criticized for being too local because the gradient of h at a single point drives the approximation.

The unscented Kalman filter (UKF) employs the eponymous transform to propagate weighted, deterministically chosen points through a nonlinear transformation and recover estimates for ht, Nt, and Ct from equation 2.13. The estimates for all three parameters prove exact for linear transformations of gaussians but inexact for general higher-order polynomials (Särkkä, 2013), so we consider this a linearization method. Variations on this approach, collectively called sigma-point filters (van der Merwe, 2004), include the central difference Kalman filter (CDKF: Ito & Xiong, 2000; Nørgaard, Poulsen, & Ravn, 2000), the Gauss-Hermite Kalman filter, the quadrature Kalman filter (Ito, 2000; Ito & Xiong, 2000), and the cubature Kalman filter (Arasaratnam, Haykin, & Elliott, 2007; Arasaratnam & Haykin, 2009).

The SLF is a related but more global approximation for the same observation model. It selects bt and Ht to minimize the difference between the true observation model Xt=h(Zt)+εt and the linear approximation Xtat+BtZt+εt, where Zt is chosen from the current, approximate, predicted distribution. For instance, at and Bt can be chosen to minimize
h(zt)-(at+Btzt)2ηd(zt;νt,Mt)dzt,
where · is the usual Euclidean norm in Rn. Defining h¯t=h(zt)ηd(zt;νt,Mt)dzt and H¯t=(h(zt)-h¯t)(zt-νt)ηd(zt;νt,Mt)dzt, the solution is Bt=H¯tMt-1 and at=h¯t-Btνt, again with Λt=Λ. Like the EKF, this version of the SLF is best suited for additive, gaussian noise models, but it further requires that the integrals defining h¯t and H¯t can be efficiently computed or easily approximated.

The UKF, the SLF, and many related techniques improve on some of the deficiencies of the EKF. Nevertheless, these methods tend to perform poorly when the conditional distribution of Xt given Zt cannot be well approximated as gaussian. The examples in sections 4.2 and 4.3 illustrate models where linearization proves completely ineffectual, as h(z)=E(Xt|Zt=z)=0 for all z in these examples, even though the G-ADF and the DKF work well.

2.5  Nonlinear State Dynamics

Filtering can be conceptually separated into two steps. The first step uses the state dynamics to transition from Zt-1|X1:t-1 to Zt|X1:t-1 via equation 1.2a, and the second step uses Bayes' rule to update Zt|X1:t-1 into Zt|X1:t via equation 1.2b. In this letter, difficulties with the first step are removed by assuming linear, gaussian, state dynamics (see equation 2.1). There are, however, a variety of approximation methods for more complicated state dynamics, including methods that approximate p(zt|x1:t-1) as a gaussian. Any such gaussian method could be easily combined with the DKF approximation, which relates to Bayes' rule in the second step. In particular, given the approximation
p(zt|x1:t-1)ηd(zt;νt,Mt),
we simply use these values of νt and Mt in the DKF algorithm (see equation 2.7) or the robust DKF algorithm see equation 2.8, instead of computing them in the first two lines of these algorithms. In this letter, we do not explore in depth this generalization to nonlinear state dynamics, although we do provide a proof of concept example in section 4.4.

There is a vast literature on more general approximation algorithms for Bayesian filtering (Särkkä, 2013; Chen, 2003). Monte Carlo integration (Metropolis & Ulam, 1949) can almost always be used. Such approaches are called sequential Monte Carlo or particle filtering and include sequential importance sampling and sequential importance resampling (Handschin & Mayne, 1969; Handschin, 1970; Gordon, Salmond, & Smith, 1993; Kitagawa, 1996; del Moral, 1996; Doucet, Godsill, & Andrieu, 2000; Cappé, Moulines, & Ryden, 2005; Cappé, Godsill, & Moulines, 2007). These methods apply to all classes of models but tend to be the most expensive to compute online and suffer from the curse of dimensionality (Daum & Huang, 2003). Alternate sampling strategies (see, e.g., Chen, 2003; Liu, 2008) can be used to improve filter performance, including acceptance-rejection sampling (Handschin & Mayne, 1969), stratified sampling (Douc & Cappé, 2005), hybrid MC (Choo & Fleet, 2001), and quasi-MC (Gerber & Chopin, 2015). There are also ensemble versions of the Kalman filter that are used to propagate the covariance matrix in high dimensions, including the ensemble Kalman filter (enKF: Evensen, 1994) and the ensemble transform Kalman filter (ETKF: Bishop, Etherton, & Majumdar, 2001; Majumdar, Bishop, Etherton, & Toth, 2002), along with versions that produce local, parallelizable approximations for covariance (Ott et al., 2004; Hunt, Kostelich, & Szunyogh, 2007).

It may be possible to usefully combine the DKF approximation with some of these more advanced filtering techniques. The key approximation in the DKF is
p(xt|zt)=p(xt)p(zt|xt)p(zt)κ(xt)ηd(zt;f(xt),Q(xt))ηd(zt;0,S).
(2.15)
This approximation could, in principle, be substituted for the likelihood p(xt|zt) in any filtering algorithm, including particle filters, which incorporate the likelihood into the particle weights. The normalizing term κ(xt) from equation 2.15 will generally cancel, since the final posterior distribution p(zt|x1:t) is invariant to terms depending only on x1:t. The advantage of equation 2.15 is that f(·), Q(·), and S might be easier to learn from data than the full conditional density p(xt|zt). For complex state dynamics, it is worth noting that the denominator ηd(zt;0,S) will no longer precisely correspond to p(zt) but will also be an approximation. If the gaussian approximations for p(zt|xt) and p(zt) are learned separately, some care may need to be taken to ensure the resulting approximation to p(xt|zt) remains a good one. One strategy might be to learn a gaussian-shaped approximation to the density ratio p(zt|xt)/p(zt) as a function of zt (Sugiyama, Suzuki, & Kanamori, 2012). Another strategy might be to use the robust DKF approximation as in section 2.3, which simply drops the denominator in equation 2.15. In future work, we plan to explore these and other approaches that might allow a DKF-style approximation to be incorporated into more general filtering models.

3  Learning the DKF

The parameters in the DKF are A, Γ, f(·), and Q(·). (S is specified from A and Γ using the stationarity assumption.) In many problems, some or all of these parameters might be unknown or not easily computable. In this section, we discuss some strategies for learning or approximating the parameters in the situation where fully supervised training data are available, meaning that we have a sequence of (Zt,Xt) pairs assumed to be sampled from the underlying Bayesian network in equation 1.1 and denoted (z1',x1'),,(zm',xm'). This training data might be real data or might be simulated from a known generative model for which the parameters, particularly f and Q, are not easily computable.

We use A^, Γ^, f^, and Q^ to denote the respective learned parameters. We consider only the situation where the parameters are learned from training data and then fixed for subsequent filtering on a different sequence of observations. In particular, for filtering, we simply replace each parameter with its corresponding estimate in the DKF algorithm in equation 2.7. We do not consider a more fully Bayesian approach, where parameter uncertainty is propagated through the filtering equations.

A and Γ are the parameters of a well-specified statistical model given by equations 2.1a and 2.1b. In the learning experiments below, we learn them from (zt-1',zt') pairs using only equation 2.1b, which reduces to multiple linear regression and is a common approach when learning the parameters of a Kalman filter from fully observed training data (see, e.g., Wu et al., 2002).

The parameters f and Q are more unusual, since they are not uniquely defined by the model, but are introduced via a gaussian approximation in equation 2.2. One possibility, and the one we focus on here, is to define f and Q via equation 2.3 and then learn them directly from training data as
f^(x)f(x)=E(Zt|Xt=x)andQ^(x)Q(x)=V(Zt|Xt=x).
(3.1)
Using equation 3.1, we learn f and Q from (zt',xt') pairs ignoring the overall temporal structure of the data, which reduces to a standard nonlinear regression problem with heteroskedastic variance. The conditional mean f can be learned using any number of off-the-shelf regression tools, and then Q can be learned from the residuals, ideally using a held-out portion of the training data. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this approach.

In the experiments that follow, we compare three standard nonlinear regression methods for learning f: Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and gaussian process (GP) regression. (Details are in sections A.4 to A.6.) While we have found that these methods work well with the DKF framework, one could readily use any arbitrary regression model.

For learning Q, we first define Rt=Zt-f(Xt) and R^t=Zt-f^(Xt), so that
Q(x)=V(Zt|Xt=x)=E(RtRt|Xt=x)E(R^tR^t|Xt=x).
(3.2)
The final expression in equation 3.2 is a conditional expectation and can in principle be learned with regression on (R^tR^t,Xt) pairs. Learning Q in this way using off-the-shelf regression tools is more challenging because of the additional requirement that Q(x) be a valid covariance matrix. Since R^tR^t is positive semidefinite, any regression estimator that is a weighted average of the training data with only nonnegative weights will also be positive semidefinite and, in most cases, positive definite. NW kernel regression constitutes one such method, and we use it for learning Q in all of our examples. Given a subset of the training set {(zi'',xi'')}i=1k, distinct from the subset used to learn the function f, we define the residuals r^i=zi''-f^(xi''), and then learn Q using NW kernel regression via
Q^(x)=i=1kr^ir^iκ(x,xi'')i=1kκ(x,xi''),
(3.3)
for a kernel κ:X×X[0,). (Complete details are in section A.4.)

4  Examples

In this section, we compare filter performance on both artificial models and real neural data. Corresponding Matlab code, and Python code for the long short-term memory (LSTM) comparison, is freely available online at https://github.com/burkh4rt/Discriminative-Kalman-Filter under the GNU General Public License v3.0 to encourage code use and adaptation. For timing comparisons, the code was run on a Mid-2018 MacBook Pro laptop with a 2.6 GHz Intel Core i7 processor using Matlab v. 2019a and Python v. 3.6.8.

4.1  Kalman Observation Model

The stationary Kalman filter observation model is
p(xt|zt)=ηn(xt;b+Hzt,Λ)
for observations in X=Rn×1 and for fixed bRn×1, HRn×d, and ΛSn. Defining f and Q via equation 2.3 gives
Q(x)Q=(S-1+HΛ-1H)-1andf(x)=QHΛ-1(x-b).
It is straightforward to verify that the DKF in equation 2.7 is exactly the well-known Kalman filter recursion. Hence, the DKF computes the exact posterior p(zt|x1:t) in this special case.

4.2  Kalman Observation Mixtures

This example and the next are designed to illustrate how the gaussian approximation underlying the DKF is more similar in spirit to the G-ADF than to linearization approximations such as the Kalman filter, the EKF, and the UKF (see section 2.4). In particular, the specific observation model used in the simulation is engineered so that the state Zt and the observation Xt are uncorrelated (but not independent). Linearization methods are useless in this case, whereas the DKF is able to take advantage of the higher-order dependence, much like the G-ADF.

The observation model is a probabilistic mixture of Kalman observation models (see section 4.1), namely,
p(xt|zt)==1Lπηn(xt;b+Hzt,Λ),
for a probability vector π=π1:L, where each bRn×1, HRn×d, and ΛSn. At each time step, one of L possible Kalman observation models is randomly and independently selected according to π and then used to generate the observation. This model can be viewed as a special case of a switching state-space model with independent switching (see Shumway & Stoffer, 1991; Ghahramani & Hinton, 2000). The integrals in equation 2.11 can be efficiently computed for any choice of νt and Mt, including νt=0 and Mt=S, so the G-ADF and the DKF can be computed exactly for this model (see section A.1 for details), although the DKF is much faster for large n, because it allows for more precomputation. Figure 1 illustrates that the DKF is comparable to the G-ADF in terms of root mean squared error (RMSE) for a particular instance of this model, and it also shows that the computational savings of the DKF over a particle filter with similar accuracy can be dramatic, especially as n gets large.
Figure 1:

Kalman observation mixtures. This figure shows filtering performance on an instance of the model in section 4.2 for various approximation algorithms as the observation dimension n increases. The hidden state dimension is d=10, and the state model parameters are S=Id, A=0.95Id-0.05, and Γ=S-ASA. The number of categories is L=2, the category probabilities are π=(0.5,0.5), and the Kalman parameters are b1=b2=b¯=0, Λ1=In, Λ2=5In, and H2=-H1, so that H¯=0 (see equation 4.1). The entries of H1 were generated as independent N(0,d-1) using the Matlab 9.6 code rng(42,`twister'); H = randn(1000,10)/sqrt(10). The data were generated for an observation dimension of 1000, and the plot shows filter performance using only the first n dimensions of Xt for selected n between 1 and 1000. Filter performance was measured using RMSE (left panel) and computation time (s, right panel) on a single test sequence of length T=104. Because Xt and Zt are uncorrelated, linearization methods (e.g., KF, EKF, and UKF) ignore Xt and always predict ZtE(Zt)=0, giving an RMSE of approximately 1 (black line) in this case. The accuracy of particle filtering increases with the number of particles at the expense of increased computation, and we show performance for different numbers of particles: 101,102,103,104,105 (blue lines, ordered as expected). We also show RMSE for the optimal prediction using only Xt (as opposed to the entire history X1:t), namely, ZtE(Zt|Xt)=f(Xt) (dotted red line). (This serves to demonstrate the performance gain that filtering provides.) Finally, we caution that the model parameters have much more influence on the relative performance of the different gaussian approximation methods when n is small than when n is large. The parameters in this model were chosen so that the DKF also performs well for small n, even though we only have guarantees about its performance in the large n setting.

Figure 1:

Kalman observation mixtures. This figure shows filtering performance on an instance of the model in section 4.2 for various approximation algorithms as the observation dimension n increases. The hidden state dimension is d=10, and the state model parameters are S=Id, A=0.95Id-0.05, and Γ=S-ASA. The number of categories is L=2, the category probabilities are π=(0.5,0.5), and the Kalman parameters are b1=b2=b¯=0, Λ1=In, Λ2=5In, and H2=-H1, so that H¯=0 (see equation 4.1). The entries of H1 were generated as independent N(0,d-1) using the Matlab 9.6 code rng(42,`twister'); H = randn(1000,10)/sqrt(10). The data were generated for an observation dimension of 1000, and the plot shows filter performance using only the first n dimensions of Xt for selected n between 1 and 1000. Filter performance was measured using RMSE (left panel) and computation time (s, right panel) on a single test sequence of length T=104. Because Xt and Zt are uncorrelated, linearization methods (e.g., KF, EKF, and UKF) ignore Xt and always predict ZtE(Zt)=0, giving an RMSE of approximately 1 (black line) in this case. The accuracy of particle filtering increases with the number of particles at the expense of increased computation, and we show performance for different numbers of particles: 101,102,103,104,105 (blue lines, ordered as expected). We also show RMSE for the optimal prediction using only Xt (as opposed to the entire history X1:t), namely, ZtE(Zt|Xt)=f(Xt) (dotted red line). (This serves to demonstrate the performance gain that filtering provides.) Finally, we caution that the model parameters have much more influence on the relative performance of the different gaussian approximation methods when n is small than when n is large. The parameters in this model were chosen so that the DKF also performs well for small n, even though we only have guarantees about its performance in the large n setting.

Define b¯=πb and H¯=πH so that
E(Xt|Zt)=b¯+H¯Zt.
(4.1)
An interesting special case of this model is when H¯=0, so that the mean of Xt given Zt does not depend on Zt, and, consequently, Xt and Zt are uncorrelated. Information about the states is found only in higher-order moments of the observations. Algorithms that are designed around E(Xt|Zt), such as the Kalman filter, EKF, and UKF, are not useful when H¯=0, illustrating the important difference between a gaussian approximation for the observation model and the DKF approximation in equation 2.2. The simulation in Figure 1 used H¯=0, and the ineffectiveness of linearization techniques is easily seen.

4.3  Independent Bernoulli Mixtures

Here we describe a model where observations take values in {0,1}n to further emphasize that our gaussian approximation is in the state space, not in the observation space. Like the example in section 4.2, this example is also engineered so that the states and observations are uncorrelated, rendering linearization-based methods ineffective (see section 2.4). Finally, the specific parameters of this example are chosen to have the peculiar property that the DKF is efficiently computable, whereas the G-ADF is not (insofar as we can tell).

The observation model is a probabilistic mixture of conditionally independent Bernoulli random variables, namely,
p(xt|zt)==1Lπi=1ngi(zt)xti(1-gi(zt))1-xti,
for a probability vector π=π1:L. For each =1,,L and i=1,,n, the functions gi:Rd×1(0,1) are defined by
gi(zt)=αi1{ztdi<γi}+βi1{ztdiγi},
where each γiR, αi,βi(0,1), di{1,,d}, and where ztk indicates the kth coordinate of zt. The ith coordinate of Xt depends on Zt only through the dith coordinate of Zt, and the probability distribution of Xti is different depending on whether Ztdi<γi. Each of the L components of the mixture changes the probability distribution of Xti, via αi and βi, but it does not change the corresponding coordinate di or the change point γi.

For the state dynamics, we use S=Id, which makes it possible to compute f(zt) and Q(zt) exactly (see section A.2). In general, however, the integrals in equation 2.11 are not easily evaluated, so the G-ADF is not a practical approximation technique in this example. Figure 2 suggests that the DKF approximation performs well for a particular instance of this model, in the sense that the DKF's RMSE is near or better than that of a particle filter with a large number of particles. The figure also shows that the computational savings over a particle filter with similar accuracy can be dramatic, especially as n gets large.

Figure 2:

Independent Bernoulli mixtures. This figure shows filtering performance on an instance of the model in section 4.3 for various approximation algorithms as the observation dimension n increases. The state model (Zt) and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. (Using this many particles with higher n was too time-consuming.) The number of categories is L=2, the category probabilities are π=(0.5,0.5), and for each i, α1i=β2i=0.01 and α2i=β1i=0.99, so that each g¯i0.5 (see equation 4.2). The d1:n were chosen as independent uniform{1,,d}, and the γ1:n were chosen as independent N(0,1).

Figure 2:

Independent Bernoulli mixtures. This figure shows filtering performance on an instance of the model in section 4.3 for various approximation algorithms as the observation dimension n increases. The state model (Zt) and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. (Using this many particles with higher n was too time-consuming.) The number of categories is L=2, the category probabilities are π=(0.5,0.5), and for each i, α1i=β2i=0.01 and α2i=β1i=0.99, so that each g¯i0.5 (see equation 4.2). The d1:n were chosen as independent uniform{1,,d}, and the γ1:n were chosen as independent N(0,1).

Define g¯i=πgi, so that
E(Xti|Zt)=P(Xti=1|Zt)=g¯i(Zt).
(4.2)
An interesting special case of this model is when g¯i is constant for each i, so that the mean of Xt given Zt does not depend on Zt, and, consequently, Xt and Zt are uncorrelated. As in the previous section, linearization approximations like the Kalman filter, EKF, and UKF are not useful when g¯i is constant. Furthermore, when g¯i is constant, Xti and Zt are independent, that is, individual coordinates of the observations carry no information about the states. Only the vector of observations Xt can be used for meaningful predictions of Zt. The simulation in Figure 2 used g¯i0.5 for all i, so that each coordinate of the observations is independent of the state.

4.4  Kalman Observation Mixtures with Nonlinear State Dynamics

This example illustrates how the DKF approximation can be combined with other filtering approximations for use with nonlinear state dynamics (see section 2.5). We include it here as a proof of concept and leave for future work a more thorough exploration of when the DKF approximation is useful for filtering with nonlinear state dynamics. We use the same mixture of Kalman observation models from section 4.2 but modify the state dynamics in equation 2.1 as follows. Define the 2×2 rotation matrix R(θ)=sinθ-cosθcosθsinθ, and for even d, define the d×d rotation matrix Rd(θ) to be the block-diagonal matrix with R(θ) repeated along the diagonal:
Rd(θ)=R(θ)000R(θ)000R(θ).
Define the function a:Rd×1Rd×1 via a(z)=ARd(|z|)z, where |·| denotes the Euclidean norm. The new state dynamics are
p(z0)=ηd(z0;0,S),p(zt|zt-1)=ηd(zt;a(zt-1),Γ),
for t=1,2,, where S=ASA+Γ. These are the same dynamics as before except that the conditional mean of Zt given Zt-1 has changed from the linear function AZt-1 to the nonlinear function a(Zt-1). In particular, before being multiplied by A, the state vector is rotated by an amount that depends on its length. This type of nonlinearity was chosen because when S=Id (as in our examples), Zt remains marginally gaussian, an important part of the DKF approximation.

We use an unscented Kalman filter (UKF) approximation for the state dynamics; that is, we replaced νt and Mt in equations 2.7 and 2.8 with the mean and covariance obtained from performing the unscented transform (Julier & Uhlmann, 1997). We used Matlab's unscentedKalmanFilter implementation with alpha=1,beta=kappa=0. The UKF approximations of νt and Mt can also be substituted directly into the G-ADF used in section 4.2.

Figure 3 shows filtering performance for a specific instance of this model and illustrates that at least in this case, a DKF approximation for nonlinear, nongaussian observation models can be usefully combined with other approximations for nonlinear state dynamics and that there is little loss of performance compared to the G-ADF.

Figure 3:

Nonlinear state dynamics. This figure shows filtering performance on an instance of the model in section 4.4 for various approximation algorithms as the observation dimension n increases. The observation model (Xt|Zt) and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. The state model is now nonlinear, and μt and Mt in the DKF, robust DKF, and G-ADF are approximated using a UKF.

Figure 3:

Nonlinear state dynamics. This figure shows filtering performance on an instance of the model in section 4.4 for various approximation algorithms as the observation dimension n increases. The observation model (Xt|Zt) and the figure conventions (and cautions) are the same as those described in the Figure 1 caption. The state model is now nonlinear, and μt and Mt in the DKF, robust DKF, and G-ADF are approximated using a UKF.

4.5  Unknown Observation Model: Macaque Reaching-Task Data

This example illustrates Bayesian filtering in a case where the observation model is unknown and must be learned from data. Flint, Lindberg, Jordan, Miller, and Slutzky (2012) implanted a rhesus monkey with a 96-channel microelectrode array (Blackrock Microsystems LLC) over the arm area of its primary motor cortex (M1). The monkey was trained to move a manipulandum to acquire illuminated targets for a juice reward. While performing this task, the monkey's neural spikes were recorded with a 128-channel acquisition system (Cerebus, Blackrock Microsystems LLC). The signal was sampled at 30 kHz, high-pass-filtered at 300 Hz, and then thresholded and manually sorted into spikes offline. Walker and Kording (2013) continue to make these data publicly available as part of the Database for Reaching Experiments and Models (DREAM). We used data from Flint et al. (2012) and aggregated spike counts over 100 ms bins. The first n=10 principal component analysis (PCA) components of neural data became the observed variable Xt, and we used the d=2-dimensional (horizontal and vertical) cursor velocity (lagged 50 ms after the end of the spike count bin) as the latent variable Zt.

Tables 1 and 2 compare filtering performance using various learning algorithms and filtering methods. For learning the function f:R10R2 for the DKF, we experimented with Nadaraya-Watson (NW) kernel regression, neural network (NN) regression, and gaussian process (GP) regression. In each case, we learned the function Q:R10S2 using NW kernel regression from the approximate residuals as in equation 3.3. For the Kalman filter, parameters are learned in the usual manner via multivariate (linear) regression. For the EKF and UKF (see section 2.4), we learned the conditional mean h:R2R10 defined by
h(z)=E(Xt|Zt=z)
(4.3)
via neural network regression and took the conditional covariance to be constant, namely, Λ(z)=V(Xt|Zt=z)ΛS10, which we learned from the approximate residuals. Finally, we also experimented with an LSTM recurrent neural network for predicting Zt given X1:t. In all cases, we used 5000 training points and a different 1000 testing points. (More details about all of these methods are in sections A.4 to A.7.)
Table 1:
Normalized RMSE (nRMSE) for Various Filtering Methods on the Flint Data Set.
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.7650.9420.7880.7930.7800.7650.805
DKF-NW −21% −18% −17% −23% −20% −23% −20% 
DKF-GP −21% −19% −15% −20% −18% −20% −19% 
DKF-NN −19% −15% −13% −13% −13% −17% −15% 
LSTM −15% −19% −16% −13% −16% −11% −15% 
EKF 2% 24% 12% 18% 12% 3% 12% 
UKF 2% 31% 18% 18% 15% 6% 15% 
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.7650.9420.7880.7930.7800.7650.805
DKF-NW −21% −18% −17% −23% −20% −23% −20% 
DKF-GP −21% −19% −15% −20% −18% −20% −19% 
DKF-NN −19% −15% −13% −13% −13% −17% −15% 
LSTM −15% −19% −16% −13% −16% −11% −15% 
EKF 2% 24% 12% 18% 12% 3% 12% 
UKF 2% 31% 18% 18% 15% 6% 15% 

Notes: The nRMSE is computed by dividing the RMSE by the root mean square of the observation vector, so that predicting identically zero would yield an nRMSE of 1. The top row shows the nRMSE of the Kalman filter. Each remaining row shows the percentage change in nRMSE relative to the Kalman filter, with methods ordered from best (top) to worst (bottom) average performance. Columns 1 to 6 refer to completely separate trials using new training and testing data. The final column gives the average performance across the six trials.

Table 2:
Mean Absolute Angular Error (Radians) for Various Filtering Methods on the Flint Data Set.
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.8890.9551.0250.9330.9640.9260.949
DKF-NW −15% −1% −20% −17% −25% −28% −18% 
DKF-GP −11% 7% −22% −16% −24% −25% −15% 
DKF-NN −7% −2% −17% −16% −21% −23% −14% 
LSTM −2% −2% −12% −6% −10% −8% −7% 
UKF 0% 3% −3% −3% −8% −6% −3% 
EKF 4% 3% −2% −4% −8% −7% −2% 
Trial 1Trial 2Trial 3Trial 4Trial 5Trial 6Average
Kalman0.8890.9551.0250.9330.9640.9260.949
DKF-NW −15% −1% −20% −17% −25% −28% −18% 
DKF-GP −11% 7% −22% −16% −24% −25% −15% 
DKF-NN −7% −2% −17% −16% −21% −23% −14% 
LSTM −2% −2% −12% −6% −10% −8% −7% 
UKF 0% 3% −3% −3% −8% −6% −3% 
EKF 4% 3% −2% −4% −8% −7% −2% 

Notes: Because cursor speed is often adjustable in BCIs (Willett et al., 2019), this may provide a more informative measure of performance. See the caption for Table 1 for more details about the table arrangement. Note that 45=π/40.79 radians, so all of these methods have fairly substantial angular error over 100 ms prediction intervals. Chance performance would be π/21.57 radians.

The DKF using NW kernel regression was the best method among the ones that we tried, and all versions of the DKF were near the top in performance. Under the mean absolute angular error (MAAE) metric (Simeral, Kim, Black, Donoghue, & Hochberg, 2011), each version of the DKF outperformed prediction using the corresponding f^, illustrating the benefit of filtering to combine information from both past and present observations. The EKF and UKF performed poorly. We do not know the degree to which poor performance is a result of errors introduced by the EKF and UKF approximations or a result of errors introduced from learning the function h in equation 4.3. All versions of the DKF outperformed the LSTM that we used. The LSTM and its variants require manually selecting a neural network architecture and several tuning parameters. This is often done by experts through trial and error. While we suspect that there is some combination of architecture and tuning parameters that would allow the LSTM to meet or exceed the DKF performance, automating this process of searching through network architecture remains an area of active research requiring extensive computational resources (Zoph & Le, 2017; Real et al., 2017).

4.6  Closed-Loop Decoding in a Person with Paralysis

Neural decoding for closed-loop brain-computer interfaces (BCIs) provided the motivating application for the development of the DKF. BCIs use neural measurements from the brain to enable voluntary control of external devices (Wolpaw, Birbaumer, McFarland, Pfurtscheller, & Vaughan, 2002; Hochberg & Donoghue, 2006; Brandman, Cash, & Hochberg, 2017). Intracortical BCI systems (iBCIs) have been shown to provide users with paralysis the ability to control computer cursors (Pandarinath et al., 2015; Jarosiewicz et al., 2015; Nuyujukian et al., 2018), robotic arms (Hochberg et al., 2012; Collinger et al., 2013), and functional electrical stimulation systems (Bouton et al., 2016; Ajiboye et al., 2017) with the real-time decoded neural activity generated during attempted movement. State-of-the-art decoding approaches have been based on the Kalman filter (Pandarinath et al., 2017; Jarosiewicz et al., 2015; Gilja et al., 2015), with observed neural features and latent motor intention used to move external devices. To construct a supervised training set, motor intentions are inferred as vectors from the instantaneous cursor position to the target position Zt (Brandman, Hosman et al., 2018).

The DKF is a natural choice for closed-loop neural decoding using iBCIs for a few reasons. First, evidence suggests that neurons have very complex behavior. Neurons in the motor cortex have been shown to encode direction of movement (Georgopoulos, Kettner, & Schwartz, 1988), velocity (Schwartz, 1994), acceleration (Paninski, Fellows, Hatsopoulos, & Donoghue, 2004), muscle activation (Lemon, 2008; Pohlmeyer, Solla, Perreault, & Miller, 2007), proprioception (Bensmaia & Miller, 2014), visual information related to the task (Rao & Donoghue, 2014), and preparatory activity (Churchland et al., 2012). Hence, iBCI-related recordings are highly complex and nonlinear (Vargas-Irwin, Brandman, Zimmermann, Donoghue, & Black, 2015). Moving away from the linear constraints of the Kalman filter could potentially capture more of the inherent complexity of the signals, resulting in higher end-effector control for the user.

Second, evidence suggests that the quality of control directly relates to the rate at which the decoding systems perform real-time decoding. Modern iBCI sytems update velocity estimates on the order of 20 ms (Jarosiewicz et al., 2015) or even 1 ms (Pandarinath et al., 2015). Thus, any potential filtering technique must be computationally feasible to implement for real-time use.

Third, new technologies have allowed neuroscientists to simultaneously record from increasingly large numbers of neurons. In fact, the number of observed brain signals has been growing exponentially (Stevenson & Kording, 2011). By contrast, the dimensionality of the underlying device being controlled remains small, generally not exceeding 10 dimensions (Wodlinger et al., 2015; Vargas-Irwin et al., 2010).

We previously reported how three people with spinal cord injuries used the DKF with GP regression to rapidly gain closed-loop neural control (Brandman, Burkhart et al., 2018; Brandman, Hosman et al., 2018). Here, as an additional proof of concept, we present data from a person with amyotrophic lateral sclerosis (participant T9) using the DKF. In these research sessions, the observations constitute neural data collected from an electrode array surgically implanted in the participant's brain, and the hidden states represent the intended cursor velocity. The DKF prediction of intended cursor velocity is used at each time step to move the cursor. For learning the DKF parameters, training data are collected during an initial calibration phase in which the participant is instructed to attempt to move the cursor to various target locations, and the intended velocity at each time step is assumed to be pointing from the current cursor position to the instructed target. GP regression was used to learn f, and for computational efficiency, Q was assumed to be constant and set as the covariance of the residuals. The participant's performance using an out-of-the-box DKF was comparable to state-of-the-art decoders based on modifications of the Kalman filter designed specifically for the BrainGate2 clinical trials.

4.6.1  Participant

The participant in this study was T9, a 52-year-old right-handed male with paralysis from late-stage amyotrophic lateral sclerosis (ALSFRS-R score = 7; see Cedarbaum et al., 1999, for a detailed explanation of this metric). T9 underwent surgical placement of two 96-channel intracortical silicon microelectrode arrays (Maynard et al., 1997) (1.5 mm electrode length, Blackrock Microsystems, Salt Lake City, UT) in the motor cortex as previously described (Kim, Simeral, Hochberg, Donoghue, & Black, 2008; Simeral et al., 2011). Data were used from trial (postimplant) days 292 and 293.

4.6.2  Signal Acquisition

Raw neural signals for each channel (electrode) were sampled at 30 kHz using the NeuroPort System (Blackrock Microsystems, Salt Lake City, UT). Further signal processing and neural decoding were performed using the xPC target real-time operating system (Mathworks, Natick, MA). Raw signals were downsampled to 15 kHz for decoding and denoised by subtracting an instantaneous common average reference (Gilja et al., 2015; Jarosiewicz et al., 2015) using 40 of the 96 channels on each array with the lowest root mean square value (selected based on their baseline activity during a 1 minute reference block run at the start of each session). The denoised signal was bandpass-filtered between 250 Hz and 5000 Hz using an 8th-order noncausal Butterworth filter (Masse et al., 2015). Spike events were triggered by crossing a threshold set at 3.5 times the root mean square amplitude of each channel, as determined by data from the reference block. The neural feature used was total power in the bandpass-filtered signal (Jarosiewicz et al., 2015; Brandman, Hosman et al., 2018). Neural features were binned in 20 ms nonoverlapping increments for decoding. We used the top 40 features ranked by signal-to-noise-ratio (Malik et al., 2015).

4.6.3  Decoder Calibration

Decoder calibration was performed using the standard radial-8 task (Simeral et al., 2011; Gilja et al., 2015) using custom-built software running Matlab. An LCD monitor was placed 55 to 60 cm at a comfortable angle and orientation to T9. Targets (size = 2.4 cm, visual angle =2.5) were presented sequentially in a pseudorandom order, alternating between one of eight radially distributed targets and a center target (radial target distance from center = 12.1 cm, visual angle =12.6). Successful target acquisition required the user to place the cursor (size = 1.5 cm, visual angle = 1.6) within the target's diameter for 300 ms, before a predetermined timeout of 15 seconds. Target timeouts resulted in the cursor moving directly to the intended target, with immediate presentation of the next target.

Calibration began with 2 minutes of open-loop presentation of a cursor; that is, the cursor moved automatically to pseudorandomly presented targets in a straight path. During this time, T9 was instructed to “imagine” or “attempt” to move the computer cursor as if he had control of it. After 2 minutes, initial hyperparameters for the GP were learned. Next, T9 acquired targets for 3 minutes with 80% of the component of the decoded vector perpendicular to the vector between the cursor and the target (Jarosiewicz et al., 2013; Velliste, Perel, Spalding, Whitford, & Schwartz, 2008), in order to assist with target acquisition. GP hyperparameters were then recomputed with all of the available data. The radial-8 task was repeated two more times with the attenuated components at 50% and 20%, for a total of 11 minutes of calibration data collected. We collected 3000 data points randomly subsampled from the 11 minutes of collected data, using all 192 neural features (96 features per array, two arrays).

4.6.4  Performance Measurement

We quantified the performance of the DKF decoder with the mFitts1 task (Gilja et al., 2015; Simeral et al., 2011). Under the Fitts model (Fitts, 1954), movement time (MT) varies linearly with the index of difficulty (ID) as
MT=a·ID+b,
(4.4)
where the parameters a and b depend on the input device. Parameters are estimated using linear regression on observed (ID, MT) pairs for each input method. These estimates are then used to evaluate filter performance.
A single target was presented on the screen in a pseudorandom location, with one of three pseudorandomly fixed diameters (size = 1.6 cm, 3.5 cm, and 5.6 cm; visual angles 1.7, 3.7, and 5.8). Targets were acquired by having the cursor contact the target for 500 ms, within a timeout of 10 seconds. For the mFitts1 task, the index of difficulty for each trial was calculated as follows,
ID=log2DW+1,
where D is the distance from the cursor's start position to the goal and W is the sum of the target's diameter and cursor's radius. Hence, DW reflects a measure of difficulty for acquiring targets.

4.6.5  Results

T9 acquired 98% of targets presented over two research sessions (N=299) with the mFitts1 task. The Fitts regression parameters were comparable to the previously described performance by different participants (T6 and T7) using the ReFIT decoder (Gilja et al., 2015, Fig. 4.6.5, slope = 1.08±0.06,p<1.2×10-30, intercept =1.6±1.3,p<2.2×10-41). (See Figure 4 for details.)

Figure 4:

On the left, we plot movement time versus index of difficulty for T9 during the radial-8 task. On the right, we compare Fitts metrics for the DKF to those for Kalman ReFit. In particular, the slope and intercept from the line of best fit on the left correspond to the yellow bars for slope and intercept on the right. Error bars correspond to a 95% confidence interval for each estimated parameter. Following the discussion in section 4.6.4, lower values for the slope parameter (a in equation 4.4) correspond to less of an increase in movement time for more difficult targets. Estimates for the intercept parameter correspond to b in equation 4.4.

Figure 4:

On the left, we plot movement time versus index of difficulty for T9 during the radial-8 task. On the right, we compare Fitts metrics for the DKF to those for Kalman ReFit. In particular, the slope and intercept from the line of best fit on the left correspond to the yellow bars for slope and intercept on the right. Error bars correspond to a 95% confidence interval for each estimated parameter. Following the discussion in section 4.6.4, lower values for the slope parameter (a in equation 4.4) correspond to less of an increase in movement time for more difficult targets. Estimates for the intercept parameter correspond to b in equation 4.4.

5  Discussion

The DKF is a novel filtering method that should prove to be a helpful addition to the filtering toolbox. It provides a fast, analytic approximation for models with linear, gaussian dynamics but nonlinear, nongaussian observations. The approximations underlying the DKF tend to improve as the dimensionality of the observation space increases relative to the dimensionality of the state space. For known models, the DKF is quite similar in nature to the G-ADF; however, when models must be learned from training data, as is the case for many practical applications, the G-ADF entails integrals that require approximation and does not provide a closed-form update. In comparison to Laplace or saddle-point approximations, the DKF provides a more global approximation to the true filtering distribution. As we demonstrate in our examples, there are many families of state-space models that render the EKF and UKF ineffective but for which the DKF performs well.

In applications where the model must be learned from supervised training data prior to filtering, off-the-shelf nonlinear or nonparametric regression tools can be used to learn the conditional mean and variance for the DKF directly, avoiding the more complicated task of learning the complete observation model p(xt|zt). Using the DKF in this way appears to be novel within the large literature on state-space models. Most approaches either learn a fully generative model and invert it for filtering (this includes the of use discriminative methods for training filters derived from generative models—Abbeel, Coates, Montemerlo, Ng, & Thrun, 2005; Hess & Fern, 2009) or learn a fully discriminative model that directly predicts states from the sequence of observations. The DKF allows a generative model for the state dynamics to be combined in a principled way with a discriminative model for predicting the states from the observations at individual time steps. We think that the ability to easily incorporate off-the-shelf discriminative learning tools into a closed-form filtering equation is one of the most exciting and useful aspects of this methodology.

Many promising opportunities exist to apply and extend the DKF. For example, using a gaussian approximation for p(zt|xt) can permit a more principled approach to mitigating nonstationarities that occur in the measurement model. In neural decoding, a large change in the behavior of a single neuron that occurs between model training and filter use can result in significant performance degradation for the decoder. In the DKF framework with a GP regression model for p(zt|xt), one can select a kernel function that ignores large differences along any single dimension. Clinical results demonstrate that this modification allows the filter to be more robust to erratic firing patterns in an arbitrary single neuron. (See Brandman, Burkhart et al., 2018, for further details.) It seems that this approach could be readily applied more generally to increase filter resilience to nonstationarities.

While the DKF assumes an approximately gaussian posterior, for general filtering models there may also be ways to incorporate the underlying gaussian approximation for p(zt|xt) to improve performance. Methods that preserve the full form of the filtering distribution, such as particle filters, could be combined with alternatively specified measurement models, as in equation 2.15, to create general-purpose filters that are both more convenient to learn from data and use in filtering applications. The DKF marks a first step in this direction.

Appendix A:  Technical Details

This appendix provides the derivations used in sections 4.2 and 4.3, along with some information on numerical stability and details for the discriminative learning methods employed in section 4.5.

A.1  Kalman Observation Mixtures

For the model in section 4.2 we provide analytic expressions for the integrals in equation 2.11, which are needed for the G-ADF and the DKF (using νt=0 and Mt=S for the DKF). Define
Ut=(Mt-1+HΛ-1H)-1,yt=Ut(Mt-1νt+HΛ-1(xt-b)),κt=ηd(νt;0,Mt)ηn(xt;b,Λ)/ηd(yt;0,Ut).
Then
a=p(xt|zt)ηd(zt;νt,Mt)dzt==1Lπηn(xt;b+Hzt,Λ)ηd(zt;νt,Mt)dzt==1Lπηn(xt;b+Hνt,Λ+HMtH),b=ztp(xt|zt)ηd(zt;νt,Mt)dzt==1Lπztηn(xt;b+Hzt,Λ)ηd(zt;νt,Mt)dzt==1Lπκtztηd(zt;yt,Ut)dzt==1Lπκtyt,c=ztztp(xt|zt)ηd(zt;νt,Mt)dzt==1Lπκtztztηd(zt;yt,Ut)dzt==1Lπκt(Ut+ytyt).

A.2  Independent Bernoulli Mixtures

For the model in section 4.3, we provide analytic expressions for the integrals in equation 2.11 for the special case of νt=0 and Mt=S=Id, which are needed for the DKF. For each k=1,,d, define Nk={i:di=k}, Γk={γi:iNk}, nk=|Γk|, and let γk,1<<γk,nk denote the sorted (distinct) values in Γk, using γk,0=- and γk,nk+1=+. Using η(u)=η1(u;0,1) to denote the standard normal pdf and φ(v)=-vη(u)du to denote the corresponding distribution function, define
Φkj=γk,j-1γk,jη(u)du=φ(γk,j)-φ(γk,j-1),Φkj'=γk,j-1γk,juη(u)du-Φkj=η(γk,j-1)-η(γk,j)-Φkj,Φkj''=γk,j-1γk,ju2η(u)du-Φkj-2Φkj'=γk,j-1η(γk,j-1)-γk,jη(γk,j)-2Φkj',ρij=αi1{γk,jγi}+βi1{γi<γk,j},(iNk),
for k=1,,d and j=1,,nk+1 and =1,,L.
Let xtNk=(xti:iNk) and define
Dkj(xtNk)=iNkρijxti(1-ρij)1-xti,p(xtNk|ztk)=iNkαixti(1-αi)1-xti1{ztk<γi}+βixti(1-βi)1-xti1{ztkγi}=j=1nk+11{γk,j-1ztk<γk,j}Dkj(xtNk),
so that p(xt|zt)==1Lπk=1dp(xtNk|ztk) and (with S=Id),
p(xt|zt)ηd(zt;0,S)=p(xt|zt)k=1dη(ztk)==1Lπk=1dp(xtNk|ztk)η(ztk).
Hence, using δkr=1{k=r},
a=p(xt|zt)ηd(zt;0,S)dzt==1Lπk=1dp(xtNk|ztk)η(ztk)dztk==1Lπk=1dj=1nk+1Dkj(xtNk)γk,j-1γk,jη(ztk)dztk==1Lπk=1dj=1nk+1Dkj(xtNk)Φkj,br=ztrp(xt|zt)ηd(zt;0,S)dzt==1Lπk=1dj=1nk+1Dkj(xtNk)(Φkj+Φkj'δkr),crs=ztrztsp(xt|zt)ηd(zt;0,S)dzt==1Lπk=1dj=1nk+1Dkj(xtNk)(Φkj+Φkj'δkr+Φkj'δks+Φkj''δkrδks),
where in equation 2.11, the vector b=(br:r=1,,d) and the matrix c=(crs:r,s=1,,d). We have f(x)=b/a and Q(x)=c/a-f(x)f(x).

A.3  Measures to Prevent Numerical Instabilities

The covariance matrix Σt must be positive definite for the DKF algorithm to make sense. As n gets large, using Q(xt)=V(Zt|Xt=xt), the probability that Σt is positive definite goes to 1 (see 7 below). However, when n is small or when Q is learned, Σt will often not be positive definite. An easy remedy is to force Q-1(x)-S-1 to be positive semidefinite for every x by shrinking the (generalized) eigenvalues of Q(x) for any x where this constraint is not satisfied. In particular, beginning with a target Q=Q(x) for a given fixed x, consider the generalized eigenvalue decomposition QV=SVD, where VRd×d is invertible and DRd×d is diagonal. (This decomposition can be computed in Matlab using [V,D]=eig(Q,S).) Let D1 denote the element-wise minimum of D and 1, and define Q'=SV(D1)V-1. By redefining Q(x) as Q', we will ensure that Q-1(x)-S-1 is positive semidefinite, as required. Moreover, Q' will be the same as the original Q if this condition was already satisfied by the original Q, showing that this modification to the DKF algorithm does not affect our asymptotic analysis. We used this modification for all of the experiments with the DKF. The robust DKF does not require this modification. Here is a proof of the claims about this method: Q-1-S-1 is positive semidefinite if and only if S-Q is positive semidefinite if and only if S-1/2(S-Q)S-1/2 is positive semidefinite. We have S-1/2(S-Q)S-1/2=S-1/2(S-SVDV-1)S-1/2=I-S1/2VD(S1/2V)-1=(S1/2V)(I-D)(S1/2V)-1, which is positive semidefinite if and only if all entries of D (which is diagonal) are 1. Replacing D with D1 exactly enforces this constraint.

For our DKF experiments with nonlinear state dynamics using an extended Kalman filter (EKF) approximation (not described here), we found that the DKF-EKF became unstable for small n because the EKF approximation to the nonlinearity was quite poor. To remedy this, we modified the DKF algorithm to prevent μt from diverging too far from νt and f(xt) (the posterior means of Zt given X1:t-1 and given Xt, respectively). In particular, we forced |μt|2|νt|2+|f(xt)|2 (by scaling μt whenever its norm exceeded our bound). For larger n, once the DKF approximation becomes more accurate, this constraint was always satisfied in our experiments without intervention, but for smaller n, enforcing it was important for preventing numerical instabilities. The robust DKF did not require this modification. Although not used in this letter, we report this modification in case others find it useful in their application.

A.4  Nadaraya-Watson Kernel Regression

We can learn f:RnRd with a variety of regression methods. The well-known Nadaraya-Watson kernel regression estimator (Nadaraya, 1964; Watson, 1964) is
f^(x)=i=1mzi'κX(x,xi')i=1mκX(x,xi'),
where the κX(x,x') is a nonnegative kernel and m is the size of the training set. Bandwidth can be chosen using rule-of-thumb or with leave-one-out cross-validation, the latter scaling as O(m2). Evaluation of f^ scales like O(m). In the examples, we use a gaussian kernel with a bandwidth chosen by minimizing leave-one-out mean squared error (MSE) on the training set.

A.5  Neural Network Regression

We can also learn f as a neural network (NN). NNs are attractive for online filtering, because evaluation of f^ scales O(1) with the size of the training set. With MSE as an objective function, we optimize parameters over the training set. Typically optimization continues until performance stops improving on a validation subset (to prevent overfitting), but instead we use Bayesian regularization to ensure network generalizability (MacKay, 1992; Foresee & Hagan, 1997). Training costs depend on the training algorithm chosen. Traditional optimizers include stochastic gradient descent, scaling with O(m); scaled conjugate gradient, with O(m2); and Levenberg-Marquardt, with O(m3) (Castillo, Guijarro-Berdiñas, Fontenla-Romero, & Alonso-Betanzos, 2010), where m is the size of the training set. More recently, Hessian-free approaches have been developed to train NNs on larger data sets (Schmidhuber, 2015). Training costs also grow with d, depending on the choice of architecture.

We implemented all feedforward neural networks with Matlab's Neural Network Toolbox R2019a. Our implementation consisted of a single hidden layer of tansig neurons trained via Levenberg-Marquardt optimization (Levenberg, 1944; Marquardt, 1963; Hagan & Menhaj, 1994) with Bayesian regularization.

A.6  Gaussian Process Regression

Gaussian process (GP) regression is another popular method for nonlinear regression (Rasmussen & Williams, 2006). The idea is to put a prior distribution on the function f and approximate f with its posterior mean given training data. We first briefly describe the case d=1. We form an m×n-dimensional matrix X' by concatenating the 1×n-dimensional vectors Xi' and an m×d-dimensional matrix Z' by concatenating the vectors Zi'. We assume that p(zi'|xi',f)=η(zi';f(xi'),σ2), where f is sampled from a mean-zero GP with covariance kernel K(·,·). Under this model,
f^(x)=E(f(x)|Z',X')=K(x,X')(K(X',X')+σ2Im)-1Z',
where K(x,X') denotes the 1×m vector with ith entry K(x,Xi'), K(X',X') denotes the m×m matrix with ijth entry K(Xi',Xj'), Z' is a column vector, and Im is the m×m identity matrix. The noise variance σ2 and any parameters controlling the kernel shape are hyperparameters. For our examples, we used the radial basis function kernel with two parameters: length scale and maximum covariance. These hyperparameters were selected via maximum likelihood. For d>1, we repeated this process for each dimension to separately learn the coordinates of f. Training costs for a single dimension scale as O(m3). Sparse approximations to GPs can reduce training requirements to O(m·NS2) where NS is the size of the sparse GP (Quiñonero Candela & Rasmussen, 2005). Evaluation of f^ scales O(m) for each dimension, or O(NS) for sparse approximations.

All GP training was performed using the publicly available GPML package (Rasmussen & Nickisch, 2010).

A.7  Comparison with a Long Short-Term Memory Neural Network

An LSTM is a stateful recurrent neural network designed to overcome error backflow problems (Hochreiter & Schmidhuber, 1997). Such recurrent neural networks have previously been shown to outperform state-of-the-art Kalman-based filters on this primate neural decoding task and so provide a good point of comparison (Sussillo, Stavisky, Kao, Ryu, & Shenoy, 2016; Sussillo et al., 2012; Pandarinath et al., 2018; Hosman et al., 2019). While there are many variants on the LSTM architecture, none seem to universally improve on the basic design (Jozefowicz, Zaremba, & Sutskever, 2015; Greff, Srivastava, Koutník, Steunebrink, & Schmidhuber, 2016). LSTM optimization uses many of the same methods that work for feedforward NN's (Schmidhuber, 2015). Training and evaluation requirements are similar.

All LSTM trials were conducted with TensorFlow r1.4.0 in a Python 3.6.8 environment. The LSTM cell used in these trials was built from scratch in TensorFlow following Gers, Schmidhuber, and Cummins (2000). Dropout was used to prevent overfitting (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014), but it was only applied to feedforward connections, not recurrent connections (Pham, Bluche, Kermorvant, & Louradour, 2014; Zaremba, Sutskever, & Vinyals, 2014). The recurrent states and outputs at each intermediate time step were batch-normalized to accommodate internal covariate shift (Ioffe & Szegedy, 2015). Model parameters were initialized via a Xavier-type method (Glorot & Bengio, 2010) designed to stabilize variance from layer to layer. Optimization was then performed with Adadelta (Zeiler, 2012), an algorithm designed to improve upon Adagrad (Duchi, Hazan, & Singer, 2011) with the explicit goals of decreasing sensitivity to hyperparameters and permitting the learning rate to sometimes increase.

Appendix B:  Mathematical Results

Our main technical result is 2. After stating the theorem, we translate it into the setting of the letter. Probability density functions (pdfs) are with respect to Lebesgue measure over Rd. ·1 and · denote the L1 and L norms, respectively, w denotes weak convergence of probability measures (equivalent, for instance, to convergence of the expected values of bounded continuous functions), and δc denotes the unit point mass at cRd. Define the Markov transition density τ(y,z)=ηd(z;Ay,Γ), and let τh denote the function
(τh)(z)=τ(y,z)h(y)dy
for an arbitrary, integrable h. Define p(z)=ηd(z;0,S), where S satisfies S=ASA+Γ.
Theorem 2.
Fix pdfs sn and un(n1) so that the pdfs
pn=unτsn/punτsn/p1
(B.1)
are well defined for each n. Suppose that for some bRd and some probability measure P over Rd:
  1. snwP as n;

  2. There exists a sequence of gaussian pdfs (sn') such that sn-sn'10 as n;

  3. unwδb as n;

  4. There exists a sequence of gaussian pdfs (un') such that un-un'10 as n;

  5. pnwδb as n;

Then:

  1. sn'wP as n;

  2. un'wδb as n;

  3. The pdf
    pn'=un'τsn'/pun'τsn'/p1
    is well defined and gaussian for n sufficiently large;
  4. pn'wδb as n;

  5. pn-pn'10 as n.

Remark 1.

The L1 distance between pdfs is equivalent to the total variation distance between the respective probability measures.

Remark 2.

We are not content to show the existence of a sequence of gaussian pdfs (pn') that satisfy C4 and C5. Rather, we are trying to show that the specific pn' defined in C3 satisfies C4 and C5 regardless of the choice of un' and sn'.

Remark 3.
An inspection of the proof shows that the pdf
rn'=pn'p/pn'p1=un'τsn'/un'τsn'1
is well defined and gaussian, with rn'wδb and
pn-rn'1An+Bn+Cn,
where the terms An,Bn,Cn are those defined in equation B.2, each of which tends to zero in the limit. Thus pn-rn'10. These rn' are precisely the estimates formed using the robust DKF.
Remark 4.

Suppose the pdfs sn,sn',un,un'(n1), the constant b, and the probability measure P are themselves random, defined on a common probability space, so that pn is well defined with probability one, and suppose that the limits in A1 to A5 hold in probability. Then the probability that pn' is a well-defined, gaussian pdf converges to one, and the limits in C1 to C5 hold in probability.

For the setting of this letter, first fix t1, and note that p is the common pdf of each Zt and τ is the common conditional pdf of Zt given Zt-1. The limit of interest is for increasing dimension (n) of a single observation. To formalize this, we let each Xt be infinite dimensional and consider observing only the first n dimensions, denoted Xt1:nRn. Similarly, X1:t1:n=(X11:n,,Xt1:n). We will abuse notation and use P(Zt=·|W) to denote the conditional pdf of Zt given another random variable W. These conditional pdfs (formally defined via disintegrations) exist under very mild regularity assumptions (Chang & Pollard, 1997). Note that we are in the setting of 6, where the randomness comes from X1:t,Z1:t. With this in mind, define
un(·)=un(·;Xt1:n)=P(Zt=·|Xt1:n)un'(·)=un'(·;Xt1:n)=ηd(·;fn(Xt1:n),Qn(Xt1:n))sn(·)=sn(·;X1:t-11:n)=P(Zt-1=·|X1:t-11:n)(t>1)sn'(·)=sn'(·;X1:t-11:n)=ηd(·;μt-1,n(X1:t-11:n),Σt-1,n(X1:t-11:n))(t>1)pn(·)=pn(·;X1:t1:n)=P(Zt=·|X1:t1:n)pn'(·)=pn'(·;X1:t1:n)=ηd(·;μt,n(X1:t1:n),Σt,n(X1:t1:n))b=ZtP(·)=P(·;Zt-1)=δZt-1(t>1),
and define snsn'Pp when t=0. The pdf un' is our gaussian approximation of the conditional pdf of Zt for a given Xt1:n. We have added the subscript n to f and Q from the main text to emphasize the dependence on the dimensionality of the observations. The pdfs sn' and pn' are our gaussian approximations of Zt-1 and Zt given X1:t-11:n and X1:t1:n, respectively. Again, we added the subscript n to μt and Σt from the text. Note that equation B.1 is simply a condensed version of equation 2.4 in the main text, and, for the same reason, the pn' defined in C3 is the same pn' defined above.

The Bernstein–von Mises (BvM) theorem gives conditions for the existence of functions fn and Qn so that A3 to A4 hold in probability. We refer readers to van der Vaart (1998) for details. Very loosely speaking, the BvM theorem requires Zt to be completely determined in the limit of increasing amounts of data, but not completely determined after observing only a finite amount of data. The simplest case is when Xt1:n are conditionally independent and identically distributed given Zt, and distinct values of Zt give rise to distinct conditional distributions for Xt1:n, but the result holds in much more general settings. A separate application of the BvM theorem gives A5 (in probability). In applying the BvM theorem to obtain A5, we also obtain the existence of a sequence of (random) gaussian pdfs (pn'') such that pn-pn''10 (in probability), but we do not make use of this result, and, as explained in 4, we care about the specific sequence (pn') defined in C3.

As long as the BvM theorem is applicable, the only remaining thing to show is A1 and A2 (in probability). For the case t=1, we have snsn'Pp, so A1 and A2 are trivially true, and the theorem holds. For any case t>1, we note that sn and sn' are simply pn and pn', respectively, for the case t-1. So the conclusions C4 and C5 in the case t-1 become the assumptions A1 and A2 for the subsequent case t. The theorem then holds for all t1 by induction. The key conclusion is C5, which says that our gaussian filter approximation pn' will be close in total variation distance (see 3) to the true Bayesian filter distribution pn with high probability when n is large.

Proof.
C1 follows immediately from A1 and A2. C2 follows immediately from A3 and A4. C3 and C4 are proved in 7 below. To show C5, we first bound
pn-pn'1pn-pnpp(b)1An+pnpp(b)-pnppnp11Bn+pnppnp1-pn'ppn'p11Cn+pn'ppn'p1-pn'pp(b)1Bn'+pn'pp(b)-pn'1An'.
(B.2)
Since pnwδb and p(z) is bounded and continuous,
An=pn|1-pp(b)|=EZnpn|1-p(Zn)p(b)||1-p(b)p(b)|=0
and
Bn=pnppnp1|pnp1p(b)-1|=|pnp1p(b)-1|=|EZnpn|p(Zn)|p(b)-1||p(b)p(b)-1|=0.
Similarly, since pn'wδb,
An'=pn'|1-pp(b)|=EZnpn'|1-p(Zn)p(b)||1-p(b)p(b)|=0
and
Bn'=pn'ppn'p1|pn'p1p(b)-1|=|pn'p1p(b)-1|=|EZnpn'|p(Zn)|p(b)-1||p(b)p(b)-1|=0.
All that remains is to show that Cn0.
We first observe that
pnppnp1=unτsnunτsn1andpn'ppn'p1=un'τsn'un'τsn'1.
Define
α=E(Y,Z)P×δbηd(Z;AY,Γ)=EYPηd(b;AY,Γ)(0,).
Since snwP, unwδb, and (z,y)τ(y,z)=ηd(z;Ay,Γ) is bounded and continuous, we have
unτsn1=ηd(z;Ay,Γ)sn(y)un(z)dydz=E(Yn,Zn)sn×unηd(Zn;AYn,Γ)α.
Similarly, since sn'wP and un'wδb,
un'τsn'1=ηd(z;Ay,Γ)sn'(y)un'(z)dydz=E(Yn,Zn)sn'×un'ηd(Zn;AYn,Γ)α.
Defining β=ηd(0;0,Γ)(0,), gives
τhsupz|(τh)(z)|supz,yηd(z;Ay,Γ)|h(t)|dtηd(0;0,Γ)h1=βh1
for any integrable h. With these facts in mind, we obtain
Cn=unτsnunτsn1-un'τsn'un'τsn'11unτsnunτsn1-un'τsnunτsn11+un'τsnunτsn1-un'τsn'un'τsn'11τsnunτsn1un-un'1+τsnunτsn1-τsn'un'τsn'1un'1βunτsn1un-un'1+τsnunτsn1-τsnun'τsn'1+τsnun'τsn'1-τsn'un'τsn'1βunτsn1un-un'1+τsnunτsn1|1-unτsn1un'τsn'1|+τsn-τsn'un'τsn'1βunτsn1β/αun-un'10+βunτsn1β/α|1-unτsn1un'τsn'1||1-α/α|=0+βun'τsn'1β/αsn-sn'10.
Since α>0, we see that Cn0, and the proof of the theorem is complete.

6 follows from standard arguments by making use of the equivalence between convergence in probability and the existence of a strongly convergent subsequence within each subsequence. The theorem can be applied to each strongly convergent subsequence.

Lemma 1
(DKF equation). If sn'(z)=ηd(z;an,Vn) and un'(z)=ηd(z;bn,Un), then defining
pn'=un'τsn'/pun'τsn'/p1
gives
pn'(z)=ηd(z;cn,Tn),
where Gn=AVnA+Γ, Tn=(Un-1+Gn-1-S-1)-1, and cn=Tn(Un-1bn+Gn-1Aan), as long as Tn is well defined and positive definite. Furthermore, if sn'wP and un'wδb, then pn' is eventually well defined, and pn'wδb.
Proof.
See above for the definition of τ, p, A, Γ, S. Assuming un'τsn'/p is integrable, we have
pn'(z)ηd(z;bn,Un)ηd(z;0,S)ηd(z;Ay,Γ)ηd(y;an,Vn)dy.
Since
ηd(z;Ay,Γ)ηd(y;an,Vn)dy=ηd(z;Aan,AVnA+Γ)=ηd(z;Aan,Gn)
and
ηd(z;bn,Un)ηd(z;0,S)exp(-12(z-bn)Un-1(z-bn))exp(-12zS-1z)exp-12(z(Un-1-S-1)z-2zUn-1bn)exp-12(z-bn')(Un')-1(z-bn')ηd(z,bn',Un')
for Un'=(Un-1-S-1)-1 and bn'=Un'Un-1bn, we have
pn'(z)ηd(z;bn',Un')ηd(z;Aan,Gn)ηd(z;Tn((Un')-1bn'+Gn-1Aan),Tn)=ηd(z;cn,Tn).
As the normal density integrates to 1, the proportionality constant drops out.
Now suppose in addition that sn'wP and un'wδb. Consider the characteristic functions
φsn'(t)=EXsn'[eitX]=eitan-12tVntandφun'(t)=eitbn-12tUnt
for these random variables. Lévy's continuity theorem (theorem 2.13 in van der Vaart, 1998) implies that φsn'(t)φP(t) and φun'(t)φδb(t) for all tRd where
φP(t)=eita-12tVtandφδb(t)=eitb
denote the characteristic functions for P and δb, respectively. Here, a and V are the mean vector and covariance matrix, respectively, of the distribution P, which must itself be gaussian, although possibly degenerate. It follows that
(itan-12tVnt)(ita-12tVt)
and as φsn'(-t)φP(-t),
(-itan-12tVnt)(-ita-12tVt),
so tanta and tVnttVt for all tRd. Choosing t to be coordinate vectors, we see that this implies ana and VnV coordinate-wise. An analogous argument allows us to conclude that bnb and Un0d×d. Thus, GnG=AVA+Γ, which is invertible, since Γ is positive definite, and so Gn-1G-1.
The Woodbury matrix identity gives
Tn=(Un-1+Gn-1-S-1)-1=Un-Un((Gn-1-S-1)-1+Un)-1Un.
(B.3)
Since Un0d×d and ((Gn-1-S-1)-1+Un)-1G-1-S-1, we see that Tn0d×d.
To show Tn is eventually well defined and strictly positive definite, it suffices to show the same for
Tn-1=Un-1+Dn,
where we set Dn=Gn-1-S-1. For a symmetric matrix MRd×d, let λ1(M)λd(M) denote its ordered eigenvalues. As a corollary to Hoffman and Wielandt's result (see corollary 6.3.8 in Horn & Johnson, 2013), it follows that
maxj|λj(Tn-1)-λj(Un-1)|Dn2,
where ·2 denotes the Frobenius norm. Since Dn2G-1-S-12<, the difference between the jth ordered eigenvalues for Tn-1 and Un-1 is upper-bounded independent of n for 1jd. Since Un is positive definite and since Un0d×d, it follows that λj(Un-1)λd(Un-1)=1/λ1(Un). Hence, all eigenvalues of Tn-1 must eventually become positive, so that Tn-1 becomes positive definite, hence also Tn.
For the means, we have
cn=TnUn-1bn+TnGn-1Aan.
Because Tn0d×d, Gn-1G-1, and ana, we have TnGn-1Aan0. Using equation B.3 for Tn gives
TnUn-1bn=bn-Un((Gn-1-S-1)-1+Un)-1bn,
where the eventual boundedness of (Gn-1-S-1)-1+Un)-1 implies
Un((Gn-1-S-1)-1+Un)-1bn0.
As bnb, we conclude cnb. Hence, pn'wδb.

Notes

1

With ht=bt+Htνt, Ct=MtHt, and Nt=Λt+HtMtHt.

Acknowledgments

We thank participant T9 and T9's family, the anonymous reviewers, and E. Crites for their thoughtful feedback on the manuscript; B. Travers and D. Rosler for administrative support; and C. Grant for clinical assistance. This work was supported by the National Institutes of Health: National Institute on Deafness and Other Communication Disorders, NIDCD (R01DC009899), Rehabilitation Research and Development Service, Department of Veterans Affairs (B6453R and N9228C); National Science Foundation (DMS1309004); National Institute of Health (IDeA P20GM103645, R01MH102840); Massachusetts General Hospital (MGH)–Deane Institute for Integrated Research on Atrial Fibrillation and Stroke; Joseph Martin Prize for Basic Research; the Executive Committee on Research of Massachusetts General Hospital; Canadian Institute of Health Research (336092); Killam Trust Award Foundation; and the Brown Institute of Brain Science. The content of this letter is solely our responsibility and does not necessarily represent the official views of the National Institutes of Health, the Department of Veterans Affairs, or the U.S. government.

References

Abbeel
,
P.
,
Coates
,
A.
,
Montemerlo
,
M.
,
Ng
,
A. Y.
, &
Thrun
,
S.
(
2005
).
Discriminative training of Kalman filters
. In
Proceedings of Robotics: Science and Systems
.
Cambridge, MA
:
MIT Press
.
Ajiboye
,
A. B.
,
Willett
,
F. R.
,
Young
,
D. R.
,
Memberg
,
W. D.
,
Murphy
,
B. A.
,
Miller
,
J. P.
, …
Kirsch
,
R. F.
(
2017
).
Restoration of reaching and grasping movements through brain-controlled muscle stimulation in a person with tetraplegia: A proof-of-concept demonstration
.
Lancet
,
389
,
1821
1830
.
Arasaratnam
,
I.
, &
Haykin
,
S.
(
2009
).
Cubature Kalman filters
.
IEEE Trans. Autom. Control
,
54
(
6
),
1254
1269
.
Arasaratnam
,
I.
,
Haykin
,
S.
, &
Elliott
,
R. J.
(
2007
).
Discrete-time nonlinear filtering algorithms using Gauss–Hermite quadrature
.
Proc. IEEE
,
95
(
5
),
953
977
.
Arulampalam
,
M. S.
,
Maskell
,
S.
,
Gordon
,
N.
, &
Clapp
,
T.
(
2002
).
A tutorial on particle filters for online nonlinear/non-gaussian Bayesian tracking
.
IEEE Trans. Signal Process.
,
50
(
2
),
174
188
.
Battin
,
R. H.
, &
Levine
,
G. M.
(
1970
).
Application of Kalman filtering techniques to the Apollo program
. In
C. T.
Leondes
(Ed.),
Theory and applications of Kalman filtering
.
Neuilly sur Seine
:
NATO, Advisory Group for Aerospace Research and Development
.
Beneš
,
V. E.
(
1981
).
Exact finite-dimensional filters for certain diffusions with nonlinear drift
.
Stochastics
,
5
(
1–2
),
65
92
.
Bensmaia
,
S. J.
, &
Miller
,
L. E.
(
2014
).
Restoring sensorimotor function through intracortical interfaces: Progress and looming challenges
.
Nat. Rev. Neurosci.
,
15
(
5
),
313
325
.
Bishop
,
C. H.
,
Etherton
,
B. J.
, &
Majumdar
,
S. J.
(
2001
).
Adaptive sampling with the ensemble transform Kalman filter. Part I: Theoretical aspects
.
Mon. Weather Rev.
,
129
(
3
),
420
436
.
Bouton
,
C. E.
,
Shaikhouni
,
A.
,
Annetta
,
N. V.
,
Bockbrader
,
M. A.
,
Friedenberg
,
D. A.
,
Nielson
,
D. M.
, …
Rezai
,
A. R.
(
2016
).
Restoring cortical control of functional movement in a human with quadriplegia
.
Nature
,
533
,
247
250
.
Brandman
,
D. M.
,
Burkhart
,
M. C.
,
Kelemen
,
J.
,
Franco
,
B.
,
Harrison
,
M. T.
, &
Hochberg
,
L. R.
(
2018
).
Robust closed-loop control of a cursor in a person with tetraplegia using gaussian process regression
.
Neural Comput.
,
30
(
11
),
2986
3008
.
Brandman
,
D. M.
,
Cash
,
S. S.
, &
Hochberg
,
L. R.
(
2017
).
Review: Human intracortical recording and neural decoding for brain-computer interfaces
.
IEEE Trans. Neural Syst. Rehabil. Eng.
,
25
,
1687
1696
.
Brandman
,
D. M.
,
Hosman
,
T.
,
Saab
,
J.
,
Burkhart
,
M. C.
,
Shanahan
,
B. E.
,
Ciancibello
,
J. G.
, …
Hochberg
,
L. R.
(
2018
).
Rapid calibration of an intracortical brain–computer interface for people with tetraplegia
.
J. Neural Eng.
,
15
(
2
),
1
14
.
Brown
,
R. G.
, &
Hwang
,
P. Y. C.
(
2012
).
Introduction to random signals and applied Kalman filtering
, 4th ed.
Hoboken, NJ
:
Wiley
.
Buehner
,
M.
,
McTaggart-Cowan
,
R.
, &
Heilliette
,
S.
(
2017
).
An ensemble Kalman filter for numerical weather prediction based on variational data assimilation: VarEnKF
.
Mon. Weather Rev.
,
145
(
2
),
617
635
.
Burkhart
,
M. C.
(
2019
).
A discriminative approach to Bayesian filtering with applications to human neural decoding
.
PhD diss., Brown University
.
Butler
,
R. W.
(
2007
).
Saddlepoint approximations with applications
.
Cambridge
:
Cambridge University Press
.
Cappé
,
O.
,
Godsill
,
S. J.
, &
Moulines
,
E.
(
2007
).
An overview of existing methods and recent advances in sequential Monte Carlo
.
Proc. IEEE
,
95
(
5
),
899
924
.
Cappé
,
O.
,
Moulines
,
E.
, &
Ryden
,
T.
(
2005
).
Inference in hidden Markov models
.
Berlin
:
Springer-Verlag
.
Castillo
,
E.
,
Guijarro-Berdiñas
,
B.
,
Fontenla-Romero
,
O.
, &
Alonso-Betanzos
,
A.
(
2010
).
A very fast learning method for neural networks based on sensitivity analysis
.
J. Mach. Learn. Res.
,
7
,
1159
1182
.
Cedarbaum
,
J. M.
,
Stambler
,
N.
,
Malta
,
E.
,
Fuller
,
C.
,
Hilt
,
D.
,
Thurmond
,
B.
, &
Nakanishi
,
A.
(
1999
).
The ALSFRS-R: A revised ALS functional rating scale that incorporates assessments of respiratory function
.
J. Neurol. Sci.
,
169
(
1
),
13
21
.
Chang
,
J. T.
, &
Pollard
,
D.
(
1997
).
Conditioning as disintegration
.
Stat. Neerl.
,
51
(
3
),
287
317
.
Chen
,
Z.
(
2003
).
Bayesian filtering: From Kalman filters to particle filters, and beyond
.
Statistics
,
182
(
1
),
1
69
.
Choo
,
K.
, &
Fleet
,
D. J.
(
2001
).
People tracking using hybrid Monte Carlo filtering
. In
Proc. Int. Conf. Comput. Vis.
(
vol. 2
, pp.
321
328
).
Piscataway, NJ
:
IEEE
.
Churchland
,
M. M.
,
Cunningham
,
J. P.
,
Kaufman
,
M. T.
,
Foster
,
J. D.
,
Nuyujukian
,
P.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2012
).
Neural population dynamics during reaching
.
Nature
,
487
(
7405
),
1
20
.
Collinger
,
J. L.
,
Wodlinger
,
B.
,
Downey
,
J. E.
,
Wang
,
W.
,
Tyler-Kabara
,
E. C.
,
Weber
,
D. J.
, …
Schwartz
,
A. B.
(
2013
).
High-performance neuroprosthetic control by an individual with tetraplegia
.
Lancet
,
381
(
9866
),
557
564
.
Daum
,
F. E.
(
1984
).
Exact finite dimensional nonlinear filters for continuous time processes with discrete time measurements
. In
Proceedings of the IEEE Conf. Decis. Control
(pp.
16
22
).
Piscataway, NJ
:
IEEE
.
Daum
,
F. E.
(
1986
).
Exact finite-dimensional nonlinear filters
.
IEEE Trans. Autom. Control
,
31
(
7
),
616
622
.
Daum
,
F. E.
, &
Huang
,
J.
(
2003
).
Curse of dimensionality and particle filters
. In
Proceedings of the 2003 IEEE Aerosp. Conf. Proc.
(
vol. 4
).
Piscataway, NJ
:
IEEE
.
sdel Moral
,
P.
(
1996
).
Nonlinear filtering using random particles
.
Theory Probab. Appl.
,
40
(
4
),
690
701
.
Douc
,
R.
, &
Cappé
,
O.
(
2005
).
Comparison of resampling schemes for particle filtering
. In
Proc. Int. Symp. Image and Signal Process. Anal.
(pp.
64
69
).
Piscataway, NJ
:
IEEE
.
Doucet
,
A.
,
Godsill
,
S.
, &
Andrieu
,
C.
(
2000
).
On sequential Monte Carlo sampling methods for Bayesian filtering
.
Stat. Comput.
,
10
(
3
),
197
208
.
Duchi
,
J.
,
Hazan
,
E.
, &
Singer
,
Y.
(
2011
).
Adaptive subgradient methods for online learning and stochastic optimization
.
J. Mach. Learn. Res.
,
12
,
2121
2159
.
Elliott
,
R.
(
1994
).
Exact adaptive filters for Markov chains observed in gaussian noise
.
Automatica
,
30
(
9
),
1399
1408
.
Evensen
,
G.
(
1994
).
Sequential data assimilation with a nonlinear quasi-geostrophic model using Monte Carlo methods to forecast error statistics
.
J. Geophys. Res: Oceans
,
99
,
10143
10162
.
Fitts
,
P. M.
(
1954
).
The information capacity of the human motor system in controlling the amplitude of movement
.
J. Exp. Pyschol.
,
47
(
6
),
381
391
.
Flint
,
R. D.
,
Lindberg
,
E. W.
,
Jordan
,
L. R.
,
Miller
,
L. E.
, &
Slutzky
,
M. W.
(
2012
).
Accurate decoding of reaching movements from field potentials in the absence of spikes
.
J. Neural Eng.
,
9
(
4
),
1
13
.
Foresee
,
F. D.
, &
Hagan
,
M. T.
(
1997
).
Gauss-Newton approximation to Bayesian learning
. In
Proceedings of the Int. Conf. Neural Netw.
(
3:1930
1935
).
Piscataway, NJ
:
IEEE
.
Gelb
,
A.
(
1974
).
Applied optimal estimation
.
Cambridge, MA
:
MIT Press
.
Georgopoulos
,
A. P.
,
Kettner
,
R. E.
, &
Schwartz
,
A. B.
(
1988
).
Primate motor cortex and free arm movements to visual targets in three-dimensional space. II. Coding of the direction of movement by a neuronal population
.
J. Neurosci.
,
8
(
8
),
2928
2937
.
Gerber
,
M.
, &
Chopin
,
N.
(
2015
).
Sequential quasi Monte Carlo
.
J. Roy. Stat. Soc. Ser. B (Stat. Methodol.)
,
77
(
3
),
509
579
.
Gers
,
F. A.
,
Schmidhuber
,
J.
, &
Cummins
,
F.
(
2000
).
Learning to forget: Continual prediction with LSTM
.
Neural Comput.
,
12
(
10
),
2451
2471
.
Ghahramani
,
Z.
, &
Hinton
,
G. E.
(
2000
).
Variational learning for switching state-space models
.
Neural Comput.
,
12
(
4
),
831
864
.
Gilja
,
V.
,
Pandarinath
,
C.
,
Blabe
,
C. H.
,
Nuyujukian
,
P.
,
Simeral
,
J. D.
,
Sarma
,
A. A.
, …
Henderson
,
J. M.
(
2015
).
Clinical translation of a high-performance neural prosthesis
.
Nat. Med.
,
21
(
10
),
1142
1145
.
Glorot
,
X.
, &
Bengio
,
Y.
(
2010
).
Understanding the difficulty of training deep feedforward neural networks
. In
Proceedings of the Int. Conf. Artif. Intell. Stats.
(
9:249
256
).
PMLR
.
Gordon
,
N. J.
,
Salmond
,
D. J.
, &
Smith
,
A. F. M.
(
1993
).
Novel approach to nonlinear/non-gaussian Bayesian state estimation
.
IEE Proc. F—Radar and Signal Process.
,
140
(
2
),
107
113
.
Greff
,
K.
,
Srivastava
,
R. K.
,
Koutník
,
J.
,
Steunebrink
,
B. R.
, &
Schmidhuber
,
J.
(
2016
).
LSTM: A search space odyssey
.
IEEE Trans. Neural Netw. Learn. Syst.
,
28
(
10
),
1
11
.
Grewal
,
M. S.
, &
Andrews
,
A. P.
(
2010
).
Applications of Kalman filtering in aerospace 1960 to the present
.
IEEE Control Syst. Mag.
,
30
(
3
),
69
78
.
Hagan
,
M. T.
, &
Menhaj
,
M. B.
(
1994
).
Training feedforward networks with the Marquardt algorithm
.
IEEE Trans. Neural Netw.
,
5
(
6
),
989
993
.
Hall
,
E. C.
(
1966
).
Case history of the Apollo guidance computer
.
Cambridge, MA
:
MIT Press
.
Handschin
,
J.
(
1970
).
Monte Carlo techniques for prediction and filtering of non-linear stochastic processes
.
Automatica
,
6
(
4
),
555
563
.
Handschin
,
J. E.
, &
Mayne
,
D. Q.
(
1969
).
Monte Carlo techniques to estimate the conditional expectation in multi-stage non-linear filtering
.
Int. J. Control
,
9
(
5
),
547
559
.
Hess
,
R.
, &
Fern
,
A.
(
2009
).
Discriminatively trained particle filters for complex multi-object tracking
. In
Proceedings of Comput. Vis. Pattern Recognit.
(pp.
240
247
).
Piscataways, NJ
:
IEEE
.
Hochberg
,
L. R.
,
Bacher
,
D.
,
Jarosiewicz
,
B.
,
Masse
,
N. Y.
,
Simeral
,
J. D.
,
Vogel
,
J.
, …
Donoghue
,
J. P.
(
2012
).
Reach and grasp by people with tetraplegia using a neurally controlled robotic arm
.
Nature
,
485
(
7398
),
372
375
.
Hochberg
,
L. R.
, &
Donoghue
,
J. P.
(
2006
).
Sensors for brain-computer interfaces
.
IEEE Eng. Med. Biol. Mag.
,
25
(
5
),
32
38
.
Hochreiter
,
S.
, &
Schmidhuber
,
J.
(
1997
).
Long short-term memory
.
Neural Comput.
,
9
(
8
),
1735
1780
.
Horn
,
R. A.
, &
Johnson
,
C. R.
(
2013
).
Matrix analysis
, 2nd ed.
Cambridge
:
Cambridge University Press
.
Hosman
,
T.
,
Vilela
,
M.
,
Milstein
,
D.
,
Kelemen
,
J. N.
,
Brandman
,
D. M.
,
Hochberg
,
L. R.
, &
Simeral
,
J. D.
(
2019
).
BCI decoder performance comparison of an LSTM recurrent neural network and a Kalman filter in retrospective simulation
. In
Proceedings of the Int. IEEE EMBS Conf. Neural Eng.
Piscataway, NJ
:
IEEE
.
Hunt
,
B. R.
,
Kostelich
,
E. J.
, &
Szunyogh
,
I.
(
2007
).
Efficient data assimilation for spatiotemporal chaos: A local ensemble transform Kalman filter
.
Physica D: Nonlinear Phenom.
,
230
(
1
),
112
126
.
Ioffe
,
S.
, &
Szegedy
,
C.
(
2015
).
Batch normalization: Accelerating deep network training by reducing internal covariate shift
. In
F.
Bach
, &
D.
Blei
(Eds.),
Proceedings of the Int. Conf. Mach. Learn.
,
vol. 37
(pp.
448
456
).
PMLR
.
Ito
,
K.
(
2000
).
Gaussian filter for nonlinear filtering problems
. In
Proceedings of the IEEE Conf. Decis. Control
,
vol. 2.
Piscataway, NJ
:
IEEE
.
Ito
,
K.
, &
Xiong
,
K.
(
2000
).
Gaussian filters for nonlinear filtering problems
.
IEEE Trans. Autom. Control
,
45
,
910
927
.
Jarosiewicz
,
B.
,
Masse
,
N. Y.
,
Bacher
,
D.
,
Cash
,
S. S.
,
Eskandar
,
E.
,
Friehs
,
G.
, …
Hochberg
,
L. R.
(
2013
).
Advantages of closed-loop calibration in intracortical brain-computer interfaces for people with tetraplegia
.
J. Neural Eng.
,
10
(
4
),
1
17
.
Jarosiewicz
,
B.
,
Sarma
,
A. A.
,
Bacher
,
D.
,
Masse
,
N. Y.
,
Simeral
,
J. D.
,
Sorice
,
B.
, …
Hochberg
,
L. R.
(
2015
).
Virtual typing by people with tetraplegia using a self-calibrating intracortical brain-computer interface
.
Sci. Transl. Med.
,
7
(
313
),
1
11
.
Jozefowicz
,
R.
,
Zaremba
,
W.
, &
Sutskever
,
I.
(
2015
).
An empirical exploration of recurrent network architectures
. In
F.
Bach
&
D.
Blei
(Eds.),
Proceedings of the Int. Conf. Mach. Learn.
,
vol. 37
(pp.
2342
2350
).
Julier
,
S. J.
, &
Uhlmann
,
J. K.
(
1997
).
New extension of the Kalman filter to nonlinear systems
.
Proc. SPIE
,
3068
,
182
193
.
Kalman
,
R. E.
(
1960
).
A new approach to linear filtering and prediction problems
.
J. Basic Eng.
,
82
(
1
),
35
45
.
Kalman
,
R. E.
, &
Bucy
,
R. S.
(
1961
).
New results in linear filtering and prediction theory
.
J. Basic Eng.
,
83
(
1
),
95
108
.
Kim
,
S.-P.
,
Simeral
,
J. D.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
, &
Black
,
M. J.
(
2008
).
Neural control of computer cursor velocity by decoding motor cortical spiking activity in humans with tetraplegia
.
J. Neural Eng.
,
5
(
4
),
455
476
.
Kitagawa
,
G.
(
1996
).
Monte Carlo filter and smoother for non-Gaussian nonlinear state space models
.
J. Comput. Graph. Stat.
,
5
(
1
).
Koyama
,
S.
,
Pérez-Bolde
,
L. C.
,
Shalizi
,
C. R.
, &
Kass
,
R. E.
(
2010
).
Approximate methods for state-space models
.
J. Am. Stat. Assoc.
,
105
(
489
),
170
180
.
Kushner
,
H.
(
1967
).
Approximations to optimal nonlinear filters
.
IEEE Trans. Autom. Control
,
12
(
5
),
546
556
.
Lemon
,
R. N.
(
2008
).
Descending pathways in motor control
.
Annu. Rev. Neurosci.
,
31
,
195
218
.
Levenberg
,
K.
(
1944
).
A method for the solution of certain non-linear problems in least squares
.
Quart. Appl. Math.
,
2
,
164
168
.
Liu
,
J. S.
(
2008
).
Monte Carlo strategies in scientific computing
.
Berlin
:
Springer
.
MacKay
,
D. J. C.
(
1992
).
Bayesian interpolation
.
Neural Comput.
,
4
(
3
),
415
447
.
Majumdar
,
S. J.
,
Bishop
,
C. H.
,
Etherton
,
B. J.
, &
Toth
,
Z.
(
2002
).
Adaptive sampling with the ensemble transform Kalman filter. Part II: Field program implementation
.
Mon. Weather Rev.
,
130
(
5
),
1356
1369
.
Malik
,
W. Q.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
,
Hochberg
,
L. R.
,
Donoghue
,
J. P.
, &
Brown
,
E. N.
(
2015
).
Modulation depth estimation and variable selection in state-space models for neural interfaces
.
IEEE Trans. Biomed. Eng.
,
62
(
2
),
570
581
.
Marquardt
,
D. W.
(
1963
).
An algorithm for least-squares estimation of nonlinear parameters
.
J. Soc. Indust. Appl. Math.
,
11
,
431
441
.
Masse
,
N. Y.
,
Jarosiewicz
,
B.
,
Simeral
,
J. D.
,
Bacher
,
D.
,
Stavisky
,
S. D.
,
Cash
,
S. S.
, …
Donoghue
,
J. P.
(
2015
).
Non-causal spike filtering improves decoding of movement intention for intracortical BCIs
.
J. Neurosci. Methods
,
244
,
94
103
.
Maynard
,
E. M.
,
Nordhausen
,
C. T.
, &
Normann
,
R. A.
(
1997
).
The Utah intracortical electrode array: A recording structure for potential brain-computer interfaces
.
Electroencephalogr. Clin. Neurophysiol.
,
102
(
3
),
228
239
.
Metropolis
,
N.
, &
Ulam
,
S.
(
1949
).
The Monte Carlo method
.
J. Am. Stat. Assoc.
,
44
(
247
),
335
341
.
Minka
,
T. P.
(
2001a
).
Expectation propagation for approximate Bayesian inference
.
Proceedings of the Conf. Uncertain. Artif. Intell.
San Mateo, CA
:
Morgan Kaufmann
.
Minka
,
T. P.
(
2001b
).
A family of algorithms for approximate Bayesian inference
.
PhD diss., MIT
.
Nadaraya
,
E. A.
(
1964
).
On a regression estimate
.
Teor. Verojatnost. i Primenen.
,
9
,
157
159
.
Nørgaard
,
M.
,
Poulsen
,
N. K.
, &
Ravn
,
O.
(
2000
).
New developments in state estimation for nonlinear systems
.
Automatica
,
36
(
11
),
1627
1638
.
Nuyujukian
,
P.
,
Albites Sanabria
,
J.
,
Saab
,
J.
,
Pandarinath
,
C.
,
Jarosiewicz
,
B.
,
Blabe
,
C. H.
, …
Henderson
,
J. M.
(
2018
).
Cortical control of a tablet computer by people with paralysis
.
PLOS One
,
13
(
11
).
Ott
,
E.
,
Hunt
,
B. R.
,
Szunyogh
,
I.
,
Zimin
,
A. V.
,
Kostelich
,
E. J.
,
Corazza
,
M.
, …
Yorke
,
J. A.
(
2004
).
A local ensemble Kalman filter for atmospheric data assimilation
.
Tellus A
,
56
(
5
),
415
428
.
Pandarinath
,
C.
,
Gilja
,
V.
,
Blabe
,
C. H.
,
Nuyujukian
,
P.
,
Sarma
,
A. A.
,
Sorice
,
B. L.
, …
Shenoy
,
K. V.
(
2015
).
Neural population dynamics in human motor cortex during movements in people with ALS
.
eLife
,
4
.
Pandarinath
,
C.
,
Nuyujukian
,
P.
,
Blabe
,
C. H.
,
Sorice
,
B. L.
,
Saab
,
J.
,
Willett
,
F.
, …
Henderson
,
J. M.
(
2017
).
High performance communication by people with paralysis using an intracortical brain-computer interface
.
eLife
, pp.
1
27
.
Pandarinath
,
C.
,
O'Shea
,
D. J.
,
Collins
,
J.
,
Jozefowicz
,
R.
,
Stavisky
,
S. D.
,
Kao
,
J. C.
, …
Sussillo
,
D.
(
2018
).
Inferring single-trial neural population dynamics using sequential auto-encoders
.
Nat. Methods
,
15
(
10
),
805
815
.
Paninski
,
L.
,
Fellows
,
M. R.
,
Hatsopoulos
,
N. G.
, &
Donoghue
,
J. P.
(
2004
).
Spatiotemporal tuning of motor cortical neurons for hand position and velocity spatiotemporal tuning of motor cortical neurons for hand position and velocity
.
J. Clin. Neurophysiol.
,
91
,
515
532
.
Pham
,
V.
,
Bluche
,
T.
,
Kermorvant
,
C.
, &
Louradour
,
J.
(
2014
).
Dropout improves recurrent neural networks for handwriting recognition
. In
Proceedings of the Int. Conf. Front. Handwriting Recognit.
(pp.
285
290
).
Piscataway, NJ
:
IEEE
.
Pohlmeyer
,
E.
,
Solla
,
S.
,
Perreault
,
E. J.
, &
Miller
,
L. E.
(
2007
).
Prediction of upper limb muscle activity from motor cortical discharge during reaching
.
J. Neural Eng.
,
4
,
369
379
.
Quang
,
P. B.
,
Musso
,
C.
, &
Le Gland
,
F.
(
2015
).
The Kalman Laplace filter: A new deterministic algorithm for nonlinear Bayesian filtering
. In
Proceedings of the Intern. Conf. Inf. Fusion
(pp.
1566
1573
).
Piscataway, NJ
:
IEEE
.
Quiñonero Candela
,
J.
, &
Rasmussen
,
C. E.
(
2005
).
A unifying view of sparse approximate gaussian process regression
.
J. Mach. Learn. Res.
,
6
,
1939
1959
.
Rao
,
N. G.
, &
Donoghue
,
J. P.
(
2014
).
Cue to action processing in motor cortex populations
.
J. Neurophysiol.
,
111
(
2
),
441
453
.
Rasmussen
,
C. E.
, &
Nickisch
,
H.
(
2010
).
Gaussian processes for machine learning (GPML) toolbox
.
J. Mach. Learn. Res.
,
11
,
3011
3015
.
Rasmussen
,
C. E.
, &
Williams
,
C. K. I.
(
2006
).
Gaussian processes for machine learning
.
Cambridge, MA
:
MIT Press
.
Real
,
E.
,
Moore
,
S.
,
Selle
,
A.
,
Saxena
,
S.
,
Suematsu
,
Y. L.
,
Le
,
Q.
, &
Kurakin
,
A.
(
2017
).
Large-scale evolution of image classifiers
. In
Proceedings of the Int. Conf. Mach. Learn.
PMLR
.
Särkkä
,
S.
(
2013
).
Bayesian filtering and smoothing
.
Cambridge
:
Cambridge University Press
.
Schmidhuber
,
J.
(
2015
).
Deep learning in neural networks: An overview
.
Neural Netw.
,
61
,
85
117
.
Schmidt
,
S. F.
,
Weinberg
,
J. D.
, &
Lukesh
,
J. S.
(
1970
).
Application of Kalman filtering to the C-5 guidance and control system
. In
C. T.
Leondes
(Ed.),
Theory and applications of Kalman filtering
.
Neuilly sur Seine, NATO, Advisory Group for Aerospace Research and Development
.
Schwartz
,
A. B.
(
1994
).
Direct cortical representation of drawing
.
Science
,
265
(
5171
),
540
542
.
Shumway
,
R. H.
, &
Stoffer
,
D. S.
(
1991
).
Dynamic linear models with switching
.
J. Am. Stat. Assoc.
,
86
(
415
),
763
769
.
Simeral
,
J. D.
,
Kim
,
S.-P.
,
Black
,
M. J.
,
Donoghue
,
J. P.
, &
Hochberg
,
L. R.
(
2011
).
Neural control of cursor trajectory and click by a human with tetraplegia 1000 days after implant of an intracortical microelectrode array
.
J. Neural Eng.
,
8
(
2
),
1
21
.
Srivastava
,
N.
,
Hinton
,
G.
,
Krizhevsky
,
A.
,
Sutskever
,
I.
, &
Salakhutdinov
,
R.
(
2014
).
Dropout: A simple way to prevent neural networks from overfitting
.
J. Mach. Learn. Res.
,
15
,
1929
1958
.
Stevenson
,
I. H.
, &
Kording
,
K. P.
(
2011
).
How advances in neural recording affect data analysis
.
Nat. Neurosci.
,
14
(
2
),
139
142
.
Sugiyama
,
M.
,
Suzuki
,
T.
, &
Kanamori
,
T.
(
2012
).
Density ratio estimation in machine learning
.
Cambridge
:
Cambridge University Press
.
Sussillo
,
D.
,
Nuyujukian
,
P.
,
Fan
,
J. M.
,
Kao
,
J. C.
,
Stavisky
,
S. D.
,
Ryu
,
S.
, &
Shenoy
,
K.
(
2012
).
A recurrent neural network for closed-loop intracortical brain–machine interface decoders
.
J. Neural Eng.
,
9
(
2
),
1
21
.
Sussillo
,
D.
,
Stavisky
,
S. D.
,
Kao
,
J. C.
,
Ryu
,
S. I.
, &
Shenoy
,
K. V.
(
2016
).
Making brain–machine interfaces robust to future neural variability
.
Nat. Commun.
,
7
,
1
12
.
van der Merwe
,
R.
(
2004
).
Sigma-point Kalman filters for probabilistic inference in dynamic state-space models
.
PhD diss., Oregon Health and Science University
.
van der Vaart
,
A. W.
(
1998
).
Asymptotic statistics
.
Cambridge
:
Cambridge University Press
.
Vargas-Irwin
,
C. E.
,
Brandman
,
D. M.
,
Zimmermann
,
J. B.
,
Donoghue
,
J. P.
, &
Black
,
M. J.
(
2015
).
Spike train SIMilarity space (SSIMS): A framework for single neuron and ensemble data analysis
.
Neural Comput.
,
27
(
1
),
1
31
.
Vargas-Irwin
,
C. E.
,
Shakhnarovich
,
G.
,
Yadollahpour
,
P.
,
Mislow
,
J. M. K.
,
Black
,
M. J.
, &
Donoghue
,
J. P.
(
2010
).
Decoding complete reach and grasp actions from local primary motor cortex populations
.
J. Neurosci.
,
30
(
29
),
9659
9669
.
Velliste
,
M.
,
Perel
,
S.
,
Spalding
,
M. C.
,
Whitford
,
A. S.
, &
Schwartz
,
A. B.
(
2008
).
Cortical control of a prosthetic arm for self-feeding
.
Nature
,
453
(
7198
),
1098
101
.
Walker
,
B.
, &
Kording
,
K.
(
2013
).
The database for reaching experiments and models
.
PLOS One
,
8
(
11
).
Wan
,
E. A.
, &
van der Merwe
,
R.
(
2000
).
The unscented Kalman filter for nonlinear estimation
. In
Proceedings of the Adaptive Syst. for Signal Process., Commun., and Control Symp.
(pp.
153
158
).
Washington, DC
:
Society for Neuroscience
.
Watson
,
G. S.
(
1964
).
Smooth regression analysis
.
Sankhyā Ser. A
,
26
,
359
372
.
Willett
,
F. R.
,
Young
,
D. R.
,
Murphy
,
B. A.
,
Memberg
,
W. D.
,
Blabe
,
C. H.
,
Pandarinath
,
C.
, …
Bolu Ajiboye
,
A.
(
2019
).
Principled BCI decoder design and parameter selection using a feedback control model
.
Sci. Rep.
,
9
(
8881
).
Wodlinger
,
B.
,
Downey
,
J. E.
,
Tyler-Kabara
,
E. C.
,
Schwartz
,
A. B.
,
Boninger
,
M. L.
, &
Collinger
,
J. L.
(
2015
).
Ten-dimensional anthropomorphic arm control in a human brain machine interface: Difficulties, solutions, and limitations
.
J. Neural Eng.
,
12
(
1
),
1
17
.
Wolpaw
,
J. R.
,
Birbaumer
,
N.
,
McFarland
,
D. J.
,
Pfurtscheller
,
G.
, &
Vaughan
,
T. M.
(
2002
).
Brain-computer interfaces for communication and control
.
Clin. Neurophysiol.
,
113
(
6
),
767
791
.
Wu
,
W.
,
Black
,
M. J.
,
Gao
,
Y.
,
Bienenstock
,
E.
,
Serruya
,
M.
, &
Donoghue
,
J. P.
(
2002
).
Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter
. In
SAB'02-Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices
(pp.
66
73
).
Washington, DC
:
Society for Neuroscience
.
Zaremba
,
W.
,
Sutskever
,
I.
, &
Vinyals
,
O.
(
2014
).
Recurrent neural network regularization
.
arXiv:1409.2329
.
Zeiler
,
M. D.
(
2012
).
Adadelta: An adaptive learning rate method
.
arXiv:1212.5701
.
Zoph
,
B.
, &
Le
,
Q. V.
(
2017
).
Neural architecture search with reinforcement learning
. In
Proceedings of the Int. Conf. Learn. Represent.
ICLR
.

Author notes

B. Franco is now at NeuroPace, Mountain View, CA, U.S.A.