## Abstract

In this work, a variational Bayesian framework for efficient training of echo state networks (ESNs) with automatic regularization and delay&sum (D&S) readout adaptation is proposed. The algorithm uses a classical batch learning of ESNs. By treating the network echo states as fixed basis functions parameterized with delay parameters, we propose a variational Bayesian ESN training scheme. The variational approach allows for a seamless combination of sparse Bayesian learning ideas and a variational Bayesian space-alternating generalized expectation-maximization (VB-SAGE) algorithm for estimating parameters of superimposed signals. While the former method realizes automatic regularization of ESNs, which also determines which echo states and input signals are relevant for “explaining” the desired signal, the latter method provides a basis for joint estimation of D&S readout parameters. The proposed training algorithm can naturally be extended to ESNs with fixed filter neurons. It also generalizes the recently proposed expectation-maximization-based D&S readout adaptation method. The proposed algorithm was tested on synthetic data prediction tasks as well as on dynamic handwritten character recognition.

## 1.  Introduction

Echo state networks (ESNs) and reservoir computing in general represent a powerful class of recurrent neural networks (Jaeger, Maass, & Principe, 2007; Verstraeten, Schrauwen, & Stroobandt, 2007); they are particularly useful for nonparametric modeling of nonlinear dynamical systems. Due to a very simple training procedure, ESNs have found applications in many areas of signal processing, including speech recognition and audio processing, system modeling and prediction, and filtering (Han & Wang, 2009; Verstraeten, Schrauwen, & Stroobandt, 2006; Xia, Mandic, Hulle, & Principe, 2008; Holzmann & Hauser, 2010), to name just a few.

A typical ESN allows for learning a nonlinear dependence between an M-dimensional input signal u[n] and a P-dimensional output signal y[n] of a nonlinear dynamical system characterized by a nonlinear difference equation y[n] = g(y[n − 1], …, y[nk], …, u[n], …, u[nl], …), where the mapping g(·) is typically unknown. The goal of ESN-based modeling is to approximate this mapping by creating a random network of interconnected neurons, called a reservoir, and linearly combining the reservoir outputs and network input signals to form the desired network response. The operation of an ESN with L neurons can be formally described by a system of two equations:
1.1
1.2
Equation 1.1 is the state equation of the ESN; it specifies how the responses of L neurons x[n] = [x1[n], …, xL[n]]T are evolving over time. In equation 1.1, the function is a vector-valued neuron activation function, such as a hyperbolic tangent, applied to each element of its argument. The matrices , , and are, respectively, the neuron interconnection weights, input signal weights, and output feedback weights. Typically the entries of these matrices are generated and fixed during the network design stage (Jaeger, 2001; Lukoševičius & Jaeger, 2009).

Equation 1.2 is the output equation of the network. It states that the output of the network is formed as a linear combination of network states x[n] and network inputs u[n].1 Under certain conditions (Jaeger, 2001), a sequence of neuron states x[n] forms echoes—a temporal basis used for reconstructing the output signal y[n]. The dynamics of the training data are thus encoded in the echoes generated by the reservoir. The ESN training then reduces to finding an optimal estimate of W so as to minimize the squared distance between the network output and the desired network response. Notice that since W enters equation 1.2 linearly, the ESN training requires solving a system of linear equations.

Despite the simple learning procedure, a straightforward application of ESNs has two practical shortcomings: an estimation of W, especially for large ESNs with many neurons, requires regularization (Jaeger, 2001), and simple ESNs have been shown to fail for certain learning problems, for example, they cannot learn multiple attractors at the same time (Jaeger, 2007). Regularization has been extensively studied in the literature within the context of ill-posed problems; the Moore-Penrose inverse (Golub & Van Loan, 1996) and ridge regression (Bishop, 2006), also known as Tikhonov regularization, are standard approaches to finding a regularized solution to the linear least-squares (LS) estimation problem. The universality of ESNs can be significantly boosted by introducing filter neurons and delay&sum (D&S) readouts in the ESN structure (Holzmann & Hauser, 2010; Wustlich & Siewert, 2007; Zechner & Shutin, 2010). Equipping neurons with additional filters will result in neurons that are specialized to more relevant frequency bands. This is achieved by applying a linear time-invariant filter to the output of the neuron activation function in equation 1.1. Introducing delays makes it possible to shift the reservoir signals in time and provides a computationally inexpensive method to vastly improve the memory capacity of the network. The parameters of such filter neurons and the corresponding readout delays can be chosen randomly during the network initialization or heuristically through trial and error (Wustlich & Siewert, 2007; Holzmann & Hauser, 2010).

The regularization of ESN LS-based training, on the one hand, and optimization of D&S readout parameters and filter neurons, on the other hand, are typically two unconnected optimization steps. Motivated by the lack of a formal optimization framework that combines both regularization and ESN parameter adaptation, and inspired by the recent developments of the variational Bayesian methods (Bishop, 2006; Beal, 2003) for sparse Bayesian learning (SBL) (Shutin, Buchgraber, Kulkarni, & Poor, 2011b; Seeger & Wipf, 2010; Tzikas, Likas, & Galatsanos, 2008; Tipping, 2001; Bishop & Tipping, 2000) and variational nonlinear parameter estimation (Shutin & Fleury, 2011), we propose a variational Bayesian ESN training framework. In the new framework, the ESN training is formulated as a variational Bayesian inference problem on a directed acyclic graph (DAG) (Bishop, 2006). Specifically, the unknown network parameters are jointly estimated by minimizing the Kullback-Leibler divergence between the true posterior probability density function (pdf) of the network parameters and a variational approximation to this posterior. The estimation of the output coefficients W and regularization parameters is realized using ideas inspired by a variational SBL approach (Bishop & Tipping, 2000). This not only allows an automatic regularization of the network but also provides quantitative information about the relative relevance or importance of individual neurons and network input signals. The estimation of D&S readout parameters is implemented using the variational Bayesian space-alternating generalized expectation-maximization (VB-SAGE) algorithm, which was originally proposed for the variational estimation of superimposed signal parameters (Shutin & Fleury, 2011). The VB-SAGE framework allows a monotonic decrease of the Kullback-Leibler divergence between the two pdfs with respect to only a subset of the parameters of interest using latent variables, also called admissible hidden data—an analog of the complete data in the expectation-maximization (EM) framework (Bishop, 2006). We demonstrate that latent variables reduce the complexity of the objective function for estimating delay parameters of a single neuron, which leads to a more efficient numerical optimization.

Previously we have considered the application of the VB-SAGE algorithm within the ESN training framework (Zechner & Shutin, 2010). However, in Zechner and Shutin (2010), the automatic regularization was not part of the estimation scheme. Also, the variational approximation used in Zechner and Shutin (2010) assumes a statistical independence between the elements of W. Although this assumption significantly simplifies the variational inference of the ESN weight coefficients, it leads to poorer performance by the trained models and does not generalize the classical pseudoinverse-based ESN training. Here we do not impose any independence assumptions on the elements of W. We demonstrate that the proposed variational ESN training framework generalizes the existing techniques for ESN training. In particular, the Tikhonov-like regularization of ESNs (Jaeger, 2001) and EM-based estimation of D&S readout parameters (Holzmann & Hauser, 2010) are obtained as special cases of the proposed variational Bayesian ESN training. Moreover, the proposed algorithm automatically regularizes the obtained solution by taking into account the training data and the amount of additive noise.

The rest of the letter is organized as follows. In section 2, we discuss the extended ESN model and explain the variables involved; in section 3, we formulate the probabilistic model and discuss the variational inference of model parameters; in section 4, we discuss the implementation and initialization of the learning algorithm. Finally, in section 5, we consider several learning examples to demonstrate the performance of the proposed scheme.

Throughout the letter, vectors are represented as boldface lowercase letters (e.g., x) and matrices as boldface uppercase letters (e.g., x). For vectors and matrices, (·)T denotes the transpose. Notation A = [X, Y] is used to denote a matrix A obtained by concatenating matrices x and Y; it is assumed that x and Y have the same number of rows. Sets are represented as calligraphic uppercase letters (e.g., ). We use to denote an index set of L neurons. With a slight abuse of notation, we write to denote a set of random variables ; also, for , denotes a set . Two types of proportionality are used: xy denotes x = αy, and xey denotes ex = eβey, and thus x = β + y, for arbitrary constants α and β. We use to denote the expectation of a function f(x) with respect to a probability density q(x). Finally, N(x|a, B) denotes a multivariate gaussian pdf with a mean A and a covariance matrix B; Ga(x|a, b) = baxa−1exp(−bx)/Γ(a) denotes a gamma pdf with parameters a and b.

## 2.  Extended ESN Model

Consider a standard batch ESN learning problem with N training samples and N echo state samples generated with an untrained network. The time index n0 ≥ 0 is chosen so as to make sure that the ESN transients due to the initialization of the network fade out. For simplicity, we restrict ourselves to a scalar output signal y[n]. Considering a general P-dimensional output signal merely leads to a more complicated probabilistic signal model without adding any new aspects relevant to the understanding of the new proposed concepts and methods.2

Let xll) = [xl[n0 − τl], …, xl[n0 + N − 1 − τl]]T denote a vector of echo state samples xl[n] of the lth neuron delayed by τl. We will collect these vectors in an N × L matrix X(τ) = [x11), …, xLL)], where τ = [τ1, …, τL]T. In order to ensure the causality of the ESN with D&S readouts, we assume that xl[n0 + i − τl] = 0 when n0 + i − τl < 0, for any τl ≥ 0 and i = 0, …, N − 1. Similarly, we collect N samples of the mth input signal um[n] in a vector um = [um[n0], …, um[n0 + N − 1]]T and define an N × M matrix U = [u1, …, uM]. Now, the output equation of an ESN with D&S readout can be rewritten in the following form:
2.1
where y = [y[n0], …, y[n0 + N − 1]]T is a desired output of the network that is represented as a linear combination of column vectors in Φ(τ) = [X(τ), U] perturbed by a random vector ξ = [ξ[n0], …, ξ[n0 + N − 1]]T. This perturbation models a random error between the predicted network response Φ(τ)w and the desired response Y. We will assume that each element of ξ is drawn independently from a zero mean gaussian distribution with variance σ2. We also point out that for the scalar output y[n], the output weight matrix W in equation 2.2 reduces to a vector W.
Observe that delays τ do not influence the generation of echo states xl[n]; they simply shift the signals xl[n] before they are linearly combined to form the network output, leading to the D&S readout terminology. When filter neurons are employed in the reservoir, the generation of echo states becomes dependent on the parameters of the neuron filters. In this case, the output of the lth neuron activation function is computed as , where cul, cxl, and cyl are the lth column vectors from the matrices Cu, Cx, and Cy, respectively. The filtered echo state, that is, the filter neuron output signal, is then computed as
where hl[n] is an impulse response of a stable linear time-invariant filter, for example, a bandpass filter.3 Typically it is assumed that transfer functions of all neuron filters hl[n], l = 1, …, L, are fixed at the network design stage (Holzmann & Hauser, 2010; Wustlich & Siewert, 2007). This is essentially a simplifying assumption; adapting filter parameters is complicated due to a recurrent interdependency of the neurons in the network. Although it is possible to construct an algorithm to estimate neuron filters hl[n] (Zechner & Shutin, 2010), there are no theoretical convergence or monotonicity guarantees for this learning scheme. Henceforth, we assume that the parameters of neuron filters are fixed at the design stage and the adaptation of the filter neuron parameters is left outside the scope of this letter. For all our experiments in section 5, we assume ESNs without filter neurons, which are obtained by choosing hl[n] = δ[n], where δ[n] is a discrete-time unit impulse.4

In a batch learning regime, the columns of the matrix Φ(τ) corresponding to the generated echo states can be interpreted as parametric basis functions, parameterized by parameters τ. In what follows, we explain how this can be exploited to formulate a variational Bayesian framework to jointly estimate the D&S readout parameters and train the network.

## 3.  Bayesian ESN Learning

We first note that ESN training is equivalent to the maximization of the log-likelihood function:
3.1
Notice that even when parameters τ are assumed to be fixed, the estimation of W from equation 3.1 typically requires a regularization (Lukoševičius & Jaeger, 2009; Jaeger, 2001). Bayesian methods introduce regularization by imposing constraints on the model parameters using priors. Consider a prior pdf p(τ, w|α), where α is a vector of prior parameters. This prior leads to a posterior pdf that in the log domain can be expressed as
3.2
where log p(τ, w|α) performs the role of a regularizing function. Depending on the choice of p(τ, w|α), different forms of the regularizing function can be constructed. Henceforth, we will assume that the prior p(τ, w|α) factors as
3.3
The motivation behind this assumption is the following. Through the prior p(w|α), we can control the contribution of individual basis functions in Φ(τ) irrespective of their form, which is specified by the parameters τ. The prior p(w|α) is assumed to fully factor as p(w|α) = ∏Kk=1p(wkk), where α = [α1, …, αK]T, K = L + M, and p(wkk) is selected as a zero mean symmetric pdf with the prior parameter αl inversely proportional to the width of p(wkk). Such factorization of the prior enables more flexible control over the importance of each column in Φ(τ) through the coefficients α: a large value of αk drives the posterior mean of the corresponding weight wk toward zero, thus effectively suppressing the corresponding basis function in Φ(τ) and leading to a regularized solution. Such a formulation of the prior is related to sparse Bayesian learning (SBL) (Shutin et al., 2011b; Tzikas et al., 2008; Tipping, 2001; Bishop & Tipping, 2000). In our work, we will, select p(wkk) as a gaussian pdf with zero mean and variance α−1k. This choice corresponds to a penalty function ∑kαk|wk|2 in equation 3.2, which is a weighted ℓ2 norm of the weight vector W (Bishop, 2006).5 This form of the penalty leads to a Tikhonov-like regularization of the original estimation problem, equation 3.1 with parameters α acting as regularization parameters.6 Additionally, the gaussian prior p(w|α) and the gaussian likelihood of W in equation 3.1 form a conjugate family (Bishop, 2006), which allows computation of the posterior distribution of W in closed form.
Within the SBL framework, the parameters α are then determined from
3.4
which is also known as the marginal likelihood function, or evidence (Tipping, 2001; Tipping & Faul, 2003). Unfortunately, the nonlinear dependence of the integrand in equation 3.4 on the parameters τ precludes the exact evaluation of the marginal p(y|α). Additionally, this nonlinear dependency significantly complicates the optimization of the posterior, equation 3.2. This motivates the use of approximation techniques to estimate the parameters W, τ, and α of the extended ESN model.

### 3.1.  Variational Bayesian Inference.

Note that the joint estimation of ESN parameters W, τ and regularization parameters α is equivalent to the maximization of the posterior pdf,
3.5
which involves equations 3.2 and 3.4 and the prior p(α). Instead of computing equation 3.5 directly, we approximate it with a proxy pdf q(τ, w, α) using variational Bayesian inference methods (Beal, 2003; Bishop, 2006).
Variational inference is realized by maximizing the lower bound on the marginal log-likelihood log p(y),
3.6
with respect to q(τ, w, α). It is known (Beal, 2003; Bishop, 2006) that the density q(τ, w, α) that maximizes the lower bound in equation 3.6 also minimizes the Kullback-Leibler divergence between q(τ, w, α) and often intractable true posterioir pdf p(τ, w, α|y).
Observe that optimizing the lower bound in equation 3.6 requires specifying both the approximating pdf and the joint pdf. Using equations 3.1 and 3.3, it is easy to conclude that the joint pdf p(τ, w, α, y) can be represented as
3.7
A DAG in Figure 1a captures this factorization using a graphical model.
Figure 1:

(a) A graphical model for estimating τ, W, and α. (b) A graphical model with the admissible hidden data hl for estimating the delay parameter τl of the lth neuron.

Figure 1:

(a) A graphical model for estimating τ, W, and α. (b) A graphical model with the admissible hidden data hl for estimating the delay parameter τl of the lth neuron.

Based on the Bayesian ESN model discussed earlier in this section, it is easy to conclude that p(y|τ, w) = N(y|Φ(τ)w, σ2I) and p(w|α) = N(w|0, A−1), where . The choice of priors p(τ) and p(α) is arbitrary in general. We will, however, assume that both priors factor as p(τ) = ∏Ll=1pl) and p(α) = ∏Kk=1pk). The choice of pl) is arbitrary in the context of our work. As we will show later, any desired form of pl) can be used in the algorithm. The prior pk), also called a hyperprior, is selected as a gamma pdf, that is, pk) = Ga(αk|ak, bk), with the prior parameters ak and bk chosen so as to ensure the desired form of the prior. Practically, we will select ak = bk = 0 to render this prior noninformative (Tipping, 2001). Such formulation of the hyperprior is related to automatic relevance determination (ARD) (Neal, 1996; MacKay, 1994). We stress that the ARD formulation of the hyperprior distribution also leads to a number of very efficient inference algorithms (Shutin et al., 2011b; Tipping & Faul, 2003).
The approximating pdf q(τ, w, α) is typically a free parameter. However, to make the optimization of the bound in equation 3.6 tractable, one typically assumes a suitable factorization of q(τ, w, α) and constrains individual approximating factors to some classes of parametric pdfs. Henceforth, we will assume that
3.8
The motivation behind such factorization is the following. Selection q(α) = ∏Kk=1qk) follows from the assumption that p(w, α) = ∏Kk=1p(wkk)pk). The assumption q(τ) = ∏Lj=1qj) is mainly done for computational reasons. Essentially, such factorization allows one to reduce a nonlinear L-dimensional optimization with respect to q(τ) to a series of L simpler one-dimensional nonlinear optimizations with respect to ql), which makes the numerical estimation problem much simpler.
Consider a random variable a ∈ {τ1, …, τL, w, α1, …, αK}, and assume we are interested in finding q(a) that maximizes the lower bound, equation 3.6. Define now
3.9
where is a Markov blanket of the variable a.7 It is then easy to show that an unconstrained (form-free) variational solution for q(a), a ∈ {τ1, …, τL, w, α1, …, αK}, which maximizes the bound, equation 3.6, is found as . If q(a) is constrained to some suitable class of density functions , then a constrained solution is obtained by solving
3.10
Note that an unconstrained solution naturally gives a tighter bound on log p(y) in equation 3.6.

Now let us return to the approximating pdf q(τ, w, α). In the case of q(w), it is easy to verify that computed from equation 3.9 is quadratic in W, which implies that must be a gaussian pdf. This can be easily verified by noting that the posterior is proportional to a product of two gaussian pdfs: p(w|y, α, τ)) ∝ p(y|w, τ)p(w|α). Therefore, selecting is equivalent to a form-free variational solution for this factor. Following the same line of argument, it can be shown that , k = 1, …, K, is a gamma pdf. Therefore, selecting corresponds to a form-free variational solution for qk). As a single exception to the above cases, we restrict ql) to a set of Dirac measures , l = 1, …, L. By doing so, we restrict ourselves to the integer point estimate of the lth neuron delay τl. While other forms of the pdfs can be assumed here, their study is outside the scope of this letter.

Now, the variational inference reduces to the estimation of the variational parameters , , , , k = 1, …, K, using equation 3.9 and , l = 1, …, L, using equation 3.10.

Should our estimation problem be independent of τ, the solution to the variational inference of q(w) and q(α) can be easily computed (Bishop & Tipping, 2000; Shutin, Buchgraber, Kulkarni, & Poor, 2011a; Tipping & Faul, 2003). Unfortunately, the nonlinear dependence of X(τ) on τ significantly complicates the evaluation of equations 3.9 and 3.10. In fact, when Y is observed, the variables W and τ become conditionally dependent (Bishop, 2006); the variational estimation of a single factor ql) would thus require computing the expectation with respect to in equation 3.9. As a consequence, the straightforward variational estimation of ql) might become computationally quite costly due to the correlations between the elements of W, especially when the number of columns in Φ(τ) is high. Although this approach is practically realizable, these numerical difficulties can be efficiently circumvented by appealing to the EM type of inference schemes used for estimating parameters of superimposed signals (Feder & Weinstein, 1988; Fleury, Tschudin, Heddergott, Dahlhaus, & Pedersen, 1999; Shutin & Fleury, 2011). Here we propose to use one such algorithm known as the VB-SAGE algorithm (Shutin & Fleury, 2011).

The VB-SAGE algorithm—a variational extension of the original SAGE algorithm (Fessler & Hero, 1994)—allows one to simplify the optimization of the bound in equation 3.6 by introducing latent variables termed admissible hidden data. Within the VB-SAGE algorithm, the admissible hidden data is introduced for only a subset of parameters of interest; this distinguishes the VB-SAGE framework from a closely related EM framework and its variational extensions (Sung, Ghahramani, & Bang, 2008a, 2008b; Palmer, Wipf, Kreutz-Delgado, & Rao, 2006; Beal, 2003; Attias, 1999), where complete data are introduced for all the unknown parameters. In our case, we would like to simplify the inference of a single delay parameter τl. The VB-SAGE algorithm is then used to maximize the bound in equation 3.6 with respect to ql) by performing a variational inference on a new graph that has been appropriately extended with latent variables. The monotonicity property of the VB-SAGE algorithm guarantees that this optimization strategy necessarily improves the variational bound in equation 3.6 (Shutin & Fleury, 2011).

### 3.2.  Variational Bayesian Space-Alternating Inference.

We begin by formally defining the notion of admissible hidden data. Let be a set of all the unknown parameters, and let be a subset of parameters we wish to update.8

Definition 1.
Given a measurement Y, hs is said to be admissible hidden data with respect to if the factorization
3.11
is satisfied (Shutin & Fleury, 2011; Fessler & Hero, 1994).

The purpose of the hidden data is to make the update procedure for the subset a tractable optimization problem. Now let us reinspect equation 3.10. We observe that the network output is represented as a superposition of L neuron responses X(τ) = [x11), …, xLL)] and M input signals U = [u1, …, uM]. Obviously the weight vector W can be partitioned as w = [wTx, wTu]T, where wx and wu are, respectively, the weighting coefficients for the echo states X(τ) and the input signals U. In what follows, we formulate the learning problem so as to estimate the delay τl of a single neuron.

Lemma 1.

Let wxl denote the lth element from the vector wx. Decompose the total perturbation ξ in equation 2.1 into two statistically independent parts such that ξ = ξl + ηl, where and for some 0 ≤ βl ≤ 1.

Then, a variable,
3.12
is admissible hidden data with respect to τl.
Proof.
With the new variable hl the ESN output expression, equation 2.1, can be rewritten as
3.13
where is an N × L − 1 matrix of delayed echo states with the response of the lth neuron removed and is a vector with L − 1 elements obtained by removing the lth weight wxl from wx. Then, the modified graphical model that accounts for the introduced variable hl can be represented as shown in Figure 1b, from which it immediately follows that the new joint pdf factors as
3.14
By comparing equations 3.14 and 3.11, we conclude that hl defined in equation 3.12 is admissible hidden data with respect to .
The key quantities in equation 3.14 that distinguish it from equation 3.7 are the likelihood of the admissible hidden data and the new likelihood of τl, which is now a function of W and hl. From equation 3.12, it follows that p(hl|w, τl) = p(hl|wxl, τl), where p(hl|wxl, τl) = N(hl|xll)wxl, βlσ2lI). The VB-SAGE-based inference of ql) now incorporates two steps: (1) a variational inference of the admissible hidden data hl using the augmented graph in Figure 1b, which forms the VB-SAGE-E-step of the scheme, followed by (2) the variational inference of ql), which is the VB-SAGE-M-step of the algorithm. Notice that the E-step of the VB-SAGE algorithm requires extending the approximating pdf, equation 3.8, so as to account for the admissible hidden data hl. We assume that
3.15
The same factorization of the approximating pdf also underpins the variational extension of the EM algorithm (Attias, 1999). Once the joint pdf, equation 3.14, and the approximating pdf, equation 3.15, are specified, the variational inference of q(hl) and ql) is realized following the standard variational inference on a DAG, that is, the expressions 3.9 and 3.10 are evaluated to estimate the corresponding approximating factors, albeit using the new graph in Figure 1b to determine the Markov blanket of the updated variables.

It has been shown (Shutin & Fleury, 2011) that in order to guarantee the monotonic increase of the variational lower bound with respect to ql) using the VB-SAGE algorithm, it suffices to estimate the approximating pdf q(hl) of the admissible hidden data as a form-free solution, equation 3.9, and select βl = 1. In our case, these constraints are easily satisfied. Note that βl is in general a free parameter. However, setting βl = 1 is convenient since it has been proven that for models linear in their parameters, this choice leads to a fast convergence of the algorithm in the early iteration steps (Fessler & Hero, 1994); the same choice has been also adopted in Fleury et al. (1999) and Shutin and Fleury (2011). In case of q(hl), it follows that due to equations 3.12 and 3.13, it is easy to demonstrate that is quadratic in hl since is gaussian. Thus, selecting guarantees the monotonicity property of the VB-SAGE scheme.

### 3.3.  Variational Estimation Expressions.

Here we provide the estimation expressions for the variational parameters of the approximating factors of q(τ, w, α, hl). The updated value of a variational parameter is denoted by (·)′.

We begin with the variational estimation of q(w). By evaluating from equation 3.9, which is quadratic in W, and minding that , we find the updated variational parameters and as
3.16
Here we defined and , where and , k = 1, …, K. Let us stress that equation 3.16 is essentially a Tikhonov-like regularized solution for the coefficients W, with acting as the regularization parameters.
Following the same inference steps, we compute the variational parameters of the pdfs , k = 1, …, K, as
3.17
In equation 3.17, is a kth element of the vector , and is the kth element on the main diagonal of posterior covariance matrix .
Now we consider the VB-SAGE-based estimation of delay parameters τ. For each neuron, the inference includes a VB-SAGE-E-step to estimate q(hl) and a VB-SAGE-M-step to update the corresponding pdf ql). The VB-SAGE-E-step involves a computation of the expectation of with respect to . As we mentioned earlier, q(hl) should be selected as to ensure the monotonicity of the algorithm. Since is quadratic in hl, the variational parameters of can be easily computed as
3.18
Observe that with β1 = 1, , that is, q(hl) collapses to a Dirac distribution. Now the VB-SAGE-M-step involves a computation of the expectation of with respect to . Since , the solution to equation 3.10 is obtained by finding as a solution to the following optimization problem:
3.19
that is, is found such that ql) is centered at the maximum of the ; naturally, is the maximum a posteriori estimate of the delay parameter τl. In equation 3.19, is the element on the main diagonal of that corresponds to the posterior variance of the lth echo state weight wxl. Notice that the estimation of the delay τl requires numerical optimization, which, however, can be implemented as a simple one-dimensional line search on the domain of ql). The VB-SAGE-E-step, equation 3.18, and VB-SAGE-M-step, equation 3.19, are then iteratively repeated for all L neurons. Let us also mention that due to q(τ) being fully factorizable, the neurons can be processed in any desired order.

Holzmann and Hauser (2010) propose a similar iterative EM-based scheme for D&S readout optimization. Their algorithm is in many respects inspired by the ideas of the original SAGE algorithm (Fessler & Hero, 1994). They propose to estimate the delay τl of the lth neuron by first subtracting the influence of the other echo states and network input signals from the desired network response Y to compute the residual signal. This realizes the E-step of the scheme. The delay τl is then found as a value that maximizes the absolute value of the correlation between the computed residual and the echo state xl[n]; this constitutes the M-step of the algorithm. Although the scheme is very effective, several heuristics are employed that distinguish it from the algorithm proposed in this work. Specifically, the weights of the echo states are estimated in two steps: during the D&S readout parameter updates, the weight wxl of the lth echo state is computed as a projection of xll) on the residual signal; then, once the delay parameters of the D&S readout have converged, the Moore-Penrose pseudoinverse is used to estimate the weights W one more time. Also, the objective function used to compute the delay parameters of the D&S readout differs from that obtained with the standard SAGE algorithm. Let us now show that equations 3.18 and 3.19 are the generalizations of this approach.

First, we note that with βl = 1 expression 3.18 naturally realizes the interference cancellation scheme of Holzmann and Hauser (2010). Indeed, in this case, equation 3.18 reads
3.20
where , , and are the expectations of , , and wu, respectively. In other words, in equation 3.20, all input signals and responses of the other neurons but the response of the lth neuron are subtracted from the target signal Y. The similarity between the VB-SAGE-E-step and the E-step of the scheme proposed in Holzmann and Hauser (2010) comes quite naturally, since both schemes use the SAGE algorithm as a starting point. The actual distinction lies in the way the D&S readout parameters are estimated. In their work, Holzmann and Hauser (2010) depart from the SAGE algorithm and use a heuristic to estimate the delay τl. Specifically, is found as a maximizer of the absolute value of the correlation . Under certain assumptions, the objective function, equation 3.19, can be shown to be very similar to that used in Holzmann and Hauser (2010).
Observe that since τl is a delay parameter for the echo state xl[n], we can assume that the term ‖xll)‖2 is independent of the delay τl. This allows us to neglect the last regularization term in equation 3.19. Furthermore, when the prior pl) is assumed to be flat, that is, p(τ) ∝ 1, it follows that the optimization problem, equation 3.19, becomes equivalent to
3.21
which estimates τl on a grid so as to maximize the correlation between and xll).9 The objective function used in Holzmann and Hauser (2010) is thus an “incoherent” version of equation 3.21, where the weight is ignored and only the magnitude of the correlation is maximized with respect to τl.

## 4.  Implementation Issues and Algorithm Initialization

In order to initialize the algorithm, a simple strategy can be used that allows for an inference of the initial variational approximation from the training data . For that, we start with an empty model, that is, assuming all variational parameters to be 0. The iterations of the algorithm then sequentially update all variational factors. In algorithm 1, we summarize the main steps of the proposed algorithm. Note that in step 3 of the algorithm, we initialize αl = ε. The choice of ε is in general application dependent; we discuss it in more detail in section 5:

An important part of the initialization procedure is a selection of the additive perturbation variance σ2. When signal y[n] is known to be noisy, the variance σ2 should be selected to reflect this. In general, a large value of σ2 leads to a more aggressive regularization and makes the network less sensitive to variations in y[n].

The iterative nature of the algorithm requires a stopping criterion for parameter updates. In our experiments, it has been empirically determined that after five or six update iterations the improvment of the algorithm performance is insignificant; thus, six update iterations are used.

### 4.1.  Computational Complexity of the Algorithm.

Incorporation of automatic regularization in the ESN training scheme as well as estimation of D&S readout parameters increases the computational complexity of the network training. Quite naturally, when D&S readout parameters and regularization parameters are fixed, the variational Bayesian ESN training reduces to an instance of the classical ridge regression-based estimate of W. This requires inverting a K × K posterior covariance matrix , an operation that has a computational complexity .

The estimation of regularization parameters using equation 3.17 has complexity . Compared to the computation of the weights W, the estimation of regularization parameters poses an insignificant increase of the total computational complexity. Recently, a new, fast variational SBL (FV-SBL) scheme has been proposed (Shutin et al., 2011a, 2011b) to accelerate the convergence of sparsity parameter update expressions 3.17 in the case when hyperpriors pk), k = 1, …, K, are chosen to be noninformative. The scheme exploits the fact that the lower bound in equation 3.6 is convex with respect to the factorization, equation 3.8, in other words, the factors in equation 3.8 can be updated in any order without compromising the monotonic increase of the variational lower bound (Bishop, 2006). Then, for a fixed k, the stationary point of variational updates equations 3.17 and 3.16 repeated ad infinitum can be computed in closed form, assuming that the other variational parameters are fixed. All K variational factors qk), k = 1, …, K, are then updated sequentially, with the complexity of a single update being on the order of . Although in general the total complexity remains , the update of a single component can be performed more efficiently. Furthermore, fewer iterations are typically needed to estimate the regularization parameters.

The estimation of D&S readout parameters also increases the total computational complexity. Specifically, the estimation of the delay parameter for each neuron from equation 3.19 requires evaluating the admissible hidden data from equation 3.18, an operation, and solving the optimization problem, equation 3.19, which is an operation. For L neurons, this results in a total computational complexity on the order of ; that is, it is quadratic10 in both the number of learning samples N and the number of neurons L, yet this increase is still dominated by the complexity of estimating the network weights. Note that an ESN with D&S readout typically requires fewer neurons compared to the standard ESN to achieve the same memory capacity; in other words, training a smaller ESN with tunable D&S readouts is typically more efficient than training a standard ESN with many neurons.

## 5.  Simulation Results

In this section, we compare the performance of the proposed variational Bayesian learning of ESNs with other state-of-the-art ESN training algorithms using synthetic as well as real-world data. In the first experiment, described in section 5.1 we train an ESN predictor to forecast a chaotic time series generated with a Mackey-Glass system, which is often used to benchmark ESN learning schemes (Jaeger, 2001; Holzmann, 2008). In the second experiment, described in section 5.2, we apply the ESN training schemes to a recognition of handwritten symbols based on measured dynamic pen trajectory data.

In both experiments, we compare an extended ESN trained with the proposed VB-SAGE algorithm (which we will further term VB-ESN) to the performance of a standard ESN (STD-ESN), an ESN trained using Moore-Penrose pseudoinverse and reservoir extended with fixed D&S readout (EXT-ESN), and an ESN with a D&S readout that is trained using the algorithm proposed in Holzmann and Hauser (2010); further in the text, we refer to this algorithm as HH-ESN.

### 5.1.  Time-Series Prediction.

In this experiment, we apply ESNs to predict a chaotic time series generated with a Mackey-Glass system. Similar experiments have also been performed in Jaeger (2001) and Holzmann (2008) to benchmark the performance of different ESN training schemes. We assume that an input signal to an ESN is a constant signal u[n] = 0.02 and an output signal is generated using the Mackey-Glass differential equation,
5.1
where γ, β, n, and τmg are the parameters of the system. Following Jaeger (2001) and Holzmann (2008), we select these parameters as follows: γ = 0.1, β = 0.2, n = 10, and τmg = 30. This choice guarantees that the Mackey-Glass system converges to a chaotic attractor.

The reservoir coefficients Cx, Cy, and Cu are randomly generated by uniformly drawing samples from the interval [ − 1, 1]. For all tested networks the connectivity of the reservoir is set to 5% and the connectivity matrix Cx is normalized so as to have a spectral radius of 0.8. To avoid instabilities due to the nonlinear feedback mechanism (Jaeger, 2001), a small zero mean white additive disturbance with variance 1 × 10−6 was added to the feedback signal CTyy[n] in the state transition equation 1.1. We set the size of the reservoir for all tested ESNs to L = 200 neurons unless explicitly stated otherwise. The variance of the additive noise ξ was set in this experiment to σ2 = 10−10.

For the EXT-ESN algorithm, the time delays are generated randomly by independently drawing samples from the interval [0, 100]; 50% of the generated delay values are then set to zero, which ensures that a particular number of echo state functions enters the ESN output without a time delay. Similarly, the VB-ESN and HH-ESN algorithms use this initialization to generate the initial values of neuron delays.

In the case of the VB-SAGE algorithm, it is important to mention that due to the iterative structure of the algorithm, the proper initialization of the network parameters plays an important role. To obtain a consistent starting point, the initial values of the weights W are drawn from the gaussian distribution with zero mean and covariance matrix A−1, where . Obviously the initial choice of α controls the algorithm's emphasis on the estimation of the time delays τ. Small initial values of α lead to weak regularization of the weights during the early iterations. We have observed that this often drives the algorithm to a local optimum, with the values of τ frozen at the initial values. Setting initial values of α to large numbers corresponds to the initial weights W being close to zero; as a result, the training algorithm essentially “deregularizes” the solution, which, as our extensive simulations show, leads to better estimation results. In our experiments, we set αk = ε = 1010, k = 1, …, K. The same strategy is also used to initialize the weights of the HH-ESN algorithm.

The training and testing of the ESNs are then realized as follows. First, 3300 samples of the Mackey-Glass time series are generated. The first 3000 samples are used to train the network, and the remaining 300 are used to validate the network performance. The network is then run from a zero initial state in teacher-forced mode (Jaeger, 2001) using the first 3000 samples of the time series; further, the initial 1000 samples of the resulting network trajectory are discarded to ensure that the system settles at the chaotic attractor. The remaining N = 2000 samples are used to train the network and estimate its parameters.

Once the coefficients of the network are estimated, the trained network is run for 300 time steps to generate the predicted trajectory by feeding the output of the trained network back into the reservoir. The performance of the trained network is evaluated by measuring the normalized root-mean-squared error between the 300 samples of the true trajectory ytrue[n] and the trajectory ytrained[n] generated by the trained network; the corresponding results are then averaged over NMC = 300 independent Monte Carlo simulations, where for each simulation run a new ESN is generated and trained using a new realization of the Mackey-Glass time series. The normalized root-mean-squared error between ytrue[n] and ytrained[n] is computed as
5.2
where the superscript [i] denotes the signal computed during the ith Monte Carlo simulation run and is the variance of the true time series ytrue[n]. Naturally the longer both systems remain synchronized (i.e., the more slowly NRMSE[n] grows as a function of n), the better the performance of the trained model is.

In Figure 2 we plot the estimated performance of the compared algorithms. Observe that the predicted output signal obtained with the STD-ESN scheme diverges much faster from the true signal as compared to the other algorithms; even doubling the size of the network from 200 to 400 neurons does not improve the performance. Introducing the random D&S readout, however, does help. However, although the EXT-ESN scheme with L = 200 neurons outperforms both STD-ESN schemes, its performance is below that of the VB-ESN and HH-ESN algorithms. The latter two schemes deliver the lowest prediction error. These algorithms still have a 30 dB performance gain over the STD-ESN and EXT-ESN after 300 time steps. Solely boosting the size of the EXT-ESN to 400 neurons makes this scheme perform on par with VB-ESN and HH-ESN with 200 neurons each. Thus, learning the optimal parameters of the D&S readout allows for a significant reduction of the required size of the network. It should be mentioned that in this example, the performance of both VB-ESN and HH-ESN schemes is nearly identical. Let us look into the performance of these two schemes a bit closer.

Figure 2:

An error between the true Mackey-Glass time series and the predicted response for (a) 300 time steps and (b) a zoom-in into the error evolution for time steps between n = 1 and n = 16.

Figure 2:

An error between the true Mackey-Glass time series and the predicted response for (a) 300 time steps and (b) a zoom-in into the error evolution for time steps between n = 1 and n = 16.

For that, we analyze the estimation results for the delays τ. In Figures 3a and 3b, we plot the histograms of the estimated D&S readout parameters with nonzero delay values computed with the HH-ESN and VB-ESN methods. Interestingly, the histogram for the D&S readout parameters computed with the VB-ESN algorithm shows strong peaks in the range between τ = 20 and τ = 60, which covers the original delay parameter τmg of the Mackey-Glass system. In fact, this is not a mere coincidence; experiments with different values of the delay parameter τmg indicate that the VB-SAGE algorithm indeed sets many of the D&S readout delays to the value closest to the true delay τmg. In contrast, the delay estimation with the HH-ESN algorithm seems to result in more uniform values of the estimated delays and thus shows only weak peaks around the true delay parameter. Based on the obtained simulation results, we can claim that the exact estimates of the D&S readout parameters are not pivotal for the successful prediction of the Mackey-Glass time series, and deviations in the delay parameter estimates can be compensated to a certain extent by the estimation of output weights. Also, due to a very low noise level, the distinction between the Bayesian regularization used in the VB-ESN and Moore-Pensore pseudoinverse-based regularization is also minimal. This explains the similarity of the prediction results obtained with the two schemes.

Figure 3:

Histogram of the estimated D&S readout parameters. (a) HH-ESN. (b) VB-ESN.

Figure 3:

Histogram of the estimated D&S readout parameters. (a) HH-ESN. (b) VB-ESN.

It is also important to stress a statistical dependency between the estimated delay parameters τ and estimated regularization parameters α. Recall that α reflect the importance of the particular echo states or input signals: the higher the value of αk, the more regularization is applied to the kth column in the matrix Φ(τ) in equation 2.1, and thus the less relevant this column is in predicting the output signal. Thus, parameters α can be used to measure the relative quality of individual neuron echo states. In Figure 4 we plot the values of regularization parameters α versus the corresponding D&S readout delays parameters for the Mackey-Glass time series prediction example. Notice that the histogram has a strong peak at τ ≈ 25 and τ ≈ 35 with relatively small values of α, which indicates an importance of the echo states with these delays for representing the desired output signal.
Figure 4:

An empirical distribution of the D&S readout parameters τ and regularization parameters α for L = 200 neurons.

Figure 4:

An empirical distribution of the D&S readout parameters τ and regularization parameters α for L = 200 neurons.

### 5.2.  Handwritten Character Recognition.

Here we assess the performance of the proposed algorithm using measured multidimensional pen trajectory data for handwritten character recognition. It has already been demonstrated (Zechner, 2010) that ESNs can successfully handle dynamic handwriting data. Here we adopt the same experimental setup as used in Zechner (2010) to test the performance of the VB-SAGE algorithm.

The following simulations are carried out using samples from the WILLIAMS database (Williams, 2010), available at the UCI Machine Learning Repository (Frank & Asuncion, 2010). The database contains 2858 character trajectories from an English alphabet, where only letters that can be written as a single stroke were recorded (20 character classes). Furthermore, each trajectory is represented as a three-dimensional time series, featuring X− and Y− velocities as well as the pen pressure. The measured data in the repository have been smoothed using a gaussian filter with a variance set to 4 (see Williams, 2010, for further details). For our purposes we will consider recognition of only a subset of characters from the repository: a, b, c, d, e, g, and h.11 The corresponding 2D patterns for these characters are shown in Figure 5. Lukoševičius and Jaeger (2009) noted that there are two ESN configurations for a classification task. First, one can design and train a single ESN with as many outputs as class labels; the classes are then predicted by the output with the largest amplitude. Alternatively, one can train several ESN predictors, one for each class; then, given a test signal, the class label is selected by choosing the ESN predictor that achieves the smallest prediction error. In this work, we use the first approach.
Figure 5:

Sample X-Y-trajectories for letters a, b, c, d, e, g, and h from the WILLIAMS database.

Figure 5:

Sample X-Y-trajectories for letters a, b, c, d, e, g, and h from the WILLIAMS database.

Let us now describe the settings for the ESN network parameters and the design of the input signals. To this end, we introduce an index set with each element corresponding to one of the seven letters. As an input signal u[n] ≡ uc[n], we use the trajectory corresponding to the class ; the corresponding output signal of the ESN y[n] = [y1[n], …, y7[n]]T is then set to zero except for the element yc[n], , which is selected as a gaussian pulse with variance 1 centered at the time instance corresponding to 70% of the input trajectory length. During the testing, a class estimate is obtained by selecting the output element of y[n] that shows the maximum value within the input trajectory's time window. The reservoir parameters for this classification task are selected as in the signal prediction example except for the reservoir connectivity, which is set to 10%. Also, no output feedback is used (i.e., Cy = 0I). For each of the EXT-ESN algorithms, 30% of the D&S readout delays are set to zero, and the remaining 70% are uniformly drawn from the interval [0, 100]. The algorithms VB-ESN and HH-ESN use this initialization to generate the initial values of the D&S readout parameters.

The classification tasks were performed using five as well as seven character classes: {a, b, c, d, e} and {a, b, c, d, e, g, h}. In both cases, 60% of the character instances are used for training and the remaining 40% for testing. As the performance measure, we compute a per class classification error rate Ecl as the number of incorrect classifications per class over the number of tested examples in this class and the total classification error rate Etotal, which is the number of incorrect classifications for all letters over the total number of tested examples. To better assess the classification performance of the compared algorithms, we estimate Ecl and Etotal over 50 independent runs. For each run, a new ESN reservoir is generated and trained, and the corresponding classification errors are estimated.

In Figure 6 we summarize the distributions of the per class classification errors Ecl using box-and-whiskers plots for the five-letter case; in Figure 7 the classification errors for the seven-letter case are presented. The edges of the boxes are the 25th and 75th percentiles of the estimated classification errors, and the central mark denotes the median. Whiskers illustrate the degree of error dispersion; they extend from the box to the most extreme data value within 1.5 × IQR, where IQR is the interquantile range of the sample. The data with values beyond the ends of the whiskers, marked with crosses, are treated as outliers.
Figure 6:

Classification results for five-letter case.

Figure 6:

Classification results for five-letter case.

Figure 7:

Classification results for seven-letter case.

Figure 7:

Classification results for seven-letter case.

In the five-letter case, the compared algorithms successfully classify the symbols with a single exception of the letter c. Note that in contrast to the previous example, the VB-ESN by far outperforms the other schemes; moreover, the distinction between the VB-ESN and HH-ESN schemes is now much more apparent. The HH-ESN and EXT-ESN schemes also produce more outliers as compared to the VB-ESN algorithm. The failure of all three schemes to achieve a low classification error rate for the letter c can be explained by its similarity to the letter e. The analysis of the confusion matrix, shown in Table 1, reveals that the letter c is indeed often predicted as e. This leads to a higher dispersion of the classification errors, as can be seen in Figure 6c. Interestingly, the reverse is not true: the letter e is less often confused with c. Notice that the VB-ESN scheme is more successful in properly classifying c, having the smallest dispersion of Ecl. The same tendency is observed when when total classification error is analyzed.

Table 1:
Confusion Matrices for the Five-Letter Case Computed Jointly by EXT-ESN, HH-ESN, and VB-ESN Schemes.
ABCDE
1859
1424
C 761 327
1623
2219
ABCDE
1859
1424
C 761 327
1623
2219

Notes: Rows correspond to actual letters and columns to predictions. The entries in bold indicate particularly high class prediction errors.

In the five-letter case (see Figure 8a), the VB-ESN has much lower dispersion of the classification error as compared to HH-ESN and EXT-ESN cases. These results clearly demonstrate that the proposed joint optimization of the D&S readout parameters and regularization parameters significantly improves the performance of the trained models as compared to the other training schemes.

Figure 8:

Total classification error Etotal for the (a) five-letter case and (b) seven-letter case.

Figure 8:

Total classification error Etotal for the (a) five-letter case and (b) seven-letter case.

In the seven-letter classification scenario, we see that increasing the number of classes makes the classification clearly more difficult. Specifically, all schemes now make more errors. Similar to the five-letter case, the letter c is often confused with the letter e, as can be seen from the confusion matrix in Table 2. Also, the letter g is often confused with a by all compared algorithms. Interestingly, the HH-ESN is able to classify the letter e without error, while VB-SAGE is not (see Figure 7e). This can be explained as follows. Increasing the complexity of the classification problem with a fixed reservoir size not only increases classification errors but, in a multiclass classifier that we employ here, also redistributes the errors between the classes. If we consider the distribution of the total classification error, shown in Figure 8b, we will see that in contrast to the five-letter case, the performance of all schemes degrades, yet the performance of the VB-ESN algorithm is still slightly better than that of the other compared schemes.

Table 2:
Confusion Matrices for the Seven-Letter Case Computed Jointly by EXT-ESN, HH-ESN, and VB-ESN Schemes.
ABCDEGH
5655 23
4307 20
C 20 2288 919
4819
30 6577
G 681 16 3223
42 106 3286
ABCDEGH
5655 23
4307 20
C 20 2288 919
4819
30 6577
G 681 16 3223
42 106 3286

Notes: Rows correspond to actual letters and columns to predictions. The entries in bold indicate particularly high class prediction errors.

## 6.  Conclusion

In this work we have proposed a variational Bayesian approach to automatic regularization and training of extended echo state networks with delay&sum (D&S) readouts. The proposed training framework combines sparse Bayesian learning methods with the variational Bayesian space-alternating generalized expectation-maximization (VB-SAGE) algorithm. We have demonstrated that the proposed scheme allows for an optimal regularization of the training algorithm, with regularization parameters being determined automatically by the input-output signals, additive noise, and the structure of the reservoir. The standard Tikhonov-like regularization of ESN training is obtained as a special case of the proposed approach. The estimated regularization parameters also provide an objective measure of the weights’ importance: excessive regularization required for some of the echo states or input signals indicates the irrelevance of these signals to the approximation of the target signal. This importance information, together with the estimated delay parameters of the D&S readout, can be potentially used for relating the structure of the reservoir and neuron responses to different features of the training data. However, the detailed analysis required to support this claim is beyond the scope of this letter.

In addition to automatic regularization, the standard ESN structure has been extended with tunable D&S readouts and filter neurons and, when filter neurons are fixed, the D&S readout parameters can be efficiently estimated using the VB-SAGE algorithm. Although in general the optimization of neuron parameters leads to an intractable nonlinear optimization, the variational approach allows a reduction of the optimization problem to a sequence of simpler one-dimensional searches with respect to the delay parameter of each neuron. The VB-SAGE-based estimation of the D&S readout parameters generalizes the ad hoc EM-based D&S readout parameter estimation proposed by Holzmann and Hauser (2010). Thus, the variational Bayesian framework for ESN training presented in this work generalizes some of the existing approaches to regularization and D&S readout parameter estimation, while at the same time providing a formal optimization framework for joint ESN training, regularization, and parameter estimation.

The proposed estimation scheme has been applied to forecasting a chaotic time series generated with a Mackey-Glass system and dynamic handwritten symbol recognition problem. Our results demonstrate that for time series prediction, the proposed variational approach outperforms a simple extended ESN with random D&S readout parameters and performs on par only with the algorithm proposed in Holzmann and Hauser (2010). However, in a handwritten character recognition problem, the advantages of the proposed training algorithm become more apparent. Specifically, the proposed training scheme consistently outperforms the other algorithms, while only marginally increasing the computational complexity as compared to the training scheme discussed in Holzmann and Hauser (2010).

## Acknowledgments

We thank the reviewers for providing valuable comments that helped to significantly improve this letter. This research was supported in part by an Erwin Schrödinger Postdoctoral Fellowship, Austrian Science Fund Project J2909-N23, in part by the U.S. Office of Naval Research under grant N00014-09-1-0342, and in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, and by the U.S. Army Research Office under grant number W911NF-07-1-0185.

## Notes

1

In general, one can also reconstruct the desired output y[n] as , where is a bijective mapping and (Lukoševičius & Jaeger, 2009).

2

Note that in general, each element signal in a P-dimensional network output signal might have a different variance. This case can be accounted for by an appropriate, albeit more elaborate, noise model in a relatively straightforward fashion. Specifically, it will lead to the introduction of non-isotropic noise covariance matrices. This case is left outside the scope of the letter.

3

Note that practically, the filter hl[n] can represent a linear time-invariant system with an infinite impulse response, as well as with a finite impulse response.

4

The ESN training algorithm proposed in this work can easily be extended to ESN with arbitrary, although fixed, filter neurons. This extension leads to a more complicated signal model without adding any new aspect relevant to the understanding of the new proposed concepts and methods.

5

It is also possible to extend the inference procedure discussed in the letter to Laplacian priors p(wkk). This selection leads to an ℓ1-type of log-likelihood penalty ∑kαk|wk| and least-absolute shrinkage and selection operator (LASSO) regression. This development is outside the scope of this work.

6

Strictly speaking, this is so when p(τ) ∝ 1.

7

The Markov blanket of a variable node in a DAG is a set of nodes that includes parent nodes, children nodes, and coparents of the children nodes (Bishop, 2006).

8

We will assume that .

9

It can be shown that in this case, the minimum of is achieved when the correlation between and is maximized.

10

Recall that K = M + L.

11

The letter f was not recorded for the Williams database because it cannot be represented as a single stroke.

## References

Attias
,
H.
(
1999
).
Inferring parameters and structure of latent variable models by variational Bayes
. In
Proc. 15th Conference on Uncertainty in Artificial Intelligence, UAI ’99
(Vol. 2).
San Francisco
:
Morgan Kaufmann
.
Beal
,
M.
(
2003
).
Variational algorithm for approximate bayesian inference
.
Unpublished doctoral dissertation, University College London
.
Bishop
,
C. M.
(
2006
).
Pattern recognition and machine learning
.
New York
:
Springer
.
Bishop
,
C. M.
, &
Tipping
,
M. E.
(
2000
).
Variational relevance vector machines
. In
Proc. 16th Conference on Uncertainty in Artificial Intelligence, UAI ’00
(pp.
46
53
).
Piscataway, NJ
:
IEEE
.
Feder
,
M.
, &
Weinstein
,
E.
(
1988
).
Parameter estimation of superimposed signals using the EM algorithm
.
IEEE Transactions on Acoustics, Speech, and Signal Processing
,
36
(
4
),
477
489
.
Fessler
,
J.
, &
Hero
,
A.
(
1994
).
Space-alternating generalized expectation-maximization algorithm
.
IEEE Transactions on Signal Processing
,
42
(
10
),
2664
2677
.
Fleury
,
B.
,
Tschudin
,
M.
,
Heddergott
,
R.
,
Dahlhaus
,
D.
, &
Pedersen
,
K. I.
(
1999
).
Channel parameter estimation in mobile radio environments using the SAGE algorithm
.
IEEE Journal on Selected Areas in Communications
,
17
(
3
),
434
450
.
Frank
,
A.
, &
Asuncion
,
A.
(
2010
).
UCI machine learning repository
. http://archive.ics.uci.edu/ml/.
Golub
,
G. H.
, &
Van Loan
,
C. F.
(
1996
).
Matrix computations
(3rd ed.).
Baltimore, MD
:
Johns Hopkins University Press
.
Han
,
M.
, &
Wang
Y.
(
2009
). Nonlinear time series online prediction using reservoir Kalman filter. In
Proc. International Joint Conference on Neural Networks IJCNN ’09
(pp.
1090
1094
).
Piscataway, NJ
:
IEEE
.
Holzmann
,
G.
(
2008
).
Echo state networks with filter neurons and a delay&sum readout with applications in audio signal processing
.
Unpublished master's thesis, Graz University of Technology
.
Holzmann
,
G.
, &
Hauser
,
H.
(
2010
).
Echo state networks with filter neurons and a delay&sum readout
.
Neural Networks
,
23
(
2
),
244
256
.
Jaeger
,
H.
(
2001
).
The echo state approach to analyzing and training recurrent neural networks.
(Tech. Rep. No. 148).
Sankt Augustin
:
German National Research Institute for Computer Science
.
Jaeger
,
H.
(
2007
).
Discovering multiscale dynamical features with hierarchical echo state networks.
(Tech. Rep. No. 10).
Bremen, Germany
:
School of Engineering and Science, Jacobs University
.
Jaeger
,
H.
,
Maass
,
W.
, &
Principe
,
J.
(
2007
).
Special issue on echo state networks and liquid state machines
.
Neural Networks
,
20
(
3
),
287
289
.
Lukoševičius
,
M.
, &
Jaeger
,
H.
(
2009
).
Reservoir computing approaches to recurrent neural network training
.
Computer Science Review
,
3
(
3
),
127
149
.
MacKay
,
D.J.C.
(
1994
).
Bayesian methods for backpropagation networks
. In
E. Domany, J. L. van Hemmen, & K. Schulten
(Eds.),
Models of neural networks III
(pp.
211
254
).
New York
:
Springer-Verlag
.
Neal
,
R.
(
1996
).
Bayesian learning for neural networks
.
New York
:
Springer-Verlag
.
Palmer
,
J.
,
Wipf
,
D.
,
,
K.
, &
Rao
,
B.
(
2006
).
Variational EM algorithms for non-Gaussian latent variable models
. In
Y. Weiss, B. Schölkopf, & J. Platt
(Eds.),
Advances in neural information processing systems
,
18
(pp.
1059
1066
).
Cambridge, MA
:
MIT Press
.
Seeger
,
M.
, &
Wipf
,
D.
(
2010
).
Variational Bayesian inference techniques
.
IEEE Signal Processing Magazine
,
27
(
6
),
81
91
.
Shutin
,
D.
,
Buchgraber
,
T.
,
Kulkarni
,
S. R.
, &
Poor
,
H. V.
(
2011a
).
Fast adaptive variational sparse Bayesian learning with automatic relevance determination
. In
Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP’10
(pp.
2180
2183
).
Piscataway, NJ
:
IEEE
.
Shutin
,
D.
,
Buchgraber
,
T.
,
Kulkarni
,
S. R.
, &
Poor
,
H. V.
(
2011b
).
Fast variational sparse Bayesian learning with automatic relevance determination for superimposed signals
.
IEEE Transactions on Signal Processing
,
59
(
12
), 6257–6261.
Shutin
,
D.
, &
Fleury
,
B. H.
(
2011
).
Sparse variational Bayesian SAGE algorithm with application to the estimation of multipath wireless channels
.
IEEE Transactions on Signal Processing
,
59
(
8
),
3609
3623
.
Sung
,
J.
,
Ghahramani
,
Z.
, &
Bang
,
S. -Y.
(
2008a
).
Latent-space variational Bayes
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
30
(
12
),
2236
2242
.
Sung
,
J.
,
Ghahramani
,
Z.
, &
Bang
,
S.-Y.
(
2008b
).
Second-order latent-space variational Bayes for approximate Bayesian inference
.
IEEE Signal Processing Letters
,
15
,
918
921
.
Tipping
,
M.
(
2001
).
Sparse Bayesian learning and the relevance vector machine
.
Journal of Machine Learning Research
,
1
,
211
244
.
Tipping
,
M. E.
, &
Faul
,
A. C.
(
2003
).
Fast marginal likelihood maximisation for sparse Bayesian models
. In
Proc. 9th International Workshop on Artificial Intelligence and Statistics
.
N.P.
:
Society for Artificial Intelligence and Statistics
.
Tzikas
,
D. G.
,
Likas
,
A. C.
, &
Galatsanos
,
N. P.
(
2008
).
The variational approximation for Bayesian inference
.
IEEE Signal Processing Magazine
,
25
(
6
),
131
146
.
Verstraeten
,
D.
,
Schrauwen
,
B.
, &
Stroobandt
,
D.
(
2006
).
Reservoir-based techniques for speech recognition
. In
Proc. International Joint Conference on Neural Networks IJCNN ’06
(pp.
1050
1053
).
Piscataway, NJ
:
IEEE
.
Verstraeten
,
D.
,
Schrauwen
,
B.
, &
Stroobandt
,
D.
(
2007
).
An experimental unification of reservoir computing methods
.
Neural Networks
,
20
(
3
),
391
403
.
Williams
,
B. H.
(
2010
).
UCI character trajectories.
http://archive.ics.uci.edu/ml/datasets/.
Wustlich
,
W.
&
Siewert
,
U.
(
2007
).
Echo-state networks with band-pass neurons: Towards generic time-scale-independent reservoir structures
(Tech. Rep.).
Raben Steinfeld, Germany
:
PLANET Intelligent Systems GmbH
.
Xia
,
Y.
,
Mandic
,
D. P.
,
Hulle
,
M.
, &
Principe
,
J. C.
(
2008
).
A complex echo state network for nonlinear adaptive filtering
. In
Proc. IEEE Workshop on Machine Learning for Signal Processing MLSP’08
(pp.
404
408
).
Piscataway, NJ
:
IEEE
.
Zechner
,
C.
(
2010
).
Variational Bayesian reservoir computing and its applications to handwriting recognition.
Master's thesis, Graz University of Technology.
Zechner
,
C.
, &
Shutin
,
D.
(
2010
).
Bayesian learning of echo state networks with tunable filters and delay&sum readouts
. In
Proc. of International Conference on Acoustics Speech and Signal Processing.
Piscataway, NJ
:
IEEE
.