Abstract

The echo state property is a key for the design and training of recurrent neural networks within the paradigm of reservoir computing. In intuitive terms, this is a passivity condition: a network having this property, when driven by an input signal, will become entrained by the input and develop an internal response signal. This excited internal dynamics can be seen as a high-dimensional, nonlinear, unique transform of the input with a rich memory content. This view has implications for understanding neural dynamics beyond the field of reservoir computing. Available definitions and theorems concerning the echo state property, however, are of little practical use because they do not relate the network response to temporal or statistical properties of the driving input. Here we present a new definition of the echo state property that directly connects it to such properties. We derive a fundamental 0-1 law: if the input comes from an ergodic source, the network response has the echo state property with probability one or zero, independent of the given network. Furthermore, we give a sufficient condition for the echo state property that connects statistical characteristics of the input to algebraic properties of the network connection matrix. The mathematical methods that we employ are freshly imported from the young field of nonautonomous dynamical systems theory. Since these methods are not yet well known in neural computation research, we introduce them in some detail. As a side story, we hope to demonstrate the eminent usefulness of these methods.

1.  Introduction

In this letter, we derive a number of theoretical results concerning the dynamics of input-driven neural systems, using mathematical methods that still are widely unknown in neural computation and mathematical neuroscience. We hope that both—the results and the methods—will be of interest to readers.

The results shed a new and sharp light on the question of when an input-driven neural dynamics is “stable” or “unstable.” These concepts are certainly understood in different ways in different communities and contexts. Here we address phenomena that are frequently described as “sensitive dependency on initial conditions” or “divergence of perturbed trajectories” or the like and are often related to “chaotic” dynamics. Such intuitions root in the theory of autonomous (i.e., not input-driven) dynamical systems. It is, in fact, not trivial to cleanly extend these intuitions to input-driven systems. We establish a rigorous formal framework in which this notion of stability becomes well defined in input-driven systems and prove a number of theorems. Among those, we derive a 0-1 law for systems driven by input from an ergodic source, to the effect that the driven system is stable with probability zero or one.

Our work was originally motivated by questions that arise in the field of reservoir computing (RC), more specifically, in the subfield of echo state networks (ESNs). ESNs are artificial recurrent neural networks (RNNs) used in machine learning for the supervised training of temporal pattern recognizers, pattern generators, predictors, controllers, and more. (For a short overview, see Jaeger, 2007; an application-oriented paper is Jaeger & Haas, 2004; and a survey on state of the art is Lukoševičius & Jaeger, 2009.) Other RC flavors besides ESNs are liquid state machines—(Maass, Natschläger, & Markram, 2002; backpropagation-decorrelation learning—Steil, 2004; and temporal recurrent neural network—Dominey, Arbib, & Joseph, 1995). The basic idea behind RC is to drive a randomly created RNN (the reservoir) with the task input signal, and from the input-excited RNN internal dynamics distill a desired output signal by a trainable readout mechanism—often just a linear readout trained by linear regression of the target output on the excited internal activation traces. A necessary enabling condition for this scheme to work is that the reservoir possesses the echo state property (ESP). This is a particular stability concept for which a number of equivalent definitions are available (Jaeger, 2001). Intuitively, these amount to the property that the reservoir dynamics asymptotically washes out initial conditions, or in other words, is input forgetting or state forgetting. The ESP is connected to spectral properties of the network weight matrix, and some work has been spent on stating and refining these conditions (Jaeger, 2001; Buehner & Young, 2006; Yildiz, Jaeger, & Kiebel, 2012).

Importantly, the ESP is intrinsically tied to the characteristics of the driving input. It may well be the case that for inputs of some kind, a reservoir does not forget initial states, while for others it does. Therefore, the ESP is not a property of a reservoir per se, but a property of a pair (reservoir, “set of admissible inputs”). Concretely, in all available definitions and conditions relating to the ESP, the admissible inputs are characterized solely by their value range. It is presupposed that the input takes values in a compact set U, from which the ESP becomes a property of a pair (reservoir, U). This setting has been the only one accessible to a mathematical treatment so far, which is why it is still there, but it is hardly relevant for the daily practice of reservoir computing and has given rise to widespread misconceptions (see the discussion in Yildiz et al., 2012).

The troublesome issue about specifying admissible inputs solely through their range is the following. Consider a standard discrete-time reservoir RNN with a tanh sigmoid. It is intuitively clear that the tanh mapping is more contractive for larger-amplitude neural activations than it is for small-amplitude ones because the slope of the tanh is greatest around zero: larger arguments become more strongly quenched by the tanh tails. Thus, when a tanh reservoir is driven by large-amplitude input, the reservoir neurons will become highly excited, the tanh quenches strongly, which results in an overall initial condition forgetting. In contrast, for small-amplitude input, one may witness that the washing-out characteristics becomes lost. In particular, a constant zero input is the most dangerous one for losing the ESP. But often in a practical application, the relevant input range contains zero. One then earns , and since all that is stated about possible inputs is their range, one also earns the constant zero signal as an admissible input, which has to be accommodated into ascertaining the ESP. This is, unrealistic because more often than not in an application, one will never encounter the constant-zero input. In addition, this leads to unnecessarily strict constraints on the reservoir weight matrix because the ESP also has to be guaranteed for the zero input signal.

This situation has led to some confusion. In many published RC studies, one finds an initial discussion on how the weight matrix was scaled to ensure the ESP (then the weight matrix is typically suboptimally scaled), but in other studies, one finds informal statements to the effect that a weight matrix scaling was used that formally aborts the ESP but was working well nonetheless because the input signal was strong enough (then there is no theoretical foundation for what was done). In some other published work, the confusion culminates in the (incorrect) approach to scale the reservoir weight matrix to the border of chaos by setting it such that the ESP for zero input just gets lost (or does not get lost; see a discussion of these issues in Yildiz et al., 2012).

All in all, an alternative definition of the ESP that respects the nature of the expected input signals in more detail than just through fixing their range would be very welcome. We provide such an alternative definition in this letter. In fact, we define the ESP for a specific single-input signal, . Our definition is not constrained to RNN settings but covers general input-driven dynamical systems provided their state space is compact. From this single-input-signal-based definition of the ESP, we are able to derive the general 0-1 law (that if the ESP is obtained for a particular input signal, then with probability 1, it is also obtained for other inputs from the same source). Furthermore, returning to the specific case of tanh reservoirs, we relate the statistics of the input signal to spectral properties of the weight matrix, such that the ESP is guaranteed. While the bounds that we were able to spell out are still far from tight, we perceive these results as door openers to further progress.

The methods we use come from the young and strongly developing theory of nonautonomous dynamical systems (NDS). In mathematics, a (discrete-time) NDS is a dynamical system whose update map is time varying. That is, while an autonomous dynamical system is governed by a single update map , an NDS is updated by a different map gn at every time via . Input-driven systems are a special case of NDS: given a particular input sequence and an input-respecting update map , one obtains the time-dependent maps gn by gn(x)≔g(un, x).

Biological and artificial neural information processing systems are almost always input driven. The natural background theory to analyze them would thus be the theory of NDS. However, the theory of NDS is significantly more complex, significantly less developed, and much less known than the familiar theory of autonomous systems. In addition, familiar concepts like attractors and bifurcations reappear in new shapes and bear properties that are thoroughly different from the characteristics of their autonomous counterparts. Furthermore, a number of different basic concepts of attractivity are being used in the field. We have started to accommodate attractor concepts from NDS theory for neural dynamics phenomena elsewhere (Manjunath & Jaeger, 2012). Here we reuse some of the definitions and results from that work. For the purpose at hand, only elementary concepts from NDS theory are necessary. For readers not familiar with NDS, this letter might serve as a gentle first notice of the theory of NDS, and we hope that the benefits of this set of methods become clear. We provide essential references as our treatment unfolds. Suggested readings containing important works in this area include Kloeden and Rasmussen (2011) as a reference text, Pötzsche (2010) for a linearization theory, Colonius and Kliemann (2000) for nonautonomous control, Rasmussen (2007) for bifurcation theory, and Arnold (1998) for random dynamics.

We hope that a wider use of proper NDS concepts may help to make some of the neural dynamics analysis more rigorous, and also more appropriate to its subject, than it is possible when one tries to adapt concepts to autonomous dynamics. In section 2.1, we highlight the hygienic capacities of NDS theory in a critique of sensitivity-to-perturbation analyses that are sometimes found in the neural dynamics literature.

The letter is organized as follows. In section 2, redraft the echo state property by defining it with regard to an input sequence; we also analyze a simple special case where the input is periodic. Keeping in view that the ESP has a wider use beyond artificial neural networks, we spell out all our definitions for a general input-driven dynamical system on a metric space. In section 3, we prove a probability 0 or 1 determination of the echo state property for an input-driven system. In section 4 for a given artificial recurrent neural network with standard (tanh) sigmoid nonlinear activations, we establish sufficient conditions on the input for ESP to hold in terms of an induced norm of the internal weight matrix.

2.  The Echo State Property with Regard to an Input

An input-driven system (IDS) on a metric space X is a continuous map , where U is the metric space that contains the input driving sequence. In this letter, we consider only those IDS for which X is compact and while U a complete metric space. This includes discrete-time recurrent neural networks whose neurons have bounded activations and which are driven by -valued input signals. The dynamics of such an IDS, when driven by an input sequence , is realized through xn+1=g(un, xn). In the rest of this letter, we denote an IDS by either or just by g with the assumption of U being a complete and X a compact metric space implicit. Throughout, we denote the diameter of a set by . Also for a vector x in , we denote as its Euclidean norm, and the operator or induced norm of any linear transformation T is denoted by .

The state evolution in a IDS is studied through the orbits or solutions. Any sequence is called an entire solution of the IDS if there exists some such that for all .

We now recall the original definition of the echo state property, which was stated for an RNN with a compact input space in Jaeger (2001). We formulate it in the more general framework of an IDS and do not restrict the input space U to be compact.

Definition 1 

(Jaeger, 2001).  Let be an input driven system, where X is compact and U is complete. A sequence is said to be compatible with when xk+1=g(uk, xk) for all k<0. The input-driven system has the echo state property with respect to U if for any given and sequences , both compatible with , the equality x0=y0 holds.

A simple consequence of ESP with regard to input space in terms of entire solutions follows:

Proposition 1. 

Suppose is an input-driven system that has the echo state property with respect to U. If are both compatible with , then xk=yk for all . As a consequence, for any input sequence , there exists at most one entire solution.

Proof. 

Since and are both compatible with , then by the definition of compatibility in definition 1, it follows that and are both compatible with , where , and . Since g has ESP, by definition 1, x−1=y−1. When we repeat this argument, an obvious induction yields xk=yk for all k<0. If x0=y0, then trivially by definition of an entire solution obtained from the input , we have xk=yk for all . Thus, xk=yk for all , and hence there exists at most one entire solution.

We now approach the core matter of this letter: a treatment of the ESP at the resolution of individual input sequences:

Definition 2. 

An input-driven system is said to have the echo state property with respect to an input sequence if there exists exactly one entire solution—if and are entire solutions—then for all.

This input-sequence-sensitive definition of the ESP is related to the classical version, as follows. We will show that any IDS has at least one entire solution for a given input sequence. Acknowledging this fact, it is then straightforward from definitions 1 and 2 and proposition 1 that an IDS has the ESP with regard to the input space U if and only if it has the ESP with regard to every .

For a deeper analysis of the ESP with regard to input sequences, we use methods from the theory of nonautonomous dynamical systems (Kloeden & Rasmussen, 2011). Because these methods are not yet widely known in the neural computation world, we recall core concepts and properties.

A discrete-time nonautonomous system on a state-space X is a (time-indexed) family of maps , where each is a continuous map and the state of the system at time n satisfies xn=gn−1(xn−1). Since we will be concerned only with the discrete-time case, we drop the qualifier discrete time in the remainder of this letter. Clearly an IDS gives rise to a nonautonomous system where . Following Kloeden and Rasmussen (2011) and several other authors, we recall the definition of what is called a process for a nonautonomous system. Although the term process has potentially confusing connotations, it is standard terminology in the theory of nonautonomous dynamical systems, so we retain it. In essence, process here simply refers to a particular notation for a nonautonomous system, which will turn out to be very convenient:

Definition 3. 

Let . A process on a state-space X is a continuous mappingthat satisfies these evolution properties:

  • for all and .

  • for all with and .

A sequence is said to be an entire solution of if for all .

It is readily observed that a nonautonomous system on X generates a process on X by setting and To verify that is a process, we need to verify continuity. Continuity in the first two variables of is trivial. Also, the composition of finitely many continuous mappings makes the map continuous, and hence is continuous. Conversely, for every given process on X, we obtain a nonautonomous system by defining . Likewise, the notion of an entire solution is equivalently transferred between the NDS and process formulation. Thus, a “process” and an “NDS” provide two equivalent views on the same object. We will switch between these views at our convenience.

Next, for each process on X, we define a particular sequence of subsets of X, which carries much information about the qualitative behavior of the process:

Definition 4. 
Let be a process on a compact space X. The sequence , defined by
formula
is called the natural association of on X.
Since for any n,
formula
2.1
Xn is a nested intersection of sets in definition 4. It is clear that if some in is not surjective on X, then the set (where denotes set complement) is nonempty. Hence any entire solution of would not assume a value in the set at time n0+1. Indeed a much stronger condition holds: Xn is exactly the set of points x through which some entire solution passes at time n. The natural association can thus be intuitively regarded as the (tight) envelope of all entire solutions. In order to ultimately establish this fact, we first note that the natural association is -invariant:
Proposition 2 

(see Manjunath & Jaeger, 2012).  Let be a process on a compact space X. Then the natural association is such that each Xn is a nonempty closed subset of X and is -invariant, that is, for any and hence for all, .

The proof is a straightforward exploitation of the compactness of X and can be found in Manjunath and Jaeger (2012) (also reproduced in the appendix). Using this finding, one obtains the desired characterization of the natural association as the envelope of entire solutions:

Lemma 1 
(see Manjunath & Jaeger, 2012).  Let be a process on a compact space X. A sequence of subsets of X is the natural association of if and only if for all, it holds that
formula
2.2
where is the projection map.

The proof is adapted from Manjunath and Jaeger (2012) and found in the appendix.

If g is an IDS, it follows from proposition 2 and lemma 1 that the set of entire solutions is nonempty. Hence in definition 2 of the ESP with regard to , the required existence of exactly one entire solution singles out the case of exactly one such solution against cases where there is more than one solution. The case with no entire solution cannot arise.

We proceed to give a sufficient condition for a process to have exactly one solution. This condition is technical and will be used later in the proof of a core theorem:

Lemma 2. 

Let be a process on a compact metric space X metrized by d. Suppose that for all , there exists a sequence of positive reals converging to 0 such that for all and for all . Then there is exactly one solution of the process.

Proof. 
Assume that there are two distinct entire solutions and . Then for some n0. For any such n0, by hypothesis there exists converging to 0 such that
formula
Since by definition of an entire solution, and , we have for all j. But is finite because X is compact, and hence as . This is a contradiction of the fact that .

Notice that in this lemma, the required null sequences , which capture the rate of contraction from the past to the present, depend on n. This allows a time-varying degree of contractivity in the process. It is even possible that for limited periods, the maps are expanding. Specifically, consider the case of an input-driven recurrent neural network, with no internal weight adaptation (e.g., exploiting a network after a training phase). State contraction over time in the sense of the lemma is, in intuitive terms, related to input forgetting: when the contraction rate is high (i.e., converges quickly to zero), information about earlier input is quickly washed out from the network state. In a nonadapting RNN, the temporal variation of the contractivity of the process is entirely due to time-varying input itself. Again, still in purely intuitive terms, this means that some temporal input patterns can interact with the network such that they will be quickly forgotten, while other input patterns may be better preserved over longer time spans—or may in fact even become enhanced in the network state if the induced maps are expanding.

2.1.  Some Remarks on Folklore

The ESP is constitutive for the echo state network (ESN) approach to training RNNs. In the literature, some assumptions are tacitly and pervasively taken for granted though they have not yet been proven. Furthermore, we also witness a lack of conceptual rigor in some published work, especially with respect to the use of notions from dynamical systems theory (attractors and chaos in particular). Here we clarify some of these themes and point to leads from nonautonomous dynamical systems theory.

First, we consider the case of an RNN driven by periodic input. This situation arises commonly when such systems are trained as periodic pattern generators or recognizers. It is taken for granted by ESN practitioners (the second author included) that the induced network dynamics is (or asymptotically becomes) periodic of the same period. Our setting here helps us make this intuition rigorous.

Proposition 3. 

Let be an input-driven system with being p-periodic, that is, the smallest positive integer s for which un+s=un for all n is p. Suppose g has the ESP with regard to ; then the entire solution for all n. This entails that is r-periodic, with p=kr for some integer k.

Proof. 

Let be the process of the IDS corresponding to the input and be the entire solution of . From lemma 1, we have for any n. Since is p-periodic, by definition of , it follows that for all m<n. Hence, . Thus, as subsets of X, the set inclusion holds. But since these sets are singletons and, moreover, nonempty, for any n. This directly entails that is r-periodic, with p=kr for some integer k.

As a noteworthy special case, a constant (i.e., 1-periodic) input induces a constant entire solution . However, entire solutions extend from the infinite past, but in real life, an RNN is started at some time 0 from some initial condition , after which its positive time evolution is observed. It follows easily from lemma 1 and proposition 3 that xn converges to .

Jumping from the simplest (constant) to the most complex behavior, we indulge in some cautionary remarks on using the notion of chaos when describing input-driven RNNs. We sometimes find in the literature discussions of RNNs driven by nonperiodic input, where statements concerning chaotic behavior are made. Typically, chaos (or “edge of chaos”) is identified numerically by simulation experiments. The RNN is driven several times with the same input sequence; at some time, the network state is slightly perturbed and an exponential divergence of the perturbed trajectory from the unperturbed reference trajectory is quantified. If it is found positive, then a chaotic dynamics is claimed.

This way of proceeding is dubious, and new approaches are needed for a number of reasons. First, the term chaos originates from the theory of autonomous systems—systems without input or driven by constant or periodic input (in which case, they can be mathematically treated as autonomous). In the original context of autonomous systems, chaoticity is a property of attractors (sometimes more generally of invariant sets). Thus, for conceptual hygiene, it should be understood as a characteristic of attractors in nonautonomous systems as well. But in general for nonautonomous systems, attractivity notions are complicated, and typically an attractor is not a single subset of X but a particular sequence of subsets of X, which is -invariant, where is the process of the nonautonomous system (Kloeden & Rasmussen, 2011). The mathematical theory of attraction in nonautonomous dynamics is in its infancy, and a number of nonequivalent proposals for defining attractors have been forwarded, often depending on specific topological or probabilistic conditions. We bring to the attention of readers interested in nonautonomous attractivity that the natural association of the process is also what is called a pullback attractor of the underlying nonautonomous system (Kloeden & Rasmussen, 2011; also see Manjunath & Jaeger, 2012). Further comments on attractor notions are beyond the scope of this letter.

As we mentioned after lemma 2, it may well be the case that for certain limited periods, a driving input leads to expanding mappings, while on longer time spans, it will result in, on average, contracting dynamics. Specifically, if a standard RNN is driven with strong enough input, its units will be driven close to their saturation, which in turn leads to contractive dynamics in the sense of our lemma 2 (or even stronger versions of contraction). If such contraction maps appear for longer time spans, the role of expanding maps diminishes or sensitivity on initial conditions may disappear eventually. Any numerical detection of sensitivity on initial conditions by perturbation experiments is only a temporally local finding in input-driven systems and by itself cannot imply or preclude chaos. Thus, chaos has to be properly acknowledged to be an attribute of the input signal plus driven system pair. In more technical terms, since chaos is an asymptotic concept (e.g., complexity quantifiers like Lyapunov exponents or topological entropy are obtained as time-related asymptotic quantities), to verify chaos in an input-driven system, the effect of the asymptotics of the input has to be factored into the picture. The notions of topological entropy for nonautonomous systems are under investigation, and some interesting results can be found in Kolyada and Snoha (1996), Oprocha and Wilczynski (2009), and Zhang and Chen (2009). Since an IDS and an input give rise to an NDS , calculating the topological entropy of actually quantifies the dynamical complexity by accounting for the input asymptotics. A formal calculation of topological entropy in this case is dependent on knowing the individual maps gn explicitly unless algorithms are developed for estimating entropy from a time series. When the input sequence is drawn from a finite-valued source, it appears possible to estimate the lower bounds on the topological entropy by the methods developed in Zhang and Chen (2009). In a related finding but a very special case, an anonymous referee points us to Amigó, Gimenez, and Kloeden (2012), where it is shown that when the input sequence is only an arbitrary switching between two values, and the two different gns obtained are affine transformations on the real line, then the topological entropy of the is identically equal to that of the NDS it induces. However, this result does not hold when gns are not affine. For instance, in an artificial RNN, by scaling the input sequence by a suitable large number, the dynamics in the reservoir can be made trivial regardless of the input complexity. All of this prompts us to conclude that until estimates of topological entropy as in Zhang and Chen (2009) or other complexity measures factoring in the input asymptotics are obtained, no rigorous claims of chaos can be made based on perturbation-based detection experiments.

3.  A 0–1 Law for the ESP

In this section we consider an IDS with an input obtained as a realization of a U-valued stationary ergodic process defined on a probability space —that is, for each and n, takes values in the set U. Each realization , where gives rise to a separate nonautonomous system and hence has its own natural association . We thus consider to be a set-valued stochastic process. Before we embark on analyzing this object in more detail, we recall some standard notions from ergodic theory (Billingsley, 1979; Krengel, 1985; Skorokhod, Hoppenstead, & Salehi, 2002; Walters, 1992):

  • • 

    Measure-theoretic dynamical systems and measure-preserving dynamical systems (Krengel, 1985; Walters, 1992): A measure-theoretic dynamical system is a quadruplet where is a measure space and is a measurable map. A measure-theoretic dynamic is said to be a measure-preserving dynamical system (MPDS) if for all A in . An MPDS is said to be ergodic if for all , T−1(A)=A implies or .

  • • 

    Representing a stationary stochastic process as an MPDS (Krengel, 1985): Let be a probability space and S a separable complete metric space. Let denote the Borel sigma field of S. Let be an S-valued stationary process. Consider , where is the Cartesian product of bi-infinite countable number of copies of S and is the sigma field generated by the product topology on . For each , there exists a such that . The process and P induce a measure on defined by for all . It holds that the set of all paths is in and has -measure 1. The process is stationary if and only if is an MPDS where is the shift map that sends a point in to .

  • • 
    Ergodic stochastic processes (Skorokhod et al., 2002): An S-valued stationary process is said to be an ergodic process on if for every two integers and for any finite collection of elements of the Borel sigma field of S, and , the limit
    formula
    exists and is equal to
    formula
  • • 

    Ergodic stochastic processes as ergodic MPDS (Krengel, 1985): It can be shown that is ergodic if and only if the MPDS is ergodic.

  • • 
    Birkhoff's ergodic theorem (Krengel, 1985; Walters, 1992): A core result in ergodic theory is Birkhoff's ergodic theorem. It states that if is an ergodic process and if , that is, is a complex-valued function defined on such that , then the limit exists -almost surely and, when it exists, is equal to the -average:
    formula
    3.1
    A particular, useful application of equation 3.1 is if is an ergodic process and belongs to L1(P) for some k (and hence for all k, since for a stationary process, is independent of k), then
    formula
    3.2
  • • 

    Hausdorff semidistance and Hausdorff metric: When X is a metric space with metric d, we denote by the collection of all nonempty closed subsets of X. Let be the Hausdorff semidistance between any two . It is well known that whenever X is complete (compact), is also a complete (compact) metric space with the Hausdorff metric equivalently defined by , where is the open -neighborhood of A.

We now begin our analysis of an IDS whose input comes from a stationary ergodic source. The following lemma concerns the definition of a new real-valued process obtained from an IDS when its input is . Since notions of measurability are not obvious for set-valued functions, we prove the measurability of the functions involved in the appendix.

Lemma 3. 
Consider an IDS and a U-valued ergodic process defined on . For each realization , , define the process on X where . Let be the set-valued stochastic process where is the natural association of the process (notice that as a consequence of proposition 2). Define by
formula
Thenis an ergodic stochastic process.
Proof. 

Recalling the definition of a natural association, we have . The set-valued function Xn is measurable by lemma 6. From lemma 7, we know is also a measurable function. Since the composition of two measurable functions is also measurable, is measurable for any n. Applying statement ii of lemma 5, we obtain to be an ergodic process.

We are now equipped for the main result of this section:

Theorem 1. 
Let be a U-valued ergodic process defined on and an input-driven system. Then the set of all such that g has the echo state property, that is, the subset of ,
formula
has either probability 1 or 0.
Proof. 
Let , and be defined as in lemma 3. Since is nonempty, if and only if contains only a singleton of X. We also know from lemma 1 that there is exactly one entire solution of if and only if for all n. Using this and the definition of ,
formula
By lemma 3 and lemma 5, is an ergodic process. Since takes the values 0 and 1, belongs to L1(P) for any i. By Birkhoff's ergodic theorem in equation 3.2, the limit
formula
3.3
exists and assumes the same value almost surely. We know that a continuous map cannot map a singleton of X into a set of positive diameter. Hence, by definition of and lemma 1, it follows that if for some i, then for all . Also by lemma 1, it follows that if for some i, then for all . Hence the limit in equation 3.3 is equal to 0 if and only if for all i. Since the limit in that equation is almost surely the same constant, we have has either probability 1 or 0.

4.  Sufficient Conditions for the ESP in an RNN

In this section, we consider a discrete-time RNN with standard sigmoidal activations (with tanh nonlinearity) to provide sufficient conditions for the ESP with regard to an input. Sufficient conditions for the ESP with regard to an input space were provided by Buehner and Young (2006) and Yildiz et al. (2012) in terms of the internal weight matrix of the RNN. However, since our definition of the ESP is with regard to an input sequence, we have to bring in the role of the input as well into sufficient conditions for ESP. This is established in theorem 2. Furthermore, when the input arises as a realization of a stationary ergodic source, we state sufficient conditions for the ESP with regard to typical realizations in theorem 3. Since by its definition for a given IDS, the ESP with regard to an input depends on the history of the input, the higher-order correlation or higher-order statistics of the input data would be expected to play a role in determining the ESP. However, basing the ESP on higher-order correlations or statistics of the input may not only be onerous but also of little help since complete higher-order correlations or statistics are rarely available. In contrast, our sufficient conditions for the ESP in theorems 2 and 3 are based on the intermittent frequencies of expanding and contracting behaviors of the nonautonomous system generated by the IDS and the input un via .

Concretely, we consider the following standard RNN model, written as an IDS given by
formula
4.1
where Win and W are – and –dimensional real matrices representing the input and internal weight matrices of the neuronal connections, un and xn are column vector representations, and the function is defined by when ( denotes transpose).

Owing to the range of the tanh function, the effective dynamics of the IDS in equation 4.1 is always contained in [−1, 1]N. Thus, we can consider the IDS in equation 4.1 to be defined on . Note that we do not restrict the input space to be compact.

We make use of the following facts in proving theorem 2. A generalization of the mean-value theorem in one-dimensional calculus in higher dimensions is the so-called mean-value inequality (Furi & Martelli, 1991) and we state that if is a C1-function where V is an open subset of , then for any ,
formula
4.2
where is the Euclidean norm and is the induced norm of the Jacobian of at the point z.
We also recall that , where . Furthermore, sech2 is such that
formula
4.3
formula
4.4
formula
4.5
formula
4.6
The following lemma can be proved by elementary steps and is stated without proof:
Lemma 4. 

Let be a sequence of real numbers such that ai>0 for all i. Then in the following statements, the implications i ii iii hold:

  • .

  • .

  • .

Theorem 2. 

Consider the IDSdefined in equation4.1with an input. Then:

  • g has the ESP with regard to if . or, in general

  • even if , g has the ESP with regard to if is such that
    formula
    where Ci is the smallest absolute component of the vector Winui: Ci≔min(|Winui|), and I is the indicator function, that is, it takes +1 if its argument is true and 0 otherwise.

Proof. 

Let X=[−1, 1]N. For any given sequence , the IDS g specified in equation 4.1 defines a process given by where . Also since g is C1 continuous on the interior of , it follows that each gn is C1-continuous on the interior of X. Since a finite composition of C1 functions is also C1 continuous, the function is C1-continuous on the interior of X for any n>m.

Since , by the chain rule of differentiation, we know
formula
In general, for any ,
formula
4.8
Fix n. By applying the mean-value inequality, equation 4.2, we get
formula
4.9
where d is the Euclidean metric, is the induced norm of the Jacobian of at the point z, and is the interior of X.
For any m, from equation 4.1, we know . We know the derivative of the tanh(y1) with regard to y1 is sech2(y1). Define the function by , where denotes an -dimensional real-valued diagonal matrix whose kth diagonal element is , the kth element of the vector , and the function . With this notation and by using the chain rule, . Again with such notation, from equation 4.8 and taking norms we can write
formula
4.10

Proof of i. We now proceed to find an upper bound on . First, we find an upper bound on regardless of the argument . We know that is a diagonal matrix, and hence is upper-bounded by the maximum of the absolute value of the diagonal elements. Since the diagonal elements belong to the range of the sech2 function and sech2 is a nonnegative function and can take a maximum value of 1, for any um and any .

We can get a tighter upper bound on if um satisfies certain conditions. Denoting the maximum (minimum) of the elements of a vector v by max(v) (min(v)),
formula
4.11
Recall that Cm=min(|Winum|), where denotes the absolute value of its argument. Suppose and let take any value in [−1, 1]N. Then by definition of Cm, clearly . By equation 4.5, we have . Using this in equation 4.11, we can write
formula
4.12
Let . Using lemma 2 in equation 4.9, we infer that if as for all n, then only one entire solution exists for , and thus g has ESP with regard to . We find an upper bound on starting from the definition of in equation 4.10:
formula
4.13
Clearly as whenever the right-hand side of equation 4.13 converges to 0 as . The right-hand side of equation 4.13 is a product of j positive reals and using lemma 4, whenever
formula
4.14
The left-hand side of the inequality in equation 4.14 is upper-bounded by zero since is upper-bounded by 1. Let . Then the right-hand side is positive (see equation 4.14), and hence the inequality in that equation is always true. Moreover, equation 4.14 holds independent of any n, and hence whenever , for all n. Using lemma 2, we infer that only one entire solution exists for , and thus g has the ESP with regard to .
Proof of ii. We proceed to rearrange equation 4.14 and deduce further:
formula
4.15
formula
4.16
formula
4.17
formula
4.18
Here, equation 4.16 follows from equation 4.15 by definition of the function sech; equation 4.17 follows from equation 4.16 by using ; equation 4.18 follows from equation 4.17 by using the fact that (we can in fact ignore without much loosening the bound as it attains values very close to zero whenever ). If equation 4.18 holds for some n, by the property of limsup, it also holds for any other n. Hence, equation 4.7 is true if and only if equation 4.18 holds. Since equation 4.18 holds independent of any n, for all n. This proves ii.

The result from ii readily extends to a probabilistic version for networks driven by an ergodic source:

Theorem 3. 
Let be an -valued ergodic process defined on such that the expectation of is finite, that is, , and let as specified in equation 4.1. Define the random variables and
formula
Then g has ESP almost surely whenever for some i, the following inequality holds:
formula
Proof. 
Let . Then by equation 4.7, if
formula
4.19
holds, g has ESP with regard to .

Define random variables on the space through , where stands for taking the absolute value of individual vector components. Furthermore, define random variables . It can easily be verified that , and hence , belong to L1(P) if is such that . Now . Hence we can apply Birkhoff's ergodic theorem, equation 3.2, to evaluate the limit in equation 4.19.

In order to include the indicator function in the Lebesgue integral on the space , for any , define
formula
Using this notation and applying equation 3.2, the limit in equation 4.19 exists and
formula
4.20
From this and equation 4.19, if , we have the ESP with regard to almost all realizations of . Hence, the theorem is proved.

The bounds offered by the theorems in this section are admittedly weak. Specifically, the reliance on , with Ci≔min(|Winui|) will often bar practically useful applications of these theorems, since the condition becomes easily unachievable when some of the input weights are small—which they usually are. However, we still think these results are worth reporting because they demonstrate the application of methods that may guide further investigation, eventually leading to tighter bounds.

5.  Conclusion

With this letter, we hope to have served two purposes. First is to put the field of reservoir computing on a more appropriate foundation than it had before by presenting a version of the echo state property that respects inputs as individual signals—and not only as “anything that comes from some admissible value range.” Second is to demonstrate the usefulness and power of concepts and insights from the field of nonautonomous dynamical systems.

Appendix:  Proofs and Some Intermediate Results

Proof of Proposition 2. 

Since is continuous for every and X is compact, we have that is also a compact subset of X. Hence, is an intersection of closed subsets of X, which implies that Xn is closed. Further, it is also a nested intersection of closed sets in view of equation 2.1. Hence, Xn is nonempty for any n.

We next prove the following claim: Let be a collection of nonempty subsets of X such that and be a continuous function on X. Then
formula
A.1

Proof of claim. When , then one directly obtains . Next, let us show the inclusion . Let , that is, for all i there exists such that . Let be an accumulation point of . Since A is a nested intersection of Ai, by definition of limsup of sets, . Also by definition of limsup of sets, all the accumulation points of are contained in A. Hence, . By continuity of , . Thus, .

Starting from the definition of Xn we deduce
formula
Proof of Lemma 1. 

We first show in equation 2.2. We first prove the following claim: A sequence of sets is -invariant if and only if for every pair , there exists an entire solution such that and for all .

Proof of claim (from Kloeden & Rasmussen, 2011). Let , and choose . For , define the sequence . Then by -invariance, for any k>k0. On the other hand, for , and so there exists a sequence with and for all . Then define for . This completes the definition of the entire solution .

Suppose for any and , there is an entire solution satisfying for all . This implies for all . Hence, . The other inclusion follows from the fact that for all .

Thus, if is a natural association, then it is -invariant by proposition 2 and by the above claim in equation 2.2. We next show in equation 2.2. From the above claim, if there is an , then there is an entire solution such that .

Let be an entire solution. Now consider some . By definition, there exists such that for all k<n. Clearly, for all k<n. This implies . Since n was chosen arbitrarily, for all i.

The following is an elementary result for stochastic processes (Krengel, 1985), which is recalled in our results:

Lemma 5. 

Let be an S-valued ergodic process defined on . Then

  • is ergodic.

  • If R is some measurable space and is a measurable function and
    formula
    then is an R-valued ergodic process.

The following result is borrowed from (Manjunath & Jaeger, 2012), which is currently under review.

Lemma 6. 

Let the random variable Xn be defined as in lemma 3. With respect to the given sigma algebra on and the Borel-sigma algebras defined on obtained by the Hausdorff distance, is a measurable function.

Proof. 

For the U-valued stationary process , let its MPDS be denoted by .

Given any pair such that n>i, we define by
formula
where , is defined by , the map being as in theorem 1.

Let dU denote some metric on U that gives rise to . Then also generates . Let be the metric on . It may be verified that generates the product topology on . We now claim that is a continuous map for each n and i. To show this, let be any sequence such that as . We will show that hn,i is continuous by proving as .

Let . Hence, where . Since is continuous, it follows from the continuity argument that given any , there exists a such that
formula
A.2
Since as , we can find an integer K such that for all , holds. This implies for all . Hence for all from equation A.2, we have . Since was chosen arbitrarily, as . This implies that hn,i is continuous.
Define by
formula
A.3

Since hn,nj is continuous for any , for any Borel subset B contained in . This implies for any Borel subset B contained in . This implies hn is measurable.

For each , there exists an such that . Hence
formula
Since , hn is measurable and is obviously measurable, it follows that Xn is measurable.
Lemma 7. 
Let X be a compact metric space. Let a map be defined by
formula
Then the function is measurable.
Proof. 
Since assumes values in , to prove is measurable, it is sufficient to show that is a Borel subset of . We will show something stronger than this by proving that is a closed subset of . Let , that is, SX is a set containing all singletons of the space X. By definition of , if and only if A is a singleton of the space X. Hence, . To show the complement of SX in (i.e., is an open set), we prove that for any , there exists an open neighborhood of eX contained in . Let . Now considering eX as a subset of the space X, we have at least two distinct elements such that . Let for some . By triangle inequality for any , it follows that at least one of the following holds: or . Now if an open ball of radius around eX does not intersect SX if and only if :
formula
Thus, the open ball of radius around eX does not intersect SX. Hence is open in HX and SX is closed and has to be a Borel subset of HX. Thus, is a measurable function.

Acknowledgments

The research reported here was funded by the FP7 European projects ORGANIC (http://organic.elis.ugent.be/organic) and AMARSi (http://www.amarsi-project.eu/).

References

Amigó
,
J. M.
,
Gimenez
,
A.
, &
Kloeden
,
P. E.
(
2012
).
Switching systems and entropy
.
Manuscript submitted for publication
.
Arnold
,
L.
(
1998
).
Random dynamical systems
.
Heidelberg
:
Springer-Verlag
.
Billingsley
,
P.
(
1979
).
Probability and measure
.
New York
:
Wiley
.
Buehner
,
M.
, &
Young
,
P.
(
2006
).
A tighter bound for the echo state property
.
IEEE Trans. Neural Networks
,
17
,
820
842
.
Colonius
,
F.
, &
Kliemann
,
W.
(
2000
).
The dynamics of control
.
Berlin
:
Birkhauser
.
Dominey
,
P. F.
,
Arbib
,
M.
, &
Joseph
,
J.-P.
(
1995
).
A model of corticostriatal plasticity for learning oculomotor associations and sequences
.
J. Cognitive Neuroscience
,
7
(
3
),
311
336
.
Furi
,
M.
, &
Martelli
,
M.
(
1991
).
On the mean value theorem, inequality and inclusion
.
Am. Math. Monthly
,
98
,
840
847
.
Jaeger
,
H.
(
2001
).
The echo state approach to analysing and training recurrent neural networks
(Techn. Rep. GMD 148). Bremen
:
German National Research Institute for Computer Science
. http://minds.jacobs-university.de/pubs.
Jaeger
,
H.
(
2007
).
Echo state network
.
Scholarpedia
,
2
(
9
),
2330
.
Jaeger
,
H.
, &
Haas
,
H.
(
2004
).
Harnessing nonlinearity, predicting chaotic systems and saving energy in wireless communication
.
Science
,
304
(
5667
),
78
80
.
Kloeden
,
P. E.
, &
Rasmussen
,
M.
(
2011
).
Nonautonomous dynamical systems
.
Providence, RI
:
American Mathematical Society
.
Kolyada
,
S.
, &
Snoha
,
L.
(
1996
).
Topological entropy of nonautonomous dynamical systems
.
Random Comput. Dynamics
,
4
(
2–3
),
205
233
.
Krengel
,
U.
(
1985
).
Ergodic theorems
.
Berlin
:
Walter de Gruyter
.
Lukoševičius
,
M.
, &
Jaeger
,
H.
(
2009
).
Reservoir computing approaches to recurrent neural network training
.
Computer Science Review
,
3
(
3
),
127
149
.
Maass
,
W.
,
Natschläger
,
T.
, &
Markram
,
H.
(
2002
).
Real-time computing without stable states: A new framework for neural computation based on perturbations
.
Neural Computation
,
14
(
11
),
2531
2560
.
Manjunath
,
G.
, &
Jaeger
,
H.
(
2012
).
The dynamics of random difference equations is remodeled by closed relations
.
Manuscript submitted for publication
.
Oprocha
,
P.
, &
Wilczynski
,
P.
(
2009
).
Chaos in nonautonomous dynamical systems
.
An. Stiint. Univ. Ovidius Constanta Ser. Mat.
,
17
(
3
),
209
221
.
Pötzsche
,
C.
(
2010
).
Chaos in nonautonomous dynamical systems
.
Berlin
:
Springer
.
Rasmussen
,
M.
(
2007
).
Attractivity and bifurcation for nonautonomous dynamical systems
.
Berlin
:
Springer
.
Skorokhod
,
A. V.
,
Hoppenstead
,
F. C.
, &
Salehi
,
H.
(
2002
).
Random perturbation methods
.
New York
:
Springer-Verlag
.
Steil
,
J.
(
2004
).
Backpropagation-decorrelation: Online recurrent learning with O(N) complexity
. In
Proceedings of the IEEE International Joint Conference on Neural Networks
(pp.
843
848
).
Piscataway, NJ
:
IEEE
.
Walters
,
P.
(
1992
).
An introduction to ergodic theory
.
Berlin
:
Springer-Verlag
.
Yildiz
,
I. B.
,
Jaeger
,
H.
, &
Kiebel
,
S. J.
(
2012
).
Re-visiting the echo state property
.
Neural Networks
.
doi:10.1016/j.bbr.2011.03.031
Zhang
,
J.
, &
Chen
,
L.
(
2009
).
Lower bounds of the topological entropy for nonautonomous dynamical systems
.
Appl. Math. J. Chinese Univ. Ser.
,
24
(
1
),
76
82
.