## Abstract

Cortical networks are hypothesized to rely on transient network activity to support short-term memory (STM). In this letter, we study the capacity of randomly connected recurrent linear networks for performing STM when the input signals are approximately sparse in some basis. We leverage results from compressed sensing to provide rigorous nonasymptotic recovery guarantees, quantifying the impact of the input sparsity level, the input sparsity basis, and the network characteristics on the system capacity. Our analysis demonstrates that network memory capacities can scale superlinearly with the number of nodes and in some situations can achieve STM capacities that are much larger than the network size. We provide perfect recovery guarantees for finite sequences and recovery bounds for infinite sequences. The latter analysis predicts that network STM systems may have an optimal recovery length that balances errors due to omission and recall mistakes. Furthermore, we show that the conditions yielding optimal STM capacity can be embodied in several network topologies, including networks with sparse or dense connectivities.

## 1. Introduction

Short-term memory (STM) is critical for neural systems to understand nontrivial environments and perform complex tasks. While individual neurons could potentially account for very long or very short stimulus memory (e.g., through changing synaptic weights or membrane dynamics, respectively), useful STM on the order of seconds is conjectured to be due to transient network activity. Specifically, stimulus perturbations can cause activity in a recurrent network long after the input has been removed, and recent research hypothesizes that cortical networks may rely on transient activity to support STM (Jaeger & Haas, 2004; Maass, Natschläger, & Markram, 2002; Buonomano & Maass, 2009).

Understanding the role of memory in neural systems requires determining the fundamental limits of STM capacity in a network and characterizing the effects on that capacity of the network size, topology, and input statistics. Various approaches to quantifying the STM capacity of linear (Jaeger, 2001; White, Lee, & Sompolinsky, 2004; Ganguli, Huh, & Sompolinsky, 2008; Hermans & Schrauwen, 2010) and nonlinear (Wallace, Hamid, & Latham, 2013) recurrent networks have been used, often assuming gaussian input statistics (Jaeger, 2001; White et al., 2004; Hermans & Schrauwen, 2010; Wallace et al., 2013). These analyses show that even under optimal conditions, the STM capacity (i.e., the length of the stimulus able to be recovered) scales only linearly with the number of nodes in the network. While conventional wisdom holds that signal structure could be exploited to achieve more favorable capacities, this idea has generally not been the focus of significant rigorous study.

Recent work in computational neuroscience and signal processing has shown that many signals of interest have statistics that are strongly nongaussian, with low-dimensional structure that can be exploited for many tasks. In particular, sparsity-based signal models (i.e., representing a signal using relatively few nonzero coefficients in a basis) have recently been shown to be especially powerful. In the computational neuroscience literature, sparse encodings increase the capacity of associative memory models (Baum, Moody, & Wilczek, 1988) and are sufficient neural coding models to account for several properties of neurons in primary visual cortex (i.e., response preferences—(Olshausen & Field, 1996) and nonlinear modulations (Zhu & Rozell, 2013)). In the signal processing literature, the recent work in compressed sensing (CS) (Candes, Romberg, & Tao, 2006; Ganguli & Sompolinsky, 2012) has established strong guarantees on sparse signal recovery from highly undersampled measurement systems.

Ganguli and Sompolinsky (2010) have previously conjectured that the ideas of CS can be used to achieve STM capacities that exceed the number of network nodes in an orthogonal recurrent network when the inputs are sparse in the canonical basis (i.e., the input sequences have temporally localized activity). While these results are compelling and provide a great deal of intuition, the theoretical support for this approach remains an open question, as the results in (Ganguli & Sompolinsky, 2010) use an asymptotic analysis on an approximation of the network dynamics to support empirical findings. In this letter, we establish a theoretical basis for CS approaches in network STM by providing rigorous nonasymptotic recovery error bounds for an exact model of the network dynamics and input sequences that are sparse in any general basis (e.g., sinusoids, wavelets). Our analysis shows conclusively that STM capacity can scale superlinearly with the number of network nodes and quantifies the impact of the input sparsity level, the input sparsity basis, and the network characteristics on system capacity. We provide both perfect recovery guarantees for finite inputs and bounds on the recovery performance when the network has an arbitrarily long input sequence. The latter analysis predicts that network STM systems based on CS may have an optimal recovery length that balances errors due to omission and recall mistakes. Furthermore, we show that the structural conditions yielding optimal STM capacity in our analysis can be embodied in many different network topologies, including networks with both sparse and dense connectivities.

## 2. Background

### 2.1. Short-Term Memory in Recurrent Networks.

Since understanding the STM capacity of networked systems would lead to a better understanding of how such systems perform complex tasks, STM capacity has been studied in several network architectures, including discrete-time networks (Jaeger, 2001; White et al., 2004; Ganguli et al., 2008), continuous-time networks (Hermans & Schrauwen, 2010; Büsing, Schrauwen, & Legenstein, 2010), and spiking networks (Maass et al., 2002; Mayor & Gerstner, 2005; Legenstein & Maass, 2007; Wallace et al., 2013). While many different analysis methods have been used, each tries to quantify the amount of information present in the network states about the past inputs (i.e., how different inputs induce different network states as in Figure 1). For example, in one approach taken to study echo state networks (ESNs) (White et al., 2004; Ganguli et al., 2008; Hermans & Schrauwen, 2010), this information preservation is quantified through the correlation between the past input and the current state. When the correlation is too low, that input is said to no longer be represented in the state. The results of these analyses conclude that for gaussian input statistics, the number of previous inputs significantly correlated with the current network state is bounded by a linear function of the network size.

In another line of analysis, researchers have sought to directly quantify the degree to which different inputs lead to unique network states (Jaeger, 2001; Maass et al., 2002; Legenstein & Maass, 2007; Strauss, Wustlich, & Labahn, 2012). In essence, the main idea of this work is that a one-to-one relationship between input sequences and the network states should allow the system to perform an inverse computation to recover the original input. A number of specific properties have been proposed to describe the uniqueness of the network state with respect to the input. In spiking liquid state machines (LSMs), in work by Maass et al. (2002), a separability property is suggested that guarantees distinct network states for distinct inputs and follow-up work (Legenstein & Maass, 2007) that relates the separability property to practical computational tasks through the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971). More recent work analyzing similar networks using separation properties (Wallace et al., 2013; Büsing et al., 2010) gives an upper bound for the STM capacity that scales like the logarithm of the number of network nodes.

In discrete ESNs, the echo-state property (ESP) ensures that every network state at a given time is uniquely defined by some left-infinite sequence of inputs (Jaeger, 2001). The necessary condition for the ESP is that the maximum eigenvalue magnitude of the system is less than unity (an eigenvalue with a magnitude of one would correspond to a linear system at the edge of instability). While the ESP ensures uniqueness, it does not ensure robustness, and output computations can be sensitive to small perturbations (i.e., noisy inputs). A slightly more robust property looks at the conditioning of the matrix describing how the system acts on an input sequence (Strauss et al., 2012). The condition number not only describes a one-to-one correspondence but also quantifies how small perturbations in the input affect the output. While work by Strauss et al. (2012) is closest in spirit to the analysis in this letter, it ultimately concludes that the STM capacity still scales linearly with network size.

Determining whether a system abides by one of the separability properties depends heavily on the network's construction. In some cases, different architectures can yield very different results. For example, in the case of randomly connected spiking networks, low connectivity (each neuron is connected to a small number of other neurons) can lead to large STM capacities (Legenstein & Maass, 2007; Büsing et al., 2010), whereas high connectivity leads to chaotic dynamics and smaller STM capacities (Wallace et al., 2013). In contrast, linear ESNs with high connectivities (appropriately normalized) (Büsing et al., 2010) can have relatively large STM capacities—(on the order of the number of nodes in the network) (Ganguli et al., 2008; Strauss et al., 2012). Much of this work centers around using systems with orthogonal connectivity matrices, which leads to a topology that robustly preserves information. Interestingly, such systems can be constructed to have arbitrary connectivity while preserving the information-preserving properties (Strauss et al., 2012).

While a variety of networks have been analyzed using the properties described, these analyses ignore any structure of the inputs sequences that could be used to improve the analysis (Jaeger, 2001; Mayor & Gerstner, 2005). Conventional wisdom has suggested that STM capacities could be increased by exploiting structure in the inputs, but formal analysis has rarely addressed this case. For example, work by Ganguli and Sompolinsky (2010) builds significant intuition for the role of structured inputs in increasing STM capacity, specifically proposing to use the tools of CS to study the case when the input signals are temporally sparse. However, the analysis by Ganguli and Sompolinsky (2010) is asymptotic and focuses on an annealed (i.e., approximate) version of the system that neglects correlations between the network states over time. This letter can be viewed as a generalization of this work to provide formal guarantees for the STM capacity of the exact system dynamics, extensions to arbitrary orthogonal sparsity bases, and recovery bounds when the input exceeds the capacity of the system (i.e., the input is arbitrarily long).

### 2.2. Compressed Sensing.

There is substantial evidence from the signal processing and computational neuroscience communities that many natural signals are sparse in an appropriate basis (Olshausen & Field, 1996; Elad, Figueiredo, & Ma, 2008). The recovery problem requires that the system knows the sparsity basis to perform the recovery, which neural systems may not know a priori. We note that work has shown that appropriate sparsity bases can be learned from example data (Olshausen & Field, 1996), even in the case where the system observes the inputs only through compressed measurements (Isley, Hillar, & Sommer, 2011). While the analysis doesnot depend on the exact method for solving the optimization in equation 2.2, we also note that this type of optimization can be solved in biologically plausible network architectures (Rozell, Johnson, Baraniuk, & Olshausen, 2010; Rhen & Sommer, 2007; Hu, Genkin, & Chklovskii, 2012; Balavoine, Romberg, & Rozell, 2012; Balavoine, Rozell, & Romberg, 2013; Shapero, Charles, Rozell, & Hasler, 2011).

*K*, holds for

**in the basis if for any vector**

*A***that is 2**

*s**K*-sparse in we have that holds for constants

*C*>0 and . Said another way, the RIP guarantees that all pairs of vectors that are

*K*-sparse in have their distances preserved after projecting through the matrix

**. This can be seen by observing that for a pair of**

*A**K*-sparse vectors, their difference has at most 2

*K*nonzeros. In this way, the RIP can be viewed as a type of separation property for sparse signals that is similar in spirit to the separation properties used in previous studies of network STM (Jaeger, 2001; White et al., 2004; Maass et al., 2002; Hermans & Schrauwen, 2010).

**satisfies the RIP-(2**

*A**K*, in the basis with reasonable (e.g., ) and the signal estimate is , canonical results establish the following bound on signal recovery error: where and are constants and

*s*_{K}is the best

*K*-term approximation to

**in the basis (i.e., using the**

*s**K*largest coefficients in

**) (Candes, 2006). Equation 2.4 shows that signal recovery error is determined by the magnitude of the measurement noise and sparsity of the signal. In the case that the signal is exactly**

*a**K*-sparse and there is no measurement noise, this bound guarantees perfect signal recovery.

While the guarantees above are deterministic and nonasymptotic, the canonical CS results state that measurement matrices generated randomly from “nice” independent distributions (e.g., gaussian, Bernoulli) can satisfy RIP with high probability when *M*=*O*(*K*log *N*) (Rauhut, 2010). For example, random gaussian measurement matrices (perhaps the most highly used construction in CS) satisfy the RIP condition for any sparsity basis with probability 1−*O*(1/*N*) when . This extremely favorable scaling law (i.e., linear in the sparsity level) for random gaussian matrices is in part due to the fact that gaussian matrices have many degrees of freedom, resulting in M statistically independent observations of the signal. In many practical examples, there exists a high degree of structure in ** A** that causes the measurements to be correlated. Structured measurement matrices with correlations between the measurements have been recently studied due to their computational advantages. While these matrices can still satisfy the RIP, they typically require more measurement to reconstruct a signal with the same fidelity, and the performance may change depending on the sparsity basis (i.e., they are no longer “universal” because they donot perform equally well for all sparsity bases). One example that arises often in the signal processing community is the case of random circulant matrices (Krahmer, Mendelson, & Rauhut, 2012), where the number of measurements needed to ensure that the RIP holds with high probability for temporally sparse signals (i.e., is the identity) increases to . Other structured systems analyzed in the literature include Toeplitz matrices (Haupt, Bajwa, Raz, & Nowak, 2010), partial circulant matrices (Krahmer et al., 2012), block diagonal matrices (Eftekhari, Yap, Rozell, & Wakin, in press; Park, Yap, Rozell, & Wakin, 2011), subsampled unitary matrices (Bajwa, Sayeed, & Nowak, 2009), and randomly subsampled Fourier matrices (Rudelson & Vershynin, 2008). These types of results are used to demonstrate that signal recovery is possible with highly undersampled measurements, where the number of measurements scales linearly with the information level of the signal (i.e., the number of nonzero coefficients) and only logarithmically with the ambient dimension.

## 3. STM Capacity Using the RIP

### 3.1. Network Dynamics as Compressed Sensing.

*n*,

**is the () recurrent (feedback) connectivity matrix, is the input sequence at time**

*W**n*,

**is the () projection of the input into the network, is a potential network noise source, and is a possible pointwise nonlinearity. As in previous studies (Jaeger, 2001; White et al., 2004; Ganguli et al., 2008; Ganguli & Sompolinsky, 2010), this letter considers the STM capacity of a linear network (i.e.,**

*z**f*(

**)=**

*x***).**

*x**n*: where

**is a matrix, the**

*A**k*th column of

**is**

*A*

*W*^{k−1}

**, , the initial state of the system is**

*z***[0]=0, and is the node activity not accounted for by the input stimulus (e.g., the sum of network noise terms ). With this network model, we assume that the input sequence**

*x***is**

*s**K*-sparse in an orthonormal basis (i.e., there are

*K*nonzeros only in ).

### 3.2. STM Capacity of Finite-Length Inputs.

*n*input signal drives a network and the current state of the

*M*network nodes at time

*n*is used to recover the input history via equation 2.2. If

**derived from the network dynamics satisfies the RIP for the sparsity basis , the bounds in equation 2.4 establish strong guarantees on recovering**

*A***from the current network states**

*s***[**

*x**N*]. Given the significant structure in

**, it is not immediately clear that any network construction can result in**

*A***satisfying the RIP. However, the structure in**

*A***is very regular and in fact depends on only powers of**

*A***applied to**

*W***: Writing the eigendecomposition of the recurrent matrix**

*z***=**

*W*

*UDU*^{−1}, we rewrite the measurement matrix as where . Rearranging, we get where

*F*_{k,l}=

*d*

^{l−1}

_{k}is the

*k*th eigenvalue of

**raised to the (**

*W**l*−1)th power and .

While the RIP conditioning of ** A** depends on all of the matrices in the decomposition of equations 3.3, the conditioning of

**is the most challenging because it is the only matrix that is compressive (i.e., not square). Due to this difficulty, we start by specifying a network structure for**

*F***and that preserves the conditioning properties of**

*U***(other network constructions are discussed in section 4). Specifically, as in White et al. (2004), Ganguli et al. (2008), and Ganguli and Sompolinsky (2010) we choose**

*F***to be a random orthonormal matrix, assuring that the eigenvector matrix**

*W***has orthonormal columns and preserves the conditioning properties of**

*U***. Likewise, we choose the feedforward vector**

*F***to be , where**

*z*

*1*_{M}is a vector of

*M*ones (the constant simplifies the proofs but has no bearing on the result). This choice for

**ensures that is the identity matrix scaled by (analogous to Ganguli et al., 2008, where**

*z***is optimized to maximize the SNR in the system). Finally, we observe that the richest information preservation apparently arises for a real-valued**

*z***when its eigenvalues are complex, distinct in phase, have unit magnitude, and appear in complex conjugate pairs.**

*W*For the above network construction, our main result shows that ** A** satisfies the RIP in the basis (implying the bounds from equation 2.4 hold) when the network size scales linearly with the sparsity level of the input. This result is made precise in the following theorem:

*Suppose , and .*

^{1}Let**be any unitary matrix of eigenvectors (containing complex conjugate pairs) and set so that . For***U**M*an even integer, denote the eigenvalues of**by . Let the first***W**M*/2 eigenvalues be chosen uniformly at random on the complex unit circle (i.e., we chose uniformly at random from ) and the other*M*/2 eigenvalues as the complex conjugates of these values (i.e., for , ). Under these conditions, for a given RIP conditioning and failure probability , if*for a universal constant*

*C*, then for any**that is***s**K*-sparse (i.e., has no more than*K*non-zero entries)*with probability exceeding*.

**is a real-valued matrix. The quantity (known as the coherence) captures the largest inner product between the sparsity basis and the Fourier basis and is calculated as In the result above, the coherence is lower (therefore the STM capacity is higher) when the sparsity basis is more “different” from the Fourier basis.**

*W*The main observation of the result above is that STM capacity scales superlinearly with network size. Indeed, for some values of *K* and , it is possible to have STM capacities much greater than the number of nodes (i.e., ). To illustrate the perfect recovery of signal lengths beyond the network size, Figure 2 shows an example recovery of a single long input sequence. Specifically, we generate a 100-node random orthogonal connectivity matrix ** W** and generate . We then drive the network with an input sequence that is 480 samples long and constructed using 24 nonzero coefficients (chosen uniformly at random) of a wavelet basis. The values at the nonzero entries were chosen uniformly in the range [0.5,1.5]. In this example, we omit noise so that we can illustrate the noiseless recovery. At the end of the input sequence, the resulting 100 network states are used to solve the optimization problem in equation 2.2 for recovering the input sequence (using the network architecture in Rozell et al., 2010). The recovered sequence, as depicted in Figure 2, is identical to the input sequence, clearly indicating that the 100 nodes were able to store the 480 samples of the input sequence (achieving STM capacity higher than the network size).

Directly checking the RIP condition for specific matrices is NP-hard (one would need to check every possible 2*K*-sparse signal). In light of this difficulty in verifying recovery of all possible sparse signals (which the RIP implies), we will explore the qualitative behavior of the RIP bounds above by examining in Figure 3 the average recovery relative MSE (rMSE) in simulation for a network with *M* nodes when recovering input sequences of length *n* with varying sparsity bases. Figure 3 uses a plotting style similar to the Donoho-Tanner phase transition diagrams (Donoho & Tanner, 2005) where the average recovery rMSE is shown for each pair of variables under noisy conditions. While the traditional Donoho-Tanner phase transitions plot noiseless recovery performance to observe the threshold between perfect and imperfect recovery, here we also add noise to illustrate the stability of the recovery guarantees. The noise is generated as random additive gaussian noise at the input ( in equation 3.1) to the system with zero mean and variance such that the total noise in the system ( in equation 3.2) has a norm of approximately 0.01. To demonstrate the behavior of the system, the phase diagrams in Figure 3 sweep the ratio of measurements to the total signal length (*M*/*N*) and the ratio of the signal sparsity to the number of measurements (*K*/*M*). Thus, at the upper left-hand corner, the system is recovering a dense signal from almost no measurements (which should almost certainly yield poor results) and at the right-hand edge of the plots the system is recovering a signal from a full set of measurements (enough to recover the signal well for all sparsity ranges). We generate 10 random ESNs for each combination of ratios (*M*/*N*, *K*/*M*). The simulated networks are driven with input sequences that are sparse in one of four different bases (Canonical, Daubechies-10 wavelet, Symlet-3 wavelet, and DCT) that have varying coherence with the Fourier basis. We use the node values at the end of the sequence to recover the inputs.^{2}

In each plot of Figure 3, the dashed line denotes the boundary where the system is able to essentially perform perfect recovery (recovery error ) up to the noise floor. Note that the area under this line (the white area in the plot) denotes the region where the system is leveraging the sparse structure of the input to get capacities of *N*>*M*. We also observe that the dependence of the RIP bound on the coherence with the Fourier basis is clearly shown qualitatively in these plots, with the DCT sparsity basis showing much worse performance than the other bases.

### 3.3. STM Capacity of Infinite-Length Inputs.

After establishing the perfect recovery bounds for finite-length inputs in the previous section, we turn here to the more interesting case of a network that has received an input beyond its STM capacity (perhaps infinitely long). In contrast to the finite-length input case where favorable constructions for ** W** used random unit-norm eigenvalues, this construction would be unstable for infinitely long inputs. In this case, we take

**to have all eigenvalue magnitudes equal to**

*W**q*<1 to ensure stability. The matrix constructions we consider in this section are otherwise identical to that described in the previous section.

In this scenario, the recurrent application of ** W** in the system dynamics ensures that each input perturbation will decay steadily until it has zero effect on the network state. While good for system stability, this decay means that each input will slowly recede into the past until the network activity contains no useable memory of the event. In other words, any network with this decay can only hope to recover a proxy signal that accounts for the decay in the signal representation induced by the forgetting factor

*q*. Specifically, we define this proxy signal to be

**, where . Previous work (Ganguli et al., 2008; Jaeger, 2001; White et al., 2004) has characterized recoverability by using statistical arguments to quantify the correlation of the node values to each past input perturbation. In contrast, our approach is to provide recovery bounds on the rMSE for a network attempting to recover the**

*Qs**n*past samples of

**, which corresponds to the weighted length-**

*Qs**n*history of

**. Note that in contrast to the previous section where we established the length of the input that can be perfectly recovered, the amount of time we attempt to recall (**

*s**n*) is now a parameter that can be varied.

Our technical approach to this problem comes from observing that activity due to inputs older than *n* acts as interference when recovering more recent inputs. In other words, we can group older terms (i.e., from further back than *n* time samples ago) with the noise term, resulting again in ** A** being an

*M*-by-

*n*linear operation that can satisfy RIP for length-

*n*inputs. In this case, after choosing the length of the memory to recover, the guarantees in equation 2.4 hold when considering every input older than

*n*as contributing to the “noise” part of the bound.

**is sparse in the canonical basis () with a maximum signal value**

*s**s*

_{max}and the maximum noise term is , we can bound the first term of equations 2.4 using a geometric sum that depends on

*n*,

*K*, and

*q*. For a given scenario (i.e., a choice of

*q*,

*K*, and the RIP conditioning of

**), a network can support signal recovery up to a certain sparsity level , given by where is a scaling constant (e.g., using the present techniques, but is conjectured (Rudelson & Vershynin, 2008). We can also bound the second term of equations 2.4 by the sum of the energy in the past**

*A**n*perturbations that are beyond this sparsity level . Together these terms yield the bound on the recovery of the proxy signal: The derivation of the first two terms in the above bound is detailed in the appendix and the final term is simply the accumulated noise, which should have a bounded norm due to the exponential decay of the eigenvalues of

**.**

*W*Intuitively, we see that this approach implies the presence of an optimal value for the recovery length *n*. For example, choosing *n* too small means that there is useful signal information in the network that the system is not attempting to recover, resulting in omission errors (i.e., an increase in the first term of equation 2.4 by counting too much signal as noise). On the other hand, choosing *n* too large means that the system is encountering recall errors by trying to recover inputs with little or no residual information remaining in the network activity (i.e., an increase in the second term of equation 2.4 from making the signal approximation worse by using the same number of nodes for a longer signal length).

The intuitive argument above can be made precise in the sense that the bound in equation 3.7 has at least one local minimum for some value of . First, we note that the noise term (i.e., the third term on the right side of Equation 3.7) does not depend on *n* (the choice in origin does not change the infinite summation), implying that the optimal recovery length depends on only the first two terms. We also note the important fact that is nonnegative and monotonically decreasing with increasing *n*. It is straightforward to observe that the bound in equation equations 3.7 tends to infinity as *n* increases (due to the presence of in the denominator of the second term). Furthermore, for small values of *n*, the second term in equation 3.7 is zero (due to ), and the first term is monotonically decreasing with *n*. Taken together, since the function is continuous in *n*, has negative slope for small *n*, and tends to infinity for large *n*, we can conclude that it must have at least one local minima in the range . This result predicts that there is (at least one) optimal value for the recovery length *n*.

The prediction of an optimal recovery length above is based on the fact that the error bound in equation 3.7 has a nontrivial minimum, and it is possible that the error itself will not actually show this behavior (since the bound may not be tight in all cases). To test the qualitative intuition from equation 3.7, we simulate recovery of input lengths and show the results in Figure 4. Specifically, we generate 50 ESNs with 500 nodes and a decay rate of *q*=0.999. The input signals are length 8000 sequences that have 400 nonzeros whose locations are chosen uniformly at random and whose amplitudes are chosen from a gaussian distribution (zero mean and unit variance). After presenting the full 8000 samples of the input signal to the network, we use the network states to recover the input history with varying lengths and compared the resulting MSE to the bound in equation 3.7. Note that while the theoretical bound may not be tight for large signal lengths, the recovery MSE matches the qualitative behavior of the bound by achieving a minimum value at *N*>*M*.

## 4. Other Network Constructions

### 4.1. Alternate Orthogonal Constructions.

Our results in the previous section focus on the case where ** W** is orthogonal and

**projects the signal evenly into all eigenvectors of**

*z***. When either**

*W***or**

*W***deviates from this structure, the STM capacity of the network apparently decreases. In this section we revisit those specifications, considering alternate network structures allowed under these assumptions, as well as the consequences of deviating from these assumptions in favor of other structural advantages for a system (e.g., wire length).**

*z*To begin, we consider the assumption of orthogonal network connectivity, where the eigenvalues have constant magnitude and the eigenvectors are orthonormal. Constructed in this way, ** U** exactly preserves the conditioning of . While this construction may seem restrictive, orthogonal matrices are relatively simple to generate and encompass a number of distinct cases. For small networks, selecting the eigenvalues uniformly at random from the unit circle (and including their complex conjugates to ensure real connectivity weights) and choosing an orthonormal set of complex conjugate eigenvectors creates precisely these optimal properties. For larger matrices, the connectivity matrix can instead be constructed directly by choosing

**at random and orthogonalizing the columns. Previous results on random matrices (Diaconis & Shahshahani, 1994) guarantee that as the size of**

*W***increases, the eigenvalue probability density approaches the uniform distribution as desired. Some recent work in STM capacity demonstrates an alternate method by which orthogonal matrices can be constructed while constraining the total connectivity of the network (Strauss et al., 2012). This method iteratively applies rotation matrices to obtain orthogonal matrices with varying degrees of connectivity. We note here that one special case of connectivity matrices not well suited to the STM task, even when made orthogonal, are symmetric networks, where the strictly real-valued eigenvalues generate poor RIP conditioning for**

*W***.**

*F*While simple to generate in principle, the matrix constructions discussed above are generally densely connected and may be impractical for many systems. However, many other special network topologies that may be more biophysically realistic (i.e., block diagonal connectivity matrices and small-world networks (Mongillo, Barak, & Tsodyks, 2008) can be constructed so that ** W** still has orthonormal columns.

^{3}For example, consider the case of a block diagonal connection matrix (illustrated in Figure 5), where many unconnected networks of at least two nodes each are driven by the same input stimulus and evolve separately. Such a structure lends itself to a modular framework, where more of these subnetworks can be recruited to recover input stimuli further in the past. In this case, each block can be created independently as above and pieced together. The columns of the block diagonal matrix will still have unit norm and will be both orthogonal to vectors within its own block (since each of the diagonal submatrices is orthonormal) and orthogonal to all columns in other blocks (since there is no overlap in the nonzero indices).

Similarly, a small-world topology can be achieved by taking a few of the nodes in every group of the block diagonal case and allowing connections to all other neurons (either unidirectional or bidirectional connections). To construct such a matrix, a block diagonal orthogonal matrix can be taken, a number of columns can be removed and replaced with full columns, and the resulting columns can be made orthonormal with respect to the remaining block-diagonal columns. In these cases, the same eigenvalue distribution and eigenvector properties hold as the fully connected case, resulting in the same RIP guarantees (and therefore the same recovery guarantees) demonstrated earlier. We note that this is only one approach to constructing a network with favorable STM capacity and not all networks with small-world properties will perform well.

Additionally, we note that as opposed to networks analyzed in prior work—in particular the work in Wallace et al. (2013) demonstrating that random networks with high connectivity have short STM—the average connectivity does not play a dominant role in our analysis. Specifically, it has been observed in spiking networks that higher network connectivity can reduce the STM capacity so that it scales only with log(*M*) (Wallace et al., 2013). However, in our ESN analysis, networks can have low connectivity (e.g., block-diagonal matrices—the extreme case of the block diagonal structure described above) or high connectivity (e.g., fully connected networks) and have the same performance.

### 4.2. Suboptimal Network Constructions.

Finally, we can also analyze some variations to the network structure assumed in this letter to see how much performance decreases. First, instead of the deterministic construction for ** z** discussed in the earlier sections, there has also been interest in choosing

**as independent and identically distributed (i.i.d) random gaussian values (Ganguli et al., 2008; Ganguli & Sompolinsky, 2010). In this case, it is also possible to show that**

*z***satisfies the RIP (with respect to the basis and with the same RIP conditioning as before) by paying an extra log(**

*A**N*) penalty in the number of measurements. Specifically, we have also established the following theorem:

*Suppose , and . Let*

**be any unitary matrix of eigenvectors (containing complex conjugate pairs) and the entries of***U***be identical and independently distributed zero-mean gaussian random variables with variance . For***z**M*an even integer, denote the eigenvalues of**by . Let the first***W**M*/2 eigenvalues () be chosen uniformly at random on the complex unit circle (i.e., we chose uniformly at random from ) and the other*M*/2 eigenvalues as the complex conjugates of these values. Then, for a given RIP conditioning and failure probability , if

*A**satisfies RIP*-

*with probability exceeding*

*for a universal constant C*.

The proof of this theorem can be found in the appendix. The additional log factor in the bound in equation 4.1 reflects that a random feedforward vector may not optimally spread the input energy over the different eigendirections of the system. Thus, some nodes may see less energy than others, making them slightly less informative. Note that while this construction does perform worse than the optimal constructions from theorem 1, the STM capacity is still very favorable (i.e., a linear scaling in the sparsity level and logarithmic scaling in the signal length).

**still lie on the complex unit circle, we can analyze how nonorthogonal matrices affect the RIP results. In this case, the decomposition in equation 3.3 still holds and theorem 1 still applies to guarantee that**

*W***satisfies the RIP. However, the nonorthogonality changes the conditioning of**

*F***and subsequently the total conditioning of**

*U***. Specifically the conditioning of**

*A***(the ratio of the maximum and minimum singular values ) will affect the total conditioning of**

*U***. We can use the RIP of**

*A***and the extreme singular values of**

*F***to bound how close**

*U***is to an isometry for sparse vectors, both above by and below by By consolidating these bounds, we find a new RIP statement for the composite matrix where and . These relationships can be used to solve for the new RIP constants: These expressions demonstrate that as the conditioning of**

*UF***improves (i.e., ), the RIP conditioning does not change from the optimal case of an orthogonal network (). However, as the conditioning of**

*U***gets worse and grows, the constants associated with the RIP statement also get worse (implying more measurements are likely required to guarantee the same recovery performance).**

*U***are still unit norm; however,**

*W***is not orthogonal. Generally when the eigenvalues of**

*U***differ from unity and are not all of equal magnitude, the current approach becomes intractable. In one case, however, there are theoretical guarantees: when**

*W***is rank deficient. If**

*W***only has unit-norm eigenvalues and the remaining eigenvalues are zero, then the resulting matrix**

*W***is composed the same way, except that the bottom rows are all zero. This means that the effective measurements depend on only an subsampled DTFT, where is matrix consisting of the nonzero rows of**

*A***. In this case, we can choose any of the nodes and the previous theorems will all hold, replacing the true number of nodes**

*F**M*with the effective number of nodes .

## 5. Discussion

We have seen that the tools of the CS literature can provide a way to quantify the STM capacity in linear networks using rigorous nonasymptotic recovery error bounds. Of particular note is that this approach leverages the nongaussianity of the input statistics to show STM capacities that are superlinear in the size of the network and depend linearly on the sparsity level of the input. This work provides a concrete theoretical understanding for the approach conjectured in (Ganguli & Sompolinsky, 2010), along with a generalization to arbitrary sparsity bases and infinitely long input sequences. This analysis also predicts that there exists an optimal recovery length that balances omission errors and recall mistakes.

In contrast to previous work on ESNs that leverage nonlinear network computations for computational power (Jaeger & Haas, 2004), this letter uses a linear network and nonlinear computations for signal recovery. Despite the nonlinearity of the recovery process, the fundamental results of the CS literature also guarantee that the recovery process is stable and robust. For example, with access to only a subset of nodes (due to failures or communication constraints), signal recovery generally degrades gracefully by still achieving the best possible approximation of the signal using fewer coefficients. Beyond signal recovery, we also note that the RIP can guarantee performance on many tasks (e.g. detection, classification) performed directly on the network states (Davenport, Boufounos, Wakin, & Baraniuk, 2010). Finally, we note that while this work addresses only the case where a single input is fed to the network, there may be networks of interest that have a number of input streams all feeding into the same network (with different feedforward vectors). We believe that the same tools used here can be used in the multi-input case, since the overall network state is still a linear function of the inputs.

## Appendix

### A.1. Proof of RIP.

We show that the matrix satisfies the RIP under the conditions stated in equation 3.4 in order to prove theorem 1. We note that Rauhut (2010) shows that for the canonical basis (), the bounds for *M* can be tightened to using a more complex proof technique than we will employ here. For , the result in Rauhut (2010) represents an improvement of several log(*N*) factors when restricted to only the canonical basis for . We also note that the scaling constant *C* found in the general RIP definition of equation 2.3 is unity due to the scaling of ** z**.

While the proof of theorem 1 is fairly technical, the procedure follows very closely the proof of theorem 8.1 in Rauhut (2010) on subsampled discrete time Fourier transform (DTFT) matrices. While the basic approach is the same, the novelty in our presentation is the incorporation of the sparsity basis and considerations for a real-valued connectivity matrix ** W**.

**is assumed unitary, for any signal**

*U***. Thus, it suffices to establish the conditioning properties of the matrix . For the upcoming proof, it will be useful to write this matrix as a sum of rank-1 operators. The specific rank-1 operator that will be useful for our purposes is**

*s**X*with , the conjugate of the

_{l}X^{H}_{l}*l*th row of , where is the conjugated

*l*th row of

**. Because of the way the “frequencies” are chosen, for any , . The**

*F**l*th row of is where is the

*l*th diagonal entry of the diagonal matrix , meaning that we can use the sum of rank-1 operators to write the decomposition . If we define the random variable and the norm , we can equivalently say that has RIP conditioning if

*w*’s. Under the assumption of theorem 1, for . Therefore, where it is straightforward to check that . By the same reasoning, we also have . This implies that we can rewrite

_{m}**as**

*B*The main proof of the theorem has two main steps. First, we establish a bound on the moments of the quantity of interest . Next we use these moments to derive a tail bound on , which will lead directly to the RIP statement we seek. The following two lemmas from the literature are critical for these two steps.

*(lemma 8.2 of Rauhut, 2010). Suppose , and suppose we have a sequence of (fixed) vectors for such that . Let be a Rademacher sequence, that is, a sequence of i.i.d. random variables. Then for*

*p*= 1 and for and ,*where*

*are universal constants*.

*(adapted from proposition 6.5 of Rauhut, 2010). Suppose*

*Z*is a random variable satisfying*for all*,

*and for constants*.

*Then for all*,

Armed with this notation and these lemmas, we now prove theorem 1:

*M*in theorem 1, . Since

**=**

*B*

*B*_{1}+

*B*_{2}and , Thus, it will suffice to bound since implies that . In this presentation, we let be some universal constant that may not be the same from line to line.

^{4}

*N*in the third line, we used our assumption that , and in theorem 1. Now, using the definition of from lemma 1, we can bound this quantity as Therefore, we have the following implicit bound on the moments of the random variable of interest: The above can be written as , where . By squaring, rearranging the terms, and completing the square, we have . By assuming , this bound can be simplified to . Now this assumption is equivalent to having an upper bound on the range of values of

*p*:

*p*

_{0}=2, and , we obtain the following tail bound for : If we pick such that and

*u*such that then we have our required tail bound of . First, observe that equation A.3 is equivalent to having Also, because of the limited range of values

*u*can take (i.e., ), we require that which, together with the earlier condition on

*M*, completes the proof.

### A.2. RIP with Gaussian Feedforward Vectors.

In this section we extend the RIP analysis of Section A.1 to the case when ** z** is chosen to be a gaussian i.i.d. vector, as presented in theorem 2. It is unfortunate that with the additional randomness in the feedforward vector, the same proof procedure as in theorem 1 cannot be used. In the proof of theorem 1, we showed that the random variable has

*p*th moments that scale like (through lemma 1) for a range of

*p*, suggesting that it has a subgaussian tail (i.e., ) for a range of deviations

*u*. We then used this tail bound to bound the probability that exceeds a fixed conditioning . With gaussian uncertainties in the feedforward vector

*z*, lemma 1 will not yield the required subgaussian tail but instead gives us moments estimates that result in suboptimal scaling of

*M*with respect to

*n*. Therefore, we will instead follow the proof procedure of theorem 16 from Tropp, Laska, Duarte, Romberg, and Baraniuk (2009) that will yield the better measurement rate given in theorem 2.

Let us begin by recalling a few notations from the proof of theorem 1 and introducing further notations that will simplify our exposition later. First, recall that we let *X ^{H}_{l}* be the

*l*th row of . Thus, the

*l*th row of our matrix of interest is where is the

*l*th diagonal entry of the diagonal matrix . Whereas before, for any , here it will be a random variable. To understand the resulting distribution of , note that for the connectivity matrix

**to be real, we need to assume that the second columns of**

*W***are complex conjugates of the first columns. Thus, we can write**

*U***=[**

*U*

*U*_{R}|

*U*_{R}]+

*j*[

*U*_{I}| −

*U*_{I}], where . Because

*U*^{H}

**=**

*U***, we can deduce that**

*I*

*U*^{T}

_{R}

*U*_{I}=

**and that the norms of the columns of both**

*0*

*U*_{R}and

*U*_{I}are .

^{4}

With these matrices *U*_{R}, *U*_{I}, we rewrite the random vector to illustrate its structure. Consider the matrix , a scaled unitary matrix (because we can check that ). Next, consider the random vector . Because is (scaled) unitary and ** z** is composed of i.i.d. zero-mean gaussian random variables of variance , the entries of are also i.i.d. zero-mean gaussian random variables, but now with variance . Then, from our definition of

**in terms of**

*U*

*U*_{R}and

*U*_{I}, for any , we have and for , we have . This clearly shows that each of the first entries of is made up of two i.i.d. random variables (one being the real component, the other imaginary) and that the other entries are just complex conjugates of the first . Because of this, for , is the sum of squares of two i.i.d. gaussian random variables.

Before moving on to the proof, we first present a lemma regarding the random sequence |*z*_{l}|^{2} that will be useful in the sequel.

*l*used as a variable for a maximization will be taken over the set without explicitly writing the index set. To calculate , we use the following result that allows us to bound the expected value of a positive random variable by its tail probability (see proposition 6.1 of Rauhut, 2010): Using the union bound, we have the estimate (since the are identically distributed). Now, because is a sum of squares of two gaussian random variables and thus is a (generalized) random variable with 2 degrees of freedom (which we shall denote by ),

^{5}we have where is the gamma function and the 2

*Mu*appears instead of

*u*in the exponential because of the standardization of the gaussian random variables (initially of variance ). To proceed, we break the integral in equation A.6 into two parts. To do so, notice that if , then the trivial upper bound of is a better estimate than . In other words, our estimate for the tail bound of is not very good for small

*u*but gets better with increasing

*u*. Therefore, we have This is the bound in expectation that we seek for in equation A.6.

*X*can be estimated once we know the moments of

*X*. Therefore, we require the moments of the random variable . For this, for any

*p*>0, we use the following simple estimate: where the first step comes from writing the expectation as an integral of the cumulative distribution (as seen in equation A.6) and taking the union bound, and the second step comes from the fact that the are identically distributed. Now, is a subexponential random variable since it is a sum of squares of gaussian random variables (Vershynin, 2012).

^{6}Therefore, for any

*p*>0, its

*p*th moment can be bounded by where the division by

*M*comes again from the variance of the gaussian random variables that make up . Putting this bound with equation A.7, we have the following estimate for the

*p*th moments of :

^{7}Therefore, by lemma 2 with , , and , we have By choosing , we have our desired tail bound of

Armed with this lemma, we can now turn our attention to the main proof. As stated earlier, this follows essentially the same form as Tropp et al. (2009), with the primary difference of including the results from lemma 3. As before, because with , we just have to consider bounding the tail bound . This proof differs from that in section A.1 in that here, we first show that is small when *M* is large enough and then show that *Z*_{1} does not differ much from with high probability.

#### A.2.1. Expectation.

In this section, we show that is small. This basically follows from lemma 1 and equation A.4. To be precise, the remainder of this section is to prove:

*Choose any . If , then *.

*C*is some universal constant that may not be the same from line to line. We follow the same symmetrization step found in the proof in section A.1 to arrive at where the outer expectation is over the Rademacher sequence and the inner expectation is over the random “frequencies” and feedforward vector . As before, for , we set . Observe that by definition, and thus is a random variable. We then use lemma 1 with

*p*= 1 to get where the second line uses the Cauchy-Schwarz inequality for expectations and the third line uses triangle inequality. Again, to get to log

^{4}

*N*in the second line, we used our assumption that , , and in theorem 2. It therefore remains to calculate . Now, . First, we have . Next, equation A.4 in lemma 3 tells us that . Thus, we have . Putting everything together, we have Now, the above can be written as , where . By squaring it, rearranging the terms, and completing the squares, we have . By supposing , this can be simplified as . To conclude, let us choose

*M*such that where is our predetermined conditioning (which incidentally fulfills our previous assumption that ). By applying the formula for

*a*, we have that if , then .

#### A.2.2. Tail Probability.

To give a probability tail-bound estimate to *Z*_{1}, we use the following lemma found in Tropp et al. (2009) and Rauhut (2010):

The goal of this section is to prove:

*Pick any and suppose . Suppose . Then *.

*Y*to look like the summands of However, this poses several problems. First, they are not symmetric,

_{l}^{8}and thus, we need to symmetrize it by defining where are independent copies of and

*X*, respectively, and is an independent Rademacher sequence. Here, the relation for two random variables

_{l}*X*,

*Y*means that

*X*has the same distribution as

*Y*. To form , what we have done is take each summand of

*Z*

_{1}and take its difference with an independent copy of itself. Because is symmetric, adding a Rademacher sequence does not change its distribution, and this sequence is introduced only to resolve a technicality that will arise later. If we let , then the random variables (symmetrized) and

*Z*

_{1}(unsymmetrized) are related via the following estimates (Rauhut, 2010):

*Y*in lemma 4 is that almost surely. Because of the unbounded nature of the gaussian random variables and in , this condition is not met. Therefore, we need to define a

_{l}*Y*that is conditioned on the event that these gaussian random variables are bounded. To do so, define the following event:

_{l}*F*, the norm of is well bounded: where in the last line we used the fact that the ratio between the and norms of an

*K*-sparse vector is

*K*, and the estimate we derived for in section A.1.

*F*, where is the indicator function of event

^{c}*F*. If we define , the random variables

_{l}*Y*(truncated) and (untruncated) are related by (Tropp et al., 2009) (see also lemma 1.4.3 of De La Peña & Giné, 1999) When are held constant so only the Rademacher sequence is random, then the contraction principle (Tropp et al., 2009; Ledoux & Talagrand, 1991) tells us that . Note that the sole reason for introducing the Rademacher sequences is for this use of the contraction principle. As this holds point-wise for all , we have

*M*, we have Using these estimates for and and choosing and , lemma 4 says that Then, using the relation between the tail probabilities of

*Y*and , equation A.11, together with our estimate for , we have Finally, using the relation between the tail probabilities of and

*Z*, equation A.10, we have where we used the fact that . Then, for a predetermined conditioning , pick for a constant , which will be chosen appropriately later. With this choice of and with our assumptions that and , the three terms in the tail bound become As for the last term, if , then (where we further supposed that ). If (where the lower bound is from the theorem assumptions), then . By choosing appropriately large, we then have Putting the formula for into completes the proof.

### A.3. Derivation of Recovery Bound for Infinite Length Inputs.

In this section, we derive the bound in equation 3.7. The approach we take is to bound the individual components of equation 2.4. Because the noise term due to noise in the inputs is unaffected, we will bound the noise term due to the unrecovered signal (the first term in equation 2.4) by the component of the input history that is beyond the attempted recovery, and we will bound the signal approximation term (the second term in equation 2.4) by the quality of the signal recovery possible in the attempted recovery length. In this way we can observe how different properties of the system and input sequence affect signal recovery.

**, as is discussed in the letter. The second summation here, , essentially acts as an additional noise term in the recovery. We can further analyze the effect of this noise term by understanding that is bounded for well-behaved input sequences**

*As**s*[

*n*] (in fact, all that is needed is that the maximum value or the expected value and variance are reasonably bounded) when the eigenvalues of

**are of magnitude . We can explicitly calculate the worst-case scenario bounds on the norm of , where is the diagonal matrix containing the normalized eigenvalues of**

*W***. If we assume that**

*W***is chosen as mentioned as in section 3.2 so that , the eigenvalues of**

*z***are uniformly spread around a complex circle of radius**

*W**q*, and that for all n, then we can bound this quantity as where

*d*is the

_{k}*k*th normalized eigenvalue of

**. In the limit of large input signal lengths (), we have and so , which leaves the approximate expression**

*W**K*-sparse) and the approximation to the signal with the largest coefficients. In the worst-case scenario, there are coefficients that cannot be guaranteed to be recovered by the RIP conditions, and these coefficients all take the maximum value

*s*

_{max}. In this case, we can bound the signal approximation error as stated in the main text:

## Acknowledgments

We are grateful to J. Romberg for valuable discussions related to this work. This work was partially supported by NSF grant CCF-0830456 and DSO National Laboratories, Singapore.

## References

## Notes

^{1}

The notation means that for some constant *C*. For clarity, we do not keep track of the constants in our proofs. See Rauhut (2010) for specific values of the constants.

^{3}

Small-world structures are typically taken to be networks where small groups of neurons are densely connected among themselves, yet sparse connections to other groups reduce the maximum distance between any two nodes.

^{5}

The pdf of a random variable with *q* degrees of freedom is given by . Therefore, its tail probability can be obtained by integration: .

^{6}

A subexponential random variable is a random variable whose tail probability is bounded by exp^{−Cu} for some constant *C*. Thus, a random variable is a specific instance of a subexponential random variable.

^{7}

We remark that this bound gives a worse estimate for the expected value as that calculated before because of the crude bound given by equation A.7.

^{8}

A random variable *X* is symmetric if *X* and −*X* has the same distribution.