Cortical networks are hypothesized to rely on transient network activity to support short-term memory (STM). In this letter, we study the capacity of randomly connected recurrent linear networks for performing STM when the input signals are approximately sparse in some basis. We leverage results from compressed sensing to provide rigorous nonasymptotic recovery guarantees, quantifying the impact of the input sparsity level, the input sparsity basis, and the network characteristics on the system capacity. Our analysis demonstrates that network memory capacities can scale superlinearly with the number of nodes and in some situations can achieve STM capacities that are much larger than the network size. We provide perfect recovery guarantees for finite sequences and recovery bounds for infinite sequences. The latter analysis predicts that network STM systems may have an optimal recovery length that balances errors due to omission and recall mistakes. Furthermore, we show that the conditions yielding optimal STM capacity can be embodied in several network topologies, including networks with sparse or dense connectivities.
Short-term memory (STM) is critical for neural systems to understand nontrivial environments and perform complex tasks. While individual neurons could potentially account for very long or very short stimulus memory (e.g., through changing synaptic weights or membrane dynamics, respectively), useful STM on the order of seconds is conjectured to be due to transient network activity. Specifically, stimulus perturbations can cause activity in a recurrent network long after the input has been removed, and recent research hypothesizes that cortical networks may rely on transient activity to support STM (Jaeger & Haas, 2004; Maass, Natschläger, & Markram, 2002; Buonomano & Maass, 2009).
Understanding the role of memory in neural systems requires determining the fundamental limits of STM capacity in a network and characterizing the effects on that capacity of the network size, topology, and input statistics. Various approaches to quantifying the STM capacity of linear (Jaeger, 2001; White, Lee, & Sompolinsky, 2004; Ganguli, Huh, & Sompolinsky, 2008; Hermans & Schrauwen, 2010) and nonlinear (Wallace, Hamid, & Latham, 2013) recurrent networks have been used, often assuming gaussian input statistics (Jaeger, 2001; White et al., 2004; Hermans & Schrauwen, 2010; Wallace et al., 2013). These analyses show that even under optimal conditions, the STM capacity (i.e., the length of the stimulus able to be recovered) scales only linearly with the number of nodes in the network. While conventional wisdom holds that signal structure could be exploited to achieve more favorable capacities, this idea has generally not been the focus of significant rigorous study.
Recent work in computational neuroscience and signal processing has shown that many signals of interest have statistics that are strongly nongaussian, with low-dimensional structure that can be exploited for many tasks. In particular, sparsity-based signal models (i.e., representing a signal using relatively few nonzero coefficients in a basis) have recently been shown to be especially powerful. In the computational neuroscience literature, sparse encodings increase the capacity of associative memory models (Baum, Moody, & Wilczek, 1988) and are sufficient neural coding models to account for several properties of neurons in primary visual cortex (i.e., response preferences—(Olshausen & Field, 1996) and nonlinear modulations (Zhu & Rozell, 2013)). In the signal processing literature, the recent work in compressed sensing (CS) (Candes, Romberg, & Tao, 2006; Ganguli & Sompolinsky, 2012) has established strong guarantees on sparse signal recovery from highly undersampled measurement systems.
Ganguli and Sompolinsky (2010) have previously conjectured that the ideas of CS can be used to achieve STM capacities that exceed the number of network nodes in an orthogonal recurrent network when the inputs are sparse in the canonical basis (i.e., the input sequences have temporally localized activity). While these results are compelling and provide a great deal of intuition, the theoretical support for this approach remains an open question, as the results in (Ganguli & Sompolinsky, 2010) use an asymptotic analysis on an approximation of the network dynamics to support empirical findings. In this letter, we establish a theoretical basis for CS approaches in network STM by providing rigorous nonasymptotic recovery error bounds for an exact model of the network dynamics and input sequences that are sparse in any general basis (e.g., sinusoids, wavelets). Our analysis shows conclusively that STM capacity can scale superlinearly with the number of network nodes and quantifies the impact of the input sparsity level, the input sparsity basis, and the network characteristics on system capacity. We provide both perfect recovery guarantees for finite inputs and bounds on the recovery performance when the network has an arbitrarily long input sequence. The latter analysis predicts that network STM systems based on CS may have an optimal recovery length that balances errors due to omission and recall mistakes. Furthermore, we show that the structural conditions yielding optimal STM capacity in our analysis can be embodied in many different network topologies, including networks with both sparse and dense connectivities.
2.1. Short-Term Memory in Recurrent Networks.
Since understanding the STM capacity of networked systems would lead to a better understanding of how such systems perform complex tasks, STM capacity has been studied in several network architectures, including discrete-time networks (Jaeger, 2001; White et al., 2004; Ganguli et al., 2008), continuous-time networks (Hermans & Schrauwen, 2010; Büsing, Schrauwen, & Legenstein, 2010), and spiking networks (Maass et al., 2002; Mayor & Gerstner, 2005; Legenstein & Maass, 2007; Wallace et al., 2013). While many different analysis methods have been used, each tries to quantify the amount of information present in the network states about the past inputs (i.e., how different inputs induce different network states as in Figure 1). For example, in one approach taken to study echo state networks (ESNs) (White et al., 2004; Ganguli et al., 2008; Hermans & Schrauwen, 2010), this information preservation is quantified through the correlation between the past input and the current state. When the correlation is too low, that input is said to no longer be represented in the state. The results of these analyses conclude that for gaussian input statistics, the number of previous inputs significantly correlated with the current network state is bounded by a linear function of the network size.
In another line of analysis, researchers have sought to directly quantify the degree to which different inputs lead to unique network states (Jaeger, 2001; Maass et al., 2002; Legenstein & Maass, 2007; Strauss, Wustlich, & Labahn, 2012). In essence, the main idea of this work is that a one-to-one relationship between input sequences and the network states should allow the system to perform an inverse computation to recover the original input. A number of specific properties have been proposed to describe the uniqueness of the network state with respect to the input. In spiking liquid state machines (LSMs), in work by Maass et al. (2002), a separability property is suggested that guarantees distinct network states for distinct inputs and follow-up work (Legenstein & Maass, 2007) that relates the separability property to practical computational tasks through the Vapnik-Chervonenkis (VC) dimension (Vapnik & Chervonenkis, 1971). More recent work analyzing similar networks using separation properties (Wallace et al., 2013; Büsing et al., 2010) gives an upper bound for the STM capacity that scales like the logarithm of the number of network nodes.
In discrete ESNs, the echo-state property (ESP) ensures that every network state at a given time is uniquely defined by some left-infinite sequence of inputs (Jaeger, 2001). The necessary condition for the ESP is that the maximum eigenvalue magnitude of the system is less than unity (an eigenvalue with a magnitude of one would correspond to a linear system at the edge of instability). While the ESP ensures uniqueness, it does not ensure robustness, and output computations can be sensitive to small perturbations (i.e., noisy inputs). A slightly more robust property looks at the conditioning of the matrix describing how the system acts on an input sequence (Strauss et al., 2012). The condition number not only describes a one-to-one correspondence but also quantifies how small perturbations in the input affect the output. While work by Strauss et al. (2012) is closest in spirit to the analysis in this letter, it ultimately concludes that the STM capacity still scales linearly with network size.
Determining whether a system abides by one of the separability properties depends heavily on the network's construction. In some cases, different architectures can yield very different results. For example, in the case of randomly connected spiking networks, low connectivity (each neuron is connected to a small number of other neurons) can lead to large STM capacities (Legenstein & Maass, 2007; Büsing et al., 2010), whereas high connectivity leads to chaotic dynamics and smaller STM capacities (Wallace et al., 2013). In contrast, linear ESNs with high connectivities (appropriately normalized) (Büsing et al., 2010) can have relatively large STM capacities—(on the order of the number of nodes in the network) (Ganguli et al., 2008; Strauss et al., 2012). Much of this work centers around using systems with orthogonal connectivity matrices, which leads to a topology that robustly preserves information. Interestingly, such systems can be constructed to have arbitrary connectivity while preserving the information-preserving properties (Strauss et al., 2012).
While a variety of networks have been analyzed using the properties described, these analyses ignore any structure of the inputs sequences that could be used to improve the analysis (Jaeger, 2001; Mayor & Gerstner, 2005). Conventional wisdom has suggested that STM capacities could be increased by exploiting structure in the inputs, but formal analysis has rarely addressed this case. For example, work by Ganguli and Sompolinsky (2010) builds significant intuition for the role of structured inputs in increasing STM capacity, specifically proposing to use the tools of CS to study the case when the input signals are temporally sparse. However, the analysis by Ganguli and Sompolinsky (2010) is asymptotic and focuses on an annealed (i.e., approximate) version of the system that neglects correlations between the network states over time. This letter can be viewed as a generalization of this work to provide formal guarantees for the STM capacity of the exact system dynamics, extensions to arbitrary orthogonal sparsity bases, and recovery bounds when the input exceeds the capacity of the system (i.e., the input is arbitrarily long).
2.2. Compressed Sensing.
There is substantial evidence from the signal processing and computational neuroscience communities that many natural signals are sparse in an appropriate basis (Olshausen & Field, 1996; Elad, Figueiredo, & Ma, 2008). The recovery problem requires that the system knows the sparsity basis to perform the recovery, which neural systems may not know a priori. We note that work has shown that appropriate sparsity bases can be learned from example data (Olshausen & Field, 1996), even in the case where the system observes the inputs only through compressed measurements (Isley, Hillar, & Sommer, 2011). While the analysis doesnot depend on the exact method for solving the optimization in equation 2.2, we also note that this type of optimization can be solved in biologically plausible network architectures (Rozell, Johnson, Baraniuk, & Olshausen, 2010; Rhen & Sommer, 2007; Hu, Genkin, & Chklovskii, 2012; Balavoine, Romberg, & Rozell, 2012; Balavoine, Rozell, & Romberg, 2013; Shapero, Charles, Rozell, & Hasler, 2011).
While the guarantees above are deterministic and nonasymptotic, the canonical CS results state that measurement matrices generated randomly from “nice” independent distributions (e.g., gaussian, Bernoulli) can satisfy RIP with high probability when M=O(Klog N) (Rauhut, 2010). For example, random gaussian measurement matrices (perhaps the most highly used construction in CS) satisfy the RIP condition for any sparsity basis with probability 1−O(1/N) when . This extremely favorable scaling law (i.e., linear in the sparsity level) for random gaussian matrices is in part due to the fact that gaussian matrices have many degrees of freedom, resulting in M statistically independent observations of the signal. In many practical examples, there exists a high degree of structure in A that causes the measurements to be correlated. Structured measurement matrices with correlations between the measurements have been recently studied due to their computational advantages. While these matrices can still satisfy the RIP, they typically require more measurement to reconstruct a signal with the same fidelity, and the performance may change depending on the sparsity basis (i.e., they are no longer “universal” because they donot perform equally well for all sparsity bases). One example that arises often in the signal processing community is the case of random circulant matrices (Krahmer, Mendelson, & Rauhut, 2012), where the number of measurements needed to ensure that the RIP holds with high probability for temporally sparse signals (i.e., is the identity) increases to . Other structured systems analyzed in the literature include Toeplitz matrices (Haupt, Bajwa, Raz, & Nowak, 2010), partial circulant matrices (Krahmer et al., 2012), block diagonal matrices (Eftekhari, Yap, Rozell, & Wakin, in press; Park, Yap, Rozell, & Wakin, 2011), subsampled unitary matrices (Bajwa, Sayeed, & Nowak, 2009), and randomly subsampled Fourier matrices (Rudelson & Vershynin, 2008). These types of results are used to demonstrate that signal recovery is possible with highly undersampled measurements, where the number of measurements scales linearly with the information level of the signal (i.e., the number of nonzero coefficients) and only logarithmically with the ambient dimension.
3. STM Capacity Using the RIP
3.1. Network Dynamics as Compressed Sensing.
3.2. STM Capacity of Finite-Length Inputs.
While the RIP conditioning of A depends on all of the matrices in the decomposition of equations 3.3, the conditioning of F is the most challenging because it is the only matrix that is compressive (i.e., not square). Due to this difficulty, we start by specifying a network structure for U and that preserves the conditioning properties of F (other network constructions are discussed in section 4). Specifically, as in White et al. (2004), Ganguli et al. (2008), and Ganguli and Sompolinsky (2010) we choose W to be a random orthonormal matrix, assuring that the eigenvector matrix U has orthonormal columns and preserves the conditioning properties of F. Likewise, we choose the feedforward vector z to be , where 1M is a vector of M ones (the constant simplifies the proofs but has no bearing on the result). This choice for z ensures that is the identity matrix scaled by (analogous to Ganguli et al., 2008, where z is optimized to maximize the SNR in the system). Finally, we observe that the richest information preservation apparently arises for a real-valued W when its eigenvalues are complex, distinct in phase, have unit magnitude, and appear in complex conjugate pairs.
For the above network construction, our main result shows that A satisfies the RIP in the basis (implying the bounds from equation 2.4 hold) when the network size scales linearly with the sparsity level of the input. This result is made precise in the following theorem:
The main observation of the result above is that STM capacity scales superlinearly with network size. Indeed, for some values of K and , it is possible to have STM capacities much greater than the number of nodes (i.e., ). To illustrate the perfect recovery of signal lengths beyond the network size, Figure 2 shows an example recovery of a single long input sequence. Specifically, we generate a 100-node random orthogonal connectivity matrix W and generate . We then drive the network with an input sequence that is 480 samples long and constructed using 24 nonzero coefficients (chosen uniformly at random) of a wavelet basis. The values at the nonzero entries were chosen uniformly in the range [0.5,1.5]. In this example, we omit noise so that we can illustrate the noiseless recovery. At the end of the input sequence, the resulting 100 network states are used to solve the optimization problem in equation 2.2 for recovering the input sequence (using the network architecture in Rozell et al., 2010). The recovered sequence, as depicted in Figure 2, is identical to the input sequence, clearly indicating that the 100 nodes were able to store the 480 samples of the input sequence (achieving STM capacity higher than the network size).
Directly checking the RIP condition for specific matrices is NP-hard (one would need to check every possible 2K-sparse signal). In light of this difficulty in verifying recovery of all possible sparse signals (which the RIP implies), we will explore the qualitative behavior of the RIP bounds above by examining in Figure 3 the average recovery relative MSE (rMSE) in simulation for a network with M nodes when recovering input sequences of length n with varying sparsity bases. Figure 3 uses a plotting style similar to the Donoho-Tanner phase transition diagrams (Donoho & Tanner, 2005) where the average recovery rMSE is shown for each pair of variables under noisy conditions. While the traditional Donoho-Tanner phase transitions plot noiseless recovery performance to observe the threshold between perfect and imperfect recovery, here we also add noise to illustrate the stability of the recovery guarantees. The noise is generated as random additive gaussian noise at the input ( in equation 3.1) to the system with zero mean and variance such that the total noise in the system ( in equation 3.2) has a norm of approximately 0.01. To demonstrate the behavior of the system, the phase diagrams in Figure 3 sweep the ratio of measurements to the total signal length (M/N) and the ratio of the signal sparsity to the number of measurements (K/M). Thus, at the upper left-hand corner, the system is recovering a dense signal from almost no measurements (which should almost certainly yield poor results) and at the right-hand edge of the plots the system is recovering a signal from a full set of measurements (enough to recover the signal well for all sparsity ranges). We generate 10 random ESNs for each combination of ratios (M/N, K/M). The simulated networks are driven with input sequences that are sparse in one of four different bases (Canonical, Daubechies-10 wavelet, Symlet-3 wavelet, and DCT) that have varying coherence with the Fourier basis. We use the node values at the end of the sequence to recover the inputs.2
In each plot of Figure 3, the dashed line denotes the boundary where the system is able to essentially perform perfect recovery (recovery error ) up to the noise floor. Note that the area under this line (the white area in the plot) denotes the region where the system is leveraging the sparse structure of the input to get capacities of N>M. We also observe that the dependence of the RIP bound on the coherence with the Fourier basis is clearly shown qualitatively in these plots, with the DCT sparsity basis showing much worse performance than the other bases.
3.3. STM Capacity of Infinite-Length Inputs.
After establishing the perfect recovery bounds for finite-length inputs in the previous section, we turn here to the more interesting case of a network that has received an input beyond its STM capacity (perhaps infinitely long). In contrast to the finite-length input case where favorable constructions for W used random unit-norm eigenvalues, this construction would be unstable for infinitely long inputs. In this case, we take W to have all eigenvalue magnitudes equal to q<1 to ensure stability. The matrix constructions we consider in this section are otherwise identical to that described in the previous section.
In this scenario, the recurrent application of W in the system dynamics ensures that each input perturbation will decay steadily until it has zero effect on the network state. While good for system stability, this decay means that each input will slowly recede into the past until the network activity contains no useable memory of the event. In other words, any network with this decay can only hope to recover a proxy signal that accounts for the decay in the signal representation induced by the forgetting factor q. Specifically, we define this proxy signal to be Qs, where . Previous work (Ganguli et al., 2008; Jaeger, 2001; White et al., 2004) has characterized recoverability by using statistical arguments to quantify the correlation of the node values to each past input perturbation. In contrast, our approach is to provide recovery bounds on the rMSE for a network attempting to recover the n past samples of Qs, which corresponds to the weighted length-n history of s. Note that in contrast to the previous section where we established the length of the input that can be perfectly recovered, the amount of time we attempt to recall (n) is now a parameter that can be varied.
Our technical approach to this problem comes from observing that activity due to inputs older than n acts as interference when recovering more recent inputs. In other words, we can group older terms (i.e., from further back than n time samples ago) with the noise term, resulting again in A being an M-by-n linear operation that can satisfy RIP for length-n inputs. In this case, after choosing the length of the memory to recover, the guarantees in equation 2.4 hold when considering every input older than n as contributing to the “noise” part of the bound.
Intuitively, we see that this approach implies the presence of an optimal value for the recovery length n. For example, choosing n too small means that there is useful signal information in the network that the system is not attempting to recover, resulting in omission errors (i.e., an increase in the first term of equation 2.4 by counting too much signal as noise). On the other hand, choosing n too large means that the system is encountering recall errors by trying to recover inputs with little or no residual information remaining in the network activity (i.e., an increase in the second term of equation 2.4 from making the signal approximation worse by using the same number of nodes for a longer signal length).
The intuitive argument above can be made precise in the sense that the bound in equation 3.7 has at least one local minimum for some value of . First, we note that the noise term (i.e., the third term on the right side of Equation 3.7) does not depend on n (the choice in origin does not change the infinite summation), implying that the optimal recovery length depends on only the first two terms. We also note the important fact that is nonnegative and monotonically decreasing with increasing n. It is straightforward to observe that the bound in equation equations 3.7 tends to infinity as n increases (due to the presence of in the denominator of the second term). Furthermore, for small values of n, the second term in equation 3.7 is zero (due to ), and the first term is monotonically decreasing with n. Taken together, since the function is continuous in n, has negative slope for small n, and tends to infinity for large n, we can conclude that it must have at least one local minima in the range . This result predicts that there is (at least one) optimal value for the recovery length n.
The prediction of an optimal recovery length above is based on the fact that the error bound in equation 3.7 has a nontrivial minimum, and it is possible that the error itself will not actually show this behavior (since the bound may not be tight in all cases). To test the qualitative intuition from equation 3.7, we simulate recovery of input lengths and show the results in Figure 4. Specifically, we generate 50 ESNs with 500 nodes and a decay rate of q=0.999. The input signals are length 8000 sequences that have 400 nonzeros whose locations are chosen uniformly at random and whose amplitudes are chosen from a gaussian distribution (zero mean and unit variance). After presenting the full 8000 samples of the input signal to the network, we use the network states to recover the input history with varying lengths and compared the resulting MSE to the bound in equation 3.7. Note that while the theoretical bound may not be tight for large signal lengths, the recovery MSE matches the qualitative behavior of the bound by achieving a minimum value at N>M.
4. Other Network Constructions
4.1. Alternate Orthogonal Constructions.
Our results in the previous section focus on the case where W is orthogonal and z projects the signal evenly into all eigenvectors of W. When either W or z deviates from this structure, the STM capacity of the network apparently decreases. In this section we revisit those specifications, considering alternate network structures allowed under these assumptions, as well as the consequences of deviating from these assumptions in favor of other structural advantages for a system (e.g., wire length).
To begin, we consider the assumption of orthogonal network connectivity, where the eigenvalues have constant magnitude and the eigenvectors are orthonormal. Constructed in this way, U exactly preserves the conditioning of . While this construction may seem restrictive, orthogonal matrices are relatively simple to generate and encompass a number of distinct cases. For small networks, selecting the eigenvalues uniformly at random from the unit circle (and including their complex conjugates to ensure real connectivity weights) and choosing an orthonormal set of complex conjugate eigenvectors creates precisely these optimal properties. For larger matrices, the connectivity matrix can instead be constructed directly by choosing W at random and orthogonalizing the columns. Previous results on random matrices (Diaconis & Shahshahani, 1994) guarantee that as the size of W increases, the eigenvalue probability density approaches the uniform distribution as desired. Some recent work in STM capacity demonstrates an alternate method by which orthogonal matrices can be constructed while constraining the total connectivity of the network (Strauss et al., 2012). This method iteratively applies rotation matrices to obtain orthogonal matrices with varying degrees of connectivity. We note here that one special case of connectivity matrices not well suited to the STM task, even when made orthogonal, are symmetric networks, where the strictly real-valued eigenvalues generate poor RIP conditioning for F.
While simple to generate in principle, the matrix constructions discussed above are generally densely connected and may be impractical for many systems. However, many other special network topologies that may be more biophysically realistic (i.e., block diagonal connectivity matrices and small-world networks (Mongillo, Barak, & Tsodyks, 2008) can be constructed so that W still has orthonormal columns.3 For example, consider the case of a block diagonal connection matrix (illustrated in Figure 5), where many unconnected networks of at least two nodes each are driven by the same input stimulus and evolve separately. Such a structure lends itself to a modular framework, where more of these subnetworks can be recruited to recover input stimuli further in the past. In this case, each block can be created independently as above and pieced together. The columns of the block diagonal matrix will still have unit norm and will be both orthogonal to vectors within its own block (since each of the diagonal submatrices is orthonormal) and orthogonal to all columns in other blocks (since there is no overlap in the nonzero indices).
Similarly, a small-world topology can be achieved by taking a few of the nodes in every group of the block diagonal case and allowing connections to all other neurons (either unidirectional or bidirectional connections). To construct such a matrix, a block diagonal orthogonal matrix can be taken, a number of columns can be removed and replaced with full columns, and the resulting columns can be made orthonormal with respect to the remaining block-diagonal columns. In these cases, the same eigenvalue distribution and eigenvector properties hold as the fully connected case, resulting in the same RIP guarantees (and therefore the same recovery guarantees) demonstrated earlier. We note that this is only one approach to constructing a network with favorable STM capacity and not all networks with small-world properties will perform well.
Additionally, we note that as opposed to networks analyzed in prior work—in particular the work in Wallace et al. (2013) demonstrating that random networks with high connectivity have short STM—the average connectivity does not play a dominant role in our analysis. Specifically, it has been observed in spiking networks that higher network connectivity can reduce the STM capacity so that it scales only with log(M) (Wallace et al., 2013). However, in our ESN analysis, networks can have low connectivity (e.g., block-diagonal matrices—the extreme case of the block diagonal structure described above) or high connectivity (e.g., fully connected networks) and have the same performance.
4.2. Suboptimal Network Constructions.
Finally, we can also analyze some variations to the network structure assumed in this letter to see how much performance decreases. First, instead of the deterministic construction for z discussed in the earlier sections, there has also been interest in choosing z as independent and identically distributed (i.i.d) random gaussian values (Ganguli et al., 2008; Ganguli & Sompolinsky, 2010). In this case, it is also possible to show that A satisfies the RIP (with respect to the basis and with the same RIP conditioning as before) by paying an extra log(N) penalty in the number of measurements. Specifically, we have also established the following theorem:
The proof of this theorem can be found in the appendix. The additional log factor in the bound in equation 4.1 reflects that a random feedforward vector may not optimally spread the input energy over the different eigendirections of the system. Thus, some nodes may see less energy than others, making them slightly less informative. Note that while this construction does perform worse than the optimal constructions from theorem 1, the STM capacity is still very favorable (i.e., a linear scaling in the sparsity level and logarithmic scaling in the signal length).
We have seen that the tools of the CS literature can provide a way to quantify the STM capacity in linear networks using rigorous nonasymptotic recovery error bounds. Of particular note is that this approach leverages the nongaussianity of the input statistics to show STM capacities that are superlinear in the size of the network and depend linearly on the sparsity level of the input. This work provides a concrete theoretical understanding for the approach conjectured in (Ganguli & Sompolinsky, 2010), along with a generalization to arbitrary sparsity bases and infinitely long input sequences. This analysis also predicts that there exists an optimal recovery length that balances omission errors and recall mistakes.
In contrast to previous work on ESNs that leverage nonlinear network computations for computational power (Jaeger & Haas, 2004), this letter uses a linear network and nonlinear computations for signal recovery. Despite the nonlinearity of the recovery process, the fundamental results of the CS literature also guarantee that the recovery process is stable and robust. For example, with access to only a subset of nodes (due to failures or communication constraints), signal recovery generally degrades gracefully by still achieving the best possible approximation of the signal using fewer coefficients. Beyond signal recovery, we also note that the RIP can guarantee performance on many tasks (e.g. detection, classification) performed directly on the network states (Davenport, Boufounos, Wakin, & Baraniuk, 2010). Finally, we note that while this work addresses only the case where a single input is fed to the network, there may be networks of interest that have a number of input streams all feeding into the same network (with different feedforward vectors). We believe that the same tools used here can be used in the multi-input case, since the overall network state is still a linear function of the inputs.
A.1. Proof of RIP.
We show that the matrix satisfies the RIP under the conditions stated in equation 3.4 in order to prove theorem 1. We note that Rauhut (2010) shows that for the canonical basis (), the bounds for M can be tightened to using a more complex proof technique than we will employ here. For , the result in Rauhut (2010) represents an improvement of several log(N) factors when restricted to only the canonical basis for . We also note that the scaling constant C found in the general RIP definition of equation 2.3 is unity due to the scaling of z.
While the proof of theorem 1 is fairly technical, the procedure follows very closely the proof of theorem 8.1 in Rauhut (2010) on subsampled discrete time Fourier transform (DTFT) matrices. While the basic approach is the same, the novelty in our presentation is the incorporation of the sparsity basis and considerations for a real-valued connectivity matrix W.
The main proof of the theorem has two main steps. First, we establish a bound on the moments of the quantity of interest . Next we use these moments to derive a tail bound on , which will lead directly to the RIP statement we seek. The following two lemmas from the literature are critical for these two steps.
Armed with this notation and these lemmas, we now prove theorem 1:
A.2. RIP with Gaussian Feedforward Vectors.
In this section we extend the RIP analysis of Section A.1 to the case when z is chosen to be a gaussian i.i.d. vector, as presented in theorem 2. It is unfortunate that with the additional randomness in the feedforward vector, the same proof procedure as in theorem 1 cannot be used. In the proof of theorem 1, we showed that the random variable has pth moments that scale like (through lemma 1) for a range of p, suggesting that it has a subgaussian tail (i.e., ) for a range of deviations u. We then used this tail bound to bound the probability that exceeds a fixed conditioning . With gaussian uncertainties in the feedforward vector z, lemma 1 will not yield the required subgaussian tail but instead gives us moments estimates that result in suboptimal scaling of M with respect to n. Therefore, we will instead follow the proof procedure of theorem 16 from Tropp, Laska, Duarte, Romberg, and Baraniuk (2009) that will yield the better measurement rate given in theorem 2.
Let us begin by recalling a few notations from the proof of theorem 1 and introducing further notations that will simplify our exposition later. First, recall that we let XHl be the lth row of . Thus, the lth row of our matrix of interest is where is the lth diagonal entry of the diagonal matrix . Whereas before, for any , here it will be a random variable. To understand the resulting distribution of , note that for the connectivity matrix W to be real, we need to assume that the second columns of U are complex conjugates of the first columns. Thus, we can write U=[UR | UR]+j[UI | −UI], where . Because UHU=I, we can deduce that UTRUI=0 and that the norms of the columns of both UR and UI are .4
With these matrices UR, UI, we rewrite the random vector to illustrate its structure. Consider the matrix , a scaled unitary matrix (because we can check that ). Next, consider the random vector . Because is (scaled) unitary and z is composed of i.i.d. zero-mean gaussian random variables of variance , the entries of are also i.i.d. zero-mean gaussian random variables, but now with variance . Then, from our definition of U in terms of UR and UI, for any , we have and for , we have . This clearly shows that each of the first entries of is made up of two i.i.d. random variables (one being the real component, the other imaginary) and that the other entries are just complex conjugates of the first . Because of this, for , is the sum of squares of two i.i.d. gaussian random variables.
Before moving on to the proof, we first present a lemma regarding the random sequence |zl|2 that will be useful in the sequel.
Armed with this lemma, we can now turn our attention to the main proof. As stated earlier, this follows essentially the same form as Tropp et al. (2009), with the primary difference of including the results from lemma 3. As before, because with , we just have to consider bounding the tail bound . This proof differs from that in section A.1 in that here, we first show that is small when M is large enough and then show that Z1 does not differ much from with high probability.
In this section, we show that is small. This basically follows from lemma 1 and equation A.4. To be precise, the remainder of this section is to prove:
Choose any . If , then .
A.2.2. Tail Probability.
The goal of this section is to prove:
Pick any and suppose . Suppose . Then .
A.3. Derivation of Recovery Bound for Infinite Length Inputs.
In this section, we derive the bound in equation 3.7. The approach we take is to bound the individual components of equation 2.4. Because the noise term due to noise in the inputs is unaffected, we will bound the noise term due to the unrecovered signal (the first term in equation 2.4) by the component of the input history that is beyond the attempted recovery, and we will bound the signal approximation term (the second term in equation 2.4) by the quality of the signal recovery possible in the attempted recovery length. In this way we can observe how different properties of the system and input sequence affect signal recovery.
We are grateful to J. Romberg for valuable discussions related to this work. This work was partially supported by NSF grant CCF-0830456 and DSO National Laboratories, Singapore.
The notation means that for some constant C. For clarity, we do not keep track of the constants in our proofs. See Rauhut (2010) for specific values of the constants.
Small-world structures are typically taken to be networks where small groups of neurons are densely connected among themselves, yet sparse connections to other groups reduce the maximum distance between any two nodes.
The pdf of a random variable with q degrees of freedom is given by . Therefore, its tail probability can be obtained by integration: .
A subexponential random variable is a random variable whose tail probability is bounded by exp−Cu for some constant C. Thus, a random variable is a specific instance of a subexponential random variable.
We remark that this bound gives a worse estimate for the expected value as that calculated before because of the crude bound given by equation A.7.
A random variable X is symmetric if X and −X has the same distribution.