In short-term memory networks, transient stimuli are represented by patterns of neural activity that persist long after stimulus offset. Here, we compare the performance of two prominent classes of memory networks, feedback-based attractor networks and feedforward networks, in conveying information about the amplitude of a briefly presented stimulus in the presence of gaussian noise. Using Fisher information as a metric of memory performance, we find that the optimal form of network architecture depends strongly on assumptions about the forms of nonlinearities in the network. For purely linear networks, we find that feedforward networks outperform attractor networks because noise is continually removed from feedforward networks when signals exit the network; as a result, feedforward networks can amplify signals they receive faster than noise accumulates over time. By contrast, attractor networks must operate in a signal-attenuating regime to avoid the buildup of noise. However, if the amplification of signals is limited by a finite dynamic range of neuronal responses or if noise is reset at the time of signal arrival, as suggested by recent experiments, we find that attractor networks can outperform feedforward ones. Under a simple model in which neurons have a finite dynamic range, we find that the optimal attractor networks are forgetful if there is no mechanism for noise reduction with signal arrival but nonforgetful (perfect integrators) in the presence of a strong reset mechanism. Furthermore, we find that the maximal Fisher information for the feedforward and attractor networks exhibits power law decay as a function of time and scales linearly with the number of neurons. These results highlight prominent factors that lead to trade-offs in the memory performance of networks with different architectures and constraints, and suggest conditions under which attractor or feedforward networks may be best suited to storing information about previous stimuli.
Short-term memory is thought to be maintained by patterns of neural activity that are initiated by a memorized stimulus and persist long after its offset. Because memory periods are relatively long compared to biophysical time constants of individual neurons, it has been suggested that network interactions can extend the time over which neural activities are sustained (Brody, Romo, & Kepecs, 2003; Durstewitz, Seamans, & Sejnowski, 2000; Major & Tank, 2004; Wang, 2001). However, the form of such interactions is currently unknown in most systems, and experimental and theoretical work has suggested a range of different network architectures that could subserve short-term memory.
A critical factor for robustly maintaining the memory of a stimulus is being able to resist the effects of noise that can accumulate over time. This is a particularly acute problem for the representation of analog values in memory. In many memory-storing paradigms during which neurophysiological recordings have been obtained (for example, see Aksay, Baker, Seung, & Tank, 2000; Goldman-Rakic, 1995; Robinson, 1989; Romo, Brody, Hernandez, & Lemus, 1999; Sharp, Blair, & Cho, 2001; Taube & Bassett, 2003), neurons have been shown to exhibit what appear to be continuously varying response levels that change in a graded manner with the stored stimulus value. With such analog representations, any noise-induced change in neural activity has the potential to affect the encoding of the stimulus. Thus, such networks are faced with apparently conflicting demands. On the one hand, the networks must be able to maintain the value of a signal in memory for long durations. On the other hand, the mechanism for performing this maintenance must keep the signal from being contaminated by excessive buildup of noise.
The most common models for how activity evoked by a transient stimulus is maintained over time are the so-called attractor networks. In attractor networks, individual neurons do not intrinsically maintain activity over long timescales and thus cannot in isolation store a memory. Instead, activity is maintained by positive feedback whereby neurons that are connected by excitatory or disinhibitory positive feedback loops maintain one another’s activity following the offset of the external drive provided by the stimulus. In such models, the network structure determines which patterns of activity can be sustained by positive feedback, and typically only a small, specially designed set of patterns can be maintained. These maintained patterns of activity are called attractors of the network dynamics, because perturbing the dynamics away from such patterns leads to a rapid return to the attractor. A number of models of analog memory storage have utilized attractor dynamics (for review, see Brody et al., 2003; Durstewitz et al., 2000; Major & Tank, 2004; Wang, 2001), and recent analyses of neocortical data provide suggestive evidence for such attractors in tasks involving a working memory component (Ganguli, Bisley et al., 2008).
Recently both theoretical models (Ganguli, Huh, & Sompolinsky, 2008; Goldman, 2009; Mauk & Buonomano, 2004; Rabinovich, Huerta, & Laurent, 2008; Savin & Triesch, 2009; White, Lee, & Sompolinsky, 2004) and experimental observations (MacDonald, Lepage, Eden, & Eichenbaum, 2011; Pastalkova, Itskov, Amarasingham, & Buzsaki, 2008) have suggested instead how purely feedforward networks can store the memory of a stimulus in their transient dynamics. Experimentally, a feedforward progression of neuronal activity has been reported in hippocampal neurons during memory delay periods (MacDonald et al., 2011; Pastalkova et al., 2008), and theoretical work suggests mechanistically how an analog signal can be represented over time by activity that slowly propagates through a feedforward chain of neurons or, in recurrent networks, through a sequence of distinct and nonoverlapping patterns of network activity (Ganguli, Huh, et al., 2008; Goldman, 2009; White et al., 2004).
Here, we compare the performance of attractor and feedforward models in the presence of noise. Our work builds on the information-theoretic frameworks for quantifying memory performance of White et al. (2004) and Ganguli, Huh et al. (2008), who considered the performance of linear neural networks with discrete dynamics (i.e., defined with difference equations so that time is measured in discrete units that facilitate analytic calculation). We measure memory performance by calculating the Fisher information that is maintained about a transient stimulus at a time T into the future. Unlike previous work in neuronal systems (but as in the fluid mechanics example of Ganguli, Huh et al., 2008), the networks we study are defined by differential equations that consider the more realistic situation of continuous time dynamics. However, to facilitate analytic calculations, we also, when appropriate, compare to networks constructed with discrete dynamics.
The structure of this letter is as follows. First, in analogy to previous studies of linear networks with discrete dynamics, we analytically calculate the memory-storing performance of linear, continuous-time networks and determine the properties that optimize the Fisher information storage capacity of both attractor and feedforward networks. We then consider the effects of two nonlinearities suggested by neuronal recording data. First, we consider the effects on memory performance of reset mechanisms that, for example, remove noise from the system near the time of stimulus arrival (Churchland et al., 2010; Rajan, Abbott, & Sompolinsky, 2010; Weber & Daroff, 1972) or keep the network from entering the memory-storing state until the time of stimulus onset (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001). Second, we consider the effect of limiting neurons to having a finite range of firing rates with which they can encode a memorized stimulus.
2. Material and Methods
In this letter, we compare the performance of attractor and feedforward network models in maintaining the memory of a brief, analog-valued stimulus for a fixed or known delay period T in the presence of noise. Here, we define the dynamics of each network model, as well as the Fisher information used to quantify the memory performance.
2.1. Linear Network Models.
In equation 2.7, n0 replaces t0 in equation 2.4 and the power of τ in the denominator is reduced by one relative to that in equation 2.4 because the differential dt’ in equation 2.4 is set equal to τ in discrete dynamics and therefore cancels one factor of τ in the denominator.
The evolution of the network activity under linear dynamics can be computed by decomposing the activity into linearly independent modes. Here, we consider two such decompositions and use them to characterize the dynamics of the attractor and feedforward network models in the absence of noise.
In attractor networks, positive feedback sustains the activity evoked by the transient stimulus, for example, due to mutual excitatory connections between neurons that form a positive feedback loop (see Figure 1B). To identify such positive feedback, the eigenvector decomposition is commonly used to decompose the coupled networks into noninteracting modes of activity that can be considered independently. In the eigenvector decomposition, the pattern of neural activity at any given time is defined in terms of the eigenvectors and corresponding eigenvalues λi of the connectivity matrix , which satisfy the equation for i = 1 to N. Geometrically, the eigenvector decomposition corresponds to a change of basis into a new coordinate system whose axes are defined by the eigenvectors . In this new basis, the connectivity matrix is represented by a diagonal matrix having eigenvalues as diagonal entries such that , where the column vectors of are the eigenvectors. When the eigenvectors are orthogonal to each other, is known as a normal matrix. In this case, the activity in each mode is equal to the Cartesian projection of the network activity onto that mode, and there is no overlap among the activities in the different modes.
Activity in any eigenmode exhibits exponential growth or decay with a time constant τieff = τ/|1 − Re(λi)|. If λi = 1, activity is sustained without decay, and the mode can integrate any input perfectly. If Re (λi) < 1, activity decays with a time constant that decreases as λi decreases, and for Re (λi)>1, activity grows exponentially. Attractor networks are defined by having a small number of modes (the attractor modes) with λi’s much larger than the other eigenvalues. For such networks, activity in all except the attractor modes decays exponentially quickly to zero so that after a transient period, the only remaining activity is along these modes. The resulting subspace spanned by these modes is then called an attractor of the network dynamics. We illustrate a simple attractor network consisting of two symmetric excitatory neurons in Figure 1B. In such a network, the noninteracting modes correspond to the sum and difference of the activities and are called the common and difference modes (see Figure 1C). In the common mode, which is proportional to the average activity in the network, activity evoked by a transient input is maintained by mutual excitation of the neurons. By contrast, because the symmetric mutual excitation tends to make the neurons fire at equal rates, the difference mode is sharply attenuated by the network interactions, leading to rapid decay of any initial activity in this mode. Thus, after a transient period, only the activity along the common mode remains and the common mode is called an attractor of the network dynamics. Generally if there exist multiple modes with strong positive feedback, the signal can be stored in any of these modes, and the network is called a d-dimensional attractor network, where d denotes the number of modes with large λi’s (see Figure 1D). In the special case when d equals one or two, the attractor is called a line or plane attractor, respectively.
Feedforward networks use a different mechanism for storing a signal. Rather than maintaining a stable pattern of activity through positive feedback, as in the attractor networks, the signal is carried by different neurons at different times. For example, in feedforward networks composed of neurons connected as a chain, the activity can be maintained as long as activity continues to propagate along a chain in which the activity in the previous neuron is passed onto the next neuron and filtered at each stage (see Figures 1E and 1G). The feedforward networks cannot be decomposed into a full set of N eigenmodes because, by the definition of an eigenmode, the activity that starts in an eigenmode remains in that mode (see Figure 1F, left). By contrast, the fundamental characteristic of the feedforward networks is that the activities of all neurons except the final one are passed onto the next neurons instead of being sustained.
The Schur decomposition is more suitable for describing feedforward networks (Ganguli, Huh et al., 2008; Goldman, 2009; Murphy & Miller, 2009). Rather than diagonalizing the matrix , the Schur decomposition changes to a basis in which is triangular, that is, it decomposes any connectivity matrix into orthogonal modes that can have both feedforward and self-connections, but no feedback connections from later-stage neurons to earlier neurons. Formally, the Schur decomposition transforms the matrix into a lower triangular matrix such that , where the columns of are the orthogonal modes, called Schur modes, and the values of along the diagonal are the eigenvalues of (equivalently, can be made into an upper triangular matrix; Horn & Johnson, 1985). As in the eigenvector decomposition for normal networks, the diagonal entries of give the feedback of the Schur modes onto themselves (for normal , the Schur and eigenvector decompositions are identical). If is nonnormal, then the Schur decomposition will contain nonzero lower triangular entries that correspond to feedforward connections between the Schur modes. In this case, activity may be transiently amplified as it propagates through the feedforward connections between modes, even when all the eigenmodes are decaying (i.e., when all λi < 1; Trefethen & Embree, 2005).
Here, we consider two types of feedforward networks. First, we consider literally feedforward networks for which the connectivity matrix itself is lower triangular with zeros along the diagonal, so that all connections are feedforward. Thus, the Schur mode patterns of activity correspond to individual neurons (see Figure 1F, right). Especially, we consider a simple chain-like structure whose connectivity matrix between neurons in the literally feedforward networks is of the form Wij = α>0 for all i and j such that i = j + 1 and zero otherwise. For networks with many neurons arranged in a chain, the propagation of activity can continue for a duration proportional to the chain length, with each neuron’s activity peaking at different times. With this diversity of temporal profiles of neural activities, the network can generate persistent activity with a simple readout that linearly sums the activities of the different neurons with appropriate weights (see Figure 1G; Goldman, 2009).
Second, we consider recurrent matrices whose Schur decomposition has a feedforward (lower triangular, with zeros along the diagonal) structure; we call these functionally feedforward networks because the activity patterns defined by the Schur modes, rather than the neuronal activity itself, propagate in a feedforward manner (Ganguli, Huh et al., 2008; Goldman, 2009; Murphy & Miller, 2009). As in the case of the literally feedforward networks, we consider simple functionally feedforward chains of the form Mij = α>0 for all i and j such that i = j + 1 and zero otherwise. An example of a functionally feedforward chain is shown in Figures 1H and 1I, in which a two-neuron network with one excitatory and one inhibitory neuron is decomposed into common and difference modes by the Schur decomposition. The modes make a feedforward chain such that the activity of the difference mode drives the activity in the common mode (see Figure 1I). More neurons can form a longer functionally feedforward chain, allowing progression of activity patterns that persist for longer periods of time (see Figure 1J).
2.2. Fisher Information Measure for Memory Performance.
To achieve good memory performance, a network must maintain a memory of the stimulus while resisting the excessive accumulation of noise. The ability to achieve this can be quantified as the ratio between the amount of signal and noise stored in the system at a given time following stimulus offset. Here we use a closely related measure, the Fisher information IF, which quantifies the amount of information carried about a signal by the distribution of neural activities and which, for linear networks and gaussian white noise, is shown below to represent a ratio between the factor by which the network amplifies signals and the amount of noise accumulated by the network (Ganguli, Huh et al., 2008).
To get an intuition for this measure, we show in Figure 2A how to compute the Fisher information for an example of the activity of a single neuron (or a single eigenmode of an attractor network) in the presence of noise. The neuron (or mode) must distinguish between different pulse-like stimuli of amplitudes s and s + δs that it receives at time 0. Making this discrimination more difficult, noise is presented to the neuron (or mode) continually in time (see Figure 2A). We model the transient stimulus as a delta function δ(t) so that the stimulus causes a jump in the mean neural activity at time 0, with size proportional to the stimulus strength s or s + δs (see Figures 2B and 2C, thick lines). Due to the noise, each presentation of the stimulus leads to a different trajectory so that there is trial-to-trial variability in the response (the black and gray noisy trajectories in Figure 2B).
The memory of the stimulus is carried by the distribution of the firing activities of the neurons. In order to perform well in maintaining the distinction among stimuli, the distributions for different stimuli must remain well separated: the more the noise makes the two distributions overlap, the greater will be the corruption of the stored memory. In linear networks, the mean activities of the neurons (gray circle and black asterisk in Figure 2C) carry the information about the presented stimulus, and the signal gain is measured as the difference in the mean activities δ〈r〉 divided (i.e., normalized) by the separation δs of the signals to be discriminated (see Figure 2D). The noise in the neural activities is given by the spread in the firing rate distribution. The Fisher information IF conveyed by the network is defined as the ratio of the square of the signal gain to the noise variance at time T. Thus, either a wider separation between the means (high signal gain) or narrower distributions about these means (small noise variance) lead to higher Fisher information.
Here, is the derivative of the mean with respect to s, called the signal gain. Thus, equation 2.9 shows that IF is of the form of a (matrix) ratio between the signal gain squared and the noise covariance. Note that σ2IF(t) is independent of the stimulus strength s and injected noise level σ2 and depends only on the properties of the network connectivity. Therefore, in the following, we calculate instead of IF and often refer to as the Fisher information for brevity (see Figure 2D). Note that this quantity has the same units as σ2 since IF is unitless (and assuming that s is unitless).
In equation 2.9, the readout of the network activities is not specified. In particular, in a linear system with gaussian noise, it can be shown that is greater than or equal to the (signal gain)-to-(noise gain) ratio in any linear readout of the network (see section A.1 in the appendix). The equality holds when the linear readout is in the direction (see the optimal linear estimator in population decoding, as in Salinas & Abbott, 1994; Sompolinsky, Yoon, Kang, & Shamir, 2001). Note that in the feedforward networks, the optimal linear readout will generally vary over time, reflecting that information about the signal propagates from earlier to later stages in the feedforward chain.
Here we compare how attractor and attractorless (literally feedforward or functionally feedforward) models perform in storing the amplitude of a brief stimulus. The memory performance is measured by the Fisher information, a measure of how much the network amplifies the signal corresponding to the stimulus compared to how much it amplifies ambient noise (see section 2.2). In section 3.1, we consider purely linear networks that allow us to isolate how the structures of attractor and feedforward networks influence memory performance in the absence of nonlinear influences. Then we consider the effect of two biologically observed nonlinearities. In section 3.2, we consider a condition that we term a reset nonlinearity under which either noise does not begin to accumulate strongly until the memory period commences (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001) or in which noise that is present before the stimulus arrival is “reset” by the appearance of the stimulus (Churchland et al., 2010; Rajan et al., 2010; Weber & Daroff, 1972). In section 3.3, we consider the effects of restricting neurons to having a finite range of firing rates with which they encode a stimulus.
3.1. Linear Networks.
In this section, we compare the memory performance of attractor and feedforward network models with continuous linear dynamics. This allows us to focus on how signal and noise are propagated through the network as a function of the structure of the network connectivity without the complicating influence of nonlinearities. We first compute the Fisher information in line attractor networks and then extend our results to higher-dimensional attractor networks. Then we compute for feedforward networks and compare their performance to that of the attractor networks.
3.1.1. One-Dimensional Attractor Networks.
We first consider the line attractor networks, which are defined by having only a single stationary or slowly decaying (or possibly growing) pattern of activity that defines the attracting mode (see Figures 1B and 1C). When stimulated by a brief stimulus, both signal and accumulated noise in line attractor networks quickly converge to this attractor. As a result, for times after the transient responses of the nonattractor modes decay away, all information conveyed by the line attractor networks is contained in the attractor mode, and we can closely approximate the Fisher information by the (signal gain)-to-(noise gain) ratio in this mode (see Figure 2).
The memory-storing performance of the line attractor models reflects a balance between two factors. First, the network must be able to sustain the signal for the full duration of the memory period. As shown in Figure 1C, this is accomplished in attractor networks by having sufficiently large positive feedback in the attracting mode. Second, the network must not accumulate too much noise over time. Since noise is assumed to be presented at all times, including prior to stimulus onset, this implies that inputs to the network should not be sustained indefinitely or noise will accumulate without bound. Thus, there is a trade-off in attractor networks between sustaining signals for sufficiently long to maintain signal strength and having enough decay of signals that noise does not accumulate excessively.
To quantify this trade-off between sustaining the signal and accumulating noise, we examine the memory performance of the attractor network in terms of the amount of positive feedback α in the attracting mode, where α is the eigenvalue associated with the attracting eigenmode and the time constant of decay (or growth, for Re(α)>1) of activity in this mode is given by τeff = τ/|1 − Re(α)|. When the feedback is too weak (see Figure 3A), the signal decays quickly to zero and any memory of the initial stimulus amplitude s is forgotten. Thus, is close to zero in this case (see Figure 3D). Increasing the recurrent feedback leads to slower decay of signals corresponding to the memorized stimulus, and when the feedback is tuned to be large enough to offset the intrinsic leak of the neurons (α ≈ 1), the mean responses to different amplitude stimuli stay well separated (black and gray thick traces in Figure 3C). However, because noise along the attractor mode is subject to the same dynamics as signals along this mode, noise also accumulates without decay. Because noise is present at all times before the stimulus arrives, this leads to an extremely large variance in the responses (see Figure 3C; note the wide spread of trajectories even before time 0). For networks that are either nonforgetful (α = 1) or amplifying (α>1), this noise becomes infinite in magnitude so that the Fisher information is zero (see Figure 3D). Thus, there is an optimal amount of feedback in linear attractor networks, and a corresponding optimal time constant of decay of network activity, that balances having a long time constant so that the signal does not decay and having a short time constant so that noise does not accumulate too much (see Figures 3B and 3D).
This example shows that the attractor network performance is benefited by having an imperfect memory-holding mechanism. To find the optimal forgetting time constant of network activity decay, we analytically calculated for the line attractor networks (see sections A.2.1 and A.2.2). We find that achieves its maximum when the decay time constant of network activity τeff, opt = 2T, where T is the duration over which the signal is to be stored. Thus, the memory duration T sets the scale for the optimal network decay time. When activity decays much faster than the memory duration T, the signal decays away before the end of the memory period. When activity decays much more slowly than T or grows exponentially, noise accumulation overwhelms the signal.
3.1.2. Higher-Dimensional Attractor Networks.
We next extend the result for line attractor networks to higher-dimensional attractor networks having many modes with slow decay of activity. In line attractor networks, the input is stored along the one-dimensional attractor. In higher-dimensional attractor networks, signal and noise can accumulate in any direction spanned by the multiple attractor modes. We show that these extra dimensions do not affect of networks with optimally arranged inputs and readout. However, for imperfectly arranged inputs or outputs, we find that the memory performance of the line attractor networks is sensitive to the input direction but insensitive to the readout direction. In contrast, higher-dimensional attractor networks are more sensitive to the readout but less sensitive to the input direction.
For illustration, in Figure 4 we compare the line attractor networks to plane attractor networks defined by having two slowly decaying modes of activity. To convey geometrical intuition, we plot only the modes and with the two largest eigenvalues and assume all the eigenmodes are orthogonal. For simplicity, we assume that each attracting mode has the same eigenvalue, so that all directions in the attracting plane have equal decay times and the signal can be stored equally well in any direction on the plane. Likewise, noise accumulates in the same manner in any direction on the plane. Then, when gaussian white noise is presented equally to all neurons and thus to all orthogonal modes (see Figure 4A), the resulting noise at any time is also equivalent in all directions of the plane (see Figure 4C). By contrast, in the line attractor networks, noise along directions other than the line attractor is filtered out so that noise along the attracting mode has a larger variance than that along the other modes (see Figure 4B).
To maximize the strength of the signal carried by attractor networks, the inputs should be arranged so that none of the input is lost due to being sent into decaying modes whose amplitudes quickly fall to zero. In line attractor networks, this corresponds to aligning the input direction along the direction of the attracting eigenmode (see Figure 4E). When there is more than one attracting mode, as in the plane attractor networks, the optimal input direction can be along any linear combination of these attracting modes (see Figure 4F). The Fisher information is proportional to the square of the projection of the signal onto the attracting modes. Thus, the Fisher information is identical and equal to its maximal value for both the line and higher-dimensional attractors as long as the input is aligned along the subspace defined by the attracting eigenmodes (see Figure 4D, θ = 0). If is not aligned along the attracting modes, a portion of the signal is lost to the decaying modes and decreases from this maximum value. In the line attractor, there exists only a single attracting mode storing the signal, so decreases as deviates from (see Figure 4D, solid curve). On the other hand, stays the same in the plane attractor networks for any in the attracting plane (see Figure 4D, dashed line). Note that in the plane attractor networks would decrease as in the line attractor networks if were to deviate from the attracting plane (not shown). However, because the dimension of the plane attractor is higher than that of the line attractor, the alignment of the input vector is less restrained in the plane attractor (or, more generally, higher-dimensional attractor) networks.
Next we consider the arrangement of the readout for the maximal memory performance. As discussed in section 2.2, the Fisher information measure implicitly assumes an optimal readout because is equal to the (signal gain)-to-(noise gain) ratio along the optimal linear readout direction. However, for nonoptimal readout, the memory performance may be less than and the sensitivity to the direction of the readout may differ between the line and plane attractor networks (see section A.2.4).
In the line attractor networks, mistuning of the readout does not have much effect on the (signal gain)-to-(noise gain) ratio because the signal and noise accumulate along the one-dimensional attracting mode and their ratio is maintained for the projection onto any readout direction (see Figure 4H). Thus, the memory performance of line attractor networks remains near the maximal even when the readout direction is well away from the attractor mode (see Figure 4G, solid line). Only when the readout direction becomes close to orthogonal to the attractor direction, so that the signal becomes smaller than or comparable to the small but finite noise accumulated in the nonattracting modes, does the memory performance fall off by a significant amount. By contrast, in plane attractor networks, the memory performance is far more sensitive to the direction of the readout (see Figure 4G, dashed line). Because noise develops along all attractor dimensions but the signal lies only along the direction defined by the input, projections that are not along the input direction pick up additional noise and lower the (signal gain)-to-(noise gain) ratio (see Figure 4I). Hence, optimal performance requires a more precise readout mechanism in higher-dimensional attractor networks.
In summary, both the line attractor and higher-dimensional attractor networks were shown to have the same maximal memory performance, as characterized by the Fisher information. For the line attractor networks, memory performance was highly insensitive to the readout direction but more sensitive to the direction of the input. Conversely, for higher-dimensional attractor networks, the memory performance was highly sensitive to the readout direction but less sensitive to the direction of the inputs. These results suggest that line attractor networks might be more useful if the stored memory needed to be used by multiple networks that each projected activity along a different direction. By contrast, higher-dimensional attractors might be more useful in storing memories that can arrive from multiple input networks that each sends in different input patterns encoding the stored variable.
3.1.3. Feedforward Networks.
Next, we compute the memory performance of linear feedforward networks and focus on networks with a chain-like structure that were proposed recently as a neural substrate for short-term memory storage (Ganguli, Huh et al., 2008; Goldman, 2009; White et al., 2004). A critical difference between feedforward and attractor networks is that, unlike in attractor networks, activity in feedforward networks eventually exits out the end of the network. Thus, the memory of any input is lost after some finite time in feedforward networks. Although this finite time of signal propagation might at first seem to be disadvantageous, finite memory duration can be advantageous because it prevents noise from building up in the network (Ganguli, Huh et al., 2008). These relative advantages and disadvantages are quantified below, where we compute the Fisher information conveyed by linear feedforward networks and compare their performance to that of attractor networks.
Here, we consider simple feedforward chains having uniform strength α of the feedforward connections and compute as a function of α. When the strength of the connections is weak, the activity decays before it reaches the last stage and is close to zero (see Figures 5A and 5C). On the other hand, if the feedforward connections are stronger, the signal decays more slowly (for α < 1) or can grow exponentially (for α>1). Noise entering the system at any given time similarly gets amplified as it passes down the chain. However, unlike in the attractor networks, accumulation of noise in the feedforward networks is limited because noise exits the system when it reaches the end of the chain. Moreover, if signals are amplified along the feedforward chain, then inputs entering the first stage of the network get amplified more than inputs entering later stages. Thus, by arranging to have the signal enter the network at the first stage, the network can make the signal at time T arbitrarily larger than the noise entering at later times by using strong connection strengths α that allow the signal to be amplified faster than noise enters the system (see Figure 5B). This implies that can increase indefinitely with increasing α, so that linear feedforward networks could in principle convey signals to arbitrary precision (see Figure 5C; for how this result changes when neurons have a finite dynamic range, see section 3.3). This result is consistent with that of Ganguli, Huh et al. (2008) who also showed a monotonic increase of with α in models with discrete dynamics.
Comparison with the attractor networks reveals two important features of the feedforward networks that reflect the advantages and disadvantages of having finite memory duration (see Figure 5D). First, because the feedforward networks can transiently amplify signals over the memory period but still remove noise due to the eventual exiting of signals from the chain, these networks can greatly outperform the attractor networks. Second, for smaller values of α, the attractor networks outperform the feedforward networks. This latter result reflects the smearing out of signals by the continuous dynamics (see section 3.2 and Figure 6) and differs from that found when comparing feedforward and attractor networks with discrete dynamics (Ganguli, Huh et al., 2008), in which the feedforward networks outperformed the attractor networks for all settings of α .
In summary, with continuous buildup of noise, short-term memory networks need to forget to prevent the excessive accumulation of noise. In feedforward networks, this forgetting mechanism is inherent in the finite length of the feedforward chain, and the networks can amplify signals transiently without noise building up in an unbounded manner. In contrast, in attractor networks, the duration of signal and noise accumulation is not limited, and a perfect memory holding mechanism is inferior to a forgetful one in which signal decay and noise accumulation are optimally balanced. Comparing the purely linear attractor and feedforward network models, we find that the feedforward networks can outperform the attractor models due to their ability to transiently amplify signals without building up excessive noise. In the following sections, we show how these results may change in the presence of select, biologically motivated nonlinearities.
3.2. Networks with a Reset Mechanism.
In the previous section, we found that the feedforward networks stored more information than the attractor networks because they could amplify the signal without infinite buildup of noise. In contrast, the attractor networks needed to be forgetful in order to prevent infinite noise buildup. However, what if the accumulation of noise before the memory period is limited, or there exists a mechanism to reset the network state near the onset of the signal? Recent experimental studies in several cortical regions showed that variability in neural activity is reduced with stimulus onset (Churchland et al., 2010), and theoretical work suggests this may be a general feature of many nonlinear recurrent networks (Rajan et al., 2010). Alternatively, networks may not switch into a memory-storing state that accumulates noise until close to the start of the memory period; for example, such a switch may occur due to a change in network state triggered by attention or neuromodulation (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001). Finally, if a network receives feedback about its deviation from a desired level and is able to correct these errors, then infinite buildup of errors is also prohibited. For example, in the oculomotor system, drift in the networks that control eye position triggers corrective saccades that can correct errors caused by accumulation of noise or systematic drift of network activity (Weber & Daroff, 1972). Motivated by these examples, we here consider the effect of allowing a network to reset its activities with the arrival of a signal and remove previously accumulated noise. Note that the level of spontaneous activity before the memory period can differ between these different reset mechanisms, being low even before stimulus arrival if there is a stable low-rate spontaneous state (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001) or being higher during spontaneous activity and reduced only at stimulus onset (Rajan et al., 2010; Churchland et al., 2010); however, for any reset mechanism, the variability of network activity would be low at the beginning of the memory period. For simplicity, we implement this “reset nonlinearity” by setting the noise to zero at the time of signal arrival, so that noise accumulates only during the memory period of duration T.
First, we consider the attractor networks. Before the signal arrives, noise accumulates, and this accumulation can grow without bounds along any nondecaying (α ⩾ 1) modes of the network (see Figures 6A and 6B). However, at the time of signal arrival, the reset mechanism quenches the neural activities to zero. Therefore, only noise presented after t = 0 degrades the memory performance, and unlike in the attractor networks without reset, perfectly integrating or exponentially growing modes can convey information about the signal. In fact, monotonically increases with increasing α (see Figure 6C and section A.2.2), showing that memory performance is enhanced by amplifying signals in the network. This result can be understood by recalling that the signal is presented only at time t = 0, whereas noise is equally presented during the entire memory performance: by amplifying the input over time, more weight is given to inputs at earlier times, allowing the signal to be amplified faster than noise enters the system. In the limit of infinite signal amplification, the signal can be made arbitrarily larger than the noise, so that the Fisher information approaches infinity and signals can be discriminated with perfect precision.
In feedforward networks, the reset mechanism also enhances memory performance by removing noise accumulated prior to the stimulus onset (see Figures 6D and 6E) and thereby increases relative to a network without reset (see Figure 6F). However, the increase in for the feedforward chain network is not nearly as large as that in the attractor networks. This reflects that, even without an externally imposed reset, the finite length of the feedforward chain already provides a mechanism for removing noise because noise exits the system when it reaches the end of the chain.
By comparing the performance of the attractor and feedforward networks, we find that the attractor networks perform better than the feedforward networks when there exists a reset mechanism (see Figure 6I). To understand what factors contribute to this result, we first consider the case of networks with discrete dynamics (Ganguli, Huh et al., 2008). In feedforward chains exhibiting discrete dynamics, all activity at one stage of the chain passes in the next time step to the following stage. When α = 1 (see Figure 6G), activity is passed from neuron to neuron in discrete time steps without loss of amplitude. Thus, there is no smearing out of activity across neurons as the activity progresses through the feedforward chain. In this case, the Fisher information for the feedforward and attractor networks is identical (see section A.4), reflecting a deep similarity between the feedforward and attractor networks: whereas in attractor networks, activity at each time step is sent from a given neuron (or mode) onto the same neuron (or mode), in feedforward networks, the activity is similarly propagated over time, but instead from one neuron to the next (see Figure 6G; for discussion of a more general mathematical formalism that formalizes the similarity between feedforward and attractor networks, based on pseudospectral analysis (Trefethen & Embree, 2005) see the supplement of Goldman, 2009).
The discrete dynamics example suggests that the key factor explaining the poorer performance of feedforward networks with continuous dynamics is the spreading of activity across neurons or modes that occurs in the continuous feedforward networks. This spreading has two effects. First, it reduces the amplitude (vector length) of the signal carried by the network by spreading activity across different neurons. To understand this, note, for example, that dividing a signal equally among two neurons, so that the activity can be described by a vector (s/2, s/2), reduces the vector amplitude of the signal by a factor of compared to when the entire signal is carried by a single neuron, that is, (s, 0). This loss of signal is evident in Figure 6H, which shows how the signal gain decreases over time in the continuous feedforward networks (dashed curve) but is maintained at a constant level in the discrete feedforward networks (circles). Second, the spreading of activity causes activity to exit the network before the end of the memory period. This is evidenced by the dip in signal gain seen near the end of the memory period (T = 2) in the same figure. Due to both the spreading of the signal and the loss of signal out the end of the chain, for continuous-time feedforward networks becomes lower than that of the attractor networks (see Figure 6I).
In the example above, we showed that the attractor network outperforms the feedforward network with the same strength between the modes. However, a more direct biological constraint is to compare the attractor and feedforward networks when the maximum connection strength between neurons is held fixed. In this case, we again find that the optimal attractor network outperforms any (literally or functionally) feedforward network. The proof is given in section A.5. There, we show that if the maximal synaptic strengths between the neurons are bounded by wmax and the eigenmodes or Schur modes are orthogonal to each other, the connectivity strength between the modes is bounded and given by Nwmax. For the attractor networks, we find a network that reaches this bound. By contrast, the feedforward networks cannot achieve this bound for all connections between modes. Further, even if we assume there exists a feedforward network with all connections between modes set to Nwmax, the previous comparison of memory performance for a given connectivity between the modes shows that the attractor network still outperforms the feedforward network when both networks have connectivity strength Nwmax between modes (see Figures 6H and 6I). Thus, the optimal attractor network outperforms any feedforward network when the synaptic connectivity between the neurons is bounded. Alternatively, we can also consider the constraint that the total postsynaptic weight is bounded. We note that at least for the case of excitatory (literally feedforward or excitatory attractor) networks, the maximal memory performance under this constraint corresponds to the previous result, in which there were fixed connections between modes (see Figures 6H and 6I). The optimal feedforward networks use this maximal connection strength wpostsynaptic, max between all neurons, and the optimal attractor networks have a maximum eigenvalue wpostsynaptic, max. Thus, the optimal attractor networks outperform the optimal feedforward networks.
In summary, in this section we considered the effect on memory performance of reset mechanisms that remove accumulated noise at the time of signal arrival. As a result of this reset, the attractor networks could amplify signals without having a buildup of noise prior to signal arrival affect the memory performance. Moreover, for a given level of amplification between the neurons or modes, the attractor networks perform better than the feedforward networks since the activity in the feedforward networks spreads out along the chain and is lost when it exits the end the chain.
3.3. Bounds on the Neuronal Activity.
In the previous sections, we found that the networks exhibiting the best memory performance depended on strong amplification of signals that led to large and potentially unbounded growth of network activity. However, unbounded amplification of activities is not possible since neurons have a limited dynamic range. This limited range assumes several forms. Biophysically, there are absolute limits on the maximal firing rates that neurons can achieve (typically in the hundreds of Hz) due to postspike refractoriness. Additionally, neurons have been suggested experimentally (Baddeley et al., 1997) to have constraints on the average firing rates they can assume over long time periods. During working memory periods, most neurons do not sustain average firing rates beyond several tens of Hz, even though trial-to-trial fluctuations may be much larger than this for brief periods of time.
Given these observations, in this section we consider the effects of imposing constraints on the range of firing rates with which neurons encode signals in memory. Throughout much of the discussion, we confine ourselves to limits on the mean firing rates attained over the course of the memory period. This constraint is motivated by the observation that memory neurons typically have much lower (trial-averaged) mean firing rates than are allowed by their moment-to-moment biophysical constraints, and for analytical tractability is implemented by adjusting the inputs of an otherwise linear network such that the activity never exceeds hard bounds on the mean rates. Then, in order to gain insights into the effects of constraints on the absolute size of neuronal fluctuations, we consider what happens when we additionally apply a hard bound on the variance of firing rates about the mean.
3.3.1. Effects of Bounded Rates on the Form of External Inputs.
Before considering the effects of a finite dynamic range of firing activity in specific networks, we investigate the constraints it places on the form of the input vector. Since the input vector drives the mean firing activity in linear networks (see equation 2.3), putting a limit on the mean firing rate correspondingly constrains the input vector. Note that this differs from our treatment of networks with unconstrained firing rates, for which we normalized to 1 because the Fisher information for all networks simply scaled with the square of the input magnitude and could be made arbitrarily large by increasing ; see equation 2.9 and compare to Ganguli, Huh et al., (2008), who assumed that is still limited to 1 under a similar constraint on the dynamic range). To implement the constraint on mean firing rates, we assume that each neuron has its mean (absolute) activity bounded by a maximal value r0 (where negative rates can be considered as the firing rate of an “anti-neuron” with opposite stimulus preference; Shadlen, Britten, Newsome, & Movshon, 1996). This is illustrated geometrically in Figure 7A, which shows that the hard limits on the mean firing rate define a hypercube (a square in 2D) in the space of possible firing rates. To stay within these limits, the magnitude of the input vector must be set such that the mean firing rate of any given neuron never exceeds its bound. As we show further, this leads to different maximal amplitudes of the input vector for different network architectures.
The constraint on the mean firing rate has immediate implications as well for the spatial pattern of inputs that are conveyed most faithfully by the network. Given the limitation on how much information any given neuron can convey with its limited dynamic range, the maximal information carried by a network is achieved when all neurons are used and each of these neurons uses its full dynamic range. When this idea is represented geometrically, information storage is maximized if the attracting or Schur modes of the networks lie along the directions pointing to the vertices of the hypercube that defines the maximal range of mean responses (see Figure 7B, open circles). With this arrangement, the strength (vector length) of signals conveyed by the networks is proportional to , illustrating the benefits of having more neurons in the network when each individual neuron has limited dynamic range. As shown in section A.6, this scaling leads to the Fisher information for the best attractor and feedforward networks scaling with the network size N (see Figure 8G).
3.3.2. Attractor Networks with Finite Dynamic Range.
We first consider the performance of attractor networks with a limited range of mean firing rates and no reset nonlinearity. In this case, our results follow closely that found for the linear networks of section 3.1: if the decay time constant of the network is too small, the signal decays to zero before the end of the memory period (see Figure 8A, bottom trace, and Figure 8B, probability distribution of activity in this mode). By contrast, if the network does not exhibit decay or decays too slowly, then noise builds up to the point that the signal becomes overwhelmed by noise (see Figures 8A and 8B, top traces). To perform optimally, the network must balance signal decay and the accumulation of noise, and we find that the optimal time constant of network decay to achieve this balance is (see Figures 8A and 8B, middle traces; in Figure 8E, the dotted line shows memory performance as a function of α). We note that this result is identical to that found in section 3.1 for networks with no bounds on the mean firing rates. This identical result reflects that, due to the need to remove noise through decay of network activity, the activity of the network never needs to be constrained by the limited dynamic range. However, the limited dynamic range does bound the total information that can be conveyed by the network because it constrains the amount of input that the network can receive at the time of the stimulus.
When there is a reset nonlinearity at the time of signal arrival, the optimal strength of network feedback does change compared to that obtained without a limited dynamic range. Recall that, without limits on the dynamic range (see section 3.2), the optimal networks were found to have strong feedback (α>1) so that they could amplify their signals faster than noise entered the system. With a finite dynamic range, unconstrained amplification of activity is no longer possible. Figure 8C illustrates mean trajectories of three modes with different recurrent feedback α, or, equivalently τeff, corresponding to decaying (τieff>0), perfectly integrating (τieff = ∞), and exponentially growing modes (τieff < 0), respectively. Compared to the decaying mode, the perfectly integrating mode performs better because it maintains the signal faithfully yet has only a finite buildup of noise due to the reset at time t = 0. For the amplifying mode that exhibits exponential growth (the increasing trajectory in Figure 8C), we set to a value such that activity propagates linearly through the network until, at the end of the memory period, it just reaches the limit of the dynamic range. Thus, the maximal signal that can be carried by the network is identical to that obtained in the perfectly integrating mode. However, due to the amplification, noise accumulates faster than in the perfectly integrating mode, resulting in a larger variance in the neuronal firing rates (see Figure 8D). Thus, with a finite dynamic range and reset of activities with arrival of the signal, we find that perfectly sustaining the activity during the memory period is optimal (see Figure 8E; see section A.6 for the detailed calculation).
More generally, in both the networks with and without a reset nonlinearity, we find that for optimally tuned networks increases linearly with the number of neurons N (see Figure 8G) and decreases inversely with the memory duration T (see Figure 8H). Thus, it scales with N/T. The former result reflects that more neurons allow more signal to be carried by the network, as discussed in the preceding section. This latter result reflects the accumulation of the continually presented noise, which results in a linear increase in noise variance over the memory period (see section A.6).
Note that the constraint on the mean firing rate alone may allow infinite accumulation of noise, which is not biologically plausible. Thus, in addition to constraining the mean activity, we further consider bounds on the variance of neural activity. For analytical tractability, we place a simple bound on the maximal variance of activity without affecting the underlying dynamics: if the variance of activity obtained from the dynamics exceeds the bound, the variance is saturated and the variance is set to the bound (see section A.6).
To see the effect of the bound on variance, we note that there are two possible cases. In the first case, the Fisher information is maximized in a regime where the noise variance bound is saturated. Since the noise variance is saturated, this corresponds to a regime in which the signal-to-noise, at best, is very low and the information transmission is exceptionally small. Thus, without some additional nonlinear mechanism for noise reduction, the system would fail to transmit much information. Furthermore, it is not even clear that Fisher information provides a good metric in such cases where only very coarse discrimination of signals may be performed (see Butts & Goldman, 2006). We therefore do not consider this case further.
In the second case, corresponding to a higher signal-to-noise regime, the maximal Fisher information is obtained when the noise variance is not saturated. In this case, as shown in section A.6, we find that the optimal value of the network feedback α is not different from that obtained without a bound on the noise variance (compare Figures 8E and 8F). Furthermore, we derive conditions on σ and N such that this higher signal-to-noise regime is attained. In a similar manner, the optimal memory performance of the feedforward networks in the high signal-to-noise regime is not affected by the bound on the variability. Therefore, for the feedforward networks discussed below, we consider only the effect of the bound on the mean activity.
3.3.3. Feedforward Networks with Finite Dynamic Range.
In the sections 3.1 and 3.2, we found that the optimal feedforward networks used transient amplification of signals to increase the (signal gain)-to–(noise gain) ratios that are represented by . However, as we noted for the attractor networks with a reset, unbounded signal amplification is no longer possible when there is a finite dynamic range, and firing rates will saturate unless the inputs entering the network are reduced. The consequences of this limited dynamic range for the feedforward networks are delineated below.
We consider first the case of networks with discrete dynamics (see Figure 6G) for which analytical calculation of the optimal Fisher information is tractable. Similar to the above results for the attractor networks with a reset, we find that the optimal memory performance is obtained under two conditions. First, the input vector should be made as strong as possible for each neuron so that each neuron in the network uses the full extent of its mean dynamic range. This immediately implies that the optimal feedforward networks must have a functionally, rather than literally, feedforward architecture (because in a literally feedforward architecture, by definition the first stage does not contain all neurons).
Second, the networks should use a value α = 1 that corresponds to perfect maintenance of the signal as it propagates down the chain of modes (see Figure 9A). If network activity decays more quickly than this (α< 1), part of the signal will be lost. If network activity grows more quickly (α > 1), then in order to use its full dynamic range of mean activity and not saturate, the network will need to have initial activity that is less than maximal and will need to amplify this activity over time, leading to an amplification of noise as well. This is precisely analogous to the case for the attractor network with a reset, and indeed the discrete feedforward and attractor networks with a reset have identical (see Figure 10C). Without a reset, the feedforward networks with discrete dynamics can outperform the attractor networks because they do not need to forget in order to remove noise and, in fact, the performance of the feedforward networks is identical with or without a reset (see Figures 10A and 10C).
We note that the two conditions above imply that for the feedforward networks to maintain neuronal activity at a level that uses the full dynamic range of all neurons, the activities of the neurons at all times need to be directed along the vertices of the hypercube that defines the allowed range of mean firing rates (see Figures 7B and 9B). It is not immediately obvious that this condition can be met for the feedforward networks, because it implies that there must be N orthogonal modes of the network that each lie along a different vertex of the hypercube. In section A.6, we show that networks can be constructed that obey this criterion, at least in the case that the number of neurons N is equal to a power of 2. When N is restricted to this case, we show that the Fisher information conveyed by the network scales as N/T, similar to the case of the attractor networks with a reset. Building on this case, we show in section A.6 that for general N, the maximal Fisher information is still of the order of N/T when the number of neurons is at least twice the number of feedforward stages.
Literally feedforward networks perform more poorly than the optimal, functionally feedforward networks already described. In the literally feedforward networks, different sets of neurons are used to convey information in each stage. In particular, since input is applied to neurons only in the first stage and is carried only by a subset of neurons at any time, the feedforward networks cannot convey as much information as functionally feedforward networks that use all neurons at every stage. Interestingly, when we considered storage of a single-dimensional stimulus in a literally feedforward network with number of stages equal to the duration of the memory period (as in Ganguli, Huh et al., 2008), we found that the memory performance of the optimal networks scaled only as N/T2 rather than as N/T (see section A.6). Furthermore, although the previous study focused on networks having a fan-out structure with more neurons at later stages, we found that the optimal network architecture in this case contained equal numbers of neurons at all times (see section 4 for further commentary). This uniform structure provides an optimal balance between the fan-out architecture, which allows larger signal amplification between stages, and the fan-in architecture, which reduces noise (and particularly the amount of noise that is common among neurons) by pooling stages with more neurons into stages with fewer neurons (see Figures 9C to 9E). Figure 9D shows how depends on the rate of fan-out, which is defined as the ratio of the number of neurons in the successive stages: when the fan-out rate is less than 1 (greater than 1), it is a fan-in (fan-out) structure. As seen in this figure, is maximized when the fan-out rate is 1, that is, for a uniform structure. We note that this result holds regardless of whether , as in Ganguli, Huh et al. (2008) (calculation not shown).
For feedforward networks with continuous dynamics, the Fisher information cannot be expressed in a simple analytical form, making it difficult to find the structure that optimizes memory performance. To obtain a lower bound on the maximal and gain an intuition for how the results obtained in discrete dynamics might change when the dynamics are continuous, we therefore calculated for networks with the structure found to be optimal under discrete dynamics. Numerical simulation in this case shows that the feedforward networks perform worse than the attractor networks, either with or without reset (see Figures 10B and 10D). Furthermore, with a reset, we show in section A.6 that the attractor network saturates the bound on information transmission achievable by any network with a finite dynamic range, whereas no feedforward network can achieve this bound. Thus, at least for networks with a reset, the attractor networks strictly outperform the feedforward networks. Without a reset, the worse performance of the feedforward networks could in principle be due to the nonoptimal architecture taken from the optimal discrete network. However, we think this is unlikely because the reduced memory performance for the feedforward networks with continuous dynamics is analogous to the similar result found in section 3.2 (Figures 6H and 6I and accompanying text), which could be explained by the combination of spreading of signals across the modes of the network and signal loss through the end of the chain.
We have compared the memory performance of two prominent classes of short-term memory networks in storing the amplitude of a briefly presented stimulus in the presence of gaussian white noise. In one class of networks, memory was sustained by positive feedback that was mediated by recurrent connections and resulted in the formation of low-dimensional attractors (Robinson, 1989; Seung, 1996). In the other class, memory was sustained by passing activity through either long feedforward chains of neurons or through a chain of orthogonal activity patterns (Schur modes) in a recurrent network (Ganguli, Huh et al., 2008; Goldman, 2009; White et al., 2004). In each case, memory performance was quantified with the Fisher information, which, for the linear network dynamics considered here, represents a ratio of the amount that the network amplifies the signals versus the noise received.
Our primary results were as follows. For the attractor networks, including those with a limited range of firing rates, we found that the best-performing networks were forgetful if noise is allowed to build up without constraint before the stimulus arrives. This forgetfulness reflected a fundamental trade-off between requiring a long time constant of decay of network activity to maintain signals throughout the memory period and needing some decay of network activity in order to remove noise that enters the system before the stimulus arrives. However, if there exists a mechanism to remove noise from the system near the time of stimulus arrival or if networks enter the memory-storing state only close to the time of the stimulus onset, then we found that the optimal networks with a limited dynamic range perfectly maintain their signals throughout the memory period.
Comparison of the memory performance between line attractor and higher-dimensional attractor networks showed that the optimal memory performance with or without reset is independent of the dimension of the attracting modes. However, optimal memory performance in the line attractor networks requires an optimal alignment of the input vector, whereas optimal memory performance in the higher-dimensional attractors requires an optimal alignment of the readout vector. These results suggest that line attractor networks might be more useful if the stored memory needs to be used by multiple networks that each project activity along a different direction. By contrast, higher-dimensional attractors might be more useful in storing memories that arrive from multiple networks that each encodes the stimulus along a different direction.
For the feedforward networks, the optimal network architectures did not depend strongly on the presence of a resetting mechanism because the feedforward networks naturally remove noise from the system when it exits from the end of the feedforward chain. Due to this inherent noise-removal mechanism, the feedforward networks could transiently amplify signals without excessive noise buildup (Ganguli, Huh et al., 2008) and, for linear networks with no reset or bounds on activity, perform better than the attractor networks. However, when the firing rates were bounded, the ability of the feedforward network to amplify inputs was limited, and the optimal feedforward networks propagated activity without amplification or decay (α = 1).
Comparing the networks, we found that the Fisher information for both the optimal attractor and feedforward networks increased linearly with the number of neurons N, reflecting that additional neurons allow more signal to be carried by the network. Additionally, the optimal networks in both cases exhibited a power law decay in memory performance. For the attractor networks and for feedforward networks in a discrete approximation, this decay was inversely proportional to time and reflected the linear increase in noise variance over time. Interestingly, we note that such a linear increase in variance has been observed experimentally in spatial working memory tasks (Ploner, Gaymard, Rivaud, Agid, & Pierrot-Deseilligny, 1998; White, Sparks, & Stanford, 1994). Feedforward networks with continuous dynamics performed less well than those with discrete dynamics, reflecting two factors: the signals in continuous networks spread out over time, leading to a reduction in the signal gain; and due to this spreading, signals exit from the end of the chain before the end of the memory period. Together these factors lead to worse performance of the feedforward networks relative to the attractor networks when there is a noise reset, and quite likely (although we could only compute a lower bound approximation on the feedforward networks) even in the absence of such a reset.
4.1. Comparison to Previous Work.
Many previous studies have proposed perfectly tuned attractor networks as a substrate for holding short-term memories in the absence of noise (for reviews, see Brody et al., 2003; Goldman, Compte, & Wang, 2009; Wang, 2001). Here, we have explicitly considered the effects of noise on both attractor and nonattractor (feedforward or functionally feedforward) networks. For networks with underlying linear dynamics and both a reset nonlinearity and a finite dynamic range on neuronal responses, our results are consistent with the optimality of perfectly tuned attractor networks. Similarly, we note that the perfectly tuned attractor network (integrator) was found in a recent study to be the optimal architecture for storing the running total of a continuously presented input in which noise likewise started with the arrival of the signal (Brown et al., 2005).
Perfect integrator networks face a fine-tuning problem of network connectivity in that the feedback connections must precisely offset intrinsic neuronal decay processes in order to sustain activity at a constant rate in the absence of external input. Several mechanisms have been suggested to lessen the strictness of this tuning requirement. These include the use of long intrinsic (Marder, Abbott, Turrigiano, Liu, & Golowasch, 1996) or synaptic (Hempel, Hartman, Wang, Turrigiano, & Nelson, 2000; Wang et al., 2006; Mongillo, Barak, & Tsodyks, 2008) time constants. In addition, bistability is a nonlinear mechanism for maintaining the robustness of memory storage (Camperi & Wang, 1998; Koulakov, Raghavachari, Kepecs, & Lisman, 2002; Goldman, Levine, Major, Tank, & Seung, 2003) and homeostatic learning rules have been suggested to be able to keep short-term memory-storing circuits tuned (Goldman, 2009; Renart, Song, & Wang, 2003). Further investigation is needed to analyze the robustness of different network architectures to synaptic weight changes.
In the absence of a reset nonlinearity, we find that noise buildup before the time of the stimulus presentation makes the perfect attractor network nonoptimal; instead, at least in the high signal-to-noise regime, we find that the optimal attractor networks must be forgetful in order to reduce noise accumulation. This result is similar to that of White et al. (2004), who considered the storage of temporal sequences in memory networks with discrete dynamics and who noted that forgetting was necessary in order to prevent the buildup of noise that arrived at all times before the stimulus onset.
For the feedforward networks, previous work that examined networks with a finite dynamic range of neural activity focused on network architectures with a fan-out structure (Ganguli, Huh et al., 2008; Ganguli & Latham, 2009). This previous work showed that under a finite dynamic range constraint, a fan-out network can achieve the same scaling as the optimal network; however, this study did not check whether other structures may achieve this bound or whether there exists a structure having better memory performance. By contrast, at least for storage of a one-dimensional stimulus, we explicitly calculated that the optimal network architectures for the (discrete) feedforward networks had a uniform structure and that the fan-out structure was suboptimal. A key difference between our study and that of Ganguli, Huh et al. (2008) is that they primarily focused their discussion on memory for sequences, whereas here we explicitly focus on memory for a single-dimensional input. Higher-dimensional signals cannot be stored in attractor networks if the dimension of the attractor is lower than the dimension of the signal. Therefore, if the stimulus to be remembered is higher dimensional, such as remembering an entire sequence of inputs, this may favor a high-dimensional or feedforward network in which time is explicitly represented by patterns of activity that are sequentially activated as signals propagate through the network (Ganguli, Huh et al., 2008; Goldman, 2009; White et al., 2004). Ganguli, Huh et al. (2008) showed that the duration T for which a network could reliably convey information about a temporal sequence increased only in proportion to . This contrasts with our result for storing a single-dimensional stimulus, in which memory increases proportional to the network size N. The reason for this difference is that for storage of a single-dimensional stimulus, our optimal networks (both attractor and functionally feedforward) could use their entire finite dynamic range to store this one dimension. By contrast, when the stimulus dimension scales with time, as in sequence memory, the network must divide its dynamic range among all stored dimensions. This leads to memory performance, which scales as N/T2, rather than N/T, so that the duration T for which a network reliably can convey information about a temporal sequence increases only in proportion to . Consistent with this observation, the memory performance of our optimal literally feedforward networks (which use approximately 1/T of the entire network’s range at any given time) scaled only as N/T2. Furthermore, literally feedforward networks might have an advantage over functionally feedforward networks or generic high-dimensional attractors because the literally feedforward networks keep the elements of a sequence arriving at different times cleanly segregated.
4.2. Temporal Information in Memory Networks.
In this work, we focused on mechanisms for storing the amplitude of a stimulus when the memory period is a fixed (or known) duration, so that there is no need for encoding the time since the pulse occurred. However, if the duration of the memory period is variable and unknown, then joint information about the amplitude of an input pulse and the time at which the pulse occurs needs to be encoded. A one-dimensional attractor network is not suitable to extract joint information about the amplitude and time of input since a one-dimensional network cannot represent such a two-dimensional quantity. Rather, at least a two-dimensional network is required. Feedforward networks seem advantageous for processing time and storing signals since different sets of neurons or modes are used at different times. However, it is unclear whether time and amplitude are dependently encoded by the network activity, as in feedforward networks, versus encoded in independent modes of activity (either with time encoded in a completely separate network from amplitude, or with independent modes of activity that represent time and that represent amplitude, as suggested by the recent work of Machens, Romo, and Brody, 2010). For example, it has been suggested in the circuits underlying bird song that time is represented through a feedforward chain of bistable units that are more robust to temporal encoding than graded networks (but with loss of any representation of amplitude information; see Long, Jin, & Fee, 2010). Alternatively, high-dimensional attractor networks have been suggested to encode both time and amplitude (Machens et al., 2010; Singh & Eliasmith, 2006). Further work, both experimental and theoretical, is needed to address the joint processing of amplitude and temporal information.
4.3. Effect of Correlated or Signal-Dependent Noise.
In this study, we assumed for simplicity that the external noise received by each neuron was equal in amplitude and uncorrelated across neurons. However, similar to studies in sensory systems that have shown strong effects on neural coding in the presence of correlated noise (Abbott & Dayan, 1999; Averbeck, Latham, & Pouget, 2006; Latham, Deneve, & Pouget, 2003; Sompolinsky et al., 2001; Zohary, Shadlen, & Newsome, 1994), we found that the correlation structure of noise received by the network may dramatically affect the optimal architecture of memory networks. As illustrated in Figure 11A, line attractor networks may be advantageous when noise is correlated: if the input direction can be chosen independent of the profile of the injected noise, then the attractor and the input direction can be oriented orthogonal to the directions of high noise and along directions with low noise. In contrast, activity in feedforward networks is passed through many different orthogonal patterns of activity (see Figure 11B), so that it may be difficult to take advantage of correlated noise that has a particularly non-noisy direction. If instead the input direction and the profile of injected noise are dependent, a more careful examination is required to determine the architectures of the best-performing attractor and feedforward networks.
We considered only additive noise in this study. When noise is instead multiplicative or signal dependent, different optimal architectures may be necessary. Although this question deserves much further study, multiplicative noise is more disruptive to higher firing rates than additive noise and therefore might lead to better memory performance for networks with relatively faster decay of signals, with smaller amplitudes of input that drive neurons to less high firing rates, or with other differences in network architecture that decrease the use of high firing rates.
4.4. Nonlinear Dynamics.
In this letter, we have modeled the finite range of neuronal activities in an analytically tractable manner by imposing a finite dynamic range on the mean firing rates and their variances and arranging the network inputs so that the trajectories of neuronal firing never exceed this range. More realistically, neurons have hard or soft limits on their observed firing rates that are best modeled with explicitly nonlinear network models. Having the underlying dynamics of the network be nonlinear rather than imposing the finite dynamic range as a simple constraint on a linear network may influence the memory performance in various ways. For example, explicit inclusion of nonlinear dynamics may reduce the buildup of noise relative to a network with linear dynamics, and even without an explicit reset mechanism, the optimal architecture of the attractor networks may become less forgetful. Furthermore, recent theoretical work shows that randomly connected nonlinear networks with sigmoidal neuronal input-output functions exhibit a sharp reduction of neural variability with the arrival of a stimulus (Rajan et al., 2010), suggesting a mechanistic explanation for the reset mechanism considered in our study. More dramatic, the presence of strong nonlinearity can lead to bistable responses, which may be useful in robustly maintaining memories in the presence of noise (Toyoizumi, 2010) or lessening the need to fine-tune synaptic connection weights (Camperi & Wang, 1998; Goldman et al., 2003; Koulakov et al., 2002). Further work is needed to explore the possibilities offered by nonlinear networks and to develop analysis methodologies that allow a rigorous understanding of networks in which the conveniences offered by linear analysis no longer apply.
A.1. Relation Between Fisher Information and (Signal Gain)-to-Noise Ratio in Linear Systems.
The above relation is the Cauchy-Schwarz inequality, and the equality holds when ki ∝ gi/ci. For a nondiagonal matrix , we can get the same result by changing to a coordinate system in which is transformed to a diagonal matrix and the condition for the equality becomes that is along .
An alternative proof of the relation between IF and the SNR (not shown) can be obtained using the Cramer-Rao bound relationship between IF and the maximum likelihood estimator of the stimulus.
A.2. Analytic Expression for the Fisher Information in Attractor Networks.
A.2.1. Calculation of Fisher Information for Line Attractors.
Above, we showed general conditions under which the Fisher information of the line attractor networks is approximated by the signal-to-noise ratio along the attractor. For ease of computation, in the calculations below for the attractor networks, we consider only the case of (normal) networks in which all modes of the attractor networks are orthogonal, . In this case, the left and right eigenvectors corresponding to a given eigenvalue are identical, and we denote the ith eigenvector by .
A.2.2. Optimal Decay Time Constant for Line Attractors with or Without Reset.
A.2.3. Fisher Information for Plane Attractors.
A.2.4. SNR for Line and Plane Attractor Networks with a Linear Read- out.
Here we calculate the memory performance of line and plane attractor networks if neural activity is linearly read out by projecting along a direction .
In line attractors, the (signal gain)-to-noise ratio is independent of the choice of as long as it is not close to orthogonal to the attracting mode, since both the signal and noise accumulate almost exclusively along the one-dimensional attracting mode. However, in the plane attractor networks, noise develops isotropically in the plane, and in order not to collect noise in the noninput direction, the readout should match the input direction exactly such that (see Figure 4).
A.3. Calculation of Noise Covariance Matrix of Feedforward Networks.
A.4. Analysis of Fisher Information of Attractor and Feedforward Networks with Reset.
In section A.2, we derived analytical expressions for IF for attractor networks as a function of the strength of network feedback (or, equivalently, network time constant). In this section, we provide an analysis of IF as a function of the signal gain achieved by the network (Ganguli, Huh et al., 2008). Specifically, here we provide analytical bounds for the Fisher information IF and show which types of attractor and feedforward networks achieve these bounds.
Moreover, the form of the upper-left submatrix of , that is, the transformation of Z to Z, is constrained by equation A.16. If we choose as the first coordinate of Z and choose the remaining coordinates as vectors of Z orthogonal to , then equation A.16 implies that for any power n, all columns of except the first column remain orthogonal to the first column .
Using the above considerations, it can be verified directly that the upper bound in equation A.16 is achieved by all orthogonal matrices, feedforward chains, and networks with a ring structure constructed by connecting the final element of a feedforward chain to the first (this list is not exclusive; other matrices can also satisfy the bound).
Attractor networks can also satisfy the upper bound exactly or very closely. If we assume that all modes of the attractor network are orthogonal and is aligned to one of the modes, denoted as , then the attractor networks satisfy equation A.16 with dim(Z) = 1 since for i ≠ 1 due to the orthogonality. If the modes are not orthogonal but there exists a mode with strong recurrent feedback compared to the other modes, we can treat the activity in the other modes as negligibly small so that the network performs similarly to a network with zero eigenvalues in the modes other than the attractor mode. Thus, the low-dimensional attractor networks also satisfy the upper bound closely.
A.5. Relation Between the Bounds on the Connectivity Strength Between Neurons and the Connectivity Strength Between Modes.
Similarly, it can be shown that the bound on the synaptic connectivity between neurons leads to bounds on the strengths of the feedforward connectivity between the Schur modes. For the feedforward networks, and denote the Schur modes and the Schur decomposition of , satisfying . Since is an orthogonal matrix for any Schur decomposition, the Frobenius norm of is equal to that of , which is at most Nwmax as in equation A.19. The Frobenius norm of a lower triangular matrix is . Furthermore, not all mij can be Nwmax at the same time. As discussed in section 3.2, these bounds lead to the result that the attractor networks with a reset outperform the feedforward networks for a given bound on synaptic strengths.
A.6. Optimal Network Structures When Neuronal Activity is Bounded.
A.6.1. Attractor Networks.
As discussed in the main text, we consider only the higher signal-to-noise regime in which the maximal Fisher information is obtained when the noise variance is not saturated. Here, we derive a simple estimate of when this regime is attained and show that the optimal value of the network feedback α is not different from that obtained without a bound on the noise variance. In this high signal-to-noise regime, α0 is greater than the value αopt = 1 − τ/(2T) (see Figure 8F, peak of dashed line) at which IF was maximized with only a constraint on the mean activity. Then, for α < α0, because the noise has not yet saturated, there is a maximum in IF at α = αopt of value c20N/(σ2eT) (see equation A.25). For α ⩾ α0, the maximal IF is given from equations A.26 and A.21 as c20N/(Nc1r20) = c20/(c1r20). This value is attained when the numerator of equation A.26 (the signal gain) has reached its maximal value, which occurs for α ⩾ 1. Comparing the expressions for the variance-saturating and nonsaturating regimes, we see that the maximal IF is achieved in the nonsaturating regime α < α0 when , that is, for small σ or large N. Furthermore, we note as claimed above that this maximum occurs at the same α = αopt that was optimal without considering the finite variance of activity.
To find the optimal τeff, we consider separately the cases of positive and negative τeff. For positive τeff (decaying attractor mode), the signal gain is maximal at t = 0 as in the case without reset, and IF is maximal for a perfect integrator (τeff = ∞) with pointing to one of the vertices so that . For negative τeff (amplifying attractor mode), both the signal gain in the numerator and the noise variance in the denominator increase with τeff. However, the signal gain is limited by the constraint in equation A.28, so that the maximum of IF occurs when the noise variance is minimized. This occurs at τeff = −∞, corresponding to the perfect integrator. Combining the results for positive and negative τeff provides that the perfect integrator is optimal and its maximal IF is c20N/(σ2T).
The optimal Fisher information for line attractors with a reset and bound on the variability, as well as the mean, can be computed analogously to the case without a reset. The variance of the noise reaches the bound when and, for sufficiently large N or small σ, the maximum IF over α still occurs at the same value α = 1, which was optimal without considering the finite variance of activity.
In summary, we have shown that the optimal IF for the line attractor networks occurs for the perfect integrator. Moreover, the perfect integrator with a reset has the optimal memory performance of any continuous-dynamics networks with a bounded firing rate. In section A.4, we found the upper limit of IF in terms of the magnitude of the signal gain vector (see equation A.17). The uniform bound on the mean firing rate sets the upper limit on the magnitude of the signal gain vector to . Substituting this bound into the expression for the upper limit of IF, we obtain that IF ≤ c20N/(σ2T), which (comparing to above) shows that the perfect integrator saturates the bound on memory performance.
A.6.2. Feedforward Networks.
Here, we calculate the optimal structure of feedforward networks when the mean activity is constrained. For analytical tractability, we perform this calculation under a discrete dynamics approximation so that, as noted in section A.4, IF for the feedforward networks achieves the equality in equation A.15.
The equality holds when the signal gain at every stage achieves its maximal bound, that is, when each mode of the feedforward network points to the vertices of the hypercube in the state space (see Figure 9B). It is not obvious that this condition can be attained given that the states of the functionally feedforward network are additionally required to be orthogonal to each other. Below, we show that at least for the case when N is a power of 2, we can construct N mutually orthogonal modes that point to the vertices of the hypercube and use this result to show more generally that the maximal Fisher information of functionally feedforward networks is of order N/T if N is at least twice the number of feedforward stages. We next present the proof of the existence of N orthogonal modes pointing to the vertices of the N-hypercube when N = 2i, where i is a natural number:
Proof. We perform the proof by induction. For N = 2, (1, 1)Tand(1, − 1)T satisfy the condition. Now assume that there exist 2i orthogonal modes whose N = 2i elements are either −1 or 1. If we denote this set as , then for N = 2i+1, we can construct 2i+1 orthogonal modes from the orthogonal modes corresponding to N = 2i as follows:
The above construction can also be used to show that for any N, the maximal Fisher information of functionally feedforward networks is of order N/T if N is larger than twice the number of feedforward stages l = T/τ. For general N, we can generate at least N/2 orthogonal modes by applying the above construction to the maximal power of 2 less than N. This creates an N/2-length feedforward network that uses more than half of the full dynamic range. As in the calculation leading to equation A.31, the Fisher information of such networks is of order (N/2)/T when the number of neurons N is greater than twice l (N/2 ⩾ l). Thus, for general N, the maximal Fisher information is still of order N/T for N ⩾2l. By contrast, for a continuous (rather than discrete) feedforward network, the spreading out of activity over time implies that it is impossible for the network to maintain a signal gain vector pointing to a vertex at all times. Therefore, the continuous feedforward networks, unlike their discrete counterparts, strictly cannot attain the maximal bound on the Fisher information.
This research was supported by a Sloan Foundation Research Fellowship, NIH grant R01 MH069726, and a UC Davis Ophthalmology Research to Prevent Blindness grant. We thank T. Toyoizumi, A. T. Sornborger, and E. Aksay for valuable discussions and D. Fisher and J. Ditterich for valuable discussions and feedback on the manuscript. This research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.
An online supplement is available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00234.