## Abstract

In short-term memory networks, transient stimuli are represented by patterns of neural activity that persist long after stimulus offset. Here, we compare the performance of two prominent classes of memory networks, feedback-based attractor networks and feedforward networks, in conveying information about the amplitude of a briefly presented stimulus in the presence of gaussian noise. Using Fisher information as a metric of memory performance, we find that the optimal form of network architecture depends strongly on assumptions about the forms of nonlinearities in the network. For purely linear networks, we find that feedforward networks outperform attractor networks because noise is continually removed from feedforward networks when signals exit the network; as a result, feedforward networks can amplify signals they receive faster than noise accumulates over time. By contrast, attractor networks must operate in a signal-attenuating regime to avoid the buildup of noise. However, if the amplification of signals is limited by a finite dynamic range of neuronal responses or if noise is reset at the time of signal arrival, as suggested by recent experiments, we find that attractor networks can outperform feedforward ones. Under a simple model in which neurons have a finite dynamic range, we find that the optimal attractor networks are forgetful if there is no mechanism for noise reduction with signal arrival but nonforgetful (perfect integrators) in the presence of a strong reset mechanism. Furthermore, we find that the maximal Fisher information for the feedforward and attractor networks exhibits power law decay as a function of time and scales linearly with the number of neurons. These results highlight prominent factors that lead to trade-offs in the memory performance of networks with different architectures and constraints, and suggest conditions under which attractor or feedforward networks may be best suited to storing information about previous stimuli.

## 1. Introduction

Short-term memory is thought to be maintained by patterns of neural activity that are initiated by a memorized stimulus and persist long after its offset. Because memory periods are relatively long compared to biophysical time constants of individual neurons, it has been suggested that network interactions can extend the time over which neural activities are sustained (Brody, Romo, & Kepecs, 2003; Durstewitz, Seamans, & Sejnowski, 2000; Major & Tank, 2004; Wang, 2001). However, the form of such interactions is currently unknown in most systems, and experimental and theoretical work has suggested a range of different network architectures that could subserve short-term memory.

A critical factor for robustly maintaining the memory of a stimulus is being able to resist the effects of noise that can accumulate over time. This is a particularly acute problem for the representation of analog values in memory. In many memory-storing paradigms during which neurophysiological recordings have been obtained (for example, see Aksay, Baker, Seung, & Tank, 2000; Goldman-Rakic, 1995; Robinson, 1989; Romo, Brody, Hernandez, & Lemus, 1999; Sharp, Blair, & Cho, 2001; Taube & Bassett, 2003), neurons have been shown to exhibit what appear to be continuously varying response levels that change in a graded manner with the stored stimulus value. With such analog representations, any noise-induced change in neural activity has the potential to affect the encoding of the stimulus. Thus, such networks are faced with apparently conflicting demands. On the one hand, the networks must be able to maintain the value of a signal in memory for long durations. On the other hand, the mechanism for performing this maintenance must keep the signal from being contaminated by excessive buildup of noise.

The most common models for how activity evoked by a transient stimulus is maintained over time are the so-called attractor networks. In attractor networks, individual neurons do not intrinsically maintain activity over long timescales and thus cannot in isolation store a memory. Instead, activity is maintained by positive feedback whereby neurons that are connected by excitatory or disinhibitory positive feedback loops maintain one another’s activity following the offset of the external drive provided by the stimulus. In such models, the network structure determines which patterns of activity can be sustained by positive feedback, and typically only a small, specially designed set of patterns can be maintained. These maintained patterns of activity are called attractors of the network dynamics, because perturbing the dynamics away from such patterns leads to a rapid return to the attractor. A number of models of analog memory storage have utilized attractor dynamics (for review, see Brody et al., 2003; Durstewitz et al., 2000; Major & Tank, 2004; Wang, 2001), and recent analyses of neocortical data provide suggestive evidence for such attractors in tasks involving a working memory component (Ganguli, Bisley et al., 2008).

Recently both theoretical models (Ganguli, Huh, & Sompolinsky, 2008; Goldman, 2009; Mauk & Buonomano, 2004; Rabinovich, Huerta, & Laurent, 2008; Savin & Triesch, 2009; White, Lee, & Sompolinsky, 2004) and experimental observations (MacDonald, Lepage, Eden, & Eichenbaum, 2011; Pastalkova, Itskov, Amarasingham, & Buzsaki, 2008) have suggested instead how purely feedforward networks can store the memory of a stimulus in their transient dynamics. Experimentally, a feedforward progression of neuronal activity has been reported in hippocampal neurons during memory delay periods (MacDonald et al., 2011; Pastalkova et al., 2008), and theoretical work suggests mechanistically how an analog signal can be represented over time by activity that slowly propagates through a feedforward chain of neurons or, in recurrent networks, through a sequence of distinct and nonoverlapping patterns of network activity (Ganguli, Huh, et al., 2008; Goldman, 2009; White et al., 2004).

Here, we compare the performance of attractor and feedforward models in the presence
of noise. Our work builds on the information-theoretic frameworks for quantifying
memory performance of White et al. (2004) and
Ganguli, Huh et al. (2008), who considered
the performance of linear neural networks with discrete dynamics (i.e., defined with
difference equations so that time is measured in discrete units that facilitate
analytic calculation). We measure memory performance by calculating the Fisher
information that is maintained about a transient stimulus at a time *T* into the future. Unlike previous work in neuronal systems
(but as in the fluid mechanics example of Ganguli, Huh et al., 2008), the networks we study are defined by differential
equations that consider the more realistic situation of continuous time dynamics.
However, to facilitate analytic calculations, we also, when appropriate, compare to
networks constructed with discrete dynamics.

The structure of this letter is as follows. First, in analogy to previous studies of linear networks with discrete dynamics, we analytically calculate the memory-storing performance of linear, continuous-time networks and determine the properties that optimize the Fisher information storage capacity of both attractor and feedforward networks. We then consider the effects of two nonlinearities suggested by neuronal recording data. First, we consider the effects on memory performance of reset mechanisms that, for example, remove noise from the system near the time of stimulus arrival (Churchland et al., 2010; Rajan, Abbott, & Sompolinsky, 2010; Weber & Daroff, 1972) or keep the network from entering the memory-storing state until the time of stimulus onset (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001). Second, we consider the effect of limiting neurons to having a finite range of firing rates with which they can encode a memorized stimulus.

## 2. Material and Methods

In this letter, we compare the performance of attractor and feedforward network
models in maintaining the memory of a brief, analog-valued stimulus for a fixed or
known delay period *T* in the presence of noise. Here, we define the
dynamics of each network model, as well as the Fisher information used to quantify
the memory performance.

### 2.1. Linear Network Models.

*s*of a briefly presented stimulus occurring at time

*t*= 0. In the majority of this work, we employ a firing rate model with continuous linear dynamics described by where is a vector containing the firing rates of the

*N*neurons in the network, each of which has intrinsic time constant

*τ*. Inputs to the neurons include recurrent feedback from other neurons , the pulse-like input whose strength

*s*is to be remembered, and gaussian white noise of mean 0 and amplitude

*σ*that is presented at all times (see Figure 1A). Here, the elements of the connectivity matrix

*W*represent the strength of the synaptic connection from the

_{ij}*j*th to the

*i*th neuron, and the elements of the input vector indicate the relative weights of the inputs to each neuron. The input is presented as a transient pulse at time 0, modeled by a delta function δ(

*t*).

*t*when the stimulus strength is

*s*and noise starts entering the system at time

*t*

_{0}:

*T*. Note that the mean neural activity scales linearly with

*s*and the magnitude of . We set except in section 3.3, where we consider the effects of imposing a finite dynamic range on neuronal responses. The covariance matrix scales linearly with σ

^{2}, and we denote the integral factor in equation 2.4 by . In linear networks with no reset, we set

*t*

_{0}to −∞ to account for noise building up continuously at all times before the stimulus onset. When considering networks in which the appearance of the stimulus resets the noise,

*t*

_{0}is set to 0, so that only noise presented after the stimulus onset affects memory performance.

_{i}(

*t*

_{1})ξ

_{i}(

*t*

_{2})) = mean(ξ

_{i}(τ

*t*′

_{1})ξ

_{i}(τ

*t*′

_{2})) = δ(τ(

*t*′

_{1}−

*t*′

_{2})) = τ

^{−1}δ(

*t*′

_{1}−

*t*′

_{2}). By discretizing time with time step Δ

*t*′ and replacing the derivative of with a finite difference, we obtain the discrete dynamics approximating the continuous dynamics as follows:

*t*′ = 1. , , and σ are the same as in equation 2.1, and δ(

*n*) and are the delta function and gaussian white noise in discrete time, respectively. The mean and covariance matrix for the above equation can be obtained as

In equation 2.7, *n*_{0} replaces *t*_{0} in equation 2.4 and
the power of τ in the denominator is reduced by one relative to that in
equation 2.4 because the
differential *dt’* in equation 2.4 is set equal to *τ* in discrete dynamics and therefore cancels one factor of *τ* in the denominator.

The evolution of the network activity under linear dynamics can be computed by decomposing the activity into linearly independent modes. Here, we consider two such decompositions and use them to characterize the dynamics of the attractor and feedforward network models in the absence of noise.

In attractor networks, positive feedback sustains the activity evoked by the
transient stimulus, for example, due to mutual excitatory connections between
neurons that form a positive feedback loop (see Figure 1B). To identify such positive feedback, the
eigenvector decomposition is commonly used to decompose the coupled networks
into noninteracting modes of activity that can be considered independently. In
the eigenvector decomposition, the pattern of neural activity at any given time is defined in terms of the eigenvectors and corresponding eigenvalues *λ*_{i} of the connectivity matrix , which satisfy the equation for *i* = 1 to *N*.
Geometrically, the eigenvector decomposition corresponds to a change of basis
into a new coordinate system whose axes are defined by the eigenvectors . In this new basis, the connectivity matrix is represented by a diagonal matrix having eigenvalues as diagonal entries such that , where the column vectors of are the eigenvectors. When the eigenvectors are orthogonal to
each other, is known as a normal matrix. In this case, the activity in
each mode is equal to the Cartesian projection of the network activity onto that
mode, and there is no overlap among the activities in the different modes.

Activity in any eigenmode exhibits exponential growth or decay with a time
constant τ^{i}_{eff} = τ/|1 − Re(λ_{i})|. If λ_{i} = 1, activity is sustained without decay, and the mode can
integrate any input perfectly. If Re (λ_{i}) < 1, activity decays with a time constant that decreases as λ_{i} decreases, and for Re (λ_{i})>1, activity grows exponentially. Attractor networks are defined
by having a small number of modes (the attractor modes) with λ_{i}’s much larger than the other eigenvalues. For such networks,
activity in all except the attractor modes decays exponentially quickly to zero
so that after a transient period, the only remaining activity is along these
modes. The resulting subspace spanned by these modes is then called an attractor
of the network dynamics. We illustrate a simple attractor network consisting of
two symmetric excitatory neurons in Figure 1B. In such a network, the noninteracting modes correspond to the
sum and difference of the activities and are called the common and difference
modes (see Figure 1C). In the common mode,
which is proportional to the average activity in the network, activity evoked by
a transient input is maintained by mutual excitation of the neurons. By
contrast, because the symmetric mutual excitation tends to make the neurons fire
at equal rates, the difference mode is sharply attenuated by the network
interactions, leading to rapid decay of any initial activity in this mode. Thus,
after a transient period, only the activity along the common mode remains and
the common mode is called an attractor of the network dynamics. Generally if
there exist multiple modes with strong positive feedback, the signal can be
stored in any of these modes, and the network is called a *d*-dimensional attractor network, where *d* denotes the number of modes with large λ_{i}’s (see Figure 1D). In the
special case when *d* equals one or two, the attractor is called
a line or plane attractor, respectively.

Feedforward networks use a different mechanism for storing a signal. Rather than
maintaining a stable pattern of activity through positive feedback, as in the
attractor networks, the signal is carried by different neurons at different
times. For example, in feedforward networks composed of neurons connected as a
chain, the activity can be maintained as long as activity continues to propagate
along a chain in which the activity in the previous neuron is passed onto the
next neuron and filtered at each stage (see Figures 1E and 1G). The
feedforward networks cannot be decomposed into a full set of *N* eigenmodes because, by the definition of an eigenmode, the activity that starts
in an eigenmode remains in that mode (see Figure 1F, left). By contrast, the fundamental characteristic of the
feedforward networks is that the activities of all neurons except the final one
are passed onto the next neurons instead of being sustained.

The Schur decomposition is more suitable for describing feedforward networks
(Ganguli, Huh et al., 2008; Goldman, 2009; Murphy & Miller, 2009). Rather than diagonalizing the matrix , the Schur decomposition changes to a basis in which is triangular, that is, it decomposes any connectivity matrix
into orthogonal modes that can have both feedforward and self-connections, but
no feedback connections from later-stage neurons to earlier neurons. Formally,
the Schur decomposition transforms the matrix into a lower triangular matrix such that , where the columns of are the orthogonal modes, called Schur modes, and the values
of along the diagonal are the eigenvalues of (equivalently, can be made into an upper triangular matrix; Horn &
Johnson, 1985). As in the eigenvector
decomposition for normal networks, the diagonal entries of give the feedback of the Schur modes onto themselves (for
normal , the Schur and eigenvector decompositions are identical). If is nonnormal, then the Schur decomposition will contain
nonzero lower triangular entries that correspond to feedforward connections
between the Schur modes. In this case, activity may be transiently amplified as
it propagates through the feedforward connections between modes, even when all
the eigenmodes are decaying (i.e., when all λ_{i} < 1; Trefethen & Embree, 2005).

Here, we consider two types of feedforward networks. First, we consider literally
feedforward networks for which the connectivity matrix itself is lower triangular with zeros along the diagonal, so
that all connections are feedforward. Thus, the Schur mode patterns of activity
correspond to individual neurons (see Figure 1F, right). Especially, we consider a simple chain-like structure
whose connectivity matrix between neurons in the literally feedforward networks
is of the form *W _{ij}* = α>0 for all

*i*and

*j*such that

*i*=

*j*+ 1 and zero otherwise. For networks with many neurons arranged in a chain, the propagation of activity can continue for a duration proportional to the chain length, with each neuron’s activity peaking at different times. With this diversity of temporal profiles of neural activities, the network can generate persistent activity with a simple readout that linearly sums the activities of the different neurons with appropriate weights (see Figure 1G; Goldman, 2009).

Second, we consider recurrent matrices whose Schur decomposition has a feedforward (lower triangular, with zeros along the
diagonal) structure; we call these functionally feedforward networks because the
activity patterns defined by the Schur modes, rather than the neuronal activity
itself, propagate in a feedforward manner (Ganguli, Huh et al., 2008; Goldman, 2009; Murphy & Miller, 2009). As in the case of the literally feedforward networks, we
consider simple functionally feedforward chains of the form *M _{ij}* = α>0 for all

*i*and

*j*such that

*i*=

*j*+ 1 and zero otherwise. An example of a functionally feedforward chain is shown in Figures 1H and 1I, in which a two-neuron network with one excitatory and one inhibitory neuron is decomposed into common and difference modes by the Schur decomposition. The modes make a feedforward chain such that the activity of the difference mode drives the activity in the common mode (see Figure 1I). More neurons can form a longer functionally feedforward chain, allowing progression of activity patterns that persist for longer periods of time (see Figure 1J).

### 2.2. Fisher Information Measure for Memory Performance.

To achieve good memory performance, a network must maintain a memory of the
stimulus while resisting the excessive accumulation of noise. The ability to
achieve this can be quantified as the ratio between the amount of signal and
noise stored in the system at a given time following stimulus offset. Here we
use a closely related measure, the Fisher information *I _{F}*, which quantifies the amount of information carried about a signal by
the distribution of neural activities and which, for linear networks and
gaussian white noise, is shown below to represent a ratio between the factor by
which the network amplifies signals and the amount of noise accumulated by the
network (Ganguli, Huh et al., 2008).

To get an intuition for this measure, we show in Figure 2A how to compute the Fisher information for an
example of the activity of a single neuron (or a single eigenmode of an
attractor network) in the presence of noise. The neuron (or mode) must
distinguish between different pulse-like stimuli of amplitudes *s* and *s* + δ*s* that
it receives at time 0. Making this discrimination more difficult, noise is
presented to the neuron (or mode) continually in time (see Figure 2A). We model the transient stimulus as a delta
function δ(*t*) so that the stimulus causes a jump in the
mean neural activity at time 0, with size proportional to the stimulus strength *s* or *s* + δ*s* (see
Figures 2B and 2C, thick lines). Due to the noise, each presentation
of the stimulus leads to a different trajectory so that there is trial-to-trial
variability in the response (the black and gray noisy trajectories in Figure 2B).

The memory of the stimulus is carried by the distribution of the firing
activities of the neurons. In order to perform well in maintaining the
distinction among stimuli, the distributions for different stimuli must remain
well separated: the more the noise makes the two distributions overlap, the
greater will be the corruption of the stored memory. In linear networks, the
mean activities of the neurons (gray circle and black asterisk in Figure 2C) carry the information about the presented
stimulus, and the signal gain is measured as the difference in the mean
activities δ〈*r*〉 divided (i.e., normalized)
by the separation δ*s* of the signals to be discriminated
(see Figure 2D). The noise in the neural
activities is given by the spread in the firing rate distribution. The Fisher
information *I _{F}* conveyed by the network is defined as the ratio of the square of the
signal gain to the noise variance at time

*T*. Thus, either a wider separation between the means (high signal gain) or narrower distributions about these means (small noise variance) lead to higher Fisher information.

*c*are constants,

_{i}*I*is in this case given by

_{F}Here, is the derivative of the mean with respect to *s*, called the signal gain. Thus, equation 2.9 shows that *I _{F}* is of the form of a (matrix) ratio between the signal gain squared and
the noise covariance. Note that σ

^{2}

*I*(

_{F}*t*) is independent of the stimulus strength

*s*and injected noise level σ

^{2}and depends only on the properties of the network connectivity. Therefore, in the following, we calculate instead of

*I*and often refer to as the Fisher information for brevity (see Figure 2D). Note that this quantity has the same units as σ

_{F}^{2}since

*I*is unitless (and assuming that

_{F}*s*is unitless).

In equation 2.9, the readout of the network activities is not specified. In particular, in a linear system with gaussian noise, it can be shown that is greater than or equal to the (signal gain)-to-(noise gain) ratio in any linear readout of the network (see section A.1 in the appendix). The equality holds when the linear readout is in the direction (see the optimal linear estimator in population decoding, as in Salinas & Abbott, 1994; Sompolinsky, Yoon, Kang, & Shamir, 2001). Note that in the feedforward networks, the optimal linear readout will generally vary over time, reflecting that information about the signal propagates from earlier to later stages in the feedforward chain.

## 3. Results

Here we compare how attractor and attractorless (literally feedforward or functionally feedforward) models perform in storing the amplitude of a brief stimulus. The memory performance is measured by the Fisher information, a measure of how much the network amplifies the signal corresponding to the stimulus compared to how much it amplifies ambient noise (see section 2.2). In section 3.1, we consider purely linear networks that allow us to isolate how the structures of attractor and feedforward networks influence memory performance in the absence of nonlinear influences. Then we consider the effect of two biologically observed nonlinearities. In section 3.2, we consider a condition that we term a reset nonlinearity under which either noise does not begin to accumulate strongly until the memory period commences (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001) or in which noise that is present before the stimulus arrival is “reset” by the appearance of the stimulus (Churchland et al., 2010; Rajan et al., 2010; Weber & Daroff, 1972). In section 3.3, we consider the effects of restricting neurons to having a finite range of firing rates with which they encode a stimulus.

### 3.1. Linear Networks.

In this section, we compare the memory performance of attractor and feedforward network models with continuous linear dynamics. This allows us to focus on how signal and noise are propagated through the network as a function of the structure of the network connectivity without the complicating influence of nonlinearities. We first compute the Fisher information in line attractor networks and then extend our results to higher-dimensional attractor networks. Then we compute for feedforward networks and compare their performance to that of the attractor networks.

#### 3.1.1. One-Dimensional Attractor Networks.

We first consider the line attractor networks, which are defined by having only a single stationary or slowly decaying (or possibly growing) pattern of activity that defines the attracting mode (see Figures 1B and 1C). When stimulated by a brief stimulus, both signal and accumulated noise in line attractor networks quickly converge to this attractor. As a result, for times after the transient responses of the nonattractor modes decay away, all information conveyed by the line attractor networks is contained in the attractor mode, and we can closely approximate the Fisher information by the (signal gain)-to-(noise gain) ratio in this mode (see Figure 2).

The memory-storing performance of the line attractor models reflects a balance between two factors. First, the network must be able to sustain the signal for the full duration of the memory period. As shown in Figure 1C, this is accomplished in attractor networks by having sufficiently large positive feedback in the attracting mode. Second, the network must not accumulate too much noise over time. Since noise is assumed to be presented at all times, including prior to stimulus onset, this implies that inputs to the network should not be sustained indefinitely or noise will accumulate without bound. Thus, there is a trade-off in attractor networks between sustaining signals for sufficiently long to maintain signal strength and having enough decay of signals that noise does not accumulate excessively.

To quantify this trade-off between sustaining the signal and accumulating
noise, we examine the memory performance of the attractor network in terms
of the amount of positive feedback α in the attracting mode, where
α is the eigenvalue associated with the attracting eigenmode and the
time constant of decay (or growth, for
Re(*α)*>1) of activity in this mode is given by τ_{eff} = τ/|1 − Re(α)|. When the feedback is too
weak (see Figure 3A), the signal decays
quickly to zero and any memory of the initial stimulus amplitude *s* is forgotten. Thus, is close to zero in this case (see Figure 3D). Increasing the recurrent feedback leads to
slower decay of signals corresponding to the memorized stimulus, and when
the feedback is tuned to be large enough to offset the intrinsic leak of the
neurons (α ≈ 1), the mean responses to different amplitude
stimuli stay well separated (black and gray thick traces in Figure 3C). However, because noise along the
attractor mode is subject to the same dynamics as signals along this mode,
noise also accumulates without decay. Because noise is present at all times
before the stimulus arrives, this leads to an extremely large variance in
the responses (see Figure 3C; note the
wide spread of trajectories even before time 0). For networks that are
either nonforgetful (α = 1) or amplifying (α>1),
this noise becomes infinite in magnitude so that the Fisher information is zero (see Figure 3D). Thus, there is an optimal amount of feedback in linear
attractor networks, and a corresponding optimal time constant of decay of
network activity, that balances having a long time constant so that the
signal does not decay and having a short time constant so that noise does
not accumulate too much (see Figures 3B
and 3D).

This example shows that the attractor network performance is benefited by
having an imperfect memory-holding mechanism. To find the optimal forgetting
time constant of network activity decay, we analytically calculated for the line attractor networks (see sections A.2.1 and A.2.2). We find that achieves its maximum when the decay time constant of
network activity τ_{eff, opt} = 2*T*, where *T* is the
duration over which the signal is to be stored. Thus, the memory duration *T* sets the scale for the optimal network decay time.
When activity decays much faster than the memory duration *T*, the signal decays away before the end of the memory
period. When activity decays much more slowly than *T* or
grows exponentially, noise accumulation overwhelms the signal.

#### 3.1.2. Higher-Dimensional Attractor Networks.

We next extend the result for line attractor networks to higher-dimensional attractor networks having many modes with slow decay of activity. In line attractor networks, the input is stored along the one-dimensional attractor. In higher-dimensional attractor networks, signal and noise can accumulate in any direction spanned by the multiple attractor modes. We show that these extra dimensions do not affect of networks with optimally arranged inputs and readout. However, for imperfectly arranged inputs or outputs, we find that the memory performance of the line attractor networks is sensitive to the input direction but insensitive to the readout direction. In contrast, higher-dimensional attractor networks are more sensitive to the readout but less sensitive to the input direction.

For illustration, in Figure 4 we compare the line attractor networks to plane attractor networks defined by having two slowly decaying modes of activity. To convey geometrical intuition, we plot only the modes and with the two largest eigenvalues and assume all the eigenmodes are orthogonal. For simplicity, we assume that each attracting mode has the same eigenvalue, so that all directions in the attracting plane have equal decay times and the signal can be stored equally well in any direction on the plane. Likewise, noise accumulates in the same manner in any direction on the plane. Then, when gaussian white noise is presented equally to all neurons and thus to all orthogonal modes (see Figure 4A), the resulting noise at any time is also equivalent in all directions of the plane (see Figure 4C). By contrast, in the line attractor networks, noise along directions other than the line attractor is filtered out so that noise along the attracting mode has a larger variance than that along the other modes (see Figure 4B).

To maximize the strength of the signal carried by attractor networks, the inputs should be arranged so that none of the input is lost due to being sent into decaying modes whose amplitudes quickly fall to zero. In line attractor networks, this corresponds to aligning the input direction along the direction of the attracting eigenmode (see Figure 4E). When there is more than one attracting mode, as in the plane attractor networks, the optimal input direction can be along any linear combination of these attracting modes (see Figure 4F). The Fisher information is proportional to the square of the projection of the signal onto the attracting modes. Thus, the Fisher information is identical and equal to its maximal value for both the line and higher-dimensional attractors as long as the input is aligned along the subspace defined by the attracting eigenmodes (see Figure 4D, θ = 0). If is not aligned along the attracting modes, a portion of the signal is lost to the decaying modes and decreases from this maximum value. In the line attractor, there exists only a single attracting mode storing the signal, so decreases as deviates from (see Figure 4D, solid curve). On the other hand, stays the same in the plane attractor networks for any in the attracting plane (see Figure 4D, dashed line). Note that in the plane attractor networks would decrease as in the line attractor networks if were to deviate from the attracting plane (not shown). However, because the dimension of the plane attractor is higher than that of the line attractor, the alignment of the input vector is less restrained in the plane attractor (or, more generally, higher-dimensional attractor) networks.

Next we consider the arrangement of the readout for the maximal memory performance. As discussed in section 2.2, the Fisher information measure implicitly assumes an optimal readout because is equal to the (signal gain)-to-(noise gain) ratio along the optimal linear readout direction. However, for nonoptimal readout, the memory performance may be less than and the sensitivity to the direction of the readout may differ between the line and plane attractor networks (see section A.2.4).

In the line attractor networks, mistuning of the readout does not have much effect on the (signal gain)-to-(noise gain) ratio because the signal and noise accumulate along the one-dimensional attracting mode and their ratio is maintained for the projection onto any readout direction (see Figure 4H). Thus, the memory performance of line attractor networks remains near the maximal even when the readout direction is well away from the attractor mode (see Figure 4G, solid line). Only when the readout direction becomes close to orthogonal to the attractor direction, so that the signal becomes smaller than or comparable to the small but finite noise accumulated in the nonattracting modes, does the memory performance fall off by a significant amount. By contrast, in plane attractor networks, the memory performance is far more sensitive to the direction of the readout (see Figure 4G, dashed line). Because noise develops along all attractor dimensions but the signal lies only along the direction defined by the input, projections that are not along the input direction pick up additional noise and lower the (signal gain)-to-(noise gain) ratio (see Figure 4I). Hence, optimal performance requires a more precise readout mechanism in higher-dimensional attractor networks.

In summary, both the line attractor and higher-dimensional attractor networks were shown to have the same maximal memory performance, as characterized by the Fisher information. For the line attractor networks, memory performance was highly insensitive to the readout direction but more sensitive to the direction of the input. Conversely, for higher-dimensional attractor networks, the memory performance was highly sensitive to the readout direction but less sensitive to the direction of the inputs. These results suggest that line attractor networks might be more useful if the stored memory needed to be used by multiple networks that each projected activity along a different direction. By contrast, higher-dimensional attractors might be more useful in storing memories that can arrive from multiple input networks that each sends in different input patterns encoding the stored variable.

#### 3.1.3. Feedforward Networks.

Next, we compute the memory performance of linear feedforward networks and focus on networks with a chain-like structure that were proposed recently as a neural substrate for short-term memory storage (Ganguli, Huh et al., 2008; Goldman, 2009; White et al., 2004). A critical difference between feedforward and attractor networks is that, unlike in attractor networks, activity in feedforward networks eventually exits out the end of the network. Thus, the memory of any input is lost after some finite time in feedforward networks. Although this finite time of signal propagation might at first seem to be disadvantageous, finite memory duration can be advantageous because it prevents noise from building up in the network (Ganguli, Huh et al., 2008). These relative advantages and disadvantages are quantified below, where we compute the Fisher information conveyed by linear feedforward networks and compare their performance to that of attractor networks.

Here, we consider simple feedforward chains having uniform strength *α* of the feedforward connections and compute as a function of *α*. When the
strength of the connections is weak, the activity decays before it reaches
the last stage and is close to zero (see Figures 5A and 5C).
On the other hand, if the feedforward connections are stronger, the signal
decays more slowly (for α < 1) or can grow exponentially (for
α>1). Noise entering the system at any given time similarly
gets amplified as it passes down the chain. However, unlike in the attractor
networks, accumulation of noise in the feedforward networks is limited
because noise exits the system when it reaches the end of the chain.
Moreover, if signals are amplified along the feedforward chain, then inputs
entering the first stage of the network get amplified more than inputs
entering later stages. Thus, by arranging to have the signal enter the
network at the first stage, the network can make the signal at time *T* arbitrarily larger than the noise entering at later
times by using strong connection strengths α that allow the signal to
be amplified faster than noise enters the system (see Figure 5B). This implies that can increase indefinitely with increasing *α*, so that linear feedforward networks could in
principle convey signals to arbitrary precision (see Figure 5C; for how this result changes when neurons have
a finite dynamic range, see section 3.3). This result is consistent with that of Ganguli, Huh et al.
(2008) who also showed a
monotonic increase of with *α* in models with discrete
dynamics.

Comparison with the attractor networks reveals two important features of the
feedforward networks that reflect the advantages and disadvantages of having
finite memory duration (see Figure 5D).
First, because the feedforward networks can transiently amplify signals over
the memory period but still remove noise due to the eventual exiting of
signals from the chain, these networks can greatly outperform the attractor
networks. Second, for smaller values of *α*, the
attractor networks outperform the feedforward networks. This latter result
reflects the smearing out of signals by the continuous dynamics (see section 3.2 and Figure 6) and differs from that found when comparing
feedforward and attractor networks with discrete dynamics (Ganguli, Huh et
al., 2008), in which the feedforward
networks outperformed the attractor networks for all settings of *α* .

In summary, with continuous buildup of noise, short-term memory networks need to forget to prevent the excessive accumulation of noise. In feedforward networks, this forgetting mechanism is inherent in the finite length of the feedforward chain, and the networks can amplify signals transiently without noise building up in an unbounded manner. In contrast, in attractor networks, the duration of signal and noise accumulation is not limited, and a perfect memory holding mechanism is inferior to a forgetful one in which signal decay and noise accumulation are optimally balanced. Comparing the purely linear attractor and feedforward network models, we find that the feedforward networks can outperform the attractor models due to their ability to transiently amplify signals without building up excessive noise. In the following sections, we show how these results may change in the presence of select, biologically motivated nonlinearities.

### 3.2. Networks with a Reset Mechanism.

In the previous section, we found that the feedforward networks stored more
information than the attractor networks because they could amplify the signal
without infinite buildup of noise. In contrast, the attractor networks needed to
be forgetful in order to prevent infinite noise buildup. However, what if the
accumulation of noise before the memory period is limited, or there exists a
mechanism to reset the network state near the onset of the signal? Recent
experimental studies in several cortical regions showed that variability in
neural activity is reduced with stimulus onset (Churchland et al., 2010), and theoretical work suggests this
may be a general feature of many nonlinear recurrent networks (Rajan et al., 2010). Alternatively, networks may
not switch into a memory-storing state that accumulates noise until close to the
start of the memory period; for example, such a switch may occur due to a change
in network state triggered by attention or neuromodulation (Amit & Brunel, 1997; Durstewitz et al., 2000; Wang, 2001). Finally, if a network receives feedback about its deviation
from a desired level and is able to correct these errors, then infinite buildup
of errors is also prohibited. For example, in the oculomotor system, drift in
the networks that control eye position triggers corrective saccades that can
correct errors caused by accumulation of noise or systematic drift of network
activity (Weber & Daroff, 1972).
Motivated by these examples, we here consider the effect of allowing a network
to reset its activities with the arrival of a signal and remove previously
accumulated noise. Note that the level of spontaneous activity before the memory
period can differ between these different reset mechanisms, being low even
before stimulus arrival if there is a stable low-rate spontaneous state (Amit
& Brunel, 1997; Durstewitz et al., 2000; Wang, 2001) or being higher during spontaneous activity and
reduced only at stimulus onset (Rajan et al., 2010; Churchland et al., 2010); however, for any reset mechanism, the variability of network
activity would be low at the beginning of the memory period. For simplicity, we
implement this “reset nonlinearity” by setting the noise to zero
at the time of signal arrival, so that noise accumulates only during the memory
period of duration *T*.

First, we consider the attractor networks. Before the signal arrives, noise
accumulates, and this accumulation can grow without bounds along any nondecaying
(α ⩾ 1) modes of the network (see Figures 6A and 6B).
However, at the time of signal arrival, the reset mechanism quenches the neural
activities to zero. Therefore, only noise presented after *t* = 0 degrades the memory performance, and unlike in the attractor networks
without reset, perfectly integrating or exponentially growing modes can convey
information about the signal. In fact, monotonically increases with increasing *α* (see Figure 6C and section A.2.2),
showing that memory performance is enhanced by amplifying signals in the
network. This result can be understood by recalling that the signal is presented
only at time *t* = 0, whereas noise is equally presented
during the entire memory performance: by amplifying the input over time, more
weight is given to inputs at earlier times, allowing the signal to be amplified
faster than noise enters the system. In the limit of infinite signal
amplification, the signal can be made arbitrarily larger than the noise, so that
the Fisher information approaches infinity and signals can be discriminated with
perfect precision.

In feedforward networks, the reset mechanism also enhances memory performance by removing noise accumulated prior to the stimulus onset (see Figures 6D and 6E) and thereby increases relative to a network without reset (see Figure 6F). However, the increase in for the feedforward chain network is not nearly as large as that in the attractor networks. This reflects that, even without an externally imposed reset, the finite length of the feedforward chain already provides a mechanism for removing noise because noise exits the system when it reaches the end of the chain.

By comparing the performance of the attractor and feedforward networks, we find that the attractor networks perform better than the feedforward networks when there exists a reset mechanism (see Figure 6I). To understand what factors contribute to this result, we first consider the case of networks with discrete dynamics (Ganguli, Huh et al., 2008). In feedforward chains exhibiting discrete dynamics, all activity at one stage of the chain passes in the next time step to the following stage. When α = 1 (see Figure 6G), activity is passed from neuron to neuron in discrete time steps without loss of amplitude. Thus, there is no smearing out of activity across neurons as the activity progresses through the feedforward chain. In this case, the Fisher information for the feedforward and attractor networks is identical (see section A.4), reflecting a deep similarity between the feedforward and attractor networks: whereas in attractor networks, activity at each time step is sent from a given neuron (or mode) onto the same neuron (or mode), in feedforward networks, the activity is similarly propagated over time, but instead from one neuron to the next (see Figure 6G; for discussion of a more general mathematical formalism that formalizes the similarity between feedforward and attractor networks, based on pseudospectral analysis (Trefethen & Embree, 2005) see the supplement of Goldman, 2009).

The discrete dynamics example suggests that the key factor explaining the poorer
performance of feedforward networks with continuous dynamics is the spreading of
activity across neurons or modes that occurs in the continuous feedforward
networks. This spreading has two effects. First, it reduces the amplitude
(vector length) of the signal carried by the network by spreading activity
across different neurons. To understand this, note, for example, that dividing a
signal equally among two neurons, so that the activity can be described by a
vector (*s*/2, *s*/2), reduces the vector
amplitude of the signal by a factor of compared to when the entire signal is carried by a single
neuron, that is, (*s*, 0). This loss of signal is evident in
Figure 6H, which shows how the signal gain
decreases over time in the continuous feedforward networks (dashed curve) but is
maintained at a constant level in the discrete feedforward networks (circles).
Second, the spreading of activity causes activity to exit the network before the
end of the memory period. This is evidenced by the dip in signal gain seen near
the end of the memory period (*T* = 2) in the same figure.
Due to both the spreading of the signal and the loss of signal out the end of
the chain, for continuous-time feedforward networks becomes lower than
that of the attractor networks (see Figure 6I).

In the example above, we showed that the attractor network outperforms the
feedforward network with the same strength between the modes. However, a more
direct biological constraint is to compare the attractor and feedforward
networks when the maximum connection strength between neurons is held fixed. In
this case, we again find that the optimal attractor network outperforms any
(literally or functionally) feedforward network. The proof is given in section A.5. There, we show that if the
maximal synaptic strengths between the neurons are bounded by *w*_{max} and the eigenmodes or Schur modes are orthogonal to each other,
the connectivity strength between the modes is bounded and given by *Nw*_{max}. For the attractor networks, we find a network that reaches this
bound. By contrast, the feedforward networks cannot achieve this bound for all
connections between modes. Further, even if we assume there exists a feedforward
network with all connections between modes set to *Nw*_{max}, the previous comparison of memory performance for a given
connectivity between the modes shows that the attractor network still
outperforms the feedforward network when both networks have connectivity
strength *Nw*_{max} between modes (see Figures 6H and 6I). Thus, the optimal
attractor network outperforms any feedforward network when the synaptic
connectivity between the neurons is bounded. Alternatively, we can also consider
the constraint that the total postsynaptic weight is bounded. We note that at
least for the case of excitatory (literally feedforward or excitatory attractor)
networks, the maximal memory performance under this constraint corresponds to
the previous result, in which there were fixed connections between modes (see
Figures 6H and 6I). The optimal feedforward networks use this maximal
connection strength *w*_{postsynaptic, max} between all neurons, and the
optimal attractor networks have a maximum eigenvalue *w*_{postsynaptic, max}. Thus, the optimal attractor
networks outperform the optimal feedforward networks.

In summary, in this section we considered the effect on memory performance of reset mechanisms that remove accumulated noise at the time of signal arrival. As a result of this reset, the attractor networks could amplify signals without having a buildup of noise prior to signal arrival affect the memory performance. Moreover, for a given level of amplification between the neurons or modes, the attractor networks perform better than the feedforward networks since the activity in the feedforward networks spreads out along the chain and is lost when it exits the end the chain.

### 3.3. Bounds on the Neuronal Activity.

In the previous sections, we found that the networks exhibiting the best memory performance depended on strong amplification of signals that led to large and potentially unbounded growth of network activity. However, unbounded amplification of activities is not possible since neurons have a limited dynamic range. This limited range assumes several forms. Biophysically, there are absolute limits on the maximal firing rates that neurons can achieve (typically in the hundreds of Hz) due to postspike refractoriness. Additionally, neurons have been suggested experimentally (Baddeley et al., 1997) to have constraints on the average firing rates they can assume over long time periods. During working memory periods, most neurons do not sustain average firing rates beyond several tens of Hz, even though trial-to-trial fluctuations may be much larger than this for brief periods of time.

Given these observations, in this section we consider the effects of imposing constraints on the range of firing rates with which neurons encode signals in memory. Throughout much of the discussion, we confine ourselves to limits on the mean firing rates attained over the course of the memory period. This constraint is motivated by the observation that memory neurons typically have much lower (trial-averaged) mean firing rates than are allowed by their moment-to-moment biophysical constraints, and for analytical tractability is implemented by adjusting the inputs of an otherwise linear network such that the activity never exceeds hard bounds on the mean rates. Then, in order to gain insights into the effects of constraints on the absolute size of neuronal fluctuations, we consider what happens when we additionally apply a hard bound on the variance of firing rates about the mean.

#### 3.3.1. Effects of Bounded Rates on the Form of External Inputs.

Before considering the effects of a finite dynamic range of firing activity
in specific networks, we investigate the constraints it places on the form
of the input vector. Since the input vector drives the mean firing activity
in linear networks (see equation 2.3), putting a limit on the mean firing rate correspondingly
constrains the input vector. Note that this differs from our treatment of
networks with unconstrained firing rates, for which we normalized to 1 because the Fisher information for all networks
simply scaled with the square of the input magnitude and could be made
arbitrarily large by increasing ; see equation 2.9 and compare to Ganguli, Huh et al., (2008), who assumed that is still limited to 1 under a similar constraint on the
dynamic range). To implement the constraint on mean firing rates, we assume
that each neuron has its mean (absolute) activity bounded by a maximal value *r*_{0} (where negative rates can be considered as the firing rate of
an “anti-neuron” with opposite stimulus preference; Shadlen,
Britten, Newsome, & Movshon, 1996). This is illustrated geometrically in Figure 7A, which shows that the hard limits on the mean
firing rate define a hypercube (a square in 2D) in the space of possible
firing rates. To stay within these limits, the magnitude of the input vector
must be set such that the mean firing rate of any given neuron never exceeds
its bound. As we show further, this leads to different maximal amplitudes of
the input vector for different network architectures.

The constraint on the mean firing rate has immediate implications as well for
the spatial pattern of inputs that are conveyed most faithfully by the
network. Given the limitation on how much information any given neuron can
convey with its limited dynamic range, the maximal information carried by a
network is achieved when all neurons are used and each of these neurons uses
its full dynamic range. When this idea is represented geometrically,
information storage is maximized if the attracting or Schur modes of the
networks lie along the directions pointing to the vertices of the hypercube
that defines the maximal range of mean responses (see Figure 7B, open circles). With this arrangement, the
strength (vector length) of signals conveyed by the networks is proportional
to , illustrating the benefits of having more neurons in the
network when each individual neuron has limited dynamic range. As shown in
section A.6, this scaling leads
to the Fisher information for the best attractor and feedforward networks
scaling with the network size *N* (see Figure 8G).

#### 3.3.2. Attractor Networks with Finite Dynamic Range.

We first consider the performance of attractor networks with a limited range of mean firing rates and no reset nonlinearity. In this case, our results follow closely that found for the linear networks of section 3.1: if the decay time constant of the network is too small, the signal decays to zero before the end of the memory period (see Figure 8A, bottom trace, and Figure 8B, probability distribution of activity in this mode). By contrast, if the network does not exhibit decay or decays too slowly, then noise builds up to the point that the signal becomes overwhelmed by noise (see Figures 8A and 8B, top traces). To perform optimally, the network must balance signal decay and the accumulation of noise, and we find that the optimal time constant of network decay to achieve this balance is (see Figures 8A and 8B, middle traces; in Figure 8E, the dotted line shows memory performance as a function of α). We note that this result is identical to that found in section 3.1 for networks with no bounds on the mean firing rates. This identical result reflects that, due to the need to remove noise through decay of network activity, the activity of the network never needs to be constrained by the limited dynamic range. However, the limited dynamic range does bound the total information that can be conveyed by the network because it constrains the amount of input that the network can receive at the time of the stimulus.

When there is a reset nonlinearity at the time of signal arrival, the optimal
strength of network feedback does change compared to that obtained without a
limited dynamic range. Recall that, without limits on the dynamic range (see
section 3.2), the optimal networks
were found to have strong feedback (α>1) so that they could
amplify their signals faster than noise entered the system. With a finite
dynamic range, unconstrained amplification of activity is no longer
possible. Figure 8C illustrates mean
trajectories of three modes with different recurrent feedback *α*, or, equivalently τ_{eff}, corresponding to decaying (τ^{i}_{eff}>0), perfectly integrating (τ^{i}_{eff} = ∞), and exponentially growing modes (τ^{i}_{eff} < 0), respectively. Compared to the decaying mode, the perfectly
integrating mode performs better because it maintains the signal faithfully
yet has only a finite buildup of noise due to the reset at time *t* = 0. For the amplifying mode that exhibits
exponential growth (the increasing trajectory in Figure 8C), we set to a value such that activity propagates linearly through
the network until, at the end of the memory period, it just reaches the
limit of the dynamic range. Thus, the maximal signal that can be carried by
the network is identical to that obtained in the perfectly integrating mode.
However, due to the amplification, noise accumulates faster than in the
perfectly integrating mode, resulting in a larger variance in the neuronal
firing rates (see Figure 8D). Thus,
with a finite dynamic range and reset of activities with arrival of the
signal, we find that perfectly sustaining the activity during the memory
period is optimal (see Figure 8E; see
section A.6 for the detailed
calculation).

More generally, in both the networks with and without a reset nonlinearity,
we find that for optimally tuned networks increases linearly with the
number of neurons *N* (see Figure 8G) and decreases inversely with the memory
duration *T* (see Figure 8H). Thus, it scales with *N*/*T*.
The former result reflects that more neurons allow more signal to be carried
by the network, as discussed in the preceding section. This latter result
reflects the accumulation of the continually presented noise, which results
in a linear increase in noise variance over the memory period (see section A.6).

Note that the constraint on the mean firing rate alone may allow infinite accumulation of noise, which is not biologically plausible. Thus, in addition to constraining the mean activity, we further consider bounds on the variance of neural activity. For analytical tractability, we place a simple bound on the maximal variance of activity without affecting the underlying dynamics: if the variance of activity obtained from the dynamics exceeds the bound, the variance is saturated and the variance is set to the bound (see section A.6).

To see the effect of the bound on variance, we note that there are two possible cases. In the first case, the Fisher information is maximized in a regime where the noise variance bound is saturated. Since the noise variance is saturated, this corresponds to a regime in which the signal-to-noise, at best, is very low and the information transmission is exceptionally small. Thus, without some additional nonlinear mechanism for noise reduction, the system would fail to transmit much information. Furthermore, it is not even clear that Fisher information provides a good metric in such cases where only very coarse discrimination of signals may be performed (see Butts & Goldman, 2006). We therefore do not consider this case further.

In the second case, corresponding to a higher signal-to-noise regime, the
maximal Fisher information is obtained when the noise variance is not
saturated. In this case, as shown in section A.6, we find that the optimal value of the network
feedback *α* is not different from that obtained
without a bound on the noise variance (compare Figures 8E and 8F).
Furthermore, we derive conditions on *σ* and *N* such that this higher signal-to-noise regime is
attained. In a similar manner, the optimal memory performance of the
feedforward networks in the high signal-to-noise regime is not affected by
the bound on the variability. Therefore, for the feedforward networks
discussed below, we consider only the effect of the bound on the mean
activity.

#### 3.3.3. Feedforward Networks with Finite Dynamic Range.

In the sections 3.1 and 3.2, we found that the optimal feedforward networks used transient amplification of signals to increase the (signal gain)-to–(noise gain) ratios that are represented by . However, as we noted for the attractor networks with a reset, unbounded signal amplification is no longer possible when there is a finite dynamic range, and firing rates will saturate unless the inputs entering the network are reduced. The consequences of this limited dynamic range for the feedforward networks are delineated below.

We consider first the case of networks with discrete dynamics (see Figure 6G) for which analytical calculation of the optimal Fisher information is tractable. Similar to the above results for the attractor networks with a reset, we find that the optimal memory performance is obtained under two conditions. First, the input vector should be made as strong as possible for each neuron so that each neuron in the network uses the full extent of its mean dynamic range. This immediately implies that the optimal feedforward networks must have a functionally, rather than literally, feedforward architecture (because in a literally feedforward architecture, by definition the first stage does not contain all neurons).

Second, the networks should use a value α = 1 that corresponds
to perfect maintenance of the signal as it propagates down the chain of
modes (see Figure 9A). If network
activity decays more quickly than this (*α*< 1),
part of the signal will be lost. If network activity grows more quickly
(*α* > 1), then in order to use its full
dynamic range of mean activity and not saturate, the network will need to
have initial activity that is less than maximal and will need to amplify
this activity over time, leading to an amplification of noise as well. This
is precisely analogous to the case for the attractor network with a reset,
and indeed the discrete feedforward and attractor networks with a reset have
identical (see Figure 10C).
Without a reset, the feedforward networks with discrete dynamics can
outperform the attractor networks because they do not need to forget in
order to remove noise and, in fact, the performance of the feedforward
networks is identical with or without a reset (see Figures 10A and 10C).

We note that the two conditions above imply that for the feedforward networks
to maintain neuronal activity at a level that uses the full dynamic range of
all neurons, the activities of the neurons at all times need to be directed
along the vertices of the hypercube that defines the allowed range of mean
firing rates (see Figures 7B and 9B). It is not immediately obvious that
this condition can be met for the feedforward networks, because it implies
that there must be *N* orthogonal modes of the network that
each lie along a different vertex of the hypercube. In section A.6, we show that networks can be
constructed that obey this criterion, at least in the case that the number
of neurons *N* is equal to a power of 2. When *N* is restricted to this case, we show that the Fisher
information conveyed by the network scales as *N*/*T*, similar to the case of the
attractor networks with a reset. Building on this case, we show in section A.6 that for general *N*, the maximal Fisher information is still of the order
of *N*/*T* when the number of neurons is at
least twice the number of feedforward stages.

Literally feedforward networks perform more poorly than the optimal,
functionally feedforward networks already described. In the literally
feedforward networks, different sets of neurons are used to convey
information in each stage. In particular, since input is applied to neurons
only in the first stage and is carried only by a subset of neurons at any
time, the feedforward networks cannot convey as much information as
functionally feedforward networks that use all neurons at every stage.
Interestingly, when we considered storage of a single-dimensional stimulus
in a literally feedforward network with number of stages equal to the
duration of the memory period (as in Ganguli, Huh et al., 2008), we found that the memory performance of
the optimal networks scaled only as *N*/*T*^{2} rather than as *N*/*T* (see
section A.6). Furthermore,
although the previous study focused on networks having a fan-out structure
with more neurons at later stages, we found that the optimal network
architecture in this case contained equal numbers of neurons at all times
(see section 4 for further commentary).
This uniform structure provides an optimal balance between the fan-out
architecture, which allows larger signal amplification between stages, and
the fan-in architecture, which reduces noise (and particularly the amount of
noise that is common among neurons) by pooling stages with more neurons into
stages with fewer neurons (see Figures 9C to 9E). Figure 9D shows how depends on the rate of fan-out, which is defined as the
ratio of the number of neurons in the successive stages: when the fan-out
rate is less than 1 (greater than 1), it is a fan-in (fan-out) structure. As
seen in this figure, is maximized when the fan-out rate is 1, that is, for a
uniform structure. We note that this result holds regardless of whether , as in Ganguli, Huh et al. (2008) (calculation not shown).

For feedforward networks with continuous dynamics, the Fisher information cannot be expressed in a simple analytical form, making it difficult to find the structure that optimizes memory performance. To obtain a lower bound on the maximal and gain an intuition for how the results obtained in discrete dynamics might change when the dynamics are continuous, we therefore calculated for networks with the structure found to be optimal under discrete dynamics. Numerical simulation in this case shows that the feedforward networks perform worse than the attractor networks, either with or without reset (see Figures 10B and 10D). Furthermore, with a reset, we show in section A.6 that the attractor network saturates the bound on information transmission achievable by any network with a finite dynamic range, whereas no feedforward network can achieve this bound. Thus, at least for networks with a reset, the attractor networks strictly outperform the feedforward networks. Without a reset, the worse performance of the feedforward networks could in principle be due to the nonoptimal architecture taken from the optimal discrete network. However, we think this is unlikely because the reduced memory performance for the feedforward networks with continuous dynamics is analogous to the similar result found in section 3.2 (Figures 6H and 6I and accompanying text), which could be explained by the combination of spreading of signals across the modes of the network and signal loss through the end of the chain.

## 4. Discussion

We have compared the memory performance of two prominent classes of short-term memory networks in storing the amplitude of a briefly presented stimulus in the presence of gaussian white noise. In one class of networks, memory was sustained by positive feedback that was mediated by recurrent connections and resulted in the formation of low-dimensional attractors (Robinson, 1989; Seung, 1996). In the other class, memory was sustained by passing activity through either long feedforward chains of neurons or through a chain of orthogonal activity patterns (Schur modes) in a recurrent network (Ganguli, Huh et al., 2008; Goldman, 2009; White et al., 2004). In each case, memory performance was quantified with the Fisher information, which, for the linear network dynamics considered here, represents a ratio of the amount that the network amplifies the signals versus the noise received.

Our primary results were as follows. For the attractor networks, including those with a limited range of firing rates, we found that the best-performing networks were forgetful if noise is allowed to build up without constraint before the stimulus arrives. This forgetfulness reflected a fundamental trade-off between requiring a long time constant of decay of network activity to maintain signals throughout the memory period and needing some decay of network activity in order to remove noise that enters the system before the stimulus arrives. However, if there exists a mechanism to remove noise from the system near the time of stimulus arrival or if networks enter the memory-storing state only close to the time of the stimulus onset, then we found that the optimal networks with a limited dynamic range perfectly maintain their signals throughout the memory period.

Comparison of the memory performance between line attractor and higher-dimensional attractor networks showed that the optimal memory performance with or without reset is independent of the dimension of the attracting modes. However, optimal memory performance in the line attractor networks requires an optimal alignment of the input vector, whereas optimal memory performance in the higher-dimensional attractors requires an optimal alignment of the readout vector. These results suggest that line attractor networks might be more useful if the stored memory needs to be used by multiple networks that each project activity along a different direction. By contrast, higher-dimensional attractors might be more useful in storing memories that arrive from multiple networks that each encodes the stimulus along a different direction.

For the feedforward networks, the optimal network architectures did not depend strongly on the presence of a resetting mechanism because the feedforward networks naturally remove noise from the system when it exits from the end of the feedforward chain. Due to this inherent noise-removal mechanism, the feedforward networks could transiently amplify signals without excessive noise buildup (Ganguli, Huh et al., 2008) and, for linear networks with no reset or bounds on activity, perform better than the attractor networks. However, when the firing rates were bounded, the ability of the feedforward network to amplify inputs was limited, and the optimal feedforward networks propagated activity without amplification or decay (α = 1).

Comparing the networks, we found that the Fisher information for both the optimal
attractor and feedforward networks increased linearly with the number of neurons *N*, reflecting that additional neurons allow more signal to be
carried by the network. Additionally, the optimal networks in both cases exhibited a
power law decay in memory performance. For the attractor networks and for
feedforward networks in a discrete approximation, this decay was inversely
proportional to time and reflected the linear increase in noise variance over time.
Interestingly, we note that such a linear increase in variance has been observed
experimentally in spatial working memory tasks (Ploner, Gaymard, Rivaud, Agid, &
Pierrot-Deseilligny, 1998; White, Sparks,
& Stanford, 1994). Feedforward networks
with continuous dynamics performed less well than those with discrete dynamics,
reflecting two factors: the signals in continuous networks spread out over time,
leading to a reduction in the signal gain; and due to this spreading, signals exit
from the end of the chain before the end of the memory period. Together these
factors lead to worse performance of the feedforward networks relative to the
attractor networks when there is a noise reset, and quite likely (although we could
only compute a lower bound approximation on the feedforward networks) even in the
absence of such a reset.

### 4.1. Comparison to Previous Work.

Many previous studies have proposed perfectly tuned attractor networks as a substrate for holding short-term memories in the absence of noise (for reviews, see Brody et al., 2003; Goldman, Compte, & Wang, 2009; Wang, 2001). Here, we have explicitly considered the effects of noise on both attractor and nonattractor (feedforward or functionally feedforward) networks. For networks with underlying linear dynamics and both a reset nonlinearity and a finite dynamic range on neuronal responses, our results are consistent with the optimality of perfectly tuned attractor networks. Similarly, we note that the perfectly tuned attractor network (integrator) was found in a recent study to be the optimal architecture for storing the running total of a continuously presented input in which noise likewise started with the arrival of the signal (Brown et al., 2005).

Perfect integrator networks face a fine-tuning problem of network connectivity in that the feedback connections must precisely offset intrinsic neuronal decay processes in order to sustain activity at a constant rate in the absence of external input. Several mechanisms have been suggested to lessen the strictness of this tuning requirement. These include the use of long intrinsic (Marder, Abbott, Turrigiano, Liu, & Golowasch, 1996) or synaptic (Hempel, Hartman, Wang, Turrigiano, & Nelson, 2000; Wang et al., 2006; Mongillo, Barak, & Tsodyks, 2008) time constants. In addition, bistability is a nonlinear mechanism for maintaining the robustness of memory storage (Camperi & Wang, 1998; Koulakov, Raghavachari, Kepecs, & Lisman, 2002; Goldman, Levine, Major, Tank, & Seung, 2003) and homeostatic learning rules have been suggested to be able to keep short-term memory-storing circuits tuned (Goldman, 2009; Renart, Song, & Wang, 2003). Further investigation is needed to analyze the robustness of different network architectures to synaptic weight changes.

In the absence of a reset nonlinearity, we find that noise buildup before the time of the stimulus presentation makes the perfect attractor network nonoptimal; instead, at least in the high signal-to-noise regime, we find that the optimal attractor networks must be forgetful in order to reduce noise accumulation. This result is similar to that of White et al. (2004), who considered the storage of temporal sequences in memory networks with discrete dynamics and who noted that forgetting was necessary in order to prevent the buildup of noise that arrived at all times before the stimulus onset.

For the feedforward networks, previous work that examined networks with a finite
dynamic range of neural activity focused on network architectures with a fan-out
structure (Ganguli, Huh et al., 2008;
Ganguli & Latham, 2009). This
previous work showed that under a finite dynamic range constraint, a fan-out
network can achieve the same scaling as the optimal network; however, this study
did not check whether other structures may achieve this bound or whether there
exists a structure having better memory performance. By contrast, at least for
storage of a one-dimensional stimulus, we explicitly calculated that the optimal
network architectures for the (discrete) feedforward networks had a uniform
structure and that the fan-out structure was suboptimal. A key difference
between our study and that of Ganguli, Huh et al. (2008) is that they primarily focused their discussion
on memory for sequences, whereas here we explicitly focus on memory for a
single-dimensional input. Higher-dimensional signals cannot be stored in
attractor networks if the dimension of the attractor is lower than the dimension
of the signal. Therefore, if the stimulus to be remembered is higher
dimensional, such as remembering an entire sequence of inputs, this may favor a
high-dimensional or feedforward network in which time is explicitly represented
by patterns of activity that are sequentially activated as signals propagate
through the network (Ganguli, Huh et al., 2008; Goldman, 2009; White et
al., 2004). Ganguli, Huh et al. (2008) showed that the duration *T* for which a network could reliably convey information
about a temporal sequence increased only in proportion to . This contrasts with our result for storing a
single-dimensional stimulus, in which memory increases proportional to the
network size *N*. The reason for this difference is that for
storage of a single-dimensional stimulus, our optimal networks (both attractor
and functionally feedforward) could use their entire finite dynamic range to
store this one dimension. By contrast, when the stimulus dimension scales with
time, as in sequence memory, the network must divide its dynamic range among all
stored dimensions. This leads to memory performance, which scales as *N*/*T*^{2}, rather than *N*/*T*, so that the
duration *T* for which a network reliably can convey information
about a temporal sequence increases only in proportion to . Consistent with this observation, the memory performance of
our optimal literally feedforward networks (which use approximately 1/T of the
entire network’s range at any given time) scaled only as *N*/*T*^{2}. Furthermore, literally feedforward networks might have an
advantage over functionally feedforward networks or generic high-dimensional
attractors because the literally feedforward networks keep the elements of a
sequence arriving at different times cleanly segregated.

### 4.2. Temporal Information in Memory Networks.

In this work, we focused on mechanisms for storing the amplitude of a stimulus when the memory period is a fixed (or known) duration, so that there is no need for encoding the time since the pulse occurred. However, if the duration of the memory period is variable and unknown, then joint information about the amplitude of an input pulse and the time at which the pulse occurs needs to be encoded. A one-dimensional attractor network is not suitable to extract joint information about the amplitude and time of input since a one-dimensional network cannot represent such a two-dimensional quantity. Rather, at least a two-dimensional network is required. Feedforward networks seem advantageous for processing time and storing signals since different sets of neurons or modes are used at different times. However, it is unclear whether time and amplitude are dependently encoded by the network activity, as in feedforward networks, versus encoded in independent modes of activity (either with time encoded in a completely separate network from amplitude, or with independent modes of activity that represent time and that represent amplitude, as suggested by the recent work of Machens, Romo, and Brody, 2010). For example, it has been suggested in the circuits underlying bird song that time is represented through a feedforward chain of bistable units that are more robust to temporal encoding than graded networks (but with loss of any representation of amplitude information; see Long, Jin, & Fee, 2010). Alternatively, high-dimensional attractor networks have been suggested to encode both time and amplitude (Machens et al., 2010; Singh & Eliasmith, 2006). Further work, both experimental and theoretical, is needed to address the joint processing of amplitude and temporal information.

### 4.3. Effect of Correlated or Signal-Dependent Noise.

In this study, we assumed for simplicity that the external noise received by each neuron was equal in amplitude and uncorrelated across neurons. However, similar to studies in sensory systems that have shown strong effects on neural coding in the presence of correlated noise (Abbott & Dayan, 1999; Averbeck, Latham, & Pouget, 2006; Latham, Deneve, & Pouget, 2003; Sompolinsky et al., 2001; Zohary, Shadlen, & Newsome, 1994), we found that the correlation structure of noise received by the network may dramatically affect the optimal architecture of memory networks. As illustrated in Figure 11A, line attractor networks may be advantageous when noise is correlated: if the input direction can be chosen independent of the profile of the injected noise, then the attractor and the input direction can be oriented orthogonal to the directions of high noise and along directions with low noise. In contrast, activity in feedforward networks is passed through many different orthogonal patterns of activity (see Figure 11B), so that it may be difficult to take advantage of correlated noise that has a particularly non-noisy direction. If instead the input direction and the profile of injected noise are dependent, a more careful examination is required to determine the architectures of the best-performing attractor and feedforward networks.

We considered only additive noise in this study. When noise is instead multiplicative or signal dependent, different optimal architectures may be necessary. Although this question deserves much further study, multiplicative noise is more disruptive to higher firing rates than additive noise and therefore might lead to better memory performance for networks with relatively faster decay of signals, with smaller amplitudes of input that drive neurons to less high firing rates, or with other differences in network architecture that decrease the use of high firing rates.

### 4.4. Nonlinear Dynamics.

In this letter, we have modeled the finite range of neuronal activities in an analytically tractable manner by imposing a finite dynamic range on the mean firing rates and their variances and arranging the network inputs so that the trajectories of neuronal firing never exceed this range. More realistically, neurons have hard or soft limits on their observed firing rates that are best modeled with explicitly nonlinear network models. Having the underlying dynamics of the network be nonlinear rather than imposing the finite dynamic range as a simple constraint on a linear network may influence the memory performance in various ways. For example, explicit inclusion of nonlinear dynamics may reduce the buildup of noise relative to a network with linear dynamics, and even without an explicit reset mechanism, the optimal architecture of the attractor networks may become less forgetful. Furthermore, recent theoretical work shows that randomly connected nonlinear networks with sigmoidal neuronal input-output functions exhibit a sharp reduction of neural variability with the arrival of a stimulus (Rajan et al., 2010), suggesting a mechanistic explanation for the reset mechanism considered in our study. More dramatic, the presence of strong nonlinearity can lead to bistable responses, which may be useful in robustly maintaining memories in the presence of noise (Toyoizumi, 2010) or lessening the need to fine-tune synaptic connection weights (Camperi & Wang, 1998; Goldman et al., 2003; Koulakov et al., 2002). Further work is needed to explore the possibilities offered by nonlinear networks and to develop analysis methodologies that allow a rigorous understanding of networks in which the conveniences offered by linear analysis no longer apply.

## Appendix

### A.1. Relation Between Fisher Information and (Signal Gain)-to-Noise Ratio in Linear Systems.

*I*is greater than or equal to the (signal gain)-to-noise ratio for a network with linear dynamics and linear readout of the network activity. Denoting the signal gain vector as and noise covariance matrix as ,

_{F}*I*in equation 2.9 can be expressed as

_{F}The above relation is the Cauchy-Schwarz inequality, and the equality holds when k_{i} ∝ *g _{i}*/

*c*. For a nondiagonal matrix , we can get the same result by changing to a coordinate system in which is transformed to a diagonal matrix and the condition for the equality becomes that is along .

_{i}An alternative proof of the relation between *I _{F}* and the SNR (not shown) can be obtained using the Cramer-Rao bound
relationship between

*I*and the maximum likelihood estimator of the stimulus.

_{F}### A.2. Analytic Expression for the Fisher Information in Attractor Networks.

#### A.2.1. Calculation of Fisher Information for Line Attractors.

_{i}'s are arranged in descending order by their real parts and the real part of the first eigenvalue is much larger than the real parts of the remaining eigenvalues as in the line attractors, then for

*i*>1, decays much more slowly than so that τ

^{(1)}

_{eff}≫ τ

^{(i)}

_{eff}. Furthermore, if the decay in the other modes relative to the activity along the line attractor is fast enough to overcome the nonorthogonality of the eigenvectors implicit in the length of , then the network activity in equation A.5 can be expressed approximately in terms of λ

_{1}and the corresponding left and right eigenvectors as

*I*, the signal gain vector and the noise covariance are expressed in the coordinates of the right eigenvectors and computed as follows: In equation A.7, is the matrix whose elements are the inner products of the left eigenvectors, and the (

_{F}*i*,

*j*)th element of the matrix is computed as If is an orthogonal matrix, then all the off-diagonal terms become zero since and the (1, 1)th element of the inverse matrix is the reciprocal of where we have used that . If they are not orthogonal but τ

^{(1)}

_{eff}is large enough to overcome the nonorthogonal factor in equation A.7, then only the (1, 1)th element is of order τ

^{(1)}

_{eff}, whereas all the other elements are of order τ

^{(i)}

_{eff}for some

*i*>1. Thus, the (1,1)th element of the inverse matrix is still close to .

^{T}in the coordinates of the right eigenvectors, the Fisher information

*I*becomes the product of the square of the signal and the (1,1)th element of the inverse of the noise covariance matrix: This approximation breaks down if the nonorthogonal factors become large, for instance, in feedforward networks that are not eigen- decomposable.

_{F}Above, we showed general conditions under which the Fisher information of the
line attractor networks is approximated by the signal-to-noise ratio along
the attractor. For ease of computation, in the calculations below for the
attractor networks, we consider only the case of (normal) networks in which
all modes of the attractor networks are orthogonal, . In this case, the left and right eigenvectors
corresponding to a given eigenvalue are identical, and we denote the *i*th eigenvector by .

#### A.2.2. Optimal Decay Time Constant for Line Attractors with or Without Reset.

*I*in line attractors with or without reset and obtain the optimal

_{F}*I*in each case. First, we consider the case of an attractor network with no reset. In this case, noise builds up at all times before the signal arrives, so that t

_{F}_{0}= −∞, and

*I*is obtained from equation A.8 as Differentiating equation A.9 with respect to τ

_{F}^{(1)}

_{eff},

*I*attains a maximum value at time

_{F}*T*equal to when τ

^{(1)}

_{eff}= 2

*T*.

*t*

_{0}as 0, then equation A.8 becomes Instead of taking a differential with respect to τ

^{(1)}

_{eff}, which is undefined at λ

_{1}= 1, taking the differential with respect to λ

_{1}shows that

*I*monotonically increases with λ

_{F}_{1}. That is, it is an increasing function for both the signal-decaying regime, λ

_{1}< 1, and the nondecaying regime, λ

_{1}⩾ 1.

#### A.2.3. Fisher Information for Plane Attractors.

_{eff}. Since any neural activity in the modes other than the plane attractor decays to zero after a transient time, the remaining neural activity lies along the projection of the input vector onto the plane, . Then for any mode on the plane, the projection of neural activity onto this mode, denoted as

*x*, evolves as

*I*gives the maximal (signal gain)-to-noise ratio among these modes, which occurs when is along . Then the form of

_{F}*I*in the plane attractor networks becomes which is similar to

_{F}*I*for the line attractor (with in place of in equation A.8).

_{F}#### A.2.4. SNR for Line and Plane Attractor Networks with a Linear Read- out.

Here we calculate the memory performance of line and plane attractor networks if neural activity is linearly read out by projecting along a direction .

In line attractors, the (signal gain)-to-noise ratio is independent of the choice of as long as it is not close to orthogonal to the attracting mode, since both the signal and noise accumulate almost exclusively along the one-dimensional attracting mode. However, in the plane attractor networks, noise develops isotropically in the plane, and in order not to collect noise in the noninput direction, the readout should match the input direction exactly such that (see Figure 4).

### A.3. Calculation of Noise Covariance Matrix of Feedforward Networks.

*t*

_{0}in equation 2.4 is set to −∞ and can be written as As noted in the supplement of Ganguli, Huh et al. (2008), a recursive relation for can be derived by differentiating the integrand above and using the fundamental theorem of calculus: This recursion relationship is in the form of a continuous Lyapunov equation and can be solved in Matlab using the lyap function.

*t*

_{0}is set to 0, no simple recursive relation can be obtained. In this case, we obtain the analytical form of the noise covariance matrix directly from the explicit expression of neural variability at time

*t*in equation 2.2. The expression for the neural variability at time

*t*is given as follows: In this expression, the variability in the

*i*th neuron contains the filtered noise generated from the

*k*th neuron for

*k*≤

*i*. The (

*i*,

*j*)th component of the noise covariance matrix is the sum of correlation due to noise generated by all

*k*neurons with

*k*≤ min(

*i*,

*j*) and is where γ(

*i*+

*j*− 2

*k*+ 1, 2

*t*/τ) is the lower incomplete gamma function.

### A.4. Analysis of Fisher Information of Attractor and Feedforward Networks with Reset.

In section A.2, we derived analytical
expressions for *I _{F}* for attractor networks as a function of the strength of network
feedback (or, equivalently, network time constant). In this section, we provide
an analysis of

*I*as a function of the signal gain achieved by the network (Ganguli, Huh et al., 2008). Specifically, here we provide analytical bounds for the Fisher information

_{F}*I*and show which types of attractor and feedforward networks achieve these bounds.

_{F}*I*has an upper bound that can be expressed solely as a function of the magnitude of the signal gain vectors at time step

_{F}*m*, : For networks without a reset, Ganguli, Huh et al. (2008) showed that only feedforward networks satisfy the equality. However, with a reset, the condition for the equality is relaxed to (Ganguli, Huh et al., 2008) In this condition, the operator projects out the activity in the direction of the signal gain vector at the

*i*th step, so that the resulting activity is orthogonal to . Then equation A.16 gives that the evolution of any activity orthogonal to after

*m*−

*i*steps, , remains orthogonal to the signal during the evolution of the dynamics.

*Z*

^{⊥}, then the above condition yields that the evolution by does not mix the two spaces, so can be decomposed into blocks of the form

Moreover, the form of the upper-left submatrix of , that is, the transformation of Z to Z, is constrained by equation A.16. If we choose as the first coordinate of Z and choose the remaining
coordinates as vectors of Z orthogonal to , then equation A.16 implies that for any power *n*, all columns of except the first column remain orthogonal to the first column .

Using the above considerations, it can be verified directly that the upper bound in equation A.16 is achieved by all orthogonal matrices, feedforward chains, and networks with a ring structure constructed by connecting the final element of a feedforward chain to the first (this list is not exclusive; other matrices can also satisfy the bound).

Attractor networks can also satisfy the upper bound exactly or very closely. If
we assume that all modes of the attractor network are orthogonal and is aligned to one of the modes, denoted as , then the attractor networks satisfy equation A.16 with dim(Z) = 1 since for *i* ≠ 1 due to the orthogonality. If
the modes are not orthogonal but there exists a mode with strong recurrent
feedback compared to the other modes, we can treat the activity in the other
modes as negligibly small so that the network performs similarly to a network
with zero eigenvalues in the modes other than the attractor mode. Thus, the
low-dimensional attractor networks also satisfy the upper bound closely.

*I*with continuous dynamics is obtained in the same way as for discrete dynamics: where the equality condition is given as As in the discrete networks, this equation holds for attractor networks with orthogonal eigenmodes and for line attractor networks as long as the input vector is set along one of the attractor modes. For feedforward networks, equation A.18 does not strictly hold. However, we have checked that the feedforward networks come close to achieving the bound given in equation A.17. This is shown in Figure 12, where the numerically calculated (panel B, solid line) is compared to the bound of equation A.17 (panel B, dotted line) calculated numerically from the signal gain vector (panel A).

_{F}### A.5. Relation Between the Bounds on the Connectivity Strength Between Neurons and the Connectivity Strength Between Modes.

*w*

_{max}and the modes are orthogonal to each other. First, we consider the attractor networks. If and λ

_{i}denote the

*i*th eigenvector and eigenvalue of , respectively, then the connectivity matrix is decomposed into a diagonal matrix such that , where the diagonal matrix has the eigenvalues as the diagonal entries and the column vectors of are the eigenvectors. If is an orthogonal matrix, the Frobenius norms ‖ · ‖

_{F}of the two matrices, and , are the same, which can be proven by using that the trace of the matrix is preserved under an orthogonal change of coordinates: If

*w*

_{max}denotes the maximal synaptic strength, that is, if each element of is bounded above by

*w*

_{max}, then the Frobenius norm of is bounded above by

*Nw*

_{max}. From equation A.19, the Frobenius norm of has the same bound, and thus each eigenvalue λ

_{i}is at most

*Nw*

_{max}. Furthermore, there exists an attractor network that reaches this bound: the matrix with all elements equal to

*w*

_{max}has a maximal eigenvalue equal to

*Nw*

_{max}.

Similarly, it can be shown that the bound on the synaptic connectivity between
neurons leads to bounds on the strengths of the feedforward connectivity between
the Schur modes. For the feedforward networks, and denote the Schur modes and the Schur decomposition of , satisfying . Since is an orthogonal matrix for any Schur decomposition, the
Frobenius norm of is equal to that of , which is at most *Nw*_{max} as in equation A.19. The Frobenius norm of a lower triangular matrix is . Furthermore, not all *m _{ij}* can be

*Nw*

_{max}at the same time. As discussed in section 3.2, these bounds lead to the result that the attractor networks with a reset outperform the feedforward networks for a given bound on synaptic strengths.

### A.6. Optimal Network Structures When Neuronal Activity is Bounded.

*I*, under constraints on the dynamic range of neural activity. First, we consider a constraint on the mean firing rates in which each neuron is constrained to have the absolute value of its firing rate bound by a maximal value

_{F}*r*

_{0}. Formally, such a bound on every element is expressed by the infinity norm of the vector of mean firing rates and denoted below by . If we assume that the stimulus strength to be remembered is in the range [−

*s*

_{0}

*s*

_{0}], then the constraint on the signal gain is given as Geometrically, this constraint corresponds to limiting the signal gain vector to reside within a hypercube (see Figure 7). The magnitude of the signal gain vector is then bounded by the distance to the vertices of the hypercube as

*r*

_{0}. Then the variance of activity is given as where the second term is the amount of accumulated noise in the

*i*th neuron when the dynamics are not constrained.

#### A.6.1. Attractor Networks.

*τ*

_{eff}. If only the mean is constrained, we can obtain the maximal Fisher information from the expression for

*I*in equation A.9 and the constraint on the mean activity, equation A.20, as

_{F}_{eff}positive) to prevent the infinite accumulation of noise. In this case, the magnitude of the signal gain vector decreases over time and is largest at time 0, so that equation A.24 can be replaced by a constraint on the initial gain, .

*I*can be maximized by separately maximizing the term and the term exp(−2

_{F}*T*/τ

_{eff})/(σ

^{2}τ

_{eff}/2) in equation A.23. As noted, the magnitude of the vector attains its maximal value of when points to one of the vertices. Moreover, in section A.2.2, it was found the second term in

*I*achieves its maximum when τ

_{F}_{eff}is equal to the optimal time constant of decay. Thus, altogether, the maximum of

*I*becomes Next, we consider what happens when the maximal variability is also bounded, as in equation A.22. In this case, the maximum Fisher information is obtained as where denotes the noise along . Note that the accumulated noise in each neuron is not independent since it is the projection of noise in the attractor onto each neuron and is the sum of the noise variances for each neuron. In the optimally arranged networks, in which points along a vertex (i.e., has all components equal in magnitude), the maximal variance is equal for all neurons so that the maximal value of equals

_{F}*Nc*

_{1}

*r*

^{2}

_{0}. The variance of noise from the dynamics exceeds the maximal variability when σ

^{2}τ

_{eff}

*q*

^{2}

_{i}/2 = σ

^{2}τ

_{eff}/(2

*N*) ⩾

*c*

_{1}

*r*

^{2}

_{0}or, in terms of the feedback strength

*α*when α ⩾ α

_{0}= 1 − σ

^{2}τ/(2

*Nc*

_{1}

*r*

^{2}

_{0}). Then

*I*becomes

_{F}As discussed in the main text, we consider only the higher signal-to-noise
regime in which the maximal Fisher information is obtained when the noise
variance is not saturated. Here, we derive a simple estimate of when this
regime is attained and show that the optimal value of the network feedback *α* is not different from that obtained without a
bound on the noise variance. In this high signal-to-noise regime,
α_{0} is greater than the value α_{opt} = 1 − τ/(2*T*) (see Figure 8F, peak of dashed line) at which *I _{F}* was maximized with only a constraint on the mean activity. Then,
for α < α

_{0}, because the noise has not yet saturated, there is a maximum in

*I*at

_{F}*α*=

*α*

_{opt}of value

*c*

^{2}

_{0}

*N*/(σ

^{2}

*eT*) (see equation A.25). For α ⩾ α

_{0}, the maximal

*I*is given from equations A.26 and A.21 as

_{F}*c*

^{2}

_{0}

*N*/(

*Nc*

_{1}

*r*

^{2}

_{0}) =

*c*

^{2}

_{0}/(

*c*

_{1}

*r*

^{2}

_{0}). This value is attained when the numerator of equation A.26 (the signal gain) has reached its maximal value, which occurs for α ⩾ 1. Comparing the expressions for the variance-saturating and nonsaturating regimes, we see that the maximal

*I*is achieved in the nonsaturating regime α < α

_{F}_{0}when , that is, for small

*σ*or large

*N*. Furthermore, we note as claimed above that this maximum occurs at the same

*α = α*

_{opt}that was optimal without considering the finite variance of activity.

*I*is calculated in a similar manner to the case without a reset. However, noise accumulates only during the memory period, so that exponential growth of activity is possible (τ

_{F}_{eff}can be negative).

*I*with the constraint on mean activity is given from equations A.10 and A.20 as

_{F}To find the optimal τ_{eff}, we consider separately the cases of positive and negative τ_{eff}. For positive τ_{eff} (decaying attractor mode), the signal gain is maximal at *t* = 0 as in the case without reset, and *I _{F}* is maximal for a perfect integrator (τ

_{eff}= ∞) with pointing to one of the vertices so that . For negative τ

_{eff}(amplifying attractor mode), both the signal gain in the numerator and the noise variance in the denominator increase with τ

_{eff}. However, the signal gain is limited by the constraint in equation A.28, so that the maximum of

*I*occurs when the noise variance is minimized. This occurs at τ

_{F}_{eff}= −∞, corresponding to the perfect integrator. Combining the results for positive and negative τ

_{eff}provides that the perfect integrator is optimal and its maximal

*I*is

_{F}*c*

^{2}

_{0}

*N*/(σ

^{2}

*T*).

The optimal Fisher information for line attractors with a reset and bound on
the variability, as well as the mean, can be computed analogously to the
case without a reset. The variance of the noise reaches the bound when and, for sufficiently large *N* or small *σ*, the maximum *I _{F}* over

*α*still occurs at the same value α = 1, which was optimal without considering the finite variance of activity.

In summary, we have shown that the optimal *I _{F}* for the line attractor networks occurs for the perfect integrator.
Moreover, the perfect integrator with a reset has the optimal memory
performance of any continuous-dynamics networks with a bounded firing rate.
In section A.4, we found the
upper limit of

*I*in terms of the magnitude of the signal gain vector (see equation A.17). The uniform bound on the mean firing rate sets the upper limit on the magnitude of the signal gain vector to . Substituting this bound into the expression for the upper limit of

_{F}*I*, we obtain that

_{F}*I*≤

_{F}*c*

^{2}

_{0}

*N*/(σ

^{2}

*T*), which (comparing to above) shows that the perfect integrator saturates the bound on memory performance.

#### A.6.2. Feedforward Networks.

Here, we calculate the optimal structure of feedforward networks when the
mean activity is constrained. For analytical tractability, we perform this
calculation under a discrete dynamics approximation so that, as noted in
section A.4, *I _{F}* for the feedforward networks achieves the equality in equation A.15.

*l*is equal to

*T/τ*so that activity reaches the final stage at time

*T*. In the literally feedforward networks, the maximal signal amplification in each stage is restrained to lie in an

*N*-dimensional hypercube, where

_{m}*N*denotes the number of neurons in the

_{m}*m*th stage. Then the bound of

*I*becomes where

_{F}*c*

_{0}is defined in equation A.20. Using the inequality and noting that the equality holds when all

*N*are equal with

_{m}*N*=

_{m}*N*/

*l*, we find that

*I*attains a maximal value

_{F}The equality holds when the signal gain at every stage achieves its maximal
bound, that is, when each mode of the feedforward network points to the
vertices of the hypercube in the state space (see Figure 9B). It is not obvious that this condition can be
attained given that the states of the functionally feedforward network are
additionally required to be orthogonal to each other. Below, we show that at
least for the case when *N* is a power of 2, we can construct *N* mutually orthogonal modes that point to the vertices
of the hypercube and use this result to show more generally that the maximal
Fisher information of functionally feedforward networks is of order *N*/*T* if *N* is at least
twice the number of feedforward stages. We next present the proof of the
existence of *N* orthogonal modes pointing to the vertices of
the *N*-hypercube when *N* = 2^{i}, where *i* is a natural number:

**Proof.** We perform the proof by induction. For *N* = 2, (1, 1)^{T}and(1, − 1)^{T} satisfy the condition. Now assume that there exist 2^{i} orthogonal modes whose *N* = 2^{i} elements are either −1 or 1. If we denote this set as , then for *N* = 2^{i+1}, we can construct 2^{i+1} orthogonal modes from the orthogonal
modes corresponding to *N* = 2^{i} as follows:

The above construction can also be used to show that for any *N*, the maximal Fisher information of functionally
feedforward networks is of order *N*/*T* if *N* is larger than twice the number of feedforward stages *l* = *T/τ*. For general *N*, we can generate at least *N*/2
orthogonal modes by applying the above construction to the maximal power of
2 less than *N*. This creates an *N*/2-length
feedforward network that uses more than half of the full dynamic range. As
in the calculation leading to equation A.31, the Fisher information of such
networks is of order (*N*/2)/*T* when the
number of neurons *N* is greater than twice *l* (*N*/2 ⩾ *l*).
Thus, for general *N*, the maximal Fisher information is
still of order *N*/*T* for *N
⩾*2*l*. By contrast, for a continuous
(rather than discrete) feedforward network, the spreading out of activity
over time implies that it is impossible for the network to maintain a signal
gain vector pointing to a vertex at all times. Therefore, the continuous
feedforward networks, unlike their discrete counterparts, strictly cannot
attain the maximal bound on the Fisher information.

## Acknowledgments

This research was supported by a Sloan Foundation Research Fellowship, NIH grant R01 MH069726, and a UC Davis Ophthalmology Research to Prevent Blindness grant. We thank T. Toyoizumi, A. T. Sornborger, and E. Aksay for valuable discussions and D. Fisher and J. Ditterich for valuable discussions and feedback on the manuscript. This research was conducted in the absence of any commercial or financial relationships that could be constructed as a potential conflict of interest.

## References

## Author notes

An online supplement is available at http://www.mitpressjournals.org/doi/suppl/10.1162/NECO_a_00234.