## Abstract

Gated working memory is defined as the capacity of holding arbitrary information at any time in order to be used at a later time. Based on electrophysiological recordings, several computational models have tackled the problem using dedicated and explicit mechanisms. We propose instead to consider an implicit mechanism based on a random recurrent neural network. We introduce a robust yet simple reservoir model of gated working memory with instantaneous updates. The model is able to store an arbitrary real value at random time over an extended period of time. The dynamics of the model is a line attractor that learns to exploit reentry and a nonlinearity during the training phase using only a few representative values. A deeper study of the model shows that there is actually a large range of hyperparameters for which the results hold (e.g., number of neurons, sparsity, global weight scaling) such that any large enough population, mixing excitatory and inhibitory neurons, can quickly learn to realize such gated working memory. In a nutshell, with a minimal set of hypotheses, we show that we can have a robust model of working memory. This suggests this property could be an implicit property of any random population, that can be acquired through learning. Furthermore, considering working memory to be a physically open but functionally closed system, we give account on some counterintuitive electrophysiological recordings.

## 1 Introduction

The prefrontal cortex (PFC), noteworthy for its highly recurrent connections (Goldman-Rakic, 1987), is involved in many high-level capabilities, such as decision making (Bechara, Damasio, Tranel, & Anderson, 1998), working memory (Goldman-Rakic, 1987), goal-directed behavior (Miller & Cohen, 2001), and temporal organization and reasoning (Fuster, 2001). In this letter, we are more specifically interested in gated working memory (O'Reilly & Frank, 2006), defined as the capacity of holding arbitrary information at a given random time $t0$ so as to be accessible at a later random time $t1$ (see Figure 1). Between times $t0$ and $t1$, we make no assumption about the inner mechanisms of the working memory. The only measures we are interested in are the precision of the output (compared to the initial information) and the maximal delay during which this information can be accessed within a given precision range. One obvious and immediate solution to the task is to make an explicit copy (inside the memory) of the information at time $t0$ and hold it unchanged until it is read at time $t1$, much like a computer program variable that is first assigned a value in order to be read later. Such a solution can be easily characterized by a fixed pattern of sustained activities inside the memory. This is precisely what led researchers to search for such sustained activity inside the frontal cortex (Funahashi, 2017; Constantinidis et al., 2018), where an important part of our working memory capacities is believed to be located. Romo, Brody, Hernández, and Lemus (1999) have shown that PFC neurons of nonhuman primates can maintain information about a stimulus for several seconds. Their firing rate was correlated with the coding of a specific dimension (frequency) of the stimulus maintained in memory. However, when Machens, Romo, and Brody (2010) later reanalyzed the data of this experiment, they showed that the stimulus was actually encoded over a subpopulation using a distributed representation. Similarly, when Rigotti et al. (2013) analyzed single-neuron activity recorded in the lateral PFC of monkeys performing complex cognitive tasks, they found several neurons displaying task-related activity. Once they discarded all the neurons that were displaying task-related activity, they were still able to decode task information with a linear decoder. They proposed that the PFC hosts high-dimensional linear and nonlinear mixed-selectivity activity. The question is thus, if working memory is not encoded in the sustained activity, what can the alternatives be?

Before answering that question, we first characterize the type and the properties of information we consider before defining what it means to access the information.

The type of information that can be stored inside working memory has been characterized using different cognitive tasks—for example, delayed matching-to-sample (DMTS), the N-back task, or the Wisconsin Card Sorting Task (WCST). From these different tasks, we can assume that virtually any kind of information, be it an auditory or visual stimulus, textual or verbal instruction, implicit or explicit cue, can be memorized and processed inside the working memory. From a computational point of view, this can be abstracted into a set of categorical, discrete, or continuous values. In this work, we are interested only in the most general case, the continuous one, which can be reduced to a single scalar value. The question we want to address is how a neural population can gate and maintain an arbitrary (e.g., random) value, in spite of noise and distractors, such that it can be decoded nonambiguously at a later time.

To answer^{1} this question, we can search the extensive literature on computational models of working memory that have been extensively reviewed by Durstewitz, Seamans, and Sejnowski (2000), Compte (2006) and, more recently, Barak and Tsodyks (2014). More specifically, Compte (2006) explains that the retention of information is often associated with attractors. In the simplest case, a continuous scalar information can be identified with a position on a line attractor. However, this kind of memory exhibits stability issues. Unlike a point attractor (i.e., a stable fixed point), a line attractor is marginally stable such that it cannot be robust against all small perturbations. Such a line attractor can be stable against orthogonal perturbations but not against colinear perturbations (i.e., perturbations along the line). Furthermore, the design (and numerical implementation) of a line attractor is tricky because even small imperfections (e.g., numerical errors) can lead to instability. Nevertheless, several models can overcome these limitations.

This is the case for the theoretical model by Amari (1977), who proved that a local excitation could persist in the absence of stimuli, in the form of a localized bump of activity in an homogeneous and isotropic neural field model, using long-range inhibition and short-range excitation. This model represents de facto a spatial attractor formed by a collection of localized bumps of activity over the plane. A few decades later, Compte (2000) showed that the same lasting bump property can also be achieved using leaky integrate-and-fire neurons arranged on a ring, with short-range excitation and constant inhibition (ring bump attractor). This model has since then been extended (Edin et al., 2009; Wei, Wang, & Wang, 2012) with the characterization of the conditions allowing simultaneous bumps of activity. This would explain multi-item memorization, where each bump represents different information that is maintained simultaneously with the other bumps. Similarly, Bouchacourt and Buschman (2019) proposed handling multi-item memorization by duplicating the bump attractor model. They explicitly limited the number of items to be maintained in memory through the interaction of the different bump attractor models (using a random layer of neurons).

If all of these models can cope with the memorization of graded information, this information is precisely localized in the bumps of activity and corresponds to sustained activity. Such patterns of activity have been identified in several cerebral structures—for example, head direction cells (Zhang, 1996) in mammals and superior colliculus (Gandhi & Katnani, 2011) in primates—but it is not yet clear to what extent this can give an account of a general working memory mechanism. Such sustained activity is also present in the model of Koulakov, Raghavachari, Kepecs, and Lisman (2002), who consider a population of bistable units that encodes (quasi) graded information using distributed encoding (percentage of units in high state). This solves both the robustness and stability issue of the line attractor by virtue of discretization. Finally, some authors (Zipser, Kehoe, Littlewort, & Fuster, 1993; Lim & Goldman, 2013) consider the encoding of the value to be correlated with the firing rate of a neuron or of a group of neurons. This is the case for the model proposed by Lim and Goldman (2013), who obtain stability of the firing rate by adding negative derivative self-feedback (hence, artificially increasing the time constant of neurons). They show how such a mechanism can be implemented by the interaction of two populations of excitatory and inhibitory neurons evolving at different timescales. However, independent of the encoding of the graded value, most of model authors are interested in characterizing the mechanism responsible for the maintenance property. They tend to consider the memory as an isolated system, not prone to external perturbations, with the noticeable exception of the model by Zipser et al. (1993), which is constantly fed by an input.

In this work, we consider working memory to be an open system under the constant influence of external activities (even when the gate is closed). Thus, we cannot rely on a dynamical system that hosts a line attractor in the absence of inputs. We have to design an input-dependent dynamical system that is robust against all kinds of perturbations (input, internal, output feedback). First, we formalize a set of tasks that we used to study the features and performance of the different models we have considered. Then, we introduce a minimal model that will help us explain the mechanism needed for a more general model. For this general one, we consider a particular instance of reservoir: an echo state network (ESN; Jaeger, 2001). The analysis of this model will allow us to show that reservoir activity is characterized by a combination of both sustained and transient activities. Moreover, we show that none of these activities are critical for the decoding of the correct output. Finally, we show that in the absence of input, the dynamics of the model implement a segment attractor (i.e., a line attractor with bounding values).

## 2 Methods

In this section, we formalize and extend the gated working memory task that we described in section 1 (see Figure 1). We consider four tasks that will illustrate the generic capacity of the reservoir model to store continuous values or a discrete set of values. The first three are variations of the working memory (WM) task for continuous values with various numbers of input values ($n$-value) and gating WM units ($n$-gate). The last task includes nonlinear computation (digit recognition from pixels) in addition to the gating task for discrete values.

### 2.1 The $n$-Value $p$-Gate Scalar Working Memory Tasks

This can be easily generalized to an $n$-value, $p$-gate scalar task with $n$ input signals, $p$ input triggers, and $p$ outputs. Only the first input signal and the $t$ input triggers determine the $p$ outputs. The other $n-1$ input signals are additional inputs irrelevant to the task.

### 2.2 The Digit One-Value, One-Gate Working Memory Task

^{1}we draw a gray-scaled bitmap representation of a sequence of random digits (0 to 9), each digit being of size $6\xd77$ pixels (after having cropped top and bottom empty lines) and the trigger signal being expanded to the width of a glyph. The output is defined as a discrete and normalized value. There is no possible linear interpolation between the different inputs, as it was the case for the scalar version. Formally, we can define the output as

### 2.3 The Minimal Model

Consequently, the trigger $T[n]$ fully dictates the output. When $T[n]=0$, the output $M[n+1]$ is unchanged and corresponds to the current memory ($M[n]$). When $T[n]=1$, the output $M[n+1]$ is assigned the value $V[n]$ of the input. We think this model represents the minimal model that is able to solve the gating working memory task using such simple neurons (with tanh activation function). By taking advantage of the linear regime around 0 and the asymptotic behavior around infinity, this model gracefully solves the task using only two parameters (a and b). However, we have no formal proof that a model having only two neurons in the reservoir part cannot solve the task.

^{2}but using only simple $tanh$ neurons, in comparison to handcrafted LSTM-like cells. Without the reset gate, the dynamics of a GRU cell can be simplified to

### 2.4 The Reservoir Model

## 3 Results

### 3.1 The Reduced Model

The reduced model displays a quasi-perfect performance (RMSE = 2e-6 with $a=10$ and $b=10-3$) as shown in Figure 5, and the three neurons $X1$, $X2$, and $X3$ behave as expected. $X1$ is strongly correlated with $V$ (see Figure 5B), $X2$ is strongly correlated with $V$ and saturates in the presence of a tick in $T$ (see Figure 5C), and $X3$ is strongly correlated with $M$ and saturates in the presence of a tick in $T$ (see Figure 5D). This reduced model is actually a very good approximation of a line attractor (i.e., a line of points with very slow dynamics) even though we can prove that due to the $tanh$ nonlinearity, in the absence of inputs, the model will converge to a null state (possibly after a very long time), independent of parameters $a$ and $b$ and the initial state. Nonetheless, Figure 5 clearly shows that information can be maintained provided $b$ is small enough. There is a drift, but this drift is so small that it can be considered negligible relative to the system time constants: these slow points can be considered as a line or segment attractor (Seung, 1996; Sussillo & Barak, 2013). As Seung (1998) explained, “The reader should be cautioned that the term ‘continuous attractor’ is an idealization and should not be taken too literally. In real networks, a continuous attractor is only approximated by a manifold in state space along which drift is very slow.” Nevertheless, it is worth mentioning that in order to have a true line attractor, one can replace the $tanh$ activity function with a linear function saturated to 1 and, $-1$.

### 3.2 The Reservoir Model

Unless specified otherwise, all the reservoir models were parameterized using values given in Table 1. These values were chosen to be simple and do not have an impact on the performance of the model, as we explain in section 3.3. All simulations and figures were produced using the Python scientific stack: SciPy (Jones, Oliphant, & Peterson, 2001), Matplotlib (Hunter, 2007), and NumPy (van der Walt, Colbert, & Varoquaux, 2011). (Sources are available at github.com/rougier/ESN-WM.)

Parameter . | Value . |
---|---|

Spectral radius | 0.1 |

Sparsity | 0.5 |

Leak | 1.0 (no leak) |

Input scaling | 1.0 |

Feedback scaling | 1.0 |

Number of units | 1000 |

Noise | 0.0001 |

Training time steps | 25,000 |

Testing time steps | 2,500 |

Trigger probability | 0.01 |

Parameter . | Value . |
---|---|

Spectral radius | 0.1 |

Sparsity | 0.5 |

Leak | 1.0 (no leak) |

Input scaling | 1.0 |

Feedback scaling | 1.0 |

Number of units | 1000 |

Noise | 0.0001 |

Training time steps | 25,000 |

Testing time steps | 2,500 |

Trigger probability | 0.01 |

Note: Unless specified otherwise, these are the parameters used in all the simulations.

Results for the reservoir model show a very good generalization performance with a precision on the order of $10-3$ for the level of noise considered ($10-4$). Better precision can be obtained for lower noise levels, as shown in Figure 7. Surprisingly, this generalization property stands with as few as only four random training values, where we can achieve a $10-3$ level of precision.

#### 3.2.1 One-Value, One-Gate Scalar Task

The model has been trained using the parameters given in Table 1. The $V$ signal is made of 25,000 random values sampled from a pseudo-random uniform distribution between $-1$ and $+1$. The $T$ signal is built from 25,000 random binary values with probability 0.01 of having $T=1$ and $T=0$ otherwise. During training, each of the inputs is presented to the model, and the output is forced with the last triggered input. All the input ($u$) and internal ($x$) states are collected, and the matrix $Wout$ is computed according to equation 2.12. The model has been tested using a $V$ signal made of 2500 random values sampled from a pseudo-random uniform distribution between $-1$ and $+1$. For readability of the figure (see Figure 6A), this signal has been smoothed using a fixed-size Hann window filter. The corresponding $T$ signal has been generated following the same procedure as during the training stage. Figure 6A displays an illustrative test run of the model (for a more thorough analysis of the performance, see the section 3.3) where the error in the output is always kept below $10-2$ and the RMSE is about $3\xd710-3$.

#### 3.2.2 One-Value, Three-Gate Scalar Task

We trained the model on the one-value, three-gate task using the same protocol as for the one-value, one-gate task, using a single-value input, three input triggers, and three corresponding outputs. Since there are now three feedbacks, we divided the respective feedback scaling by 3. Figure 6B shows that maintaining information simultaneously has an impact on the performance of the model (illustrative test run). There is no catastrophic effect but performances are clearly degraded when compared to the one-value, one-gate task. The RMSE on this test run increased by one order of magnitude and is about $2\xd710-2$. Nevertheless, in the majority of the cases we tested, the error does stay below $10-2$. However in a few cases, one memory (and not necessary all) degrades faster.

#### 3.2.3 Three-Value, One-Gate Scalar Task

We used the same protocol as for the one-value, one-gate scalar task, but there are now two additional inputs not related to the task and that can be considered as noise. Adding such irrelevant inputs had no effect on solving the task, as illustrated in Figure 6C, which shows an illustrative test run of the model. The error in the output is also always kept below $10-2$, and the RMSE is about the same ($3\xd710-3$). This is an interesting result because it means the network is not only able to deal with background noise at $10-4$, but it is also able to deal with noise that has the same amplitude as the input. This is an important property to be considered for the modeling of the prefrontal cortex: being an integrative area, the PFC is typically dealing with multimodal information, much of it not relevant for the working memory task at hand (Mante, Sussillo, Shenoy, & Newsome, 2013).

#### 3.2.4 One-Value, One-Gate Digit Task

The model has been trained using 25,000 random integer values between 0 and 9 sampled from an uniform distribution (these values are then divided by 10 to fit in [0, 1]). Each of these values is drawn one after the other onto an image, each character covering six columns of the image. The input $V$ consists then of the rows of this image. The $T$ signal is sampled similarly as in the one-value, one-gate task and then expanded six times to fit the transformation of the value to the picture of the values. This means that the trigger lasts six time steps. Interestingly, as we show in Figure 6D, even if the value to maintain is no longer explicit, it can still be extracted and maintained. On the test run, we show that the RMSE is about $4\xd710-2$. It is to be noted that the recognition of a digit is not straightforward and may require a few time steps before the digit is actually identified. However, when a good value is maintained, it seems to last; the absolute error stays below 0.05, the threshold from which we can distinguish between two values. The reservoir parameters that we found are robust enough to enable not only a pure memory task (i.e., gating), but also a discrimination task (i.e., digit recognition).

Dambre, Verstraeten, Schrauwen, and Massar (2012) demonstrated the existence of a universal trade-off between the nonlinearity of the computation and the short-term memory in the information processing capacity of any dynamical systems, including echo state networks. In other words, the hyperparameters used to generate an optimal reservoir for solving a given memory task would not be optimal for a nonlinear computation task. Here we see that even if the reservoir is made to memorize values for a while, it is still able to do a nonlinear task such as discriminating a stream of digits. Pascanu and Jaeger (2011) initiated the concept of such working memory (WM) units for reservoirs. They processed streams of characters using six binary WM units to record the deepness of curly brackets that appeared. We made such reservoir WM units coupling more general: from binary to continuous values. Instead of relying on a collection of $N$ binary WM units to encode $N$ values or on the maximum encoding of $2N$ values, we have shown that a reservoir can use only one WM unit to encode a continuous value with good precision.

### 3.3 Robustness

We analyzed the robustness of the model first by measuring its sensitivity for each of the hyperparameters (see Figures 7A–7F): input scaling, feedback scaling, spectral radius, sparsity of the reservoir weight matrix, leak term ($\alpha $) in equation 2.11, and number of units in the reservoir. We also measured its sensitivity to task hyperparameters (see Figures 7G–7L): the noise level ($\xi $), the number of discrete values used during training (when there is a trigger, $V$ is uniformly sampled between $-1$ and 1 discrete values), the temporal delay between successive gating signals in training ($T$ is built sampling its interval between triggers uniformly between 0 and a bound), the bound used to sample the input value during training ($V$ is uniformly sampled between $-b$ and $b$ instead, where $b$ is the bound), the total number of input gates (with a corresponding number of outputs), and the number of input values. For most hyperparameters, we set a minimum and maximum value and pick 20 values logarithmically spread inside this range. For each task and model hyperparameter, we ran 20 simulation instances for 25,000 time steps and record the mean performance using 2500 values. Results are shown in Figure 7.

First, we can see a nonsensitivity to the sparsity (i.e., minor differences in performances when these parameters vary). Similarly, we can see a nonsensitivity to the leak term, input scaling, and feedback scaling as long as they are not too small. Note that input and feedback scaling should also not be too big. As expected, performance increases with the number of neurons. Surprisingly, we note that the performance decreases with the spectral radius. In fact, in supplementary Figure S5 we analyzed the behavior of the reservoir model with various spectral radii. We show that even with a bigger spectral radius, the reservoir continues to maintain something relevant but less precise (the segment attractor is slowly degenerating). Globally, the reservoir model is very robust against model hyperparameter changes (as long as it is trained in each condition).

Concerning the task hyperparameters, one can see in Figure 7 that only the trigger range has no impact. This means that whatever the time elapsed between two triggers, it does not affect the performance. Performance naturally decreases with the increase of the noise level (see Figure 7G), the number of input gates (I), or the number of input values (J). We note that the number of discrete values used during training affects performance in a very specific way (H). Using between four and seven training values, the performance is already good and does not improve further with supplementary training values. This means that even if the reservoir model has been trained only to build a few stable points, it is able to interpolate and maintain the other points on the segment attractor. Interestingly, in Figure 7K, we can see a similar case of interpolation relative to the input value bound $x$ (i.e., the interval $[-x,x]$ on which the output is trained). The performance reached a plateau when the bound values reached 0.5, while the interval used for testing performance is always $[-1,1]$.

### 3.4 Dynamics

#### 3.4.1 A Segment Attractor

Figure 8 shows how the model evolves after having been trained for the one-value, one-gate task using different starting positions and receiving no input. This results in the formation of a segment attractor even though the model was trained only to gate and memorize continuous values. If we compute the principal component analysis (PCA) on the reservoir states and look at the principal components (PCs) in the absence of inputs, we can observe (see supplementary Figure S4) that all the reservoir states are organized along a straight line on the first component (the one that explains most of the variance) and each point of this line can be associated with a corresponding memory value. Interestingly enough, there are points on this line that correspond to values outside the $[-1;1]$ range, that is, values for which the model has not been trained for. However, these points are not stable, and any dynamics starting from these points converge toward the points associated with the values 1 or $-1$ (see Figure 8).

#### 3.4.2 $V$-Like and $M$-Like Neurons

Similar to the minimal model, in the absence of input, the inner dynamic of the reservoir model is a combination of both sustained and highly variable activities. More precisely, in the one-value, one-gate task, we notice two types of neurons that are similar to neurons $X1$ and $X3$ in the reduced model: (1) neurons that solely follow the input $V$ (i.e., $V$-like neurons) and (2) neurons that mostly follow the output $M$ (i.e., $M$-like neurons), respectively. We also notice that $M$-like neurons' average activity is linearly linked with the $M$ value and fluctuate around this mean activity according to the input $V$. In Figure 9, we show $M$-like neurons for the different tasks. These neurons were found by taking the most correlated neurons with the output $M$^{3}. From Figures 9A to 9D, we see that the $M$-like neurons' link with $M$ output gets weaker and weaker: the average sustained activity goes from nearly flat to highly perturbed.

This “degradation” of sustained activity is explained by the change in the distribution of correlations of the entire reservoir population with $M$ output. In Figure 9 (right) we see that the correlation with $M$ output is quickly shrinking from panels A to D. For the one-value, one-gate task (see Figure 9A), almost all neurons stay at the same value while maintaining the memory. However, when more values have to be maintained (see Figure 9B) or when more inputs are received (see Figure 9C), most of the activities no longer stay at the same value while maintaining the memory. In fact, in the one-value, one-gate digit task, neurons do not display a sustained activity at all (see Figure 9D). Interestingly, similar behavior (no sustained activity) can be obtained by lowering the feedback scaling (see supplementary Figure S6) or by enforcing the input weights to be big enough (see supplementary Figure S8). More formally, when there is no trigger, the activity of the neurons can be rewritten as $tanh(aX+bM)$. The two proposed modifications make the ratio $ab$ bigger, and eventually, when $a\u226bb$, $tanh(aX+bM)\u2248tanh(aX)$. Consequently, $tanh(aX)$ is highly correlated with $X$ as $aX$ stays bounded between $-1$ and 1 and does not depend of $M$. Similarly, when $a\u226ab$, $tanh(aX+bM)\u2248tanh(bM)$ which is in turn highly correlated with $M$ for the same reasons.

#### 3.4.3 Linear Decoder

To go further in understanding the role played by sustained activities, we wanted to know how much of these sustained activities were necessary to decode the output memory $M$. For the one-value, one-gate task, we trained a separate decoder based on a subsample of the reservoir population. We increasingly subsampled neurons based on three conditions: choosing the most correlated one first, choosing the least correlated one first, or just randomly selecting them. In Figure 10, we can see two interesting facts. First, there is no need for the whole reservoir population to decode well enough the memory $M$: taking 100 neurons among 1000 is sufficient. Second, if we take enough neurons, there is no advantage in taking the most correlated one first; random ones are enough. Surprisingly, it seems better to rely on randomly selected units than most correlated ones. This suggests that randomly distributed activities contained more information than just the most correlated unit activities and offers a complementary perspective when comparing it to Rigotti et al. (2013) decoding of neural population recorded in monkeys: when task-related neurons are not kept for decoding, information (but less accurate) can still be decoded.

### 3.5 Equivalence between the Minimal and the Reservoir Model

In order to understand the equivalence between the minimal and the reservoir model, it is important to note that there are actually two different regimes, as shown in Figure 11. One regime corresponds to the absence of a trigger $(T=0)$, and the other regime corresponds to the presence of a trigger $(T=1)$. When there is no trigger $(T=0)$, the activities of $X1$ and $X2$ compensate each other because they are in a quasi-linear regime ($b$ being very small), and their summed contribution to the output is nil.

In the reservoir model, we can identify an equivalent population by discriminating neurons inside the reservoir based on the strength of their input weight relative to $V$. More formally, we define $R12$ as the group of neurons whose input weight from $V$ (absolute value) is greater than 0.1. Figure 12H shows that the summed output of these neurons is quasi nil, while the complementary population, $R3$, is fully correlated with the output and is thus equivalent to the $X3$ unit in the minimal model.

Symmetrically, in the presence of a trigger $(T=1)$, the activities of $X2$ and $X3$ compensate each other because they are in a saturated regime ($a$ being very large) and their summed contribution to the output is nil. We can again identify an equivalent population in the reservoir model by discriminating neurons inside the reservoir based on the strength of their input weights relative to both $T$ and $V$. More formally, we define $R23$ as the group of neurons whose input weight from $T$ (absolute value) is greater than 0.05 and whose input weight from $V$ (absolute value) is smaller than 0.1. Figure 12J shows that the summed output of these neurons is quasi nil, while the complementary population $R1$ is fully correlated with the input $V$ and is thus equivalent to the $X1$ unit in the minimal model.

We can identify $R2$ by taking the intersection of $R12$ and $R23$, whose activity is similar to $X2$ (see Figure 12L). Consequently, we have identified in the reservoir disjoint subpopulations that are respectively and collectively equivalent to activity of $X1$, $X2$, and $X3$ in the minimal model. Table 2 quantifies this equivalence using simple correlations. To explore further this equivalence, we also conducted comparative lesion studies between the two models (see the supplementary materials). However, with the original set of parameters for the reduced model ($a=1000,b=0.001$), lesioning $X2$ or $X3$ makes the reduced model behave in a degraded and extreme mode: outputs range from 0 to $\xb11000$ whenever there is a trigger, and it makes the comparison with the reservoir difficult (see supplementary Figure S9). By choosing an alternative set of parameters ($a=1000,b=1$, which, incidently, makes the reduced model unable to sustain memory), we can show a strong correlation with the reservoir when $X1/R1$ (resp. $X2/R2$, $X3/R3$) are silenced (see supplementary Figure S10) and this further tightens the relationship between the two models.

Populations . | Output Correlation . |
---|---|

X1 / R1 $(T=1)$ | 0.9996 |

X2 / R2 $(T=0)$ | 0.9997 |

X3 / R3 $(T=0)$ | 0.9997 |

Populations . | Output Correlation . |
---|---|

X1 / R1 $(T=1)$ | 0.9996 |

X2 / R2 $(T=0)$ | 0.9997 |

X3 / R3 $(T=0)$ | 0.9997 |

Notes: Output is restricted to the contribution of the considered subpopulation. Correlations are computed only at time relevant for the subpopulation ($T=0$ or $T=1$).

## 4 Discussion

In computational neuroscience, the reservoir computing paradigm (RC) (Jaeger, 2001; Maass, Natschläger, & Markram, 2002; Verstraeten, Schrauwen, D'Haene, & Stroobandt, 2007), originally proposed indepen-dently by Dominey (1995) and Buonomano and Merzenich (1995),^{4} is often used as a model of canonical microcircuits (Maass et al., 2002; Hoerzer, Legenstein, & Maass, 2012; Sussillo, 2014). It is composed of a random recurrent neural network (i.e., a reservoir) from which readout units (i.e., outputs) are trained to linearly extract information from the high-dimensional, nonlinear dynamics of the reservoir. Several authors have taken advantage of this paradigm to model cortical areas such as PFC (Hinaut & Dominey, 2013; Mannella & Baldassarre, 2015; Hinaut et al., 2015; Enel, Procyk, Quilodran, & Dominey, 2016) because most of the connections are not trained, especially the recurrrent ones. Another reason to use the reservoir computing paradigm for PFC modelling is that PFC also hosts high-dimensional, nonlinear dynamics (Rigotti et al., 2013). RC offers a neuroanatomically plausible view of how cortex-to-basal-ganglia (i.e., cortico-basal) connections could be trained with dopamine: the reservoir plays the role of a cortical area (e.g., trained with unsupervised learning), and the read-out units play the role of basal ganglia input (i.e., striatum).

However, in many dynamical systems, reservoirs included, there exists a trade-off between memory capacity and nonlinear computation (Dambre et al., 2012).^{5} This is why some studies have focused on reservoirs with dedicated readout units acting as WM units (Hoerzer et al., 2012; Pascanu & Jaeger, 2011; Nachstedt & Tetzlaff, 2017). These WM units have feedback connections projecting to the reservoir and are trained to store binary values that are input dependent. This somehow simplifies the task and enables the reservoir to access and use such long-time dependency information to perform a more complex task, freeing the system from constraining reservoir short-term dynamics. Such ideas already had some theoretical support; for instance Maass, Joshi, and Sontag (2007) showed that with an appropriate readout and feedback functions, readout units could be used to approximate any $k$-order differential equation. Pascanu and Jaeger (2011) used up to six binary WM units to store information in order to solve a nested bracketing-levels task. Using principal component analysis, they showed that these binary WM units constrain the reservoir in lower-dimensional “attractors.” In addition, Hoerzer et al. (2012) showed that analog WM units (encoding binary information) also drive the reservoir into a lower-dimensional space (i.e., 99% of the variability of the reservoir activities are explained by fewer principal components). More recently, Strock, Rougier, and Hinaut (2018) and Beer and Barak (2019) used such WM units in order to store analog values (as opposed to binary ones) in order to build a line attractor (Seung, 1996; Sussillo & Barak, 2013). In particular, Beer and Barak (2019) explored how a line attractor can be built online, by comparing FORCE (Sussillo & Abbott, 2009) and LMS algorithms, using a WM unit to maintain a continuous value in the absence of input perturbations.

In that context, the minimal model of three neurons we have proposed helps to understand the mechanisms that allow a reservoir model to gate and maintain scalar values in the presence of input perturbations instead of just binary values. As explained previously, this minimal model exploits the nonlinearity and the asymptotic behavior of the three $tanh$ units and mimics a select operator between the input signal and the output. In the case of the reservoir model, there is no precise architecture or crafted weights, but this is compensated by the size of the population inside the reservoir, along with the training of the output weights. More precisely, we have shown that virtually any population of randomly connected units is able to maintain an analog value at any time and for an arbitrary delay. Taking advantage of the nonlinearity of the neuron transfer function, we have shown how such a population can learn a set of weights during the training phase using only a few representative values. Given the random nature of the architecture and the large set of hyperparameters, for which the precision of the output remains acceptable, this suggests that this property could be a structural property of any population that could be acquired through learning. To achieve such property, we mainly used offline training in our analyses (for efficiency reasons), but we have shown that it also works with online FORCE learning (see the supplementary materials and Figures S2 and S3).

We have shown that the reservoir model behavior is similar to the minimal model with the presence of two “macrostates” that are implemented by compensatory clusters. In a nutshell, this working memory uses two distinct mechanisms: a selection mechanism (i.e., a switch) and a line attractor. Such mechanisms have been also reported in a fully trained recurrent neural network with backpropagation (Mante et al., 2013). The authors proposed a context-dependent selective integration task and showed that “the rich dynamics of PFC responses during selection and integration of inputs can be characterized and understood with just two features of a dynamical system—the line attractor and the selection vector, which are defined only at the level of the neural population.” However in our case, not only did we rely on the analysis of the dynamical system to understand the behavior of the system, but we were able to design a minimal model implementing these mechanisms and show that these same mechanisms are also present in the reservoir model but in a distributed way.

Finally, one important feature of the model is that it is actually an open system and, as such, is continuously under the direct influence of external activities. More precisely, the model is able to retain information when the gate is closed, but this closed gate corresponds to a functional state rather than a physical state where input signals would have been blocked and information continues to enter the system. This is illustrated quite clearly when one looks at the internal activities inside the reservoir: a large number of neurons are directly (and only) correlated with the input signals. This has consequences for the analysis of the dynamics of the population: this population is partly driven by the (working memory) task and partly driven by uncorrelated external activities. If we go back to biology, this makes perfect sense: a population of neurons is never isolated from the rest of the brain. When studying electrophysiological recordings, we have to keep in mind that such activities can be fully uncorrelated with the task we are observing. This might be one of the reasons for the variability of hypotheses about working memory encoding mechanisms.

## Notes

^{1}

The Inconsolata font is available from https://www.levien.com/type/myfonts/inconsolata.html.

^{2}

We can find a similar LSTM variant in Greff, Srivastava, Koutnik, Steunebrink, and Schmidhuber (2017), the coupled input and forget gate.

^{3}

Because a trigger input has substantial influence on the reservoir states, we made the categorization by ignoring the time steps when there is a trigger.

^{4}

Earlier formulations of very similar concepts can be found in Jaeger (2007).

^{5}

For reservoirs, this trade-off depends on the hyperparameters (HP) chosen. Some HP sets give more memory, others more computational capacity (Legenstein & Maass, 2007).

## References

## Author notes

X.H. and N.R. contributed equally.