How episodic memories are formed in the brain is a continuing puzzle for the neuroscience community. The brain areas that are critical for episodic learning (e.g., the hippocampus) are characterized by recurrent connectivity and generate frequent offline replay events. The function of the replay events is a subject of active debate. Recurrent connectivity, computational simulations show, enables sequence learning when combined with a suitable learning algorithm such as backpropagation through time (BPTT). BPTT, however, is not biologically plausible. We describe here, for the first time, a biologically plausible variant of BPTT in a reversible recurrent neural network, R2N2, that critically leverages offline replay to support episodic learning. The model uses forward and backward offline replay to transfer information between two recurrent neural networks, a cache and a consolidator, that perform rapid one-shot learning and statistical learning, respectively. Unlike replay in standard BPTT, this architecture requires no artificial external memory store. This approach outperforms existing solutions like random feedback local online learning and reservoir network. It also accounts for the functional significance of hippocampal replay events. We demonstrate the R2N2 network properties using benchmark tests from computer science and simulate the rodent delayed alternation T-maze task.

Forming memories of our lives’ episodes requires the ability to encode and store extended temporal sequences. Those sequences could be things said, places visited, or innumerable other sequences of states. Beyond enabling humans’ pastime of recounting our prior experiences, episodic memories are the basis of predictive models of how the world works that support adaptive decision making (Murty et al., 2016; Gershman, 2018). How brains build memories of temporal sequences remains poorly understood. It is known that specific brain circuits (e.g., the hippocampal formation Diba & Buzsáki, 2007; Hemberger et al., 2019; Schuck & Niv, 2019; Vaz et al., 2020; Eichenlaub et al., 2020; Fernández-Ruiz et al., 2019; Michon et al., 2019) and functional dynamics (e.g., hippocampal replay events) are particularly important. However, the functional principles by which the hippocampus and replay events enable sequence or episode encoding remain a puzzle. Machine learning approaches can solve this problem but are biologically implausible. Here, we explore how when the principles that underlie machine learning approaches are modified to be biologically plausible may elucidate our understanding of how the brain builds episodic memories.

Artificial neural networks, when trained with engineered machine learning approaches, are well capable of encoding protracted temporal sequences. Temporal sequence learning is a task solved particularly well by recurrent neural networks (RNNs). RNNs contain one or more layers of neurons with reciprocal connections among the neurons in that layer (i.e., recurrent connections). This means that the activity of a neuron is a function of both activity in other layers and the activity of its own layer a moment prior. When the recurrent connections are tuned appropriately, the network becomes capable of recognizing sequences, predicting upcoming transitions, and intrinsically recalling sequences. The key challenge, of course, is how to tune the connections. This breaks down to two specific questions: “What learning algorithm allows for reliable encoding and storage of extended sequences from as little as a single experience with that sequence?” and “How can this learning be done in a biologically plausible way?”

A potent learning algorithm that enables artificial RNNs to effectively encode extended temporal sequences is backpropagation through time (BPTT; Werbos, 1990). Briefly, BPTT works as follows: A sequence of patterns is applied to an input layer of an RNN. A recurrent layer integrates this input along with its own state, generating a temporally evolving pattern of activity. A full record of the spatiotemporal activity of the input and recurrent layers is stored. Given the activity of the recurrent layer, the network can then predict subsequent outputs or any signals that it is trained with. Differences between the predicted and actual next states are errors. Errors are the product of the current connection strengths and the past activity of the input and recurrent network layers. Following a presentation to a sequence, during an offline learning phase, BPTT combines information about the connection strengths and activity history while propagating the error back along the computation graph to attribute the blame for the errors to individual network connections (see Figure 1A). Finally, individual connections are weakened or strengthened proportionally to their share of the blame for the error. With repeated presentations and epochs of offline learning, the network becomes able to accurately predict or generate the sequence.

Figure 1:

(A) Temporally unfolded computational process of an RNN in BPTT. The recurrent layer at each time step generates an output through the blue projection; error signals (yellow circles) are computed according to the output layer and propagated to the recurrent layer via the yellow projection. To compute the gradient at different time steps, error signals also need to be propagated temporally, that is, backward in time. (B) Consolidator-cache model. The consolidator-cache system first generates output and gets feedback from the environment. Then the cache network stores sequences and plays them back in reverse order to the consolidator network, which in turn optimizes itself based on the replayed content.

Figure 1:

(A) Temporally unfolded computational process of an RNN in BPTT. The recurrent layer at each time step generates an output through the blue projection; error signals (yellow circles) are computed according to the output layer and propagated to the recurrent layer via the yellow projection. To compute the gradient at different time steps, error signals also need to be propagated temporally, that is, backward in time. (B) Consolidator-cache model. The consolidator-cache system first generates output and gets feedback from the environment. Then the cache network stores sequences and plays them back in reverse order to the consolidator network, which in turn optimizes itself based on the replayed content.

Close modal

Though BPTT is an effective RNN learning algorithm, its value for explaining how brains form episodic memories is questionable because it is not biologically plausible. The implausibility results from violations of the locality constraint. The locality constraint captures the fact that biological connections, synapses, can only be changed given locally available information. BPTT violates this constraint in two key ways.

First, BPTT stores and uses an external record of the network’s past activity. Though this activity information was available as it propagated across the network during the presentation of the input sequence, it is no longer locally available during the offline learning phase. Moreover, the prior activation states cannot be recomputed with locally available information. This is because the state at time t depended on the state at time t-1, information not available at time t+1. Ordinary BPTT solves this by literally saving a record of the activation history (e.g., in RAM or GPU memory). Resolving this form of biological implausibility requires addressing how information about the activation history can be obtained in reverse chronological order with only locally available information.

Second, BPTT violates the locality constraint in how the information about current connection strengths is used in the offline learning phase. To attribute blame for error to individual connections, BPTT propagates error backward over the computation graph. This can be accomplished in two conceptually distinct ways, but both are biologically implausible. The first is that a separate error network is implemented wherein the connections (e.g., between neurons A and B) are defined as the transpose of the connection in the main network. In other words, the connection BA in the error network is identical to the strength of AB in the main network. This makes it so that the error passed from B to A in the error network is proportional to the amount of activity passed from A to B in the main network. In this way, the error network accurately attributes errors to each connection. The main network is then updated accordingly. The key implausibility, called the “weight transport problem,” is how the connections of the error network mirror those of the main network and how the main network updates are informed by the processing in the error network. Addressing the weight transport problem but suffering from another implausibility is the second way to backpropagate errors. The second approach propagates error backward in the main network itself, firing connections backward with information about the error. This effectively removes the need to transfer connection information between networks but creates the need to pass information backward across connections. Though dedicated backward projections exist particularly in the sensory processing stream and are essential in theories like predictive coding (Huang & Rao, 2011), individual synapses generally do not run in reverse themselves in terms of propagating downstream neural activity (Lillicrap et al., 2020). Resolving these implausibilities requires addressing the question of how the connection strength information can be factored into learning so that only locally available information is used.

Establishing biologically plausible means of tuning neural networks to store temporal sequences is an important and ambitious goal. This goal is important for its potential to offer a functional hypothesis for how neural systems support memory for episodes or protracted sequential events. It is ambitious because the native mechanisms of BPTT were engineered to specifically meet the functional needs. Reaching this goal requires addressing biological implausibilities regarding retrieving the activation history of the input and recurrent layers and how information about connection strengths is used to attribute error across connections and backward in time.

Solutions have been proposed previously for both implausibilities, but each has suffered from notable limitations. The utilization of external storage, for example, has been addressed with various approaches. The specifics of those approaches differ, but they share a common feature. The solutions omit the need for offline access to a record of the activity through clever handling of the activity while it is still present during the original presentation of the sequence. That is, they compute the sources of error in an online way (Depasquale et al., 2018; Murray, 2019; Tallec & Ollivier, 2017; Bellec et al., 2019). These are remarkable for their ability to leverage on-the-fly computations to support learning, but these adaptations come at a substantial cost to final performance. Moreover, the omission of offline replay does not improve biological plausibility. Offline forward and reverse replays are well established to occur in biological neural networks (Foster & Wilson, 2006). Indeed, there is strong evidence that offline replay is essential for learning (Jadhav et al., 2012). For offline replay to exist, however, there must be a way to regenerate the patterns. Defining how this occurs is a puzzle that we address in this article.

Solutions also exist to address the weight transport problem and reversible connections problem, but they too have limitations. For example, the reversible connection problem comes up in training feedforward networks. In that setting, it was shown that knowledge of the weights is not needed and that fixed random top-down connections can function to train various networks (Lillicrap et al., 2016; see also Akrout et al., 2019). This is referred to as feedback alignment. Though originally designed for feedforward networks, feedback alignment can also be used in RNNs. One such variant is random feedback local online (RFLO; Murray, 2019). The feedback alignment approach of RFLO effectively addresses the biological implausibility issue. Critically, however, RFLO is functionally limited to propagating error only one step backward in time (Marschall et al., 2020). A method capable of tracking the temporal gradient over many time steps in RNNs remains lacking but is a gap we address in this article. There’re also variants like e-prop (Bellec et al., 2019), which claim to be biologically plausible learning algorithms for spiking and other types of RNNs. However, e-prop can be decomposed into several major versions. The first version, e-prop-1, to our knowledge, works in a similar way to direct feedback alignment in RNNs as RFLO does. Other later versions of e-prop, specifically, e-prop-2 and -3, leverage extra information from modules learned via BPTT to improve the effect of online error propagation.

Though evolution-based explanations can be used to argue for the biological plausibility of such BPTT-optimized external error modules, we intend not to consider them as pure BPTT-free biologically plausible learning algorithms for RNNs. We present here a biologically plausible recurrent network model of episodic learning based on BPTT without suffering from the biological implausibilities of BPTT. This model, referred to here as R2N2 (short for reversible recurrent neural network), fully satisfies the locality constraint. R2N2 uses no external record of the network’s past activity. Instead, it leverages two previously described separate solutions for enabling reversible reactivation of a network, one for the input layer and one for the recurrent layer. Further, R2N2 neither transports weights nor assumes reversible synapses. Instead, it leverages an error network that is controlled by and controls the main network in a way that allows error backpropagation to train the network without weight transport. The individual components of R2N2 are each based on established approaches. The full R2N2 model, and the fact that it collectively represents a high-functioning biologically plausible replacement for BPTT, is novel and innovative. The specifics of each separated solution and the operation of the full R2N2 model are described in section 2. In section 3, we demonstrate the sufficiency of each solution separately. We then combine the components to form R2N2 and benchmark the performance of this fully biologically plausible implementation of BPTT, showing it surpasses current state-of-the-art biologically plausible implementations. Finally, to facilitate comparison to sequence learning in brains, we show that R2N2 can learn the classic delayed alternation T-maze task. While our model is designed to replace biologically implausible components with ones that are plausible in principle, it is not designed to simulate specific anatomy or physiology. Nonetheless, as illustrated and discussed, the full model recapitulates several key functional properties of the hippocampal formation, including place cells and offline replay.

The full model consists of two interconnected RNNs referred to here as the consolidator and the cache (see Figure 1B). We refer to the full model as the reversible recurrent neural network (R2N2). The consolidator and cache are both RNNs but with different architectures and learning rules as each is designed for distinct functions. These are briefly summarized here, and specifics are unpacked in detail in the sections that follow. The consolidator is the primary RNN, designed to have a large storage capacity and robust generalization ability. A trained consolidator is functionally akin to an RNN trained with BPTT. The cache is an auxiliary network, designed to support the training of the consolidator. The cache is optimized for rapid encoding and high-precision bidirectional retrieval of input sequences. This enables it to perform one-shot learning of to-be-learned sequences. During offline processing, the cache replays the sequences in reverse order to the consolidator to train the durable trace of the memory.

2.1  The Consolidator Network

Taking inspiration from Chang et al. (2017), we developed the consolidator network, an RNN that is composed of at least two interconnected populations of neurons A and B (see Figure 2A). While it's plausible to implement this architecture with more than two neuron groups, two is the minimum number required for reciprocally dependent firing rate reconstruction. The neuron group A receives input signals from neuron group B and itself and vice versa. More specifically, we divide the incoming projections for each group into two parts, and the corresponding dynamics equations can be written as follows:
(2.1)
Figure 2:

Schematic structure of consolidator and cache. (A) Projections of neuronal groups A and B in the consolidator network. The consolidator network is composed of an activity network and an error network. The activity network, shown in red, represents f*, while the network in blue represents g*. Notice that the direction of sequence play (forward or reverse) is not fixed by the arrow but determined by the competition between f* and g*, unlike in Chang et al. (2017), where the forward and backward connections are the same and are controlled explicitly through the code logic. (B) The temporal unfolding of forward and backward computation in the consolidator. When projection f* is stronger, the whole network operates in the forward mode. In this case, the network updates its activations in both groups A and B, generates outputs, and computes error signals accordingly. When the g* connection is stronger, the whole network turns into backward mode. Neuronal groups A and B generate reversed activation sequence and thus can be used to propagate activation backward to reconstruct the previous time step activities, bypassing the temporal credit assignment issue caused by nonlocality and obviating the need to store previous activities. The error network recursively multiplies the error vector by the feedback alignment random matrix to compute the error vector at the previous time step. (C) Left: Projections in the cache. In the cache, each neuron receives projections from both WE (orange) and WO (green left). Right: detailed description of connection pattern in a single neuron i. The final incoming weight of neuron i is determined by λ(0,1), which can be viewed as a result of competing oscillating interneurons tuning inputs from WE and WO synaptic inputs. (D) A schematic description of state transitions in cache. By periodically switching between WO and WE via the control signal λ (lower panel), the network operates in two sets of weights. Each of them builds state attractors between successive states Sn and Sn+1(energy landscape slopes between two edges of the same phase). Different successive state pairs are connected in a chaining way and thus form a long state sequence (upper panel).

Figure 2:

Schematic structure of consolidator and cache. (A) Projections of neuronal groups A and B in the consolidator network. The consolidator network is composed of an activity network and an error network. The activity network, shown in red, represents f*, while the network in blue represents g*. Notice that the direction of sequence play (forward or reverse) is not fixed by the arrow but determined by the competition between f* and g*, unlike in Chang et al. (2017), where the forward and backward connections are the same and are controlled explicitly through the code logic. (B) The temporal unfolding of forward and backward computation in the consolidator. When projection f* is stronger, the whole network operates in the forward mode. In this case, the network updates its activations in both groups A and B, generates outputs, and computes error signals accordingly. When the g* connection is stronger, the whole network turns into backward mode. Neuronal groups A and B generate reversed activation sequence and thus can be used to propagate activation backward to reconstruct the previous time step activities, bypassing the temporal credit assignment issue caused by nonlocality and obviating the need to store previous activities. The error network recursively multiplies the error vector by the feedback alignment random matrix to compute the error vector at the previous time step. (C) Left: Projections in the cache. In the cache, each neuron receives projections from both WE (orange) and WO (green left). Right: detailed description of connection pattern in a single neuron i. The final incoming weight of neuron i is determined by λ(0,1), which can be viewed as a result of competing oscillating interneurons tuning inputs from WE and WO synaptic inputs. (D) A schematic description of state transitions in cache. By periodically switching between WO and WE via the control signal λ (lower panel), the network operates in two sets of weights. Each of them builds state attractors between successive states Sn and Sn+1(energy landscape slopes between two edges of the same phase). Different successive state pairs are connected in a chaining way and thus form a long state sequence (upper panel).

Close modal

We define the f* (* can be A or B) projections as forward connections with the g* being the backward projections and k being the output projection. In this section, we focus on the recurrent units in A and B and leave the discussion of output y to later sections, in which the network firing rate dynamics generated by f* is Hf={(hA0,hB0),...,(hAT,hBT)}, which is called the forward sequence and represents the normal running or forward replay phase of this RNN. To generate a reversed sequence of Hf mathematically, one needs to simply flip the sign of derivatives in the dynamics equation and then integrate from t=0 to t=T. If we discretize the dynamics equations into a different form, it will degenerate to the case of a reversible deep neural network block (Chang et al., 2017), which has been shown to be memory constant in various tasks as the storage requirement of neural activity equals the number of units in the network.

The intuition is depicted in Figure 2B. The blue circuit represents a transformation ( represents the operation that combines A and B) from At,Bt to At+1,Bt+1 in the next time step: AtBtAt+1 and BtAt+1Bt+1. Then, as long as there exists another inverse operation that satisfies XY=ZZY=X, we can construct a circuit that turns At+1,Bt+1 into At,Bt: Bt+1At+1Bt and At+1BtAt.

In effect, these two operators ( and ) reverse the update process without requiring any other constraints: the two variables A and B could be either scalar (signal neuron case) or high-dimensional vector (neuron group case). One of the simplest pairs of operators that meets the above conditions consists of addition and subtraction, which is the operator set that Chang et al. (2017) used. To map these two operators to the brain seems implausible as it requires flipping the excitability of a synapse in a short time range. However, we can approximately reach the same effect by introducing competition from another group of learnable backward projections g* (* can be A or B). To function, the g* projections must generate currents that have the same amplitude but with different signs from f* to cancel f*-h*: f*-h*+g*=0:
(2.2)

This can be easily implemented with local dendritic propagation and local training (see equation 2.2) on g*’s parameter θg* (Poirazi et al., 2003; Guerguiev et al., 2017). Although equation 2.2 does not guarantee f*-h*+g* to be exactly zero after training, in practice we find that it works extremely well (see Figure 3). This detailed balance between f* and g* projections thus makes it possible to run the whole system in a backward manner: If the excitability of a trained g* projection is scaled by a factor of 2, the network will switch from forward to backward replay. The resulting effective projection will be approximately h*-f*, as when g*=2(h*-f*), f*-h*+g* becomes h*-f*. This is precisely the method we used to switch the consolidator between forward and reverse play in the simulations below.

Figure 3:

Normalized firing rates of neuronal groups A and B in the consolidator (For each group, the first 5 neurons are selected, while overall the consolidator has 128 neurons). (A, D) Solid lines represent firing rates in forward running (B, E) Dashed lines represent firing rates during backward running. (C, F) Difference of firing rates between forward and backward running (flipped). Signs of symmetry are shown clearly in comparisons between the pair.

Figure 3:

Normalized firing rates of neuronal groups A and B in the consolidator (For each group, the first 5 neurons are selected, while overall the consolidator has 128 neurons). (A, D) Solid lines represent firing rates in forward running (B, E) Dashed lines represent firing rates during backward running. (C, F) Difference of firing rates between forward and backward running (flipped). Signs of symmetry are shown clearly in comparisons between the pair.

Close modal

There are many possible choices for f* and g* projections. For example, each can be a simple two-layer neural network if we consider the branching structure of dendrites. Neurons with dendritic structures have been demonstrated experimentally to be more complex than a simple neural network with just one activation function (Poirazi et al., 2003), since the inputs, prior to summation and thresholding in the soma, are initially aggregated with another nonlinear activation function in elongated terminal dendrites. In this case, a complex f*-h* architecture can be approximated by g* as long as the complexity of f* is not higher than g*. For demonstration, in the following experiments, we choose the simplest form of f* to be a single-layer network taking the form of y=Wφ(x)+b (φ is tanh) and g* to be another simple networks like y=W'φ(x)+b'. Mathematically, this reduces the learning process of backward projection g* to be a regression process, which is known to be trivial to solve with local learning rules but is still enough to support complex sequential computation as nonlinearity is involved in each time step. This architecture (see Figure 2A) is scalable, and thus it is possible to support more complicated computations. If one adds another group along with groups A and B, the coupling can be extended, and the backward running can still be preserved.

Figure 2A depicts a network that can produce sequences of activity in reverse, and the same network structure can also be used to propagate error signals in reverse. Figure 2B shows two network graphs, one that runs activity sequences in reverse and one that runs error sequences in reverse.

In BPTT (see equation 2.3), the error signal eht (h can be either hA or hB) used to update hidden layer connections θf* is recursively computed in a mirroring circuit (yellow projections in Figures 2A and 2B) of the forward circuit from “future” to “past,” that is, eht relies on both the future hidden layer error eht+1 and the transient output error Lt. The Lt can be mean squared error or other types of loss functions depending on the task. In our later experiments, mean squared error is used:
(2.3)

Two types of implausibility exist in this process: (1) the weight transport problem (i.e., how to compute the transient component of eht with Lt; Whittington & Bogacz, 2019) and (2) the external storage of activations (i.e., how to compute Xt in 2.3). On the one hand, for the first issue, Lillicrap et al. (2016) proposed an alternative local error circuit, feedback alignment (FA). With fixed random projections, it has been shown to be effective on various deep network architectures and tasks (Nøkland, 2016; Moskovitz et al., 2018). Our synaptic competition balance mechanism described above addresses the second issue, as it reconstructs the previous network states Xt in a backward manner. This eliminates the need to store the neural activities at multiple time steps in the past.

Together these two mechanisms propagate the error temporally back with a simple linear tuning of projection excitability and without any nonlocal information (the first line in equation 2.3). The FA algorithm approximates hidden layer errors eht in equation 2.3 with eht^ at each time step using Lt and a fixed random feedback matrix. (For the specific derivation of eht^ and its recursive update equation, see the appendix A.) The backward projections (red connections in Figure 2A) in the consolidator network replace Xt with Xt^ by approximating ht with ht^, the reconstructed neural activities generated in a reverse replay. The product of eht^ and Xt^ is then used to update forward projections θf*,
(2.4)

2.2  The Cache Network

The learning mechanism described so far (see equation 2.3) is effective when the consolidator network is solely determined by its previous states (i.e., without external inputs) as the reverse replay equation will not hold if we add a time-varying term in equation 2.1. This limits the use cases of the consolidator as most of the sequence learning tasks involve dealing with temporal inputs. To perform reverse replay in a running consolidator that integrates time-varying input sequences I (see equation 2.5) through the mapping b, external storage of sensory sequence inputs I becomes necessary, so that equation 2.1 is modified as
(2.5)

In the replay phase, the consolidator itself cannot generate the dynamics without knowing b and, by extension, I. Superficially, this brings us back to the original dilemma: to design another RNN that can run backward. The difference is that it should have the capability to memorize a given sequence after as little as a single exposure, which makes the problem harder. Nevertheless, the fact is that sensory input sequences in most cases are usually in a space that has many fewer dimensions compared with the number of neurons in the consolidator, and this suggests a solution.

To memorize sequential sensory inputs and play them in a reversed manner, one can build point attractors representing inputs in the state space and connect them with directed line attractors (see Figure 2D). A modified Hopfield RNN (see Figure 2C) matches these desired characteristics. A classical Hopfield network builds energy basins that allow noisy inputs to settle into corresponding attractors. A modification to its learning rule, from ΔWI·I to equation 2.6, then links one attractor to another (ItIt+1) in the state space with a direction pointing to Ii if this trial is rewarded (r=1). This is the general Hopfield weight update equation for the cache network:
(2.6)
By linking multiple (It,It+1) pairs from t=0 to t=T with the modified learning rule, a reversed pattern sequence {IT,IT-1,...,I0} is built. The cache is an RNN that uses the above learning rule in combination with time-varying weights (see equation 2.7), because unlike traditional Hopfield networks with one set of weights, Lee (2002) has shown that time-varying weights enhance the stability of transitions between successive sensory patterns. Once It is stably transformed to It-1 through WE, another group of weights WO will dominate the transition from It-1 to It-2 through the tuning of λ (see in Figure 2C the yellow and green projections tuned by two competing interneurons), which can be viewed as an external periodic control signal or a signal indicating the stability of cache activations (Sompolinsky & Kanter, 1986). In terms of a physical analogy, one can imagine this as a reciprocating pump in which WE drives the system from It to It-1, and then WO drives the system from It-1 to It-2, and so on back and forth between WE and WO, by the following equation that governs the activity of the cache network:
(2.7)

The cache network thus can learn arbitrary sequences of patterns in a local, stable, and one-shot manner as the weights update rule of the Hopfield network is local and can be computed with only a single exposure to the inputs. The cache network must discretize the input stimuli in time, so a continuous Hopfield network would not be appropriate. This is not a problem, though, as arbitrary sequences can be learned provided the sampling rate is high enough to avoid aliasing. Moreover, this is appropriate as a model of biology, as there is evidence that human perception itself is discretized in time (Landau et al., 2015).

2.3  R2N2: Sequence Learning with Consolidator and Cache

In training a vanilla RNN with BPTT, one needs to perform the following steps:

  • Initialize the RNN and run forward with temporally varying inputs.

  • Store the inputs sequence, hidden unit activations, output sequence generated, and the target sequence to an external memory device.

  • After the whole input sequence has been received, compute the error between output and target for the last time step.

  • Extract input, output, and target pairs from the memory device in a temporally reversed order, propagate the error in a backward manner, and compute weight changes simultaneously.

  • Apply the accumulated weight changes after finishing the backward running phase.

The first point to note here is that the external storage is where the main biological implausibility lies. It’s unclear how the brain could store the activity in each cell at each time step somewhere else and replay it precisely. However, with the consolidator and cache, this activity memory can be reconstructed dynamically. Notably, the storage size requirement is substantially reduced, as the consolidator can reproduce its historical activations as a reverse-play sequence with the help of the cache. Consider a case in which the consolidator has 128 neurons and the channel size of inputs is 16. The standard BPTT needs to store a sequence of 16+128=144-dimensional vectors as all input and hidden state vectors need to be preserved in the temporal unfolding process. However, in our model, the system only requires a cache to store a sequence of 16-dimensional vectors representing only the sequence of input vectors, because the consolidator can reconstruct its activity by itself. This means a memory of sensory experience instead of all neural activities is enough to support sequence learning. This approach also matches the empirical findings that the replay of location sequences improves animals’ performance in given spatial navigation tasks (Ambrose et al., 2016).

Second, we modify the standard learning process in BPTT to fit the R2N2 model. In BPTT, the input and target channels usually belong to different categories. Taking the classical random dots perceptual decision-making task as an example, the input is usually set to the coherence of randomly moving dots’ directions, and the desired output target is the eye motion direction (Lo & Wang, 2006). This makes the backward running phase more complicated as the system needs to store the desired target and input patterns together and only compute the error signal based on the difference between generated outputs and desired targets. Instead, the process can be simplified if there is no categorical difference between desired outputs and inputs. Taking inspiration from predictive coding in sequence learning (Zhang et al., 2019), we view performing cognitive tasks as a process of online sequence prediction: the task-relevant stimuli, action signal, and reward signal are treated equally and are concatenated into an integrated sensory inputs vector. Regardless of their structure, various cognitive tasks then can be reduced to the same type of sequence prediction task. Thus, the task reduces to predicting the future state at time t+1 on the basis of task-relevant variables at time t.

Based on these assumptions and modifications, we propose that learning a specific task can be divided into two phases, with the first one mapped to fast learning and the second one to slower statistical learning, essentially as a consolidation process. In the first stage, the animal explores the task settings and environment randomly, generating both rewarded and unrewarded sensory sequences involving all task-relevant variables. During this initial phase, the cache memorizes sensory sequences that are rewarded at the end of each trial, which can be learned in a one-shot fashion as it is a Hopfield network in principle (see equation 2.6).

In the second phase, the cache starts reverse replay, sending signals to the consolidator and thus training it. This means a target for the cache at time t is actually input for both the consolidator and cache at time t+1, so that the cache does not need to store a target sequence separately. Then the consolidator in the second phase optimizes its forward projections f* according to the targets provided by the cache and its own reconstructed reversed activations. Once its forward projections are changed, the backward projections g* will be adjusted accordingly to cancel f*. Notice that the adjustment of f* and g* (see equations 2.4 and 2.2) could occur simultaneously as the learning of backward projection is an online process. Consequently, the knowledge about the rewarded sensory experience is transferred from cache to consolidator via fast learning at first and then statistical learning later. Besides, as the consolidator can go back to states it experienced, it can also perform forward replay using projections f*, which could be used to explore possible future outcomes when the model is in an intermediate state (Van Der Meer & Redish, 2010; Pfeiffer & Foster, 2013). To sum up, we view this process as an implementation of Buzsaki’s two-phase model (Lörincz & Buzsáki, 2000) for training long-term memories as the interplay between consolidator and cache in two phases simulates the entorhinal-hippocampal communication.

3.1  Consolidator

The consolidator is the primary long-term memory store of R2N2. The architecture is illustrated in Figures 2A and 2B and described in section 2.1. To be a biologically plausible high-functioning network, the consolidator must demonstrate two capabilities with only local information: (1) reversibility, that is, the ability to reconstruct its activity backward, and (2) error, that is, the ability to tune the connection in proportion to the errors produced so that error can be reduced. BPTT uses nonlocal solutions to achieve both capabilities. In this section, we show functioning solutions to both that use only local information. Competing complementary subnetworks (Chang et al., 2017) can reconstruct the spatiotemporal consolidator activity patterns in reverse without an external record. Feedback alignment (Lillicrap et al., 2016) can reduce the reconstruction error over training without weight transport or reversible synapses. It should also outperform the highest functioning biologically plausible algorithms. We benchmark R2N2 performance and demonstrate that R2N2 performs comparably to BPTT and better than RFLO and echo state networks.

3.1.1  Reversibility

To achieve reversibility in the consolidator network, we used the competing subnetwork approach described previously (Chang et al., 2017). To demonstrate reversibility, the consolidator should be able to reconstruct an activation sequence in reversed temporal order without external signals so that the error can be aligned to the dynamics that led to the error.

A random time-varying pattern of activity was applied to the input layer of the consolidator that induced a complex time-varying pattern of consolidator activity. A representative example of the consolidator activity is shown for five neurons in each of the consolidator subnetworks in Figure 3A. The activity of each consolidator subnetwork is part a function of the input from the other subnetwork by way of connections f* (see section 2 for implementation details). A separate set of intersubnetwork connections, g*, learns to be equal and opposite in sign to f* through a local learning rule (i.e., without use of nonlocal information). It is proper training of g* that allows reversible reconstruction of the consolidator activity. Figure 3B illustrates the reversibility of the consolidator after four epochs of training. Shown are 100 time steps of the same five neurons as shown in Figure 3A as the newly trained g* connections control the consolidator activity. Comparing this activity to the forward pattern by flipping the time axis and subtracting it from the forward pattern reveals that the activity was well matched, as shown in Figure 3C.

3.1.2  Error Backpropagation with Feedback Alignment

Backpropagation allows a network to take error information that becomes available at the end of a sequence and retroactively tune connection strengths to reduce error. Successful backpropagation requires both a record of the prior activity and a means of relaying the error signal. The record of prior activity in the consolidator is provided by the reversibility property shown above. To relay the error signal, we used the feedback alignment approach described previously (Lillicrap et al., 2016). To demonstrate successful backpropagation, the consolidator should be able to adjust its connections to be able to minimize error and thereby reconstruct an input sequence.

A random binary vector data stream was generated to serve as inputs as shown in Figure 4A. Notably, the random inputs included repeated elements at both adjacent and remote time points challenging the network to attribute the error appropriately as a function of time (i.e., not simply learn that state Y always follows X). The consolidator was trained to generate input patterns in the next time step (i.e., predict transitions) using data at the current time step as a cue. After 50,000 training steps, the consolidator was able to predict the random binary vector as shown in Figure 4B. This experiment shows that the consolidator can learn to map input patterns to outputs at a given time despite using shared weights across multiple time steps. This implies that the temporal credit assignment problem is solved effectively (i.e., the correct connections were adjusted for an error resulting from an earlier activity pattern) through the use of feedback alignment and consolidator reversibility.

Figure 4:

Sequence memorization task for the consolidator. The consolidator is trained to recall elements in the next step using data from the current time step. The top row represents the input data stream, and the bottom row represents the sequence generated by the consolidator.

Figure 4:

Sequence memorization task for the consolidator. The consolidator is trained to recall elements in the next step using data from the current time step. The top row represents the input data stream, and the bottom row represents the sequence generated by the consolidator.

Close modal

3.1.3  Performance

We benchmarked the consolidator's performance by comparing its memorization capacity to that of BPTT, RFLO (random feedback local learning; Murray, 2019) and ESN (Echo state networks; Maass et al., 2002)—two high-performing, biologically plausible sequence encoders. e-prop is not included, as, per our discussion in section 1, we see no fundamental difference between BPTT-free e-prop and RFLO. Therefore, we compared the performance of the four algorithms (consolidator, BPTT, RFLO, and ESN) on a character prediction task anbn (see Figure 5A). The anbn character prediction task (Rodriguez et al., 1999) is a classic test for assessing RNN capability to encode sequences in the face of strong interference. In short, the input stimuli are a stream of nas and nbs (see section 2 for details). Performing this task requires accurately predicting whether the next character is another repeat or a switch, and this requires staying oriented to how many repeats have come already. We also tested the sequential MNIST task, but this turned out to be surprisingly easy and not a useful way to discriminate among models. R2N2 could make use of local or compressed information to perform the task, and even an echo state network can solve the problem.

Figure 5:

(A) A schematic view of anbn task. Each model is trained to predict the next character based on the previous inputs. (B) Performance comparison on the anbn task. The upper rows show averaged loss curves while the lower rows show prediction accuracies. For all curves, solid lines represent the running average (window = 1000) while the shadowed lines represent the corresponding standard error. From the left-most to the right-most column, the range of n in each data set is linearly increased from a bin including 1 to 4 to a bin including 21 to 24; thus, the corresponding sequence length T ranges from [4,10] to [44,50] (T=2n+2. 2n is used as both a and b are repeated n times, and the extra two elements are the newline symbols in the middle and end; see panel A for details). Notice that for each column, the results are visualized for all sequence lengths within its range. Orange lines represent the consolidator with the transpose of forward matrices as backward matrices. Blue lines represent the consolidator with fixed random backward matrices, implementing feedback alignment. Green lines represent RFLO, and red lines represent an echo state network (ESN).

Figure 5:

(A) A schematic view of anbn task. Each model is trained to predict the next character based on the previous inputs. (B) Performance comparison on the anbn task. The upper rows show averaged loss curves while the lower rows show prediction accuracies. For all curves, solid lines represent the running average (window = 1000) while the shadowed lines represent the corresponding standard error. From the left-most to the right-most column, the range of n in each data set is linearly increased from a bin including 1 to 4 to a bin including 21 to 24; thus, the corresponding sequence length T ranges from [4,10] to [44,50] (T=2n+2. 2n is used as both a and b are repeated n times, and the extra two elements are the newline symbols in the middle and end; see panel A for details). Notice that for each column, the results are visualized for all sequence lengths within its range. Orange lines represent the consolidator with the transpose of forward matrices as backward matrices. Blue lines represent the consolidator with fixed random backward matrices, implementing feedback alignment. Green lines represent RFLO, and red lines represent an echo state network (ESN).

Close modal

The panels of Figure 5B show the mean squared error (MSE) and correct rates of the consolidator (blue), BPTT (orange), RFLO (green), and ESN (red) for sequences of increasing length. We used 64 neurons for all networks in this experiment. Since the consolidator is composed of two groups of neurons, for fair comparisons, each group is set to 32 neurons. The left-most panels show the performance for input sequences where n{1,2,3,4} and the right-most panels show the performance when n{21,22,23,24}. Given differences in how each algorithm handles the temporal gradient (as unpacked in the section 2), we expected that the consolidator performance would be close to BPTT and better than RFLO (Performance: BPTTconsolidator>RFLO>ESN). The results match our prediction and are shown in Figure 5B. Across sequence lengths, the consolidator performs consistently better than RFLO and ESN, approaching the performance of the biologically implausible BPTT. Trained RFLO networks can reach a correct rate with an upper bound at around 0.5, which reflects that the error gradient in RFLO is essentially limited to one step backward in time (Marschall et al., 2020), while the consolidator can propagate the error gradient backward multiple steps in time. By comparison, the ESN can barely generate any appropriate outputs when the sequence length exceeds 10 (depicted by the red lines in the right four columns of Figure 5B). It should be noted that the total sequence length is doubled by the single n, as in the anbn task, where each letter is repeated n times. This shows the advantage of multi-time-step temporal error signal propagation over online error minimization, even with the constraint of no external storage of neural activation history.

For shorter sequences (the first two columns in Figure 5B), the consolidator and BPTT have similar asymptotic performance. As the sequence length increases, the performance of all models decreases (bottom row of Figure 5B), and the divergence between a consolidator implemented with BPTT (orange lines) versus feedback alignment (blue lines) gradually increases. This divergence shows that feedback alignment does have limits relative to pure BPTT, which maintains a perfect record, for temporal propagation across numerous time steps.

3.2  Cache

The cache network functions as the primary input to the consolidator. Functionally, it performs rapid memorization of the input sequence for subsequent playback to the consolidator during the offline learning phase. BPTT uses an externally stored record of the input sequence that is aligned with the backpropagated error. To be a biologically plausible high-functioning network, the network must be capable of storing a sequence of states in a way that can be retrieved in reverse order (to synchronize with the reverse replay in the consolidator) after a single training trial.

To achieve this, the cache is itself a classic form of a recurrent neural network, using well-established learning principles (akin to a Hopfield network with multiple weight matrices) that enable retrieval of stored states in forward or reverse order as described in full detail in the section 2. To demonstrate this ability, we tested the ability of an isolated cache network (i.e., with no consolidator network connected) to retrieve a sequence of randomly generated binary vectors.

As shown in Figure 6, the cache was presented with 20 distinct binary vectors over time. Transitions between adjacent vectors were encoded by alternating weight matrices based on a control signal (see Figure 6A). With only the single presentation, the cache can step through the same set of states (see Figure 6B). This playback can be performed in the forward or backward direction depending on which pattern the cache is initialized with. In our experiments, for the purpose of imposing learning on the consolidator, the cache network is usually initialized with its terminal states, acting as a cue to trigger learning. The timing of each transition is tuned by the control signal, allowing the network to intrinsically regulate the retrieval. With this ability, the cache can support playback of the input sequence so that it is synchronized with consolidator processing.

Figure 6:

One-shot sequence memorization task for the cache. (A) A cache trained to recall the random binary sequence with an external periodic signal. Top: the oscillating external control signal λ. Bottom: The relative hamming distance between the cache’s activation and all patterns. A larger pattern index represents a pattern that appears later in the given sequence. (B) The same as panel A except that it uses an internally generated control signal.

Figure 6:

One-shot sequence memorization task for the cache. (A) A cache trained to recall the random binary sequence with an external periodic signal. Top: the oscillating external control signal λ. Bottom: The relative hamming distance between the cache’s activation and all patterns. A larger pattern index represents a pattern that appears later in the given sequence. (B) The same as panel A except that it uses an internally generated control signal.

Close modal

3.3  R2N2 Solving Sequence Learning Problems

The results shown thus far demonstrate that each component of R2N2 is capable of performing its function as intended. In this section, we demonstrate that nothing is lost and nothing additional is needed when the components are assembled into the full R2N2 model while adhering to the locality constraint. That is, we show that R2N2 is capable of encoding the memory of temporal sequence into a recurrent neural network using only local information. Given that our motivation was to understand how brains enable episodic memory, we applied R2N2 to a simulation of a T-maze navigation task, in which the animal needs to make decisions to turn to either the left or right end of the horizontal branch to get a reward based on the visual cue at the beginning of the maze (see Figure 7). In our simulations, the consolidator and cache network are discretized with a time step set to 10 ms. Under this setting, single trials can be encoded into sequences with lengths up to 106.

Figure 7:

T-maze task trained with consolidator and cache. (A) Task structure and training paradigm. Upper: The animal makes a decision to run and then turn to the left or right according to the cue type (black or gray block in the top of the T-maze) to get reward. Bottom: The system first selectively receives rewarded sensory sequences (red trajectories in the T-maze) and stores it in the cache, which then performs reverse replay, providing a reversed input sequence to the consolidator. The consolidator then is trained to generate rewarding predictions. (B) Place representations in the hidden unit firing rates of the consolidator. For left and right rewarded trials, neurons are sorted according to the distance between the starting point and positions with the highest firing rates. The first vertical dashed line represents the distance at which the cue ends, and the second one indicates where the left or right decision point lies. (C) The distribution of place representation density in neurons. Mann-Whitney tests are performed between the density of three crucial regions (cue, turning point, and reward) and that of all other regions. For all of them, p10-4.

Figure 7:

T-maze task trained with consolidator and cache. (A) Task structure and training paradigm. Upper: The animal makes a decision to run and then turn to the left or right according to the cue type (black or gray block in the top of the T-maze) to get reward. Bottom: The system first selectively receives rewarded sensory sequences (red trajectories in the T-maze) and stores it in the cache, which then performs reverse replay, providing a reversed input sequence to the consolidator. The consolidator then is trained to generate rewarding predictions. (B) Place representations in the hidden unit firing rates of the consolidator. For left and right rewarded trials, neurons are sorted according to the distance between the starting point and positions with the highest firing rates. The first vertical dashed line represents the distance at which the cue ends, and the second one indicates where the left or right decision point lies. (C) The distribution of place representation density in neurons. Mann-Whitney tests are performed between the density of three crucial regions (cue, turning point, and reward) and that of all other regions. For all of them, p10-4.

Close modal

This was not intended to be a simulation of the brain itself. Rather, it was to test and examine the functionality of the model in a setting parallel to one commonly used to study memory in rodent models.

Briefly, the model alternately explored a T-maze and performed offline learning after collecting enough experiences. As with rodents learning to complete the task, the model generated actions that directly affected the sensory inputs that formed the episodic sequences. Thus, the learning task was two-fold: encoding the experienced sequences to enable accurate prediction of upcoming transitions and adaptive selection of actions to collect rewards. This is fundamentally different from the benchmark tests presented above in that the question is not whether the network can simply recall a training sequence.

Similar to the paradigm discussed previously, we performed another simulation to test both performance and a match with empirical data from the hippocampus. We set the sensory inputs to both consolidator and cache as a concatenated binary vector (ot,at,rt) data stream at time step t, with ot for visual observation, at for action, and rt for the presence of the reward.

During training, the R2N2 model learns from successive trials, and its capability to generate correct response increases over time, which can be treated as a biased trajectory sampling process from the distribution of all possible behaviors. After 500 epochs of training (each epoch is composed of 100 trials), the (consolidator/cache) system successfully mastered the task with a correct performance rate above 90% (see Figure 8D), comparable to the results of animal experiments (see Figure 8A), in which the rat learns the task after around eight sessions (15 to 20 minutes per session). The performance rise is driven by a series of gradually decreasing replay epochs of training (see Figure 8E). Note that since we did not intend to model the exact change in reverse replay frequency observed in animals (see Figure 8B), the correct rate ratio curve of the model (see Figure 8D) may distort if the decrease in replay frequency is slower in the initial stage. Nonetheless, as long as the model undergoes sufficient reverse replay, the final correct ratio will eventually converge to the same level.

Figure 8:

Model versus rat behavior in navigation tasks in comparison with results, reproduced with permission from Shin et al. (2019). Panels A, B, and C are from Figure 4 in Shin et al. (2019). Panels D, E, and F are results from the the consolidator-cache model. (A, D) Throughout training the model and animal performance steadily increase. (B, E) During the training process, the replay rate gradually decreases as the performance increases. (C, F) C Animals have relatively balanced forward and reverse replay in the all training sessions regardless of performance. (F) In the system of the cache and consolidator, the replay rate is balanced by definition as forward replay involves learning of backward circuits and backward replay trains forward circuits.

Figure 8:

Model versus rat behavior in navigation tasks in comparison with results, reproduced with permission from Shin et al. (2019). Panels A, B, and C are from Figure 4 in Shin et al. (2019). Panels D, E, and F are results from the the consolidator-cache model. (A, D) Throughout training the model and animal performance steadily increase. (B, E) During the training process, the replay rate gradually decreases as the performance increases. (C, F) C Animals have relatively balanced forward and reverse replay in the all training sessions regardless of performance. (F) In the system of the cache and consolidator, the replay rate is balanced by definition as forward replay involves learning of backward circuits and backward replay trains forward circuits.

Close modal

Besides, since this system requires training for both forward and backward circuits in the consolidator, the replay rate for both directions is balanced by definition (see Figure 8F). These results account for the empirical finding that as the animal gets familiar with the task, the replay events occur less often (Shin et al., 2019; see Figures 8A, 8B, and 8C). To investigate the representation of tasks in the system, we sorted neurons’ normalized firing rates according to the distance between their positions with the highest firing rate and the starting point. Similar to some previous work (Ziv et al., 2013; Driscoll et al., 2017), the results (see Figure 7B) show that place cells–like structure emerged after training, and some neurons are biased to crucial positions such as the end of the cue and the turning point. We verified this by calculating the place representation density for different regions in the task (see Figure 7C). First, we categorized the task into four types of regions: the cue region, turning point region where the rat is about to make a decision, reward region, and other regions that do not belong to the former three. Next, for both left and right trials, we calculated the place representation density for each region type by dividing the region length by the number of cells that have the highest firing rate in that region. The results show that the density for ordinary regions was significantly lower than that of the three crucial regions.

4.1  The R2N2 Model

The massive success of deep learning models and their similarities with biological neural networks at both the behavior and neural dynamics level have captured the attention of the neuroscience community on many different questions, with one of the most crucial being how the full error gradient learning in RNNs might be implemented in the brain due to its recurrent (lateral connections) and hierarchical (multilayer) connectivity nature.

To form episodic memories, brains need to encode protracted temporal sequences of states with high efficiency. RNNs can accomplish this task when trained with the biologically implausible machine learning algorithm BPTT. To build a biologically plausible alternative of BPTT, we divided the problem into three pieces and applied solutions to each to form a novel integrated learning system: R2N2. The first two pieces eliminate the need for an external record of the spatiotemporal activity pattern to perform offline learning. One piece establishes a way to reconstruct the input sequence activity in the backward direction. This was solved with an RNN using a reciprocating weight structure. The second piece establishes a way to reconstruct the recurrent layer activity. This was solved in the consolidator network with a pair of competing, complementary subnetworks. Finally, the third piece eliminates the need for either weight transport or reversible synapses to backpropagate error. This is solved using feedback alignment.

Consequently, R2N2 consists of two major RNNs: (1) the consolidator, a main network that is to be trained and that serves as a long-term memory store for inference, and (2) the cache, an auxiliary network that supports the training of the consolidator by performing one-shot learning with the input layer activity sequences and thus providing training samples to the consolidator. Besides the constraint of only local information at synapses, a seemingly obvious biological constraint in the brain, much of the existing work trying to solve this problem has focused on the online side as a workaround (see Whittington & Bogacz, 2019, for a comprehensive and in-depth discussion). For example, Whittington and Bogacz (2017), Han et al. (2018), Ororbia and Kifer (2020), and Song et al. (2020) employ the predictive coding approach to address the issue of complex global error propagation using stateful neurons combined with error-correcting units. Some models, like Scellier and Bengio (2017), take the energy-based method to address the error signal estimation issue. Other models focus on the biological realization of such learning algorithms. Guerguiev et al. (2017) use apical dendrites to perform error propagation. Given certain assumptions, these frameworks and their variants have proved to be mathematically equivalent to BP (Song et al., 2020). It's also worth noting that these models are not mutually exclusive. As Whittington and Bogacz (2019) highlight, they can converge to account for multifaceted learning in the brain. However, unlike these online alternatives to BPTT, R2N2’s underlying principle is that it has a backward phase, inspired by learning-related reverse-replay phenomena, to compute and assign credit to synapses in recurrent projections without violating the locality constraint.

The ability to compute the error feedback signal through numerous time steps may account for the advantage over previous localist supervised sequence learning models such as echo state network (Maass et al., 2002; Jaeger, 2002) and RFLO (Murray, 2019), as it is able to extend the error gradient further back in time.

Since the consolidator is gradually trained to perform reverse replay in a nearly perfect way, we further speculate that the consolidator could, in turn, train other similar consolidator instances in the cortex in a bootstrapped manner to implement distributed knowledge representation across distinct brain regions.

4.2  Biological Implications

Our model bears some similarity to the complementary learning systems (CLS) framework regarding the relative roles of the hippocampus and cortex. Typically the hippocampus is cast as the fast learner and the cortex the slower learner (McClelland et al., 1995), and more recently the role of replays has been incorporated into the framework (Kumaran et al., 2016). The R2N2 model suggests that the cache and consolidator functions (analogous to fast and statistical learning, respectively) may both be carried out within the hippocampal region, as well as between the hippocampus and neocortex. For example, the cache could be implemented in CA3 pyramidal neurons with recurrent lateral excitatory projections, which have the same arbitrary spatial association and pattern completion capability. With a trainer providing reversed sequence samples, the consolidator, which could be a circuit in the entorhinal cortex receiving inputs from CA3, could learn statistics in the data stream and solidify the short-term memory in the hippocampus to provide longer-lasting memories. When assembled together, this system could be triggered and tuned by reward-related signals as we did in the T-maze simulation to ensure the sequence being replayed and learned is rewarded and beneficial for the animal, which has been found to be the case in the hippocampus (Ambrose et al., 2016).

Recent work has similarly argued that both fast and statistical learning may take place within the hippocampus, with the entorhinal cortex to CA1 pathway providing statistical learning and the pathway from dentate gyrus to CA3 to CA1 providing fast learning (Schapiro et al., 2017). The R2N2 model is consistent with this anatomical delineation but does not exclude other possible functional mappings. Another implication provided by R2N2 is its utilization of the reverse-replay phenomenon found in the hippocampus (Foster & Wilson, 2006; Diba & Buzsáki, 2007), which is the key element that drives the whole model to learn sequences. However, many of the existing models (Haga & Fukai, 2018; Evangelista et al., 2020) for reverse and forward replay do not account for sequence learning at all or have a limited learning capacity. They are limited in that those models are built on handcrafted attractor connectivity patterns and thus usually have only one or a few spatially clustered neurons active at each moment, which is functionally equivalent to one-hot encodings and puts a limit (N= the number of neurons) on their learning capacity. Instead, the consolidator in our model builds connectivity matrices for the reverse replay of arbitrary neuronal activation pattern sequences without any prior assumptions on the spatial distribution of synapse strengths, which is far more flexible and biologically realistic considering the high-dimensional nature of spiking activities in the brain. This also matches previous observations that the reverse replay in the brain is key for sequence learning (Diba & Buzsáki, 2007; Hemberger et al., 2019; Schuck & Niv, 2019; Vaz et al., 2020; Eichenlaub et al., 2020; Fernández-Ruiz et al., 2019; Michon et al., 2019). It implies that the hippocampal cortical system may be a neural instantiation of BPTT, and our proposed model might account for the underlying mechanism of reverse replay as well as its computational role in learning.

A possible neural realization of this consolidator-cache system might be the entorhinal-hippocampus communication system. First, there is empirical evidence showing it is the backward-running phase (reverse replay) rather than the forward-running phase (forward replay) in the hippocampus during immobilization that is crucial for the animal’s later performance after experiencing the task environment (Ambrose et al., 2016). Some other recent studies even show that prolonged reverse replay enhances task performance (Fernández-Ruiz et al., 2019), while destroyed reverse replay leads to failures in task performance (Michon et al., 2019). Second, R2N2 also suggests the importance of internal clock signals, as the consolidator and cache each oscillate to generate state updates. The importance of oscillating clock signals is consistent with several hippocampal cell types that show either the greatest or smallest activity levels at the peak of the theta cycle or a ripple (Klausberger & Somogyi, 2008). Our simulation results also reveal that during learning, the system shows similar characteristics and internal representations to place cells and replay rate effects observed in previous studies (Ziv et al., 2013; Driscoll et al., 2017; Shin et al., 2019).

Together, these observations imply the existence of offline backward learning in recurrent neuronal networks, which is conceptually isomorphic with the temporal unfolding process in BPTT. However, there also exist some limitations in the current implementation of R2N2. For instance, the speed of forward replay and reverse replay is the same in our simulation, while the reverse replay in the hippocampus is usually highly compressed in time compared with the forward-running process (Foster & Wilson, 2006).

The temporal symmetry of reverse replay in R2N2 is due to the same time constant we used in equation 2.1 for both forward and reverse modes. In future research, we will explore addressing this issue by diving into the level of spiking networks, since the speed change is caused by the spike interval reduction, which can be implemented with a discrete version of the consolidator that preserves the symmetry but in which the timing can be freely tuned. The speed of network evolution may also be controlled by a clocking mechanism similar to a CPU clock, in which the frequency of an oscillatory signal as in Figure 6 may control the speed of the network. Furthermore, the application of R2N2 in the T-maze tasks shown in this study is also limited in the sense that the model is trained only on rewarded trials. We do this purely to demonstrate the computation and modeling power of R2N2. In the future, there are several potential directions to evolve R2N2 into a more realistic model for learning in recurrent neural networks. For instance, we could adopt the pretraining fine-tuning approach to enforce the model first to learn task dynamics (unbiased prediction error reduction) regardless of the reward it received from the environment and then selectively train the model to build a reward preference. This approach is similar to the common strategy now used by many large autoregressive language models (Brown et al., 2020), which also formalize complex tasks as a simple sequence prediction task and have been proved to share the same computation principle as humans (Goldstein et al., 2022).

Another limitation exists in the cache network. Since now we use a Hopfield network to model the cache, it suffers from the linear storage capacity issue, preventing us from using it in tasks that have an extremely vast input space, like language modeling. A potential solution to this is to replace the classical Hopfield network with modern associative memory networks like Ramsauer et al. (2020) and Krotov and Hopfield (2020). They highlight the possibility of using an RNN to store continuous time and continuous state variables more efficiently than the Hopfield network, while still maintaining biological plausibility, by employing higher-order energy functions through recurrent interactions in RNNs. Besides, since the current implementation relies on FA to propagate error signals across different layers, it becomes hard to build spatially deep architectures with the consolidator, making it hard to use this framework in tasks with high expressive power requirement (Raghu et al., 2017).

4.3  Summary

In summary, this article provides a new possible approach for biological RNNs to learn sequential tasks: R2N2. R2N2 can memorize sequences in a one-shot way and transfer the experience to long-lasting synaptic changes through reverse replay. When it comes to cognitive tasks, the consolidator-cache system treats different types of tasks under a unified sequence prediction framework and solves it with rewards as a signal for reverse replay. The whole process is based on competitions between different synaptic projections—that is, competition between f* and g* in the consolidator and WO and WE in the cache, which does not require any nonlocal information or weight symmetries. When compared with other online alternatives to BPTT, R2N2 can better propagate the temporal information in the training phase and thus has better performance in some tasks. This computational superiority is driven by the use of reverse replay as an error-propagating mechanism, which also raises several experimentally testable predictions for future research on sequence learning in the brain. First, an imbalanced synaptic projection (e.g., decreased excitatory level in one projection) between different neural assemblies may lead to impaired reverse replay since in our model, the reverse replay in the consolidator relies on competition of different projections between neuronal groups. Second, as in the cache, an internally generated pseudo-periodic signal is responsible for the transition between firing pattern attractors, one may expect to see induced reverse replay with external periodic signals acting on the gating neurons for the CA3 network or corrupted reverse replay with aperiodic perturbations.

In this article, we develop a novel learning system, R2N2, to address the long-standing question of biologically plausible learning of RNNs. It is composed of two components: a fast RNN that stores and replays experiences (cache) and a statistical learning RNN (consolidator).

We have shown that R2N2 is capable of running itself in a reverse order and shows improved performance relative to other models based on recurrent weight updates computed in the backward phase in various tasks. In addition, by applying this model to a rat navigation task we showed its power as a whole in sequence learning and that it captures several observed phenomena in previous experiments such as balanced replay and place cell encodings.

In our simulation, equation 2.1 is discretized and implemented with
(A.1)
Next, the specific process to update the postsynpatic component, eht, in equation 2.3 is discretized as a difference equation from time t=0 to t=T. Consider a simple case in which from (0,T), the task of the network is only to generate a proper output OT at the last time step t=T and the overall loss function L is then defined as 12(OT-OT)2, with OT being the desired output. If neuronal group B is connected to the output layer O through a linear mapping WO, then we have ehT=LhT:
(A.2)
for the last time step.
For all other previous steps t(0,T-1), we have two coupling equations to compute the error signals for both A and B iteratively:
(A.3)

We include a leaky term in the the above update equations to maintain consistency with the leaky nature of the activity updates in equation 2.1.

With the help of feedback alignment algorithm, the WO, WA, and WB in the above equation can further be replaced by random fixed matrices βO, βA, and βB, for t=T; then we have
(A.4)
and for t(0,T-1)
(A.5)
If we generalize the case of only generating output OT at T to outputs at each time step O1,O2,...,OT, the corresponding eA and eB will be similar with each time step to the neuronal group connected to the output layer receiving an extra nonzero output error term used in equation A.4:
(A.6)

When connecting with the cache network, since we’re taking a sequence prediction paradigm, O1,O2,...,OT will be the activation sequence of the cache, which can perform reverse replay itself to provide the reversed desired output sequence required in equation A.5.

R2N2 has two main components: the consolidator and the cache. When learning sequences, the sequence information is first stored in a cache, where a fast erasable and nongeneralizable memory is formed. Then through reverse replay, the cache provides the consolidator with the task information, which the consolidator then learned. The temporal procedure for the R2N2 learning is shown in algorithm 1.

graphic

This appendix presents some hyperparameters we used for experiments in the main results.

Table 1:

Sequence Memorization Task for the Consolidator Network.

HyperparameterValue
Number of neurons (group A + group B) 64+64 
Learning rate 0.005 
τ 10 
Time steps 19 
HyperparameterValue
Number of neurons (group A + group B) 64+64 
Learning rate 0.005 
τ 10 
Time steps 19 
Table 2:

anbn Task.

HyperparameterValue
Number of neurons (group A + group B) in the consolidator RNN 32 + 32 
Number of hidden neurons in ESN RNN 64 
Number of neurons in RFLO RNN 64 
Number of neurons in BPTT RNN 64 
Learning rate (shared by all RNNs) 0.001 
τ (shared by all RNNs) 10 
HyperparameterValue
Number of neurons (group A + group B) in the consolidator RNN 32 + 32 
Number of hidden neurons in ESN RNN 64 
Number of neurons in RFLO RNN 64 
Number of neurons in BPTT RNN 64 
Learning rate (shared by all RNNs) 0.001 
τ (shared by all RNNs) 10 
Table 3:

One-Shot Sequence Memorization Task for the Cache Network.

HyperparameterValue
Number of neurons 500 
Sequence length 570 
τ 
HyperparameterValue
Number of neurons 500 
Sequence length 570 
τ 
Table 4:

T-Maze Task.

HyperparameterValue
Number of hidden neurons (group A + group B) 64 + 64 
Learning rate 0.001 
τ 10 
HyperparameterValue
Number of hidden neurons (group A + group B) 64 + 64 
Learning rate 0.001 
τ 10 

We declare no competing interests.

We thank Ehren Newman for extensive and helpful discussions and comments on the manuscript.

The code for the simulations in this article can be found at https://github.com/CogControlLab/R2N2.

Akrout
,
M.
,
Wilson
,
C.
,
Humphreys
,
P. C.
,
Lillicrap
,
T.
, &
Tweed
,
D.
(
2019
).
Deep learning without weight transport
. In
H.
Wallach
,
H.
Larochelle
,
A.
Beygelzimer
,
F.
d’Alché-Buc
,
E.
Fox
, &
R.
Garnett
(Eds.),
Advances in neural imformation processing system
,
32
.
Curran
.
Ambrose
,
R. E.
,
Pfeiffer
,
B. E.
, &
Foster
,
D. J.
(
2016
).
Reverse replay of hippocampal place cells is uniquely modulated by changing reward
.
Neuron
,
91
(
5
),
1124
1136
.
Bellec
,
G.
,
Scherr
,
F.
,
Hajek
,
E.
,
Salaj
,
D.
,
Legenstein
,
R. A.
, &
Maass
,
W.
(
2019
).
Biologically inspired alternatives to backpropagation through time for learning in recurrent neural nets
.
arXiv:abs/1901.09049
.
Brown
,
T.
,
Mann
,
B.
,
Ryder
,
N.
,
Subbiah
,
M.
,
Kaplan
,
J. D.
,
Dhariwal
,
P.
,
Neelakantan
,
A.
,
Shyam
,
P.
,
Sastry
,
G.
,
Askell
,
A.
, . . .
Amodel
,
D.
(
2020
).
Language models are few-shot learners
. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
(pp.
1877
1901
).
Curran
.
Chang
,
B.
,
Meng
,
L.
,
Haber
,
E.
,
Ruthotto
,
L.
,
Begert
,
D.
, &
Holtham
,
E.
(
2017
).
Reversible architectures for arbitrarily deep residual neural networks
.
arXiv:1709.03698
.
Depasquale
,
B.
,
Cueva
,
C. J.
,
Rajan
,
K.
,
Escola
,
G. S.
, &
Abbott
,
L. F.
(
2018
).
Full-FORCE: A target-based method for training recurrent networks
.
arXiv:1710
.
Diba
,
K.
, &
Buzsáki
,
G.
(
2007
).
Forward and reverse hippocampal place-cell sequences during ripples
.
Nature Neuroscience
,
10
(
10
),
1241
1242
.
Driscoll
,
L. N.
,
Pettit
,
N. L.
,
Minderer
,
M.
,
Chettih
,
S. N.
, &
Harvey
,
C. D.
(
2017
).
Dynamic reorganization of neuronal activity patterns in parietal cortex
.
Cell
,
170
(
5
),
986
999
.
Eichenlaub
,
J. B.
,
Jarosiewicz
,
B.
,
Saab
,
J.
,
Franco
,
B.
,
Kelemen
,
J.
,
Halgren
,
E.
, . . .
Cash
,
S. S.
(
2020
).
Replay of learned neural firing sequences during rest in human motor cortex
.
Cell Reports
,
31
(
5
),
107581
.
Evangelista
,
R.
,
Cano
,
G.
,
Cooper
,
C.
,
Schmitz
,
D.
,
Maier
,
N.
, &
Kempter
,
R.
(
2020
).
Generation of sharp wave-ripple events by disinhibition
.
Journal of Neuroscience
,
40
(
41
),
7811
7836
.
Fernández-Ruiz
,
A.
,
Oliva
,
A.
,
de Oliveira
,
E. F.
,
Rocha-Almeida
,
F.
,
Tingley
,
D.
, &
Buzsáki
,
G.
(
2019
).
Long-duration hippocampal sharp wave ripples improve memory
.
Science
,
364
(
6445
),
1082
1086
.
Foster
,
D. J.
, &
Wilson
,
M. A.
(
2006
).
Reverse replay of behavioural sequences in hippocampal place cells during the awake state
.
Nature
,
440
(
7084
),
680
683
.
Gershman
,
S. J.
(
2018
).
The successor representation: Its computational logic and neural substrates
.
Journal of Neuroscience
,
38
(
33
),
7193
7200
.
Goldstein
,
A.
,
Zada
,
Z.
,
Buchnik
,
E.
,
Schain
,
M.
,
Price
,
A.
,
Aubrey
,
B.
, . . .
Hasson
,
U.
(
2022
).
Shared computational principles for language processing in humans and deep language models
.
Nature Neuroscience
,
25
(
3
),
369
380
.
Guerguiev
,
J.
,
Lillicrap
,
T. P.
, &
Richards
,
B. A.
(
2017
).
Towards deep learning with segregated dendrites
.
eLife
,
6
,
e22901
.
Haga
,
T.
, &
Fukai
,
T.
(
2018
).
Recurrent network model for learning goal-directed sequences through reverse replay
.
eLife
,
7
,
e34171
.
Han
,
K.
,
Wen
,
H.
,
Zhang
,
Y.
,
Fu
,
D.
,
Culurciello
,
E.
, &
Liu
,
Z.
(
2018
).
Deep predictive coding network with local recurrent processing for object recognition
. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
.
Curran
.
Hemberger
,
M.
,
Shein-Idelson
,
M.
,
Pammer
,
L.
, &
Laurent
,
G.
(
2019
).
Reliable sequential activation of neural assemblies by single pyramidal cells in a three-layered cortex
.
Neuron
,
104
(
2
),
353
369
.
Huang
,
Y.
, &
Rao
,
R. P.
(
2011
).
Predictive coding
.
Wiley Interdisciplinary Reviews: Cognitive Science
,
2
(
5
),
580
593
.
Jadhav
,
S. P.
,
Kemere
,
C.
,
German
,
P. W.
, &
Frank
,
L. M.
(
2012
).
Awake hippocampal sharp-wave ripples support spatial memory
.
Science
,
336
(
6087
),
1454
1458
.
Jaeger
,
H.
(
2002
).
Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach
.
GMD-Forschungszentrum Informationstechnik Bonn
.
Klausberger
,
T.
, &
Somogyi
,
P.
(
2008
).
Neuronal diversity and temporal dynamics: The unity of hippocampal circuit operations
.
Science
,
321
(
5885
),
53
57
.
Krotov
,
D.
, &
Hopfield
,
J.
(
2020
).
Large associative memory problem in neurobiology and machine learning
.
arXiv:2008.06996
.
Kumaran
,
D.
,
Hassabis
,
D.
, &
McClelland
,
J. L.
(
2016
).
What learning systems do intelligent agents need? Complementary learning systems theory updated
.
Trends in Cognitive Sciences
,
20
(
7
),
512
534
.
Landau
,
A. N.
,
Schreyer
,
H. M.
,
van Pelt
,
S.
, &
Fries
,
P.
(
2015
).
Distributed attention is implemented through theta-rhythmic gamma modulation
.
Current Biology
,
25
(
17
),
2332
2337
.
Lee
,
D.-L.
(
2002
).
Pattern sequence recognition using a time-varying Hopfield network
.
IEEE Transactions on Neural Networks
,
13
(
2
),
330
342
.
Lillicrap
,
T. P.
,
Cownden
,
D.
,
Tweed
,
D. B.
, &
Akerman
,
C. J.
(
2016
).
Random synaptic feedback weights support error backpropagation for deep learning
.
Nature Communications
,
7
,
1
10
.
Lillicrap
,
T. P.
,
Santoro
,
A.
,
Marris
,
L.
,
Akerman
,
C. J.
, &
Hinton
,
G.
(
2020
).
Backpropagation and the brain
.
Nature Reviews Neuroscience
,
21
,
335
346
.
Lo
,
C.-C.
, &
Wang
,
X.-J.
(
2006
).
Cortico–basal ganglia circuit mechanism for a decision threshold in reaction time tasks
.
Nature Neuroscience
,
9
(
7
),
956
963
.
Lörincz
,
A.
, &
Buzsáki
,
G.
(
2000
).
Two-phase computational model training long-term memories in the entorhinal-hippocampal region
.
Annals of the New York Academy of Sciences
,
911
(
1
),
83
111
.
Maass
,
W.
,
Natschláger
,
T.
, &
Markram
,
H.
(
2002
).
Real-time computing without stable states: A new framework for neural computation based on perturbations
.
Neural Computation
,
14
(
11
),
2531
2560
.
Marschall
,
O.
,
Cho
,
K.
, &
Savin
,
C.
(
2020
).
A unified framework of online learning algorithms for training recurrent neural networks
.
Journal of Machine Learning Research
,
21
(
135
),
1
34
.
McClelland
,
J. L.
,
McNaughton
,
B. L.
, &
O’Reilly
,
R. C.
(
1995
).
Why there are complementary learning systems in the hippocampus and neocortex: Insights from the successes and failures of connectionist models of learning and memory
.
Psychological Review
,
102
(
3
),
419
.
Michon
,
F.
,
Sun
,
J.-J.
,
Kim
,
C. Y.
,
Ciliberti
,
D.
, &
Kloosterman
,
F.
(
2019
).
Postlearning hippocampal replay selectively reinforces spatial memory for highly rewarded locations
.
Current Biology
,
29
(
9
),
1436
1444
.
Moskovitz
,
T. H.
,
Litwin-Kumar
,
A.
, &
Abbott
,
L.
(
2018
).
Feedback alignment in deep convolutional networks
.
arXiv:1812.06488
.
Murray
,
J. M.
(
2019
).
Local online learning in recurrent networks with random feedback
.
eLife
,
8
,
1
25
.
Murty
,
V. P.
,
FeldmanHall
,
O.
,
Hunter
,
L. E.
,
Phelps
,
E. A.
, &
Davachi
,
L.
(
2016
).
Episodic memories predict adaptive value-based decision-making
.
Journal of Experimental Psychology: General
,
145
(
5
),
548
.
Nøkland
,
A.
(
2016
).
Direct feedback alignment provides learning in deep neural networks
.
arXiv:1609.01596
.
Ororbia
,
A.
, &
Kifer
,
D.
(
2020
).
The neural coding framework for learning generative models
.
Nature Communications
,
13
.
Pfeiffer
,
B. E.
, &
Foster
,
D. J.
(
2013
).
Hippocampal place-cell sequences depict future paths to remembered goals
.
Nature
,
497
(
7447
),
74
79
.
Poirazi
,
P.
,
Brannon
,
T.
, &
Mel
,
B. W.
(
2003
).
Pyramidal neuron as two-layer neural network
.
Neuron
,
37
(
6
),
989
999
.
Raghu
,
M.
,
Poole
,
B.
,
Kleinberg
,
J.
,
Ganguli
,
S.
, &
Sohl-Dickstein
,
J.
(
2017
).
On the expressive power of deep neural networks
. In
Proceedings of the International Conference on Machine Learning
(pp.
2847
2854
).
Ramsauer
,
H.
,
Schäfl
,
B.
,
Lehner
,
J.
,
Seidl
,
P.
,
Widrich
,
M.
,
Adler
,
T.
, . . .
Hochreiter
,
S.
(
2020
).
Hopfield networks is all you need
.
arXiv:2008.02217
.
Rodriguez
,
P.
,
Wiles
,
J.
, &
Elman
,
J. L.
(
1999
).
A recurrent neural network that learns to count
.
Connection Science
,
11
(
1
),
5
40
.
Scellier
,
B.
, &
Bengio
,
Y.
(
2017
).
Equilibrium propagation: Bridging the gap between energy-based models and backpropagation
.
Frontiers in Computational Neuroscience
,
11
,
24
.
Schapiro
,
A. C.
,
Turk-Browne
,
N. B.
,
Botvinick
,
M. M.
, &
Norman
,
K. A.
(
2017
).
Complementary learning systems within the hippocampus: A neural network modelling approach to reconciling episodic memory with statistical learning
.
Philosophical Transactions of the Royal Society B: Biological Sciences
,
372
(
1711
),
20160049
.
Schuck
,
N. W.
, &
Niv
,
Y.
(
2019
).
Sequential replay of nonspatial task states in the human hippocampus
.
Science
,
364
(
6447
).
Shin
,
J. D.
,
Tang
,
W.
, &
Jadhav
,
S. P.
(
2019
).
Dynamics of awake hippocampal-prefrontal replay for spatial learning and memory-guided decision making
.
Neuron
,
104
(
6
),
1110
1125
.
Sompolinsky
,
H.
, &
Kanter
,
I.
(
1986
).
Temporal association in asymmetric neural networks
.
Physical Review Letters
,
57
(
22
),
2861
.
Song
,
Y.
,
Lukasiewicz
,
T.
,
Xu
,
Z.
, &
Bogacz
,
R.
(
2020
).
Can the brain do backpropagation? Exact implementation of backpropagation in predictive coding networks
. In
H.
Larochelle
,
M.
Ranzato
,
R.
Hadsell
,
M. F.
Balcan
, &
H.
Lin
(Eds.),
Advances in neural information processing systems
,
33
(pp.
22566
22579
).
Curran
.
Tallec
,
C.
, &
Ollivier
,
Y.
(
2017
).
Unbiased online recurrent optimization
.
arXiv:1702.05043
.
Van Der Meer
,
M. A.
, &
Redish
,
A. D.
(
2010
).
Expectancies in decision making, reinforcement learning, and ventral striatum
.
Frontiers in Neuroscience
,
3
,
6
.
Vaz
,
A. P.
,
Wittig
,
J. H.
,
Inati
,
S. K.
, &
Zaghloul
,
K. A.
(
2020
).
Replay of cortical spiking sequences during human memory retrieval
.
Science
,
367
(
6482
),
1131
1134
.
Werbos
,
P. J.
(
1990
).
Backpropagation through time: What it does and how to do it
.
Proceedings of the IEEE
,
78
(
10
),
1550
1560
.
Whittington
,
J. C.
, &
Bogacz
,
R.
(
2017
).
An approximation of the error backpropagation algorithm in a predictive coding network with local Hebbian synaptic plasticity
.
Neural Computation
,
29
(
5
),
1229
1262
.
Whittington
,
J. C.
, &
Bogacz
,
R.
(
2019
).
Theories of error back-propagation in the brain
.
Trends in Cognitive Sciences
,
23
(
3
),
235
250
.
Zhang
,
Z.
,
Cheng
,
H.
, &
Yang
,
T.
(
2019
).
A recurrent neural network model for flexible and adaptive decision making based on sequence learning
.
bioRxiv:555862
.
Ziv
,
Y.
,
Burns
,
L. D.
,
Cocker
,
E. D.
,
Hamel
,
E. O.
,
Ghosh
,
K. K.
,
Kitch
,
L. J.
, . . .
Schnitzer
,
M. J.
(
2013
).
Long-term dynamics of CA1 hippocampal place codes
.
Nature Neuroscience
,
16
(
3
),
264
.