We show that Hopfield neural networks with synchronous dynamics and asymmetric weights admit stable orbits that form sequences of maximal length. For units, these sequences have length ; that is, they cover the full state space. We present a mathematical proof that maximal-length orbits exist for all , and we provide a method to construct both the sequence and the weight matrix that allow its production. The orbit is relatively robust to dynamical noise, and perturbations of the optimal weights reveal other periodic orbits that are not maximal but typically still very long. We discuss how the resulting dynamics on slow time-scales can be used to generate desired output sequences.
Humans and some other animals can learn complex sequential behavior, such as dancing, singing, playing a musical instrument, or writing. These behaviors require precise coordination of many muscles on the timescale of seconds or minutes. That the brain achieves this coordination is remarkable, in particular, given that typical processes on a neuronal level, like action potentials or synaptic transmission, operate on a timescale of milliseconds.
To introduce a neuronal mechanism that could underlie such computations, we give an operational definition of sequence: a sequence is a map from a ordered set of indices to a set of sequence elements. We can take, for example, the natural numbers as the ordered index set and lowercase roman letters as the sequence elements. An example of a map is . A putative neuronal mechanism uses a recurrent network of neurons to represent the ordered set of indices and a group of readout neurons to represent the set of sequence elements (see Figure 1A). Each neuronal activity pattern in the index network encodes an index, and the ordering is established by the autonomous dynamics. Neurons in the index network are recurrently connected to each other such that when the network is initialized in a particular state, the activity patterns evolve through a fixed sequence. The activity in readout neurons could encode motor commands that lead to a specific coactivation of muscles. To produce complex movements, it is sufficient to learn a map from index patterns to motor commands such that the first motor command is activated by the first index pattern and so forth.
It has been hypothesized that songbirds use this mechanism to learn songs (Fee, Kozhevnikov, & Hahnloser, 2004). For example, zebra finches produce songs that consist of motifs (sequences), each defined by a specific ordering of sounds (elements). The activity in premotor area RA (robust nucleus of the arcopallium) is highly correlated with the vocalization of single sounds and can thus be seen as encoding sequence elements. Neurons in RA receive input from brain area HVC (hyperstriatum ventrale, pars caudalis). Most of the neurons in HVC that project to RA are active only once during a motif, and the time of activity is locked relative to the onset of the motif itself (Hahnloser, Kozhevnikov, & Fee, 2002). This observation leads to the hypothesis that neurons in HVC form a recurrent neural network that produces a chain-like activity pattern, where one group of neurons excites the next group of neurons and so forth (see Figure 1B). This can be seen as implementing the index network, where an index is associated with the activity of a particular group of neurons. In this way, each neuron is active only once during a sequence.
The main limitation of reading out from a chain-like activity is the maximal length of the sequence that can be generated in the recurrent network. Indeed, with each neuron in the recurrent network being active only once during a sequence, the length of learnable sequences is severely limited. The maximal length scales linearly with the number of neurons. If each recurrently connected neuron would be allowed to spike more than once, one would expect that the recurrent network could generate much longer sequences. Here we focus on intrinsically generated sequential activity that allows overcoming the linear scaling limit.
Models of recurrent neural networks come in different flavors. We can distinguish between discrete and continuous temporal dynamics, between deterministic and stochastic updates, and between binary (spiking) and real-valued (rate-based) signal transmission. Each flavor comes with its own ways to overcome the linear scaling limit.
In systems with an infinite state space, typically the case for models with continuous temporal dynamics, a better scaling behavior is possible by exploiting the chaotic regime. Under specific conditions, transients in random networks of coupled oscillators (Zumdieck, Timme, Geisel, & Wolf, 2004) have been shown to scale exponentially with the number of units. A similar phenomenon can be observed in spiking networks (Zillmer, Brunel, & Hansel, 2009). Rate-based networks were shown to be useful to implement the index network (Sussillo & Abbott, 2009; Laje & Buonomano, 2013). In this case, each index corresponds to a certain configuration in the state space, and the order is determined by the intrinsic dynamics of the network.
The linear scaling limit can also be overcome in rate-based networks without relying on chaotic trajectories. One remarkable example is the coding strategy of grid cells, where the combination of cells with different (real-valued) periods leads to a representation capability that is exponential in the number of units (Fiete, Burak, & Brookings, 2008; Sreenivasan & Fiete, 2011; Mathis, Herz, & Stemmler, 2012). Although grid cells code for space, a translation of the same mechanism to the temporal domain could be possible (Gorchetchnikov & Grossberg, 2007; Eichenbaum, 2014).
Here we consider discrete dynamics with binary signal transmission, which does not allow making use of the chaotic regime, since the state-space is finite. More specifically, we study Hopfield neural networks with synchronous update and asymmetric weights. The dynamics of these networks converges usually to a limit cycle with a short period or to a fixed point. Indeed, sequence generation in a Hopfield network can be related to linear separability in perceptron learning (Gardner, 1988; Brea, Senn, & Pfister, 2013). This implies that the expectation of having an admissible sequence made of random patterns goes to zero when its length is larger than , where is the number of units. Therefore, using random patterns does not lead to any significant advantage with respect to the activity chain approach.
However, there are examples of very long sequences that can be generated with such networks. Distinct subnetworks could, for example, produce activity chains of different lengths. A network of 10 units produces a periodic orbit of length steps, if it is divided into subnetworks of 2, 3, and 5 units with each subnetwork generating an activity chain of corresponding length. Generally combinations of chains of co-prime length yield a very fast growth of the sequence length. This idea is related to the coding strategy of grid cells (see, e.g., Fiete et al., 2008).
The occurrence of long periodic orbits in Hopfield networks raises the question: What are the longest sequences that such a network can generate? Here we prove that for each network size, it is possible to find weights such that the dynamics generates an orbit of maximal length. Moreover, our proof provides an algorithm to construct the weight matrix. In contrast to the network with chains of co-prime lengths, this network produces orbits of length , and it cannot be split into distinct subnetworks. Finally, we show that this network is surprisingly robust to dynamical noise and that small perturbations of the optimal weights lead to networks that are likely to produce nonmaximal but long orbits.
2.1 Maximal-Length Orbits Need Reflection Symmetry
Sequences that satisfy the hypothesis of lemma 1 will be referred to as maximal-length orbits. Lemma 1 illustrates a necessary condition that a maximal-length sequence needs to satisfy in order to be linearly separable, that is, implementable in a recurrent network. However, the condition is not sufficient, and one could construct maximal-length sequences that have the reflection symmetry but are not linearly separable.
2.2 Existence of Maximal-Length-Period Orbit
In this section, we illustrate a recursive procedure that allows us to construct linearly separable sequences of maximal length. The procedure is inspired by lemma 1. Suppose we have a sequence of maximal length for a network of units. We denote this sequence by . To increase its dimensionality, we add a unit to the network. This new unit takes a constant value, so that we obtain an -dimensional sequence that explores half of the -dimensional state space. Lemma 1 tells us that the second half should be the reflection of the first half in order to allow linear separability. The reflection step concludes the construction of an -dimensional sequence of length starting from . Algorithm 1 summarizes the sequence construction algorithm.
In the appendix, we prove that the sequences devised according to algorithm 1 are linearly separable and that the weights for an implementation in a recurrent neural network can be constructed recursively. Here we provide the intuition of the proof and a simple algorithm for the construction of the weights.
The proof is done by induction; assuming that we have a linearly separable sequence for the -dimensional case, we look for the existence of one in the -dimensional case (). We notice that the dynamics in equation 2.2 is symmetric under a simultaneous sign change of both and , since this would correspond to a sign change of both sides of the equation. Given that is constructed according to algorithm 1 (i.e., the second half is the reflection of the first), we have only to show that the first half of the sequence, from to , is linearly separable. Notice that this first half of is different from , since it is its -dimensional extension. We restrict to the case in which we do not modify the weights , for . We introduce new weights to and from the added unit, . The proof consists in showing that the new weights can be chosen in a way that the -dimensional sequence is linearly separable.
As we can see in Figure 1C, the th unit stays constant for the whole first half of the sequence. It flips its sign at and then stays constant for the second half. Due to the special role of the switching point, we refer to it as the critical time point. The activity of the first units evolves as in the -dimensional case except for the critical time point. Indeed, while in the -dimensional case, all the units go from the state at to the all-plus state (see Figure 1C), in the -dimensional case, the first units should go to the all-minus state (see Figure 1D). Since we do not change the weights between these units, this new transition should be caused by the interaction with the added unit.
2.3 Exact Bounds on the Weights
Algorithm 2 is a special case within the more general conditions that the weights must satisfy. In the appendix, we derive the exact bounds that the new weight elements have to satisfy at each recursive step. Here we only report these bounds. In the following equations, are the elements of the maximal length orbit constructed according to algorithm 1:
Equation 2.19 represents the tightest bound to be satisfied. As we can see in Figure 2B, both the upper and lower bound on the new column elements go exponentially to zero with , as well as the distance between them. This means that new column elements need to be exponentially fine tuned.
2.4 Comparison to Co-Prime Chains
It is straightforward to find weights such that a network of units produces a chain-like activity pattern, where if and otherwise (e.g., for and otherwise). If such networks with units are combined into one network with units and if are co-prime (i.e., their greatest common divisor is 1), then the combined network will show a periodic orbit of length . Figure 3A shows an example with and . Although the sequence length grows asymptotically like (Sloane & Conway, 2011) and thus much faster than the number of units (Bach & Shallit, 1996), the orbit length of co-prime chains is considerably below the maximal sequence length: (see Figure 3B).
In contrast to the network with chains of co-prime lengths, the maximal length orbit is produced by a network that cannot be split into distinct subnetworks; the weight matrix in Figure 2A does not show block structure but reveals the all-to-all connectivity of the network.
2.5 Robustness to Noise
Given the tightness of the bounds on the weight matrix, one may wonder whether the maximal-length orbit is robust to perturbations. We considered two types of noise: dynamical noise (i.e., perturbations of the total input onto each unit), and weight noise (i.e., perturbations of the weight matrix).
2.5.1 Dynamical Noise
The maximal-length orbit covers the whole state space; therefore, the orbit cannot be attractive. Indeed, for any “mistake” in the update, the network state jumps to a different point of the orbit. We define the size of a jump as the distance measured along the noiseless orbit, and we estimate the distribution of jump sizes for different network sizes and noise intensities . The result for the case can be seen in Figure 4A. The probability of having a jump of a certain size decreases rapidly with the size itself and increases with . This result is due to the fact that the average distance from the threshold of the input onto a unit increases approximately linearly with the unit index (not shown) and to the fact that large jumps require a large-index unit to flip sign. The distributions are slightly asymmetric toward positive jump sizes, as can be seen by looking at their means (orange dots). Nonetheless, the probability of mistakes increases with and, due to the asymmetry in the jump size distribution, errors accumulate more for larger , causing an effective shortening of the orbit for high levels of noise.
2.5.2 Weight Noise
2.6 A Substrate to Read Out Slow Sequences
However, the orbit constructed according to algorithm 1 is well suited to read out sequences with slow timescales. Indeed, if we measure the average number of time steps between two switches across the whole sequence for each unit (mean interswitch interval), we see that it is exponentially increasing with the index (not shown). We can therefore say that higher index units have longer effective timescales, because they change their state with an average interval much longer than the intrinsic timescale, which is equal to one time step. It is therefore possible to read out sequences that evolve on a slow timescale. A trivial example is a readout unit that copies the activity of one of the slow units. Combining the activity of several “slow” units, one could generate nontrivial sequences. Since the readout is not the main focus of this letter, we provide only two examples of how this can be done.
If a real-valued variable is read out from our maximal-length orbit, it will produce some form of oscillations on possibly multiple timescales. Figure 5A shows an example, generated with random readout weights, in which the slow timescales are well visible. As expected, if we add dynamical noise to the reservoir network, the slow timescales are maintained more than the fast ones. Noise has the effect of producing small shifts either backward or forward, but it will very rarely cause a jump to a very distant point.
A second possible application could be the readout of a pattern detector—a binary readout unit that takes the value only when the network is in a specific pattern. Since the reservoir network is in a specific pattern only once per cycle, the unit will be regularly active at intervals of time steps in the noiseless case. In order to set up this kind of readout, one could choose , where are the components of the pattern that we want to detect and . As before, we can study what happens in the presence of noise in the reservoir dynamics. In Figure 5B, we show the distribution of the activation periods of the readout unit for . We see that for small amounts of noise, the performance of this type of readout unit degrades gracefully, with an asymmetric diffusion caused by the positive bias of jump sizes that was observed in Figure 4A.
We have shown that a simple recurrent binary neural network with deterministic synchronous update dynamics can exhibit periodic orbits of maximal length . To prove this result, we explicitly built a weight matrix that produces such an orbit. Although in principle it would have been possible to perform a search of long orbits or transients using random weights, the limit of learnability in the perceptron (Hertz et al., 1991; Gardner, 1988) suggests that the expectation of finding a long orbit or transient would have been very low. However, the improvement on the length of the orbit comes at the cost of fine-tuning the weights: the bounds in equation 2.19 become progressively tighter and the weights need to span multiple orders of magnitude. This requirement is rather unlikely to be exactly met by biological neural networks, but the simulations with weight noise showed that very long orbits are also possible with less finetuning. The bounds in equation 2.19 were found in a constructive proof that relies, in the inductive step ( to ), on appending a row and a column to the -weight matrix while keeping the rest of the weight matrix fixed. It is possible that by using a different procedure, one would find a larger region of the weight space whose elements produce the desired orbit. However, the limit of learnability in the perceptron (Hertz et al., 1991; Gardner, 1988) suggests that fine-tuning would be necessary anyway.
3.1 Other Maximal-Length Orbits
The sequence presented above is not the unique maximal-length orbit. Trivially, if we have one maximal-length orbit, we can find other ones by relabeling unit indices, provided that one also permutes rows and columns of the weight matrix accordingly. Another allowed operation is to flip the sign of one unit along the entire orbit. Indeed, it is easy to show that changing the signs of all the weights in the row and column containing the flipped index, except for the diagonal element, can produce the modified orbit.
3.2 Noise Robustness and Other Approaches
In section 2, we showed that in the presence of dynamical noise, the network state is unlikely to jump to an exponentially distant state on the orbit; rather, it goes to the vicinity of the “correct” state. However, already small perturbations of the weights can significantly reduce the length of the longest orbit produced by the system unless the noise level is also scaled down exponentially with . This behavior is in contrast to what happens with co-prime chains that are robust to weight noise, since no fine-tuning of the weights is needed. However, dynamical noise is detrimental for co-prime chains. First, if individual chains are unstable, the activity in one subnetwork may vanish (all units inactive) or saturate at a maximal level (all units active). Second, even if we enforce only one unit per subnetwork to be active at each time step, such that jumps relative to the noiseless orbit can be measured as described in the paragraph after equation 2.20, the distribution of jumps is not peaked around small values (not shown). This is not surprising, since the subnetworks are uncoupled. For similar reasons, temporal versions of grid cell coding with different periods (Fiete et al., 2008; Sreenivasan & Fiete, 2011; Mathis et al., 2012) are likely to suffer from a high sensitivity to dynamical noise.
Models with continuous state space that rely on chaos to produce long transients are by definition sensitive to noise. It has been shown that the time interval in which the activity of a noisy network is reliable scales only linearly with the number of neurons (Ganguli, Huh, & Sompolinsky, 2008). Therefore, reading out from a chaotic or nearly chaotic network also presents severe limitations in terms of noise robustness.
Although there is no obvious mapping between a binary network and a biological system, Hopfield networks have been shown to be useful conceptual tools. For example, the Hopfield model (Hopfield, 1982) had a strong conceptual influence on many associative memory models (Amit, Gutfreund, & Sompolinsky, 1985; Amit & Fusi, 1994; Brunel, 2000). Moreover, a Hopfield network can be approximately mapped to a biological substrate, such a multistable neural population (Zenke, Agnes, & Gerstner, 2015). Seen from this perspective, the orbit discussed above could provide a method to produce long timescale sequences in a system that has only fast timescales, without exploiting any intrinsic slow timescale. Interestingly, this feature of the orbit would be largely robust to dynamical noise, because as we have already mentioned, the “slower” units are also more resistant to dynamical perturbations.
Appendix: Proof of the Theorem
For convenience, we rewrite here the theorem of the section 2.
To prove the theorem, we need to show the existence of at least one sequence that covers the whole state space and is linearly separable. Our approach is to explicitly construct one particular maximal-length sequence and show that it is linearly separable. The theorem does not contain any restriction on the structure of the weights; therefore, we are free to constrain them in any way as long as we show their existence.
We proceed by induction, building recursively both the sequence , according to algorithm 1, and the weight matrix . For to be a periodic orbit of the dynamics in equation 2.2, the weights have to satisfy linear separability constraints. We choose to perform the inductive step by extending the weight matrix (i.e., adding one column and one row without changing the other matrix elements). We stress that this does not restrict the statement of the theorem since it requires only the existence of one set of weights, regardless of how this is constructed.
Our inductive hypothesis contains the linear separability of the sequence for the -dimensional case and an additional constraint on the weights that is necessary to be able to construct the weights by extension. This procedure not only shows the existence of a linearly separable sequence of maximal length but also provides a construction method for both and .
A.1 Inductive Hypothesis and Base Case
where is the set of all time points from to for which .
A.2 The Inductive Step for Linear Separability Requires Bounds on the Weights
A.3 The Inductive Step on the Weight Constraints Requires Tighter Bounds on the Weights
We now prove that equation A.3 holds true given the inductive hypothesis.
S.P.M. was supported by the Swiss National Science Foundation, grant 200020_147200. J.B. was supported by the European Research Council, grant agreement 268 689.