Abstract
Hopfield attractor networks are robust distributed models of human memory, but they lack a general mechanism for effecting state-dependent attractor transitions in response to input. We propose construction rules such that an attractor network may implement an arbitrary finite state machine (FSM), where states and stimuli are represented by high-dimensional random vectors and all state transitions are enacted by the attractor network’s dynamics. Numerical simulations show the capacity of the model, in terms of the maximum size of implementable FSM, to be linear in the size of the attractor network for dense bipolar state vectors and approximately quadratic for sparse binary state vectors. We show that the model is robust to imprecise and noisy weights, and so a prime candidate for implementation with high-density but unreliable devices. By endowing attractor networks with the ability to emulate arbitrary FSMs, we propose a plausible path by which FSMs could exist as a distributed computational primitive in biological neural networks.
1 Introduction
Hopfield attractor networks are one of the most celebrated models of robust neural autoassociative memory, as from a simple Hebbian learning rule they display emergent attractor dynamics that allow for reliable pattern recall, completion, and correction even in situations with considerable nonidealities imposed (Amit, 1989; Hopfield, 1982). Attractor models have since found widespread use in neuroscience as a functional and tractable model of human memory (Chaudhuri & Fiete, 2016; Eliasmith, 2005; Khona & Fiete, 2022; Little, 1974; Rolls, 2013; Schneidman et al., 2006). The assumption of these models is that the network represents different states by different, usually uncorrelated, global patterns of persistent activity. When the network is presented with an input that closely resembles one of the stored states, the network state converges to the corresponding fixed-point attractor.
This process of switching between discrete attractor states is thought to be fundamental both to describe biological neural activity, as well as to model higher cognitive decision-making processes (Brinkman et al., 2022; Daelli & Treves, 2010; Mante et al., 2013; Miller, 2016; Tajima et al., 2017). What attractor models currently lack, however, is the ability to perform state-dependent computation, a hallmark of human cognition (Buonomano & Maass, 2009; Dayan, 2008; Granger, 2020). That is, when the network is presented with an input, the attractor state to which the network switches ought to be dependent on the input stimulus as well as the state the network currently inhabits rather than simply the input.
We thus seek to endow a classical neural attractor model, the Hopfield network, with the ability to perform state-dependent switching between attractor states, without resorting to the use of biologically implausible mechanisms, such as training via backpropagation algorithms. The resulting attractor networks will then be able to robustly emulate any arbitrary finite state machine (FSM), considerably improving their usefulness as a neural computational primitive.
We achieve this by leaning heavily on the framework of vector symbolic architectures (VSAs), also known as hyperdimensional computing (HDC). VSAs treat computation in an entirely distributed manner by letting symbols be represented by high-dimensional random vectors, hypervectors (Gayler, 1998; Kanerva, 1997; Kleyko et al., 2022; Plate, 1995). When equipped with a few basic operators for binding and superimposing hypervectors together, corresponding often either to component-wise multiplication or addition, respectively, these architectures are able to store primitives such as sets, sequences, graphs, and arbitrary data bindings, as well as enabling more complex relations, such as analogical and figurative reasoning (Kanerva, 2009; Kleyko et al., 2021). Although different VSA models often have differing representations and binding operations (Kleyko et al., 2022), they all share the need for an autoassociative cleanup memory, which can recover a clean version of the most similar stored hypervector, given a noisy version of itself. We here use the recurrent dynamics of a Hopfield-like attractor neural network as a state-holding autoassociative memory (Gritsenko et al., 2017).
Symbolic FSM states will thus be represented each by a hypervector and stored within the attractor network as a fixed-point attractor. Stimuli will also be represented by hypervectors, which, when input to the attractor network, will trigger the network dynamics to transition between the correct attractor states. We make use of common VSA techniques to construct a weights matrix to achieve these dynamics, where we use the Hadamard product between bipolar hypervectors as the binding operation (the multiply-add-permute (MAP) VSA model; Gayler, 1998). We thus claim that attractor-based FSMs are a plausible biological computational primitive insofar as Hopfield networks are.
This represents a computational paradigm that is a departure from conventional von Neumann architectures, wherein the separation of memory and computation is a major limiting factor in current advances in conventional computational performance (the von Neumann bottleneck—Backus, 1978; Indiveri & Liu, 2015). Similarly, the high redundancy and lack of reliance on individual components makes this architecture fit for implementation with novel in-memory computing technologies such as resistive RAM (RRAM) or phase-change memory (PCM) devices, which could perform the network’s matrix-vector-multiplication (MVM) step in a single operation (Ielmini & Wong, 2018; Xia & Yang, 2019; Zidan & Lu, 2020).
2 Methods
2.1 Hypervector Arithmetic
to create a multiplicative mask , setting to 0 all components where . In the second line, we have split the summation over all components into summations over components where and , respectively. The final similarity of is a consequence of approximately half of all values in any hypervector being (see equation 2.2).
2.2 Hopfield Networks
2.3 Finite State Machines
3 Attractor Network Construction
We now show how a Hopfield-like attractor network may be constructed to emulate an arbitrary FSM, where the states within the FSM are stored as attractors in the network and the stimuli for transitions between FSM states trigger all corresponding transitions between attractors. More specifically, for every FSM state , an associated hypervector is randomly generated and stored as an attractor within the network, the set of which we denote . We henceforth refer to these hypervectors as node hypervectors or node attractors. Every unique stimulus in the FSM is also now associated with a randomly generated hypervector , where is the set of all stimulus hypervectors. For the FSM edge outputs , a corresponding set of output hypervectors is similarly generated. These correspondences are summarized in Table 1.
3.1 Constructing Transitions
3.2 Edge Outputs
3.3 Sparse Activity States
4 Results
4.1 FSM Emulation
To show the generality of FSM construction, we chose to implement a directed graph representing the relationships between gods in ancient Greek mythology due to the graph’s dense connectivity. The graph, and thus FSM to be implemented, is shown in Figure 1. From the graph, it is clear that a state machine representing the graph must explicitly be capable of state-dependent transitions; for example, the input “overthrown_by” must result in a transition to state “Kronos” when in state “Uranus,” but to state “Zeus” when in state “Kronos.” To construct , the necessary hypervectors are first generated. For every state in the FSM (e.g., “Zeus,” “Kronos”) a random bipolar hypervector is generated according to equation 2.2. For every unique stimulus (e.g., “overthrown_by,” “father_is”) a pair of random bipolar stimulus hypervectors and is likewise generated. Similarly, sparse ternary output hypervectors are also generated. The weights matrix is then iteratively constructed as per equations 3.4 and 3.13, with a new hypervector also being generated for every edge. The matrix generated from this procedure we denote . For all of the following results, the attractor network is first initialized to be in a certain node attractor state, in this case, “Hades.” The network is then allowed to freely evolve for 10 time steps (chosen arbitrarily) as per equation 2.10, with every neuron being updated simultaneously on every time step. During this period, it is desired that the network state remains in the attractor state in which it was initialized. An input stimulus is then presented to the network for 10 time steps, during which time the network state is masked by the stimulus hypervector, and the network evolves synchronously according to equation 3.2. If the stimulus corresponds to a valid edge in the FSM, the network state should then be driven toward the correct edge state attractor . After these 10 time steps, the second stimulus hypervector for a particular input is presented for 10 time steps. Again, the network evolves according to equation 3.2, and the network should be driven toward the target attractor state , completing the transition. This process is repeated every 30 time steps, causing the network state to travel between node attractor states , corresponding to a valid walk between states in the represented FSM. To view the resulting network dynamics, the similarity between the network state and the edge- and node attractor states is calculated as per equation 2.3, such that a similarity of 1 between and some attractor state implies and thus that the network is inhabiting that attractor. The similarity between the network state and the output states is also calculated, but due to the output hypervectors being sparse, the maximum value that the similarity can take is , which would be interpreted as that output symbol being present.
An attractor network performing a walk is shown in Figure 2, with parameters , , , and . This corresponds to the network having a per neuron noise (the finite-size effect resulting from random hypervectors having a nonzero similarity to each other) of , calculated via equation 3.7. The magnitude of the noise is thus small compared with the desired signal of magnitude 1 (see equation 3.6), and so we are far away from reaching the memory capacity of the network. The network performs the walk as intended, transitioning between the correct node attractor states and corresponding edge states with their associated outputs. The specific sequence of inputs was chosen to show the generality of implementable state transitions. First, there is the explicit state dependence in the repeated input of “father_is, father_is.” Second, it contains an input stimulus that does not correspond to a valid edge for the currently inhabited state (“Zeus overthrown_by”), which should not cause a transition. Third, it contains bidirectional edges (“consort_is”), whose repeated application causes the network to flip between two states (between “Kronos” and “Rhea”). And fourth, it contains self-connections, whose target states and source states are identical. Since the network traverses all these edges as expected, we do not expect the precise structure of an FSM’s graph to limit whether it can be emulated by the attractor network.
4.2 Network Robustness
where are independently sampled standard gaussian variables, sampled once during matrix construction, and is a scaling factor on the strength of noise being imposed. The function forces the weights to be bipolar, emulating that the synapses may have only one bit of precision, while the random variables act as a smearing on the weight state, emulating that the two weight states have a finite width. A value of 2 thus corresponds to the magnitude of the noise being equal to that of the signal (whether ), and so, for example, for a damaged weight value of , there is a 38% chance that the predamaged weight . This level of degradation is far worse than is expected even from novel binary memory devices (Xia & Yang, 2019), and presumably also for biology. We used the same set of hypervectors and sequence of inputs as in Figure 2, but this time using the degraded weights matrix to test the network’s robustness. The results are shown in Figure 3 for weight degradation values of and , corresponding to signal-to-noise ratios (SNRs) of 0 dB and 0.8 dB, respectively. We see that for , the attractor network performs the walk just as well as in Figure 2, which used the ideal weights matrix, despite the fact that here, the binary weight distributions overlap each other considerably. Furthermore, we have that where is the attractor that the network should be inhabiting at any time, indicating that the attractor stability and recall accuracy are unaffected by the nonidealities. For , a scenario where the realized weight carries very little information about the ideal weight’s value, we see that the network nonetheless continues to function, performing the correct walk between attractor states. However, there is a degradation in the recall of stored attractor states, with the network state no longer converging to a similarity of 1 with the stored attractor states. For greater values of , the network ceases to perform the correct walk, and indeed does not converge on any stored attractor state (not shown).
where is a threshold set such that has the desired sparsity. Through this procedure, only the most extreme weight values are allowed to be nonzero. Since the terms inside are symmetrically distributed around zero, there are approximately as many +1 entries in as 1s. Using the same hypervectors and sequence of inputs as before, an attractor network performing a walk using the sparse weights matrix is shown in Figure 4, with sparsities of 98% and 99%. We see that for the 98% sparse case, there is again very little difference with the ideal case shown in Figure 2, with the network still having a similarity of with stored attractor states and performing the correct walk. When the sparsity is pushed further to 99%, however, we see that despite the network performing the correct walk, the attractor states are again slightly degraded, with the network converging on states with with stored attractor states . For greater sparsities, the network ceases to perform the correct walk and again does not converge on any stored attractor state (not shown).
These two tests thus highlight the extreme robustness of the model to imprecise and unreliable weights. The network may be implemented with 1-bit precision weights, whose weight distributions are entirely overlapping, or set 98% of the weights to zero and still continue to function without any discernible loss in performance. The extent to which the weights matrix may be degraded and the network still remain stable is of course a function not only of the level of degradation but also of the size of the network , as well as the number of FSM states and edges stored within the network. For conventional Hopfield models with Hebbian learning, these two factors are normally theoretically treated alike, as contributing an effective noise to the postsynaptic sum as in equation 3.7, and so the magnitude of withstandable synaptic noise increases with increasing (Amit, 1989; Sompolinsky, 1987). Although a thorough mathematical investigation into the scaling of weight degradation limits is justified, as a first result, we have here given numerical data showing stability even in the most extreme cases of nonideal weights, and expect that any implementation of the network with novel devices would be far away from such extremities.
4.3 Asynchronous Updates
Another useful property of Hopfield networks is the ability to robustly function even with asynchronously updating neurons, wherein not every neuron experiences a simultaneous state update. This property is especially important for any architecture claiming to be biologically plausible, as biological neurons update asynchronously and largely independent of each other, without the the need for global clock signals. To this end, we ran a similar experiment to that in Figure 2, using the undamaged weights matrix but with an asynchronous neuron update rule, wherein on each time step, every neuron has only a 10% chance of updating its state. The remaining 90% of the time, the neuron retains its state from the previous time step, regardless of its postsynaptic sum. There is thus no fixed order of neuron updates, and indeed it is not even a certainty that a neuron will update in any finite time. To account for the slower dynamics of the network state, the time for which inputs were presented to the network, as well as the periods without any input, was increased from 10 to 40 time steps. To be able to easily view the gradual state transition, three of the node hypervectors were chosen to be columns of the -dimensional Hadamard matrix rather than being randomly generated. The results are shown in Figure 5 for a shorter sequence of stimulus inputs. We see that the network functions as intended, but with the network now converging on the correct attractors in a finite number of updates rather than in just one. The model proposed here is thus not reliant on synchronous dynamics, which is important not only for biological plausibility but also when considering possible implementations on asynchronous neuromorphic hardware (Davies et al., 2018; Liu et al., 2014).
4.4 Storage Capacity
It is well known that the number of patterns that can be stored and reliably retrieved in a Hopfield network is proportional to the size of the network, via (Amit, 1989; Hopfield, 1982). When one tries to store more than attractors within the network, the so-called memory blackout occurs, after which no pattern can be retrieved. We thus perform numerical simulations for a large range of attractor network and FSM sizes to see if an analogous relationship exists. Said otherwise, for an attractor network of finite size , what sizes of FSM can the network successfully emulate?
For a given , number of FSM states , and edges , a random FSM was generated and an attractor network constructed to represent it as described in section 3. To ensure a reasonable FSM was generated, the FSM’s graph was first generated to have all nodes connected in a sequential ring structure (i.e., every state connects to ). The remaining edges between nodes were selected at random until the desired number of edges was reached. For each edge, an associated stimulus is then required. Although one option would be to allocate as few unique stimuli as possible, so that the state transitions are maximally state-dependent, this results in some advantageous cancellation effects between the transition terms and the stored attractors . To instead probe a worst-case scenario, each edge was assigned a unique stimulus.
4.5 Storage Capacity with Sparse States
We now apply the same procedure as in the dense case for determining the memory capacity of the sparse-activity attractor network. For direct comparison with the dense case, we define the memory capacity to be the largest FSM with for which walk success and failure are equiprobable. For every tested tuple, we generate a corresponding set of hypervectors and weights matrix as discussed in section 3.3 and then randomly choose a walk between six node attractor states to be completed. The chosen walk then determines the sequence of stimuli to be input, and each stimulus is then applied for 10 time steps. Each tuple was then determined to have passed or failed, with a success criterion that in the middle of all intervals when the network should be in a certain node attractor state. This criterion was chosen as it is the sparse analogue of that used in the dense case: at most, only one attractor state may satisfy it at any time.
The results are shown in Figure 8. We see that for a fixed number of neurons , the size of FSM that may be stored initially increases as is decreased, but below a certain , drops off rapidly. To estimate the optimal coding level and maximum FSM size for an attractor network of size , we apply a 2D gaussian convolutional filter with standard deviation 3 over the grid of successes or failures for each value separately; in order to obtain a kernel density estimate (KDE) of the walk success probability. The capacity was then obtained by taking the maximum value for which . This procedure was chosen in order to be comparable to that performed in the dense bipolar case (see Figure 6), where a linear separation boundary between success and failure was used instead. Plotting capacity against and applying a linear fit in the log-log domain reveals a scaling relation of . This approximately quadratic scaling in the sparse case is a vast improvement over the linear scaling shown in the dense case (see Figure 6) and is in keeping with the theoretical scaling estimates of for sparsely coded binary attractor networks (Amari, 1989). The optimal coding level is also shown, and a linear fit in the log-log domain implies a scaling relation of the form . Again, this is similar to the theoretically optimal scaling relation for sparse binary attractor networks, where the coding level scales like (Amari, 1989).
5 Relation to Other Architectures
5.1 FSM Emulation
While there is a large body of work concerning the equivalence between RNNs and FSMs, their implementations broadly fall into a few categories. There are those that require iterative gradient descent methods to mimic an FSM (Das & Mozer, 1994; Lee Giles et al., 1995; Pollack, 1991; Zeng et al., 1993), which makes them difficult to train for large FSMs and improbable for use in biology. There are those that require creating a new FSM with an explicitly expanded state set, , such that there is a new state for every old state-stimulus pair (Alquézar & Sanfeliu, 1995; Minsky, 1967), which is unfavorable due to the the explosion of (usually one-hot) states needing to be represented, as well as the difficulty of adding new states or stimuli iteratively. There are those that require higher-order weight tensors in order to explicitly provide a weight entry for every unique state-stimulus pair (Forcada & Carrasco, 2001; Mali et al., 2020; Omlin et al., 1998) which, as well as being nondistributed, may be more difficult to implement, for example, requiring the use of sigma-pi units (Groschner et al., 2022; Koch, 1998) or a large number of hidden neurons with two-body synaptic interactions only (Krotov & Hopfield, 2021).
In Recanatesi et al. (2017), transitions are triggered by adiabatically modulating a global inhibition parameter, such that the network may transition between similar stored patterns. Lacking, however, is a method to construct a network to perform arbitrary, controllable transitions between states. In Chen & Miller (2020) an in-depth analysis of small populations of rate-based neurons is conducted, wherein synapses with short-term synaptic depression enable a rich behavior of itinerancy between attractor states but does not scale to large systems and arbitrary stored memories.
Most closely resembling our approach, however, are earlier works concerned with the related task of creating a sequence of transitions between attractor states in Hopfield-like neural networks. The majority of these efforts rely on the use of synaptic delays, such that the postsynaptic sum on a time step depends, for example, also on the network state at time rather than just . These delay synapses thus allow attractor cross-terms of the form to become influential only after the network has inhabited an attractor state for a certain amount of time, triggering a walk between attractor states (Kleinfeld, 1986; Sompolinsky & Kanter, 1986). This then also allowed for the construction of networks with state-dependent input-triggered transitions (Amit, 1988; Drossaers, 1992; Gutfreund & Mezard, 1988). Similar networks were shown to function without the need for synaptic delays, but require fine tuning of network parameters and suffer from extremely low storage capacity (Amit, 1989; Buhmann & Schulten, 1987). In any case, the need for synaptic delay elements represents a large requirement on any substrate that might implement such a network and indeed are problematic to implement in neuromorphic systems (Nielsen et al., 2017).
State-dependent computation in spiking neural networks was realized in Neftci et al. (2013) and Liang et al. (2019), where they used population attractor dynamics to achieve robust state representations via sustained spiking activity. Additionally, these works highlight the need for robust yet flexible neural state machine primitives if one is to succeed in designing intelligent end-to-end neuromorphic cognitive systems. These approaches differ from this work, however, in that the state representations are still fundamentally population-based rather than distributed, and so pose difficulties such as the requirement of finding a new population of neurons to represent any new state (Rutishauser & Douglas, 2009).
Rigotti et al. (2010) discuss the need for a mechanism to induce flips in the neuron state (an operation akin to a Hadamard product) in order to directly implement nontrivial switching between different attractor states but disqualify such a mechanism from plausibly existing using synaptic currents alone. We also reject such a mechanism as a biologically plausible solution, but on the grounds that it would not robustly function in an asynchronous neural system (see section A.3). They instead show the necessity of a population of neurons with mixed selectivity, connected to both the input and attractor neurons, in order to achieve the desired attractor itinerancy dynamics. This requirement arose by demanding that the network state switch to resembling the target state immediately upon receiving a stimulus. We instead show that similar results can be achieved without this extra population if we relax to instead demanding only that the network soon evolve to the target state.
The main contribution of this article is thus to introduce a method by which attractor networks may be endowed with state-dependent, attractor-switching capabilities, without requiring biologically implausible elements or components that are expensive to implement (e.g., precise synaptic delays) and can be scaled up efficiently. The extension to arbitrary FSM emulation shows the generality of the method and that its limitations can be overcome by the appropriate modifications, like introducing the edge state attractors (see section A.7).
5.2 VSA Embeddings
This work also differs from more conventional methods to implement graphs and FSMs in VSAs (Kleyko et al., 2022; Osipov et al., 2017; Poduval et al., 2022; Teeters et al., 2023; Yerxa et al., 2018) in that the network state does not need to be read by an outsider in order to implement the state transition dynamics. That is, where in previous works a graph is encoded by a hypervector (or an associative memory composed of hypervectors) such that the desired dynamics and outputs may be reliably decoded by external circuitry, we instead encode the graph’s connectivity within the attractor network’s weights matrix, such that its recurrent neural dynamics realize the desired state machine behavior.
The use of a Hopfield network as an autoassociative cleanup memory in conjunction with VSAs has been explored in previous works, including theoretical analyses of their capacity to store bundled hypervectors with different representations (Clarkson et al., 2023), and using single attractor states to retrieve knowledge structures from partial cues (Steinberg & Sompolinsky, 2022). Further links between VSAs and attractor networks have also been demonstrated with the use of complex phasor hypervectors, rather than binary or bipolar hypervectors, being stored as attractors within phasor neural networks (Frady & Sommer, 2019; Kleyko et al., 2022; Noest, 1987; Plate, 2003). Complex phasor hypervectors are of particular interest in neuromorphic computing, since they may be very naturally implemented with spike-timing phasor codes, wherein the value represented by a neuron is encoded by the precise timing of its spikes with respect to other neurons or a global oscillatory reference signal, and hypervector binding may be implemented by phase addition (Auge et al., 2021; Orchard & Jarvis, 2023).
Osipov et al. (2017) show the usefulness of VSA representations for synthesizing state machines from observable data, which might be combined with this work to realize a neural system that can synthesize appropriate attractor itinerancy dynamics to best fit observed data. Similarly, if equally robust attractor-based neural implementations of other primitive computational blocks could be created, such as a stack, then they might be combined to create more complex VSA-driven cognitive computational structures, such as neural Turing machines (Graves et al., 2014; Grefenstette et al., 2015; Yerxa et al., 2018). Looking further, this combined with the end-to-end trainability of VSA models could pave the way for neural systems that have the explainability, compositionality, and robustness thereof, but the flexibility and performance of deep neural networks (Hersche et al., 2023; Schlag et al., 2020).
6 Biological Plausibility
Transitions between discrete neural attractor states are thought to be a crucial mechanism for performing context-dependent decision making in biological neural systems (Daelli & Treves, 2010; Mante et al., 2013; Miller, 2016; Tajima et al., 2017). Attractor dynamics enable a temporary retention of received information and ensure that irrelevant inputs do not produce stable deviations in the neural state. Such networks are widely theorized to exist in the brain—for example, in the hippocampus for its pattern completion and working memory capabilities (Khona & Fiete, 2022; Rolls, 2013). As such, we showed that a Hopfield attractor network and its sparse variant can be modified such that they can perform stimulus-triggered, state-dependent attractor transitions without resorting to additional biologically implausible mechanisms and while abiding by the principles of distributed representation. The changes we introduced are (1) an altered weights matrix construction with additional asymmetric cross-terms (which does not incur any considerable extra complexity) and (2) the ability for a stimulus to mask a subset of neurons within the attractor population. As long as such a mechanism exists, the network proposed here could thus map onto brain areas theorized to support attractor dynamics. The masking mechanism could, for example, feasibly be achieved by a population of inhibitory neurons representing the stimuli, which selectively project to neurons within the attractor population.
6.1 Robustness
The robust functioning of the network despite noisy and unreliable weights is a crucial prerequisite for the model to plausibly be able to exist in biological systems. As we have shown, the network weights may be considerably degraded without affecting the behavior of the network, and indeed beyond this, the network exhibits a so-called graceful degradation in performance. Furthermore, biological synapses are expected to have only a few bits of precision (Baldassi et al., 2016; Bartol et al., 2015; O’Connor et al., 2005), and the network has been shown to function even in the worst case of binary weights. These properties stem from the massive redundancy arising from storing the attractor states across the entire synaptic matrix in a distributed manner, a technique that the brain is expected to use (Crawford et al., 2016; Rumelhart & McClelland, 1986). Of course, we expect there to be a trade-off between the amount of each nonideality that the network can withstand before failure. That is, an attractor network with dense noisy weights may withstand a greater degree of synaptic noise than if the weights matrix were also made sparse. Likewise, larger networks storing the same-sized FSM should be able to withstand greater nonidealities than smaller networks, as is the case for attractor networks in general (Amit, 1989; Sompolinsky, 1987).
Since the network is still an attractor network, it retains all of the properties that make them suitable for modeling cognitive function, such as that the network can perform robust pattern completion and correction, that is, the recovery of a stored prototypical memory given a damaged, incomplete, or noisy version, and thereafter function as a stable working memory (Amit, 1989; Hopfield, 1982).
The robustness of the network to weight nonidealities also makes it a prime candidate for implementation with novel memristive crossbar technologies, which would allow an efficient and high-density implementation of the matrix-vector multiplication required in the neural state update rule (see equation 3.2) to be performed in one operation (Ielmini & Wong, 2018; Verleysen & Jespers, 1989; Xia & Yang, 2019). Akin to the biological synapses they emulate, such devices also often have only a few bits of precision and suffer from considerable per-device mismatch in the programmed conductance states. The network proposed in this article is thus highly suitable for implementation with such architectures, as we have shown that robust performance is retained even when the network is subjected to a very high degree of such nonidealities.
The continued functionality of the network when its dynamics are asynchronous is another important factor when considering its biological plausibility. In a biological neural system, neurons will produce action potentials whenever their membrane potential happens to exceed the neuron’s spiking threshold, rather than all updating synchronously at fixed time intervals. We tested the regime where the timescale of the neuron dynamics is much slower than the timescale of the input by replacing the synchronous neuron update rule with a stochastic asynchronous variant thereof, and showed that the network is robust to this asynchrony. Similarly, we tested the regime where neuron dynamics are much faster than the input by considering input that is applied stochastically and asynchronously instead (see section A.3). The continued robustness of the model in these two extreme asynchronous regimes implies that the network is not dependent on the exact timing of inputs to the network or on the neuron updates within the network, and so would function robustly in both biological neural systems and asynchronous neuromorphic systems where the exact timing of events cannot be guaranteed (Davies et al., 2018; Liu et al., 2014).
6.2 Learning
The procedure for generating the weights matrix as a result of its simplicity makes the proposed network more biologically plausible than other more complex approaches (e.g., those using gradient descent methods). It can be learned in one shot in a fully online fashion, since adding a new node or edge involves only an additive contribution to the weights matrix, which does not require knowledge of irrelevant edges, nodes, their hypervectors, or the weight values themselves. Furthermore, as a result of the entirely distributed representation of states and transitions, new behaviors may be added to the weights matrix at a later date without having to allocate new hardware and without having to recalculate with all previous data. Both of these factors are critical for continual online learning.
From the hardware perspective, the locality of the learning rule means that if the matrix-vector multiplication step in the neuron state update rule is implemented using novel memristive crossbar circuits (Ielmini & Wong, 2018; Xia & Yang, 2019; Zidan & Lu, 2020), then the weights matrix could be learned online and in-memory via a sequence of parallel conductance updates rather than by computing the weights matrix offline and then writing the summed values to the devices’ conductances. As long as the updates in the memristors’ conductances are sufficiently linear and symmetric, then attractors and transitions could be sequentially learned in one shot and in parallel by specifying the two hypervectors in the outer product weight update at the crossbar’s inputs and outputs by appropriately shaped voltage pulses (Alibart et al., 2013; Li et al., 2021).
6.3 Scaling
When the FSM states are represented by dense bipolar hypervectors within the attractor network, we found a linear scaling between the size of the network and the capacity in terms of the size of FSM that could be embedded without errors. Although this is in keeping with the results in the Hopfield paper, this is not a favorable result when considering the biological plausibility of the system for large (Hopfield, 1982). Since the attractor network is fully connected, the capacity actually scales sublinearly with the number of synapses , meaning that an increasing number of synapses are required per attractor and transition to be stored for large , and so the network becomes increasingly inefficient. Additionally, the fact that every neuron is active at any time (or half of them, depending on the interpretation of the state) represents an unnecessarily large energy burden for any system using this model. This is in contrast to data from neural recordings, where a low per neuron mean activity is ensured by the sparse coding of information (Barth & Poulet, 2012; Olshausen & Field, 2004; Rolls & Treves, 2011).
We thus tested how the capacity of the network scales with when the FSM states are instead represented by sparse binary hypervectors with coding level , since it is well known that the number of sparse binary vectors that can be stored in an attractor network scales much more favorably, (Amari, 1989). We found indeed that the sparse coding of the FSM states vastly improved the capacity of the network, scaling approximately quadratically with , and so approximately linearly in the number of synapses. This linear scaling with the number of synapses not only ensures the efficient use of available synaptic resources in biological systems, but is especially important when one considers a possible implementation in neuromorphic hardware, where the number of synapses usually represents the main size constraint rather than the number of neurons (Davies et al., 2018; Manohar, 2022).
The coding level was found to have an approximately inverse relationship with the attractor network size, , which would imply that the number of active neurons in any attractor state grows very slowly, . This is in agreement with the theoretically optimal case, where the coding level for a sparse binary attractor network should scale like , and so the number of active neurons in any pattern scales like (Amari, 1989).
Sparsity in the stored hypervectors is especially important when one considers how the weights matrix could be learned in an online fashion if the synapses are restricted to have only a few bits of precision. So far we have considered quantization of the weights only after the summed values have been determined, whereas including weight quantization while new patterns are being iteratively learned is a much harder problem and implies attractor capacity relations as poor as . One solution is for the states to be increasingly sparse, in which case the optimal scaling of can be recovered (Amit & Fusi, 1994; Brunel et al., 1998).
In short, by letting the FSM states be represented by sparse binary hypervectors rather than dense bipolar hypervectors, we not only move closer to a more biologically realistic model of neural activity, but also benefit from the superior scaling properties of sparse binary attractor networks, which lets the maximum size of FSM that can be embedded scale approximately quadratically with the attractor network size rather than linearly.
7 Conclusion
Attractor neural networks are robust abstract models of human memory, but previous attempts to endow them with complex and controllable attractor-switching capabilities have suffered mostly from being nondistributed, not scalable, or not robust. We have here introduced a simple procedure by which any arbitrary FSM may be embedded into a large-enough Hopfield-like attractor network, where states and stimuli are represented by high-dimensional random hypervectors and all information pertaining to FSM transitions is stored in the network’s weights matrix in a fully distributed manner. Our method of modeling input to the network as a masking of the network state allows cross-terms between attractors to be stored in the weights matrix in a way that they are effectively obfuscated until the correct state-stimulus pair is present, much in a manner similar to the standard binding-unbinding operation in more conventional VSAs.
We showed that the network retains many of the features of attractor networks that make them suitable for biology, namely, that the network is not reliant on synchronous dynamics and is robust to unreliable and imprecise weights, thus also making it highly suitable for implementation with high-density but noisy devices. We presented numerical results showing that the network capacity in terms of implementable FSM size scales linearly with the size of the attractor network for dense bipolar hypervectors and approximately quadratically for sparse binary hypervectors.
In summary, we introduced an attractor-based neural state machine that overcomes many of the shortcomings that made previous models unsuitable for use in biology and propose that attractor-based FSMs represent a plausible path by which FSMs may exist as a distributed computational primitive in biological neural networks.
Appendix: Supplementary Material
A.1 Dynamics without Masking
A.2 Dynamics with Masking
A.3 Why Model Input as Masking?
We then test the functionality of the attractor network with Hadamard input when the exact simultaneous arrival of input stimuli cannot be guaranteed (i.e., the input to the network is asynchronous. To model this, we consider that the arrival time of the stimulus is component-wise randomly and uniformly spread over five time steps rather than just one. The same attractor network receiving the same sequence of Hadamard-product stimuli, but now asynchronously, is shown in Figure 9c). The network does not perform the correct walk between attractor states and instead remains localized near the initial attractor state across all time steps. This is due to the fact that although when input is applied, the network begins to move away from the initial attractor state, these changes are immediately undone by the network’s inherent attractor dynamics, since the neural state is still within the initial attractor’s basin of attraction. Only when the timescale of the input is far faster than the timescale of the attractor dynamics (e.g., input is synchronous) may the input accumulate fast enough to escape the initial basin of attraction.
When input to the network is treated as masking operation, however (see equation 3.2), the attractor itinerancy dynamics are robust to input asynchrony. To model this, the input stimulus is stochastically applied, with each component being delayed randomly and uniformly by up to 20 time steps. The stimulus is then held for 10 time steps and stochastically removed over 20 time steps in the same manner. The attractor network with asynchronous masking input is shown in Figure 10 and functions as desired, performing the correct walk between attractor states. Modeling input to the network as a masking operation thus allows the network to operate robustly in asynchronous regimes, while modeling input to the network as a Hadamard product does not.
A.4 The Need for Edge States
In the fully synchronous case, when input is applied for one time step only, there is no need for edge states. When the stimulus is applied, the network will make one transition only. In the asynchronous case, however, one cannot ensure that the stimulus is applied for one time step only. Thus, starting from , when the stimulus is applied “once” for an arbitrary number of time steps, the network may have the unwanted behavior of transitioning to on the first time step and then to on the second, effectively overshooting and skipping . In Figure 11 we see the dynamics of the attractor network constructed without any edge states, with inputs that are applied for 10 time steps each, and we indeed see the undesirable skipping behavior. Similarly, bidirectional edges with the same stimulus (e.g., “consort_is”) cause an unwanted oscillation between attractor states. The edge states offer a solution to this problem: by adding an intermediate attractor state for every edge and splitting each edge into two transitions with stimuli and , we ensure that there are no consecutive edges with the same stimulus.
If we don’t necessarily need to be able to embed FSMs with consecutive edges with the same stimulus, then we can get rid of the edge states and construct our weights matrix with simpler transition terms as in equation 6.2. An attractor network constructed in this way is shown in Figure 12 for a chosen FSM that does not require edge states but still contains state-dependent transitions. The network performs the correct walk between attractor states as intended and does not suffer from any of the unwanted skipping or oscillatory phenomena as in Figure 11. Thus, while the edge states are required to ensure that any FSM can be implemented in a “large enough” attractor network, they are not strictly necessary to achieve state-dependent, stimulus-triggered attractor transition dynamics.
A.5 Sparse Stimuli
One shortcoming of the model might be that we used dense bipolar hypervectors to represent the stimuli, meaning that when is being input to the network, masking all neurons for which , approximately half of all neurons within the network are silenced. This was initially chosen because unbiased bipolar hypervectors are arguably the simplest and most common choice of VSA representation, and highlights the fact that VSA-based methods can be applied to the design of attractor networks with very little required tweaking (Gayler, 1998; Kleyko et al., 2022).
From the biological perspective, however, it could be seen as somewhat implausible that the number of active neurons should change so drastically (halving) while a stimulus is present. Furthermore, if implemented with spiking neurons, the large changes in the total spiking activity could cause unwanted effects in the spike rate of the nonmasked neurons. Also, this means that while the network is being masked, the size of the network (and so its capacity) is reduced to , and so the network is especially prone to instability during the transition periods if the network is nearing its memory capacity limits.
To be noted is that as we approach , the stimuli become less and less distributed, with the limiting case implying that only one component of is negative, and so by masking only one neuron, the network will switch between attractor states. This case is obviously a stark departure from the robustness that the more distributed representations afford us, since if that single neuron is faulty or dies, it would be catastrophic for the functioning of the network. Similarly, if another independent stimulus were to, by chance, choose the same component to be nonnegative, this would cause similarly unwanted dynamics. Less catastrophic but still worth considering is that the noise added per edge term, as a result of the negative terms becoming very large, has variance that scales like , and so for contributes an increasing amount of unwanted noise to the system, destabilizing the attractor dynamics. Nevertheless, this represents yet another trade-off in the attractor network’s design, as needing to mask fewer neurons might be worth the increased noise within the system, decreasing its memory capacity.
A.6 Symbols and Definitions
Symbol . | Definition . |
---|---|
Number of neurons within the attractor network | |
Number of FSM states | |
Number of FSM edges | |
Dense bipolar hypervectors | |
Sparse binary hypervectors | |
Coding level of a hypervector (fraction nonzero components) | |
Neuron state vector at time step | |
Node hypervectors representing an FSM state | |
Edge-state hypervectors | |
Stimulus hypervectors | |
Ternary output hypervectors | |
A hypervector of all ones | |
Recurrent weights matrix | |
Synaptic weight from neuron to | |
Matrices added to to implement transitions | |
Hadamard product (component-wise multiplication) | |
Transpose of | |
Component-wise Heaviside function | |
Component-wise sign function |
Symbol . | Definition . |
---|---|
Number of neurons within the attractor network | |
Number of FSM states | |
Number of FSM edges | |
Dense bipolar hypervectors | |
Sparse binary hypervectors | |
Coding level of a hypervector (fraction nonzero components) | |
Neuron state vector at time step | |
Node hypervectors representing an FSM state | |
Edge-state hypervectors | |
Stimulus hypervectors | |
Ternary output hypervectors | |
A hypervector of all ones | |
Recurrent weights matrix | |
Synaptic weight from neuron to | |
Matrices added to to implement transitions | |
Hadamard product (component-wise multiplication) | |
Transpose of | |
Component-wise Heaviside function | |
Component-wise sign function |
Acknowledgments
We thank Dr. Federico Corradi, Dr. Nicoletta Risi, and Prof. Matthew Cook for their invaluable input and suggestions, as well as their help with proofreading this article.
The work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation; Project MemTDE, project number 441959088) as part of the DFG priority program SPP 2262 MemrisTec (project number 422738993) and Project NMVAC (project number 432009531). We acknowledge the financial support of the CogniGron research center and the Ubbo Emmius Funds (University of Groningen).
Though this arbitrary choice may seem to incur a bias to a particular state, in practice, the postsynaptic sum very rarely equals 0.
We have here ignored that the diagonal of is set to zero (no self-connections), but this does not significantly affect the following results.
We could also use binary hypervectors, rather than positive/negative, and then alter the transition terms to include and terms to achieve the same result. We believe it is more intuitive not to make this change for this section, however.