Vector Symbolic Finite State Machines in Attractor Neural Networks

Abstract Hopfield attractor networks are robust distributed models of human memory, but they lack a general mechanism for effecting state-dependent attractor transitions in response to input. We propose construction rules such that an attractor network may implement an arbitrary finite state machine (FSM), where states and stimuli are represented by high-dimensional random vectors and all state transitions are enacted by the attractor network’s dynamics. Numerical simulations show the capacity of the model, in terms of the maximum size of implementable FSM, to be linear in the size of the attractor network for dense bipolar state vectors and approximately quadratic for sparse binary state vectors. We show that the model is robust to imprecise and noisy weights, and so a prime candidate for implementation with high-density but unreliable devices. By endowing attractor networks with the ability to emulate arbitrary FSMs, we propose a plausible path by which FSMs could exist as a distributed computational primitive in biological neural networks.


I. INTRODUCTION
Hopfield attractor networks are one of the most celebrated models of robust neural auto-associative memory, as from a simple Hebbian learning rule they display emergent attractor dynamics which allow for reliable pattern recall, completion, and correction even in situations with considerable non-idealities imposed (Amit 1989;Hopfield 1982).Attractor models have since found widespread use in neuroscience as a functional and tractable model of human memory (Chaudhuri and Fiete 2016;Eliasmith 2005;Khona and Fiete 2022;Little 1974;Rolls 2013;Schneidman et al. 2006).The assumption of these models is that the network represents different states by different, usually uncorrelated, global patterns of persistent activity.When the network is presented with an input that closely resembles one of the stored states, the network state converges to the corresponding fixed-point attractor.
This process of switching between discrete attractor states is thought to be fundamental both to describe biological neural activity, as well as to model higher cognitive decision making processes (Brinkman et al. 2022;Daelli and Treves 2010;Mante et al. 2013;Miller 2016;Tajima et al. 2017).What attractor models currently lack, however, is the ability to perform state-dependent computation, a hallmark of human cognition (Buonomano and Maass 2009;Dayan 2008;Granger 2020).That is, when the network is presented with an input, the attractor state to which the network switches ought to be dependent both upon the input stimulus as well as the state the network currently inhabits, rather than simply the input.
We thus seek to endow a classical neural attractor model, the Hopfield network, with the ability to perform state-dependent switching between attractor states, without resorting to the use of biologically implausible mechanisms, such as training via backpropagation algorithms.The resulting attractor networks will then be able to robustly emulate any arbitrary Finite State Machine (FSM), considerably improving their usefulness as a neural computational primitive.
We achieve this by leaning heavily on the framework of Vector Symbolic Architectures (VSAs), also known as Hyperdimensional Computing (HDC).VSAs treat computation in an entirely distributed manner, by letting symbols be represented by high-dimensional random vectors, hypervectors (Gayler 1998;Kanerva 1997;Kleyko, Rachkovskij, et al. 2022;Plate 1995).When equipped with a few basic operators for binding and superimposing hypervectors together, corresponding often either to component-wise multiplication or addition respectively, these architectures are able to store primitives such a sets, sequences, graphs and arbitrary data bindings, as well as enabling more complex relations, such as analogical and figurative reasoning (Kanerva 2009;Kleyko, Davies, et al. 2021).Although different VSA models often have differing representations and binding operations (Kleyko, Rachkovskij, et al. 2022), they all share the need for an auto-associative cleanup memory, which can recover a clean version of the most similar stored hypervector, given a noisy version of itself.We here use the recurrent dynamics of a Hopfield-like attractor neural network as a state-holding auto-associative memory (Gritsenko et al. 2017).
Symbolic FSM states will thus be represented each by a hypervector and stored within the attractor network as a fixed-point attractor.Stimuli will also be represented by hypervectors, which, when input to the attractor network, will trigger the network dynamics to transition between the correct attractor states.We make use of common VSA techniques to construct a weights matrix to achieve these dynamics, where we use the Hadamard product between bipolar hypervectors {−1, 1} N as the binding operation (the Multiply-Add-Permute (MAP) VSA model) (Gayler 1998).We thus claim that attractor-based FSMs are a plausible biological computational primitive insofar as Hopfield networks are.
This represents a computational paradigm that is a departure from conventional von Neumann architectures, wherein the separation of memory and computation is a major limiting factor in current advances in conventional computational performance (the von Neumann bottleneck (Backus 1978;Indiveri and Liu 2015)).Similarly, the high redundancy and lack of reliance on individual components makes this architecture fit for implementation with novel in-memory computing technologies such as resistive RAM (RRAM) or phase-change memory (PCM) devices, which could perform the network's matrix-vector-multiplication (MVM) step in a single operation (Ielmini and Wong 2018;Xia and Yang 2019;Zidan and Lu 2020).

A. Hypervector arithmetic
Throughout this article, symbols will be represented by high-dimensional randomly-generated dense bipolar hypervectors x ∈ {−1, 1} N where the number of dimensions N is generally taken to be greater than 10, 000.Unless explicitly stated otherwise, any bold lowercase Latin letter may be assumed to be a new, independently generated hypervector, with the value Y i at any index i in x generated according to For any two arbitrary hypervectors a and b, we define the similarity between the two hypervectors by the normalised inner product where the similarity between a hypervector and itself d(a, a) = 1, and d(a, −a) = −1.Due to the high dimensionality of the hypervectors, the similarity between any two unrelated (and so independently generated) hypervectors is the mean of an unbiased random sequence of −1 and 1s which tends to 0 for N → ∞.It is from this result that we get the requirement of high dimensionality, as it ensures that the inner product between two random hypervectors is approximately 0. We can thus say that independently generated hypervectors are pseudo-orthogonal (Kleyko, Davies, et al. 2021).For a set of independently generated states {x µ }, these results can be summarised by where δ µν is the Kronecker delta.Hypervectors may be combined via a so called binding operation to produce a new hypervector that is dissimilar to both its constituents.We here choose the Hadamard product, or component-wise multiplication, as our binding operation, denoted "•".
The statement that the binding of two hypervectors is dissimilar to its constituents is written as where we implicitly assume that N is large enough that we can ignore the O( 1 √ N ) noise terms.If we wish to recover a similarity between the hypervectors a • b and a, we could bind the b hypervector to the lone a term as well, in which case we would have d(a • b, a • b) = 1.For reasons of ease and robustness of implementation in an asynchronous neural system, we focus instead on another method to recover the similarity (see Sections III-A & VII-C).If we mask the system using b, such that only components where b i = 1 are remaining.Then, we have where we have used the Heaviside step function H(•) defined by to create a multiplicative mask H(b), setting to 0 all components where b i = −1.In the second line, we have split the summation over all components into summations over components where b i = 1 and −1 respectively.The final similarity of1 2 is a consequence of approximately half of all values in a any hypervector being +1 (Equation 2).

B. Hopfield networks
A Hopfield network is a dynamical system defined by its internal state vector z and fixed recurrent weights matrix W, with a state update rule given by z t+1 = sgn Wz t (10) where z t is the network state at discrete time step t, and sgn(•) is an component-wise sign function, with zeroes resolving 1 to +1 (Hopfield 1982).We know that if we want to store P uncorrelated patterns {x ν } P ν=1 within a Hopfield network, we can construct the weights matrix W according to then as long as not too many patterns are stored (P < 0.14N (Hopfield 1982)), the patterns will become fixed-point attractors of the network's dynamics, and the network can perform robust auto-associative pattern completion and correction.

C. Finite State Machines
A Finite State Machine (FSM) M is a discrete system with a finite state set X FSM = {χ 1 , χ 2 , . . ., χ N Z }, a finite input stimulus set S FSM = {ς 1 , ς 2 , . . ., ς N S }, and finite output response set R FSM = {ρ 1 , ρ 2 , . . ., ρ N R }.The FSM M is then fully defined with the addition of the transition function F (•) : X FSM ×S FSM → X FSM and the output response function G(•) : where x t ∈ X FSM , r t ∈ R FSM and s t ∈ S FSM are the state, output and stimulus at time step t respectively.The transition function F (•) thus provides the next state for any state-stimulus pair, while G(•) provides the output, and both may be chosen arbitrarily.The FSM M can thus be represented by a directed graph, where each node represents a different state χ, and every edge has a stimulus ς and optional output ρ associated with it.

III. ATTRACTOR NETWORK CONSTRUCTION
We now show how a Hopfield-like attractor network may be constructed to emulate an arbitrary FSM, where the states within the FSM are stored as attractors in the network, and the stimuli for transitions between FSM states trigger all corresponding transitions between attractors.More specifically, for every FSM state χ ∈ X FSM , an associated hypervector x is randomly generated and stored as an attractor within the network, the set of which we denote X AN .We henceforth refer to these hypervectors as node hypervectors, or node attractors.Every unique stimulus ς ∈ S FSM in the FSM is also now associated with a randomly generated hypervector s ∈ S AN , where S AN is the set of all stimulus hypervectors.For the FSM edge outputs ρ ∈ R FSM , a corresponding set of output hypervectors r ∈ R AN is similarly generated.These correspondences are summarised in Table I.

A. Constructing transitions
We consider the general situation that we want to initiate a transition from source attractor state x ∈ X AN to target attractor state y ∈ X AN , by imposing some stimulus hypervector s ∈ S AN as input onto the network.
To ensure the plausible functionality of the network in a biological system, the mechanism for enacting transitions in the network should make very few timing assumptions about the system, and should be robust to an arbitrary degree of asynchrony.How we model input to the network is thus of crucial importance to its functionality in these regimes.We model input to the network as a masking of the network state, such that all components where the stimulus s is -1 are set to 0. This may be likened to saying we are considering input to the network that selectively silences half of all neurons according to the stimulus hypervector.This mechanism was chosen as it allows the network to function even when the input is applied asynchronously and with random delays (see Section VII-C).While a stimulus hypervector s is being imposed upon the network, the modified state update rule is given by where the Hadamard product of the network state with H(s) enacts the masking operation, and the weights matrix W is constructed such that z t+1 will resemble the desired target state (Section III-A).
For every edge in the FSM, we randomly generate an "edge state" e, which is also stored as an attractor within the network.Each edge will use this e state as an intermediate attractor state, en route to y.Additionally, each unique stimulus ς ∈ S FSM will now have two stimulus hypervectors associated with it, s a and s b , which trigger transitions from source state x to edge state e and edge state e to target state y respectively.The edge states are introduced to allow the system to function even when stimuli are input to the network for arbitrarily many time steps, and prevents unwanted effects such as skipping over certain attractor states, or oscillations between states (see Section VII-D).A general transition now looks like where x, y ∈ X AN are node attractor states but e exists purely to facilitate the transition.The weights matrix is constructed2 as Asymmetric transition terms (16) where x ν ∈ X AN is the node hypervector corresponding to the ν'th node in the graph to be implemented, N Z and N E are the number of nodes and edges respectively, and E η is the addition to the weights matrix required to implement an individual edge, given by where x, e and y are the source, edge, and target states of the edge η respectively, and s a and s b are the input stimulus hypervectors associated with this edge's label.The edge index η has been dropped for brevity.The ee ⊺ term is the edge state attractor we have introduced as an intermediary for the transition.The second set of terms enacts the x sa −→ e transition, by giving a nonzero inner product with the network state z t only when the network is in state x, and the network is being masked by the stimulus s a .When both of these conditions are met, the (x•s a ) ⊺ term will have a nonzero inner product with the network state, projecting out the (e−x) term, which "pushes" the network from the x to the e attractor state.This allows terms to be stored in W which are effectively obfuscated, not affecting network dynamics considerably, until a specific stimulus is applied as a mask to the network.Likewise, the third set of terms enacts the e s b −→ y transition.In the absence of input, the network functions like a standard Hopfield attractor network, where n ∈ R N is a standard normally-distributed random vector, and is the magnitude of noise due to the undesired finite inner product with other stored terms (see Section VII-A for proof).Thus as long as the magnitude of the noise is not too large, x will be a solution of z = sgn(Wz) and so a fixed-point attractor of the dynamics.When a valid stimulus is presented as input to the network however, masking the network state, the previously obfuscated asymmetric transition terms become significant and dominate the dynamics.Assuming there is a stored transition term E corresponding to a valid edge with hypervectors x, e, y, s a , s b having the same meaning as in Equation 17, during a masking operation we have where ∝ ∼ implies approximate proportionality (see Section VII-B for proof).The second set of terms can be ignored, as they project only to neurons which are currently being masked.Thus the only significant term is that containing the edge state e, which consequently drives the network to the e state, enacting the x sa −→ e transition.Since the state e is also stored as an attractor within the network, we have and We ≈ e ± σn (22) thus the edge states e are also fixed-point attractors of the network dynamics.To complete the transition from state x to y, the second stimulus s b is applied, giving which drives the network state towards y ∈ X AN , the desired target attractor state.By consecutive application of the inputs s a and s b , the transition terms E η stored in W have thus caused the network to controllably transition from the source state attractor state to the target attractor state.Due to the robustness of the masking mechanism, the stimuli can be applied asynchronously and with arbitrary delays (see Section VII-C).Transition terms E η may be iteratively added to W to achieve any arbitrary transition between attractor states, and so any arbitrary FSM may be implemented within a large enough attractor network.

B. Edge outputs
Until now we have not mentioned the other critical component of FSMs: the output associated with every edge.We have separated the construction of transitions and edge outputs for clarity, since the two may be effectively decoupled.Much like for the nodes and edges in the FSM to be implemented, for every unique FSM output ρ ∈ R FSM , we generate a corresponding hypervector r ∈ R AN , where R AN is the set of all output hypervectors.We then seek to somehow embed these hypervectors into the attractor network, such that every transition between node attractor states may contain one of these hypervectors r.A natural solution would be to embed the r hypervector into the edge state attractors ee ⊺ , since there already exists one for every edge.We can consider altering the edge state attractors from ee ⊺ to e r e r ⊺ , where e r resembles the original e state with r somehow embedded within it, such that its presence can be detected via a linear projection.If multiple edges have the same r hypervector however, then the e r e r ⊺ terms for different edges will be correlated, incurring unwanted interference between attractor states and violating the assumption that the inner product between different attractor terms is small enough that it can be ignored.We avoid this by instead storing altered edge state attractors of the form e r e ⊺ .We then choose e r such that it is minimally different from e (i.e.d(e r , e) ≈ 1), so that we still retain the desired attractor dynamics.We thus choose the output hypervectors r ∈ R AN to be sparse ternary hypervectors r ∈ {−1, 0, 1} N with coding level f r := 1 N N i |r i |, the fraction of nonzero components.These output hypervectors are then embedded in the edge state attractors, altering the ee ⊺ terms in each E term according to where the composite vector e r introduced above is here defined and 1 is a hypervector of all ones.As a result of this modification, the edge states e themselves will no longer be exact attractors of the space.The composite state e r will however be stable, in which the presence of r can be easily detected by a linear projection (e r • r = N f r ).This has been achieved without incurring any similarity and thus interference between attractors, which would otherwise alter the dynamics of the previously described transitions.A full transition term E η , including its output, is thus given by which combined with the network state masking operation is solely responsible for storing the FSM connectivity and enabling the desired inter-attractor transition dynamics.

C. Sparse activity states
It is well known that the memory capacity of attractor networks can be vastly increased by storing sparsely-coded activity patterns, rather than dense patterns as we have done thus far (Amari 1989;Amit 1989;Tsodyks and Feigel'man 1988).We therefore adapt the construction of the attractor network to the case that the network state z t and its stored hypervectors x ν are binary and f −sparse, i.e. contain mostly zeroes, with very few entries being +1, to test if there are similar gains in the size of FSM that can be reliably embedded.To distinguish these hypervectors from the dense bipolar hypervectors we have been using thus far, we denote sparse binary hypervectors x sp ∈ {0, 1} N with |x sp | 1 = N f , where f is the fixed coding level of the states, the fraction of nonzero components.Note that we here construct hypervectors which have exactly N f nonzero components, and so they may better be described as a sparse N -of-M code (Furber et al. 2004).The attractor network's weights matrix is constructed as where E η are the equivalent sparse edge terms to be defined.
If the neuron state update rule (Equation 10) is replaced with a sparse binary variant, e.g. a top-k activation function or a Heaviside function with an appropriately chosen threshold, then the stored states x ν sp will be attractors of the network's dynamics (Amari 1989).The additional edge terms E η are analogously constructed as where the first set of terms embeds the sparse binary edge state e sp as an attractor, while the second and third terms embed the source-to-edge and edge-to-target transitions respectively.The stimulus hypervectors s a and s b can also be made sparse, such that fewer than half of all neurons are masked by the stimuli, but at the cost of decreased memory capacity (Section VII-E).For this reason, we here keep them as bipolar hypervectors, with an approximately equal number of +1 as −1 entries.Each set of terms within each E η term performs the same role as in the dense bipolar case as discussed in Section III-A.How output states should be embedded into each transition in the sparse case is unclear, because unlike in the dense case, they cannot be embedded into the edge state attractors without considerably affecting the network dynamics and thus attractor stabilities.

A. FSM emulation
To show the generality of FSM construction, we chose to implement a directed graph representing the relationships between gods in ancient Greek mythology, due to the graph's dense connectivity.The graph and thus FSM to be implemented is shown in Figure 1.From the graph it is clear that a state machine representing the graph must explicitly be capable of state-dependent transitions, e.g. the input "overthrown by" must result in a transition to state "Kronos" when in state "Uranus", but to state "Zeus" when in state "Kronos".To construct W, the necessary hypervectors are first generated.For every state χ ∈ X FSM in the FSM (e.g."Zeus", "Kronos") a random bipolar hypervector x is generated according to Equation 2. For every unique stimulus ς ∈ S FSM (e.g."overthrown by", "father is") a pair of random bipolar stimulus hypervectors s a and s b is likewise generated.Similarly, sparse ternary output hypervectors r are also generated.The weights matrix W is then iteratively constructed as per Equations 16 and 25, with a new hypervector e also being generated for every edge.The matrix generated from this procedure we denote W ideal .For all of the following results, the attractor network is first initialised to be in a certain node attractor state, in this case, "Hades".The network is then allowed to freely evolve for 10 time steps (chosen arbitrarily) as per Equation 10, with every neuron being updated simultaneously on every time step.During this period, it is desired that the network state z t remains in the attractor state in which it was initialised.An input stimulus s a is then presented to the network for 10 time steps, during which time the Fig. 1: An example FSM which we implement within the attractor network.Each node within the graph (e.g."Zeus") is represented by a new hypervector x µ and stored as an attractor within the network.Every edge is labelled by its stimulus (e.g."father is"), for which corresponding hypervectors s a and s b are also generated.When a stimulus' hypervector is input to the network, it should allow all corresponding attractor transitions to take place.Each edge may also have an associated output symbol, where we here choose the edges labelled "type" to output the generation of the god {"Primordial", "Titans", "Olympians"}.This graph was chosen as it displays the generality of the embedding: it contains cycles, loops, bidirectional edges and state-dependent transitions.
network state is masked by the stimulus hypervector, and the network evolves synchronously according to Equation 14.If the stimulus corresponds to a valid edge in the FSM, the network state z t should then be driven towards the correct edge state attractor e.After these 10 time steps, the second stimulus hypervector s b for a particular input is presented for 10 time steps.Again, the network evolves according to Equation 14, and the network should be driven towards the target attractor state y, completing the transition.This process is repeated every 30 time steps, causing the network state z t to travel between node attractor states x ∈ X AN , corresponding to a valid walk between states χ ∈ X FSM in the represented FSM.To view the resulting network dynamics, the similarity between the network state z t and the edge-and node attractor states is calculated as per Equation 3, such that a similarity of 1 between z t and some attractor state x ν implies z t = x ν and thus that the network is inhabiting that attractor.The similarity between the network state z t and the outputs states r ∈ R AN is also calculated, but due to the output hypervectors being sparse, the maximum value that the similarity can take is d(z t , r) = f r , which would be interpreted as that output symbol being present.
An attractor network performing a walk is shown in Figure 2, with parameters N = 10, 000, N f r = 200, N Z = 8, and N E = 16.This corresponds to the network having a per-neuron noise (the finite size effect resulting from random hypervectors having a nonzero similarity to each-other) of σ ≈ 0.07, calculated via Equation 19.The magnitude of the noise is thus small compared with the desired signal of magnitude 1 (Equation 18), and so we are far away from reaching the memory capacity of the network.The network performs the walk as intended, transitioning between the correct node attractor states and corresponding edge states with their associated outputs.The specific sequence of inputs was chosen to show the generality of implementable state transitions.First, there is the explicit state dependence in the repeated input of "father is, father is".Second, it contains an input stimulus that does not correspond to a valid edge for the currently inhabited state ( "Zeus overthrown by"), which should not cause a transition.Third, it contains bidirectional edges ( "consort is"), whose repeated application causes the network to flip between two states (between "Kronos" and "Rhea").And fourthly self-connections, whose target states and source states are identical.Since the network traverses all these edges as expected, we do not expect the precise structure of an FSM's graph to limit whether or not it can be emulated by the attractor network.

B. Network robustness
One of the advantages of attractor neural networks that make them suitable as plausible biological models is their robustness to imperfect weights (Amit 1989).That is, individual synapses may have very few bits of precision or become damaged, yet the relevant brain region must still be able to carry out its functional task.To this end, we subjected the network presented here to similar non-idealities, to check that the network retains the feature of global stability and robustness despite being implemented with low-precision and noisy weights.In the first of these tests, the ideal weights matrix W ideal was binarised and then additive noise was applied, via where χ ij ∈ R are independently sampled standard Gaussian variables, sampled once during matrix construction, and σ noise ∈ R is a scaling factor on the strength of noise being imposed.The sgn(•) function forces the weights to be bipolar, emulating that the synapses may have only 1 bit of precision, while the χ ij random variables act as a smearing on the weight state, emulating that the two weight states have a finite width.A σ noise value of 2 thus corresponds to the magnitude of the noise being equal to that of the signal (whether W ideal ij ≥ 0), and so, for example, for a damaged weight value of W noisy ij = +1 there is a 38% chance that the pre-damaged weight W ideal ij = −1.This level of degradation is far worse than is expected even from novel binary memory devices (Xia and Yang 2019), and presumably also for biology.We used the same set of hypervectors and sequence of inputs as in Figure 2, but this time using the degraded weights matrix W noisy , to test the network's robustness.The results are shown in Figure 3 for weight degradation values of σ noise = 2 and σ noise = 5, corresponding to signal-to-noise ratios (SNRs) of The similarity of the network state z t to stored node attractor states x ∈ X AN and stored edge states e respectively, computed via the inner product (Equation 3).d) The similarity of the network state z t to the sparse output states r ∈ R AN .All similarities have been labelled with the state they represent and the colours are purely illustrative.The attractor transitions shown here are explicitly state-dependent, as can be seen from the repeated input of the stimulus "father is", which results in a transition to state "Kronos" when in "Hades", but to "Uranus" when in "Kronos".Additionally, the network is unaffected by nonsense input that does not correspond to a stored edge, as the network remains in the attractor "Uranus" when presented with the stimulus "father is".
0 dB and −0.8 dB respectively.We see that for σ noise = 2 the attractor network performs the walk just as well as in Figure 2, which used the ideal weights matrix, despite the fact that here the binary weight distributions overlap each-other considerably.Furthermore, we have that d(z t , x ν ) ≈ 1 where x ν is the attractor that the network should be inhabiting at any time, indicating that the attractor stability and recall accuracy is unaffected by the non-idealities.For σ noise = 5, a scenario where the realised weight carries very little information about the ideal weight's value, we see that the network nonetheless continues to function, performing the correct walk between attractor states.However, there is a degradation in the recall of stored attractor states, with the network state no longer converging to a similarity of 1 with the stored attractor states.
For greater values of σ noise , the network ceases to perform the correct walk, and indeed does not converge on any stored attractor state (not shown).
A further test of robustness was to restrict the weights matrix to be sparse, as a dense all-to-all connectivity may not be feasible in biology, where synaptic connections are spatially constrained and have an associated chemical cost.Similar to the previous test, the sparse weights matrix was generated via where θ is a threshold set such that W sparse ∈ {−1, 0, 1} N ×N has the desired sparsity.Through this procedure, only the most extreme weight values are allowed to be nonzero.Since the terms inside W ideal are symmetrically distributed around 0, there are approximately as many +1 entries in W sparse as -1s.Using the same hypervectors and sequence of inputs as before, an attractor network performing a walk using the sparse weights matrix W sparse is shown in Figure 4, with sparsities of 98% and 99%.We see that for the 98% sparse case, there is again very little difference with the ideal case shown in Figure 2, with the network still having a similarity of d(z t , x) ≈ 1 with stored attractor states, and performing the correct walk.When the sparsity is pushed further to 99% however, we see that despite the network performing the correct walk, the attractor states are again slightly degraded, with the network converging on states with d(z t , x ν ) < 1 with stored attractor states x ν .For greater sparsities, the network ceases to perform the correct walk, and again does

d)
Fig. 3: The attractor network performing a walk as in Figure 2, but using the damaged weights matrix W noisy , whose entries have been binarised and then independent additive noise has been applied, as per Equation 16. a) The distribution of weights after they have been thusly damaged with noise of magnitude σ noise = 2, corresponding to an SNR of 0 dB.Weights whose ideal values were positive or negative have been plotted separately.b) The similarity of the network state z t to stored node hypervectors, with the network using the weights from a).Shown above is the sequence of inputs given to the network, identical to in Figure 2. c) The distribution of weights damaged with σ noise = 5, corresponding to an SNR of −0.8 dB.d) The similarity of the network state to stored node hypervectors, but with the network using the damaged weights from c).The network transitions are thus highly robust to unreliable weights, and show a gradual degradation in performance, even when the network's weights are majorly imprecise and noisy.For both b) and d) the edge state and output similarity plots have been omitted for visual clarity.Fig. 4: The attractor network performing a walk as in Figure 2, but using a sparse ternary weights matrix W sparse ∈ {−1, 0, 1} N ×N , generated via Equation 29.The weights matrices for a) and b) are 98% and 99% sparse respectively.Shown are the similarities of the network state z t with stored node hypervectors x ∈ X AN , with the applied stimulus hypervector at any time shown above.We see that even when 98% of the entries in W are zeroes, the network continues to function with negligible loss in stability, as the correct walk between attractor states is performed, and the network converges on stored attractors with similarity d(z t , x) ≈ 1.At 99% sparsity there is a degradation in the accuracy of stored attractors, with the network converging on states with d(z t , x) < 1, but with the correct walk still being performed.Beyond 99% sparsity the attractor dynamics break down (not shown).Thus although requiring a large number of neurons N to enforce state pseudoorthogonality, the network requires far fewer than N 2 nonzero weights to function robustly.
not converge on any stored attractor state (not shown).These two tests thus highlight the extreme robustness of the model to imprecise and unreliable weights.The network may be implemented with 1 bit precision weights, whose weight distributions are entirely overlapping, or set 98% of the weights to 0, and still continue to function without any discernible loss in performance.The extent to which the weights matrix may be degraded and the network still remain stable is of course a function not only of the level of degradation, but also of the size of the network N , as well as the the number of FSM states N Z and edges N E stored within the network.For conventional Hopfield models with Hebbian learning, these two factors are normally theoretically treated alike, as contributing an effective noise to the postsynaptic sum as in Equation 19, and so the magnitude of withstandable synaptic noise increases with increasing N (Amit 1989;Sompolinsky 1987).Although a thorough mathematical investigation into the scaling of weight degradation limits is justified, as a first result we have here given numerical data showing stability even in the most extreme cases of non-ideal weights, and expect that any implementation of the network with novel devices would be far away from such extremities.

C. Asynchronous updates
Another useful property of Hopfield networks is the ability to robustly function even with asynchronously updating neurons, wherein not every neuron experiences a simultaneous state update.This property is especially important for any architecture claiming to be biologically plausible, as biological neurons update asynchronously and largely independent of each-other, without the the need for global clock signals.To this end, we ran a similar experiment to that in Figure 2, using the undamaged weights matrix W ideal , but with an asynchronous neuron update rule, wherein on each time step every neuron has only a 10% chance of updating its state.The remaining 90% of the time, the neuron retains its state from the previous time step, regardless of its postsynaptic sum.There is thus no fixed order of neuron updates, and indeed it is not even a certainty that a neuron will update in any finite time.To account for the slower dynamics of the network state, the time for which inputs were presented to the network, as well as the periods without any input, was increased from 10 to 40 time steps.To be able to easily view the gradual state transition, three of the node hypervectors were chosen to be columns of the N -dimensional Hadamard matrix, rather than being randomly generated.The results are shown in Figure 5, for a shorter sequence of stimulus inputs.We see that the network functions as intended, but with the network now converging on the correct attractors in a finite number of updates rather than in just one.The model proposed here is thus not reliant on synchronous dynamics, which is important not only for biological plausibility, but also when considering possible implementations on asynchronous neuromorphic hardware (Davies et al. 2018;Liu et al. 2014).

D. Storage capacity
It is well known that the storage capacity of a Hopfield network, the number of patterns P that can be stored and reliably retrieved, is proportional to the size of the network, via P < 0.14N (Amit 1989;Hopfield 1982).When one tries to store more than P attractors within the network, the socalled memory blackout occurs, after which no pattern can be retrieved.We thus perform numerical simulations for a large range of attractor network and FSM sizes, to see if an analogous relationship exists.Said otherwise, for an attractor network of finite size N , what sizes of FSM can the network successfully emulate?
For a given N , number of FSM states N Z and edges N E , a random FSM was generated and an attractor network constructed to represent it as described in Section III.To ensure a reasonable FSM was generated, the FSM's graph was first generated to have all nodes connected in a sequential ring structure, i.e. every state χ ν ∈ X FSM connects to χ ν+1 mod N Z .The remaining edges between nodes were selected at random, until the desired number of edges N E was reached.For each edge an associated stimulus is then required.Although one option would be to allocate as few unique stimuli as possible, so that the state transitions are maximally state-dependent, this results in some advantageous cancellation effects between the E η transition terms and the stored attractors x ν x ν⊺ .To instead probe a worst-case scenario, each edge was assigned a unique stimulus.
With the FSM now generated, an attractor network with N neurons was constructed as previously described.An initial attractor state was chosen at random, and then a random valid walk between states was chosen to be performed (chosen arbitrarily to be of length 6, corresponding to each run taking 180 time steps).The corresponding sequence of stimuli was input to the attractor network via the same procedure as in Figure 2, each masking the network state in turn.Each run was then evaluated to have either passed or failed, with a pass meaning that the network state inhabited the correct attractor state with overlap d(z t , x ν ) > 0.5 in the middle of all intervals when it should be in a certain node attractor state.This 0.5-criterion was chosen since, for a set of orthogonal hypervectors, at most only one hypervector may satisfy the criterion at once.A pass thus corresponds to the network performing the correct walk between attractor states.The results are shown in Figure 6.We see that for a given N , there is a linear relationship between the the number of nodes N Z and number of edges N E in the FSM that can be implemented before failure.That this tradeoff exists is not surprising, since both contribute additively to the SNR within the attractor network (Equation 19).For each N , a linear Support Vector Machine (SVM) was fitted to the data, to find the separating boundary at which failure and success of the walk are approximately equiprobable.The boundary is given by N Z + βN E = c(N ), where β represents the relative cost of adding nodes and edges, and c(N ) is an offset.For all of the fitted boundaries, the value of β was found to be approximately constant, with β = 2.2 ± 0.1, and so is  assumed to be independent of N .For every value of N , we define the capacity C to be the maximum size of FSM which can be implemented before failure, for which , and is also plotted in Figure 6.A linear fit reveals an approximate proportionality relationship of C(N ) ≈ 0.029N .Combining these two results, the boundary which limits the size of FSM which can be emulated is then given by It is expected that additional edges consume more of the network's storage capacity than additional nodes, since for every edge, 5 additional terms are added to W (Equation 25), contributing 3× as much cross-talk noise as adding a node would (Equation 19).We can compare this storage capacity relation with that of the standard Hopfield model, by considering the case N E = 0, i.e. there are no transition terms in the network, and so the network is identical to a standard Hopfield network.In this case, our failure boundary would become N Z < 0.10N , in comparison to Hopfield's P < 0.14N .

E. Storage capacity with sparse states
The same FSM as shown in Figure 1 was embedded into an attractor network via the construction scheme described in Section III-C, with values N = 10, 000 neurons and coding level f = 0.1.To enforce the correct sparsity in the neural state, the sgn(•) activation function (Equation 10) was replaced with a top-k activation function (also known as "k-Winners-Take-All") where H(•) is a component-wise Heaviside function, and θ is chosen to be the N f 'th largest value of Wz t,sp , to enforce that z t+1,sp is f -sparse.While a stimulus hypervector s ∈ {−1, 1} N is being applied as a mask to the network, the activation function is similarly with θ being chosen in the same manner.Note that although the introduction of this adaptive θ threshold mechanism may seem to be somewhat biologically implausible, or at least a tall order for any possible neural implementation, it may easily be implemented using a suitably connected population of inhibitory feedback neurons, which silence all attractor neurons except those that receive the greatest input (Amari 1989;Lin et al. 2014).The sparse attractor network is shown performing a walk between the correct attractor states in Figure 7, as a sequence of stimuli is applied as input to the network.In contrast to the dense bipolar case, the maximum overlap between the network state z t,sp and a stored attractor state x ν sp is now d(z t,sp , x ν sp ) = f = 0.1, while the expected overlap between unrelated states is f 2 = 0.01 rather than 0.
We now apply the same procedure as in the dense case for determining the memory capacity of the sparse-activity attractor network.For direct comparison with the dense case, we define the memory capacity C(N ) to be the largest FSM with N E = N Z for which walk success and failure are equiprobable.For every tested (N, f, N Z ) tuple we generate a corresponding set of hypervectors and weights matrix as discussed in Section III-C, and then randomly choose a walk between 6 node attractor states to be completed.The chosen The overlap between the network state z t,sp to stored node attractor states x sp and stored edge attractor states e sp respectively, computed via the inner product (Equation 3).Note that since the network and attractor states are now sparse binary, the maximum possible overlap value is f = 0.1, while independently generated states have an expected overlap of f 2 = 0.01 walk then determines the sequence of stimuli to be input, and each stimulus is then applied for 10 time steps.Each (N, f, N Z ) tuple was then determined to have passed or failed, with a success criterion that d(z t,sp , x ν sp ) > 1 2 (f + f 2 ) in the middle of all intervals when the network should be in a certain node attractor state.This criterion was chosen as it is the sparse analogue of that used in the dense case: at most only one attractor state may satisfy it at any time.
The results are shown in Figure 8.We see that for a fixed number of neurons N , the size of FSM that may be stored initially increases as f is decreased, but below a certain f value drops off rapidly.To estimate the optimal coding level f and maximum FSM size N Z for an attractor network of size N , we apply a 2D Gaussian convolutional filter with standard deviation 3 over the grid of successes/failures for each N value separately, in order to obtain a kernel density estimate (KDE) p KDE of the walk success probability.The capacity C(N ) was then obtained by taking the maximum N Z value for which p KDE ≥ 0.5.This procedure was chosen in order to be comparable to that performed in the dense bipolar case (Figure 6), where a linear separation boundary between success and failure was used instead.Plotting capacity C against N and applying a linear fit in the log-log domain reveals a scaling relation of C ∼ N 1.90 .This approximately quadratic scaling in the sparse case is a vast improvement over the linear scaling shown in the dense case (Figure 6), and is in keeping with the theoretical scaling estimates of P max ∼ N 2 /(log N ) 2 for sparsely-coded binary attractor networks (Amari 1989).The optimal coding level f is also shown, and a linear fit in the log-log domain implies a scaling relation of the form f ∼ N −0.949 .Again, this is similar to the theoretically optimal f (N ) scaling relation for sparse binary attractor networks, where the coding level scales like f ∼ (log N )/N (Amari 1989).

V. RELATION TO OTHER ARCHITECTURES A. FSM emulation
While there is a large body of work concerning the equivalence between RNNs and FSMs, their implementations broadly fall into a few categories.There are those that require iterative gradient descent methods to mimic an FSM (Das and Mozer 1994;Lee Giles et al. 1995;Pollack 1991;Zeng et al. 1993), which makes them difficult to train for large FSMs, and improbable for use in biology.There are those that require creating a new FSM with an explicitly expanded state set, Z ′ := Z × S, such that there is a new state for every old statestimulus pair (Alquézar and Sanfeliu 1995;Minsky 1967), which is unfavourable due to the the explosion of (usually one-hot) states needing to be represented, as well as the difficulty of adding new states or stimuli iteratively.There are those that require higher-order weight tensors in order to explicitly provide a weight entry for every unique statestimulus pair (Forcada and Carrasco 2001;Mali et al. 2020;Omlin et al. 1998) which, as well as being non-distributed, may be more difficult to implement, for example requiring the use of Sigma-Pi units (Groschner et al. 2022;Koch 1998) or a large number of hidden neurons with 2-body synaptic interactions only (Krotov and Hopfield 2021).
In Recanatesi et al. 2017 transitions are triggered by adiabatically modulating a global inhibition parameter, such that the network may transition between similar stored patterns.Lacking however is a method to construct a network to perform arbitrary, controllable transitions between states.In Chen and Miller 2020 an in-depth analysis of small populations of rate-based neurons is conducted, wherein synapses with shortterm synaptic depression enable a rich behaviour of itinerancy between attractor states, but does not scale to large systems and arbitrary stored memories.
Most closely resembling our approach, however, are earlier works concerned with the related task of creating a sequence of transitions between attractor states in Hopfield-like neural networks.The majority of these efforts rely upon the use of synaptic delays, such that the postsynaptic sum on a time step t depends, for example, also on the network state at time t−10, rather than just t−1.These delay synapses thus allow attractor cross-terms of the form x ν+1 x ν⊺ to become influential only after the network has inhabited an attractor state for a certain amount of time, triggering a walk between attractor states (Kleinfeld 1986;Sompolinsky and Kanter 1986).This then also allowed for the construction of networks with statedependent input-triggered transitions (Amit 1988;Drossaers 1992;Gutfreund and Mezard 1988).Similar networks were shown to function without the need for synaptic delays, but require fine tuning of network parameters and suffer from extremely low storage capacity (Amit 1989;Buhmann and Schulten 1987).In any case, the need for synaptic delay elements represents a large requirement on any substrate which might implement such a network, and indeed are problematic to implement in neuromorphic systems (Nielsen et al. 2017).
State-dependent computation in spiking neural networks was realised in Neftci et al. 2013 andLiang et al. 2019, where they used population attractor dynamics to achieve robust state representations via sustained spiking activity.Additionally, these works highlight the need for robust-yet-flexible neural state machine primitives, if one is to succeed in designing intelligent end-to-end neuromorphic cognitive systems.These approaches differ from this work however in that the state representations are still fundamentally population-based rather than distributed, and so pose difficulties such as the requirement of finding a new population of neurons to represent any new state (Rutishauser and Douglas 2009).
In Rigotti et al. 2010 they discuss the need for a mechanism to induce flips in the neuron state (i.e. an operation akin to a Hadamard product) in order to directly implement nontrivial switching between different attractor states, but disqualify such a mechanism from plausibly existing using synaptic currents alone.We also reject such a mechanism as a biologically plausible solution, but on the grounds that it would not robustly function in an asynchronous neural system (see Section VII-C).They instead show the necessity of a population of neurons with mixed selectivity, connected to both the input and attractor neurons, in order to achieve the desired attractor itinerancy dynamics.This requirement arose by demanding that the network state switch to resembling the target state immediately upon receiving a stimulus.We instead show that similar results can be achieved without this extra population, if we relax to instead demanding only that the network soon evolve to the target state.
The main contribution of this article is thus to introduce a method by which attractor networks may be endowed with state-dependent attractor-switching capabilities, without requiring biologically implausible elements or components which are expensive to implement (e.g.precise synaptic delays), and can be scaled up efficiently.The extension to arbitrary FSM emulation shows the generality of the method, and that its limitations can be overcome by the appropriate modifications, like introducing the edge state attractors (Section VII-D).

B. VSA embeddings
This work also differs from more conventional methods to implement graphs and FSMs in VSAs (Kleyko, Rachkovskij, et al. 2022;Osipov et al. 2017;Poduval et al. 2022;Teeters et al. 2023;Yerxa et al. 2018), in that the network state does not need to be read by an outsider in order to implement the state transition dynamics.That is, where in previous works a graph is encoded by a hypervector (or an associative memory composed of hypervectors) such that the desired dynamics and outputs may be reliably decoded by external circuitry, we instead encode the graph's connectivity within the attractor network's weights matrix, such that its recurrent neural dynamics realise the desired state machine behaviour.
The use of a Hopfield network as an auto-associative cleanup memory in conjunction with VSAs has been explored in previous works, including theoretical analyses of their capacity to store bundled hypervectors with different representations (Clarkson et al. 2023), and using single attractor states to retrieve knowledge structures from partial cues (Steinberg and Sompolinsky 2022).Further links between VSAs and attractor networks have also been demonstrated with the use of complex phasor hypervectors -rather than binary or bipolar hypervectors -being stored as attractors within phasor neural networks (Frady and Sommer 2019;Kleyko, Rachkovskij, et al. 2022;Noest 1987;Plate 2003).Complex phasor hypervectors are of particular interest in neuromorphic computing, since they may be very naturally implemented with spike-timing phasor codes, wherein the value represented by a neuron is encoded by the precise timing of its spikes with respect to other neurons or a global oscillatory reference signal, and hypervector binding may be implemented by phase addition (Auge et al. 2021;Orchard and Jarvis 2023).
In Osipov et al. 2017 the authors show the usefulness of VSA representations for synthesizing state machines from observable data, which might be combined with this work to realise a neural system that can synthesise appropriate attractor itinerancy dynamics to best fit observed data.Similarly, if equally robust attractor-based neural implementations of other primitive computational blocks could be created -such as a stack -then they might be combined to create more complex VSA-driven cognitive computational structures, such as neural Turing machines (Graves et al. 2014;Grefenstette et al. 2015;Yerxa et al. 2018).Looking further, this combined with the end-to-end trainability of VSA models could pave the way for neural systems which have the explainability, compositionality and robustness thereof, but the flexibility and performance of deep neural networks (Hersche et al. 2023;Schlag et al. 2020).

VI. BIOLOGICAL PLAUSIBILITY
Transitions between discrete neural attractor states are thought to be a crucial mechanism for performing contextdependent decision making in biological neural systems (Daelli and Treves 2010;Mante et al. 2013;Miller 2016;Tajima et al. 2017).Attractor dynamics enable a temporary retention of received information, and ensure that irrelevant inputs do not produce stable deviations in the neural state.Such networks are widely theorised to exist in the brain, for example in the hippocampus for its pattern completion and working memory capabilities (Khona and Fiete 2022;Rolls 2013).As such, we showed that a Hopfield attractor network and its sparse variant can be modified such that they can perform stimulus-triggered state-dependent attractor transitions, without resorting to additional biologically-implausible mechanisms and while abiding by the principles of distributed representation.The changes we introduced are a) an altered weights matrix construction with additional asymmetric crossterms (which does not incur any considerable extra complexity) and b) the ability for a stimulus to mask a subset of neurons within the attractor population.As long as such a mechanism exists, the network proposed here could thus map onto brain areas theorised to support attractor dynamics.The masking mechanism could, for example, feasibly be achieved by a population of inhibitory neurons representing the stimuli, which selectively project to neurons within the attractor population.

A. Robustness
The robust functioning of the network despite noisy and unreliable weights is a crucial prerequisite for the model to plausibly be able to exist in biological systems.As we have shown, the network weights may be considerably degraded without affecting the behaviour of the network, and indeed beyond this the network exhibits a so-called graceful degradation in performance.Furthermore, biological synapses are expected to have only a few bits of precision (Baldassi et al. 2016;Bartol et al. 2015;O'Connor et al. 2005), and the network has been shown to function even in the worst case of binary weights.These properties stem from the massive redundancy arising from storing the attractor states across the entire synaptic matrix in a distributed manner, a technique that the brain is expected to utilise (Crawford et al. 2016;Rumelhart and McClelland 1986).Of course, we expect there to be a trade-off between the amount of each non-ideality that the network can withstand before failure.That is, an attractor network with dense noisy weights may withstand a greater degree of synaptic noise than if the weights matrix were also made sparse.Likewise, larger networks storing the same sized FSM should be able to withstand greater non-idealities than smaller networks, as is the case for attractor networks in general (Amit 1989;Sompolinsky 1987).
Since the network is still an attractor network, it retains all of the properties that make them suitable for modelling cognitive function, such as that the network can perform robust pattern completion and correction, i.e. the recovery of a stored prototypical memory given a damaged, incomplete or noisy version, and thereafter function as a stable working memory (Amit 1989;Hopfield 1982).
The robustness of the network to weight non-idealities also makes it a prime candidate for implementation with novel memristive crossbar technologies, which would allow an efficient and high-density implementation of the matrix-vector multiplication required in the neural state update (Equation 14) to be performed in one operation (Ielmini and Wong 2018;Verleysen and Jespers 1989;Xia and Yang 2019).Akin to the biological synapses they emulate, such devices also often have only a few bits of precision, and suffer from considerable per-device mismatch in the programmed conductance states.The network proposed in this article is thus highly suitable for implementation with such architectures, as we have shown that robust performance is retained even when the network is subjected to very high degree of such non-idealities.
The continued functionality of the network when its dynamics are asynchronous is another important factor when considering its biological plausibility.In a biological neural system, neurons will produce action potentials whenever their membrane potential happens to exceed the neuron's spiking threshold, rather than all updating synchronously at fixed time intervals.We tested the regime where the timescale of the neuron dynamics is much slower than the timescale of the input, by replacing the synchronous neuron update rule with a stochastic asynchronous variant thereof, and showed that the network is robust to this asynchrony.Similarly, we tested the regime where neuron dynamics are much faster than the input, by considering input which is applied stochastically and asynchronously instead (Section VII-C).The continued robustness of the model in these two extreme asynchronous regimes implies that the network is dependent neither upon the exact timing of inputs to the network, nor on the neuron updates within the network, and so would function robustly both in biological neural systems and asynchronous neuromorphic systems where the exact timing of events cannot be guaranteed (Davies et al. 2018;Liu et al. 2014).

B. Learning
The procedure for generating the weights matrix W, as a result of its simplicity, makes the proposed network more biologically plausible than other more complex approaches, e.g.those utilising gradient descent methods.It can be learned in one-shot in a fully online fashion, since adding a new node or edge involves only an additive contribution to the weights matrix, which does not require knowledge of irrelevant edges, nodes, their hypervectors, or the weight values themselves.Furthermore, as a result of the entirely distributed representation of states and transitions, new behaviours may be added to the weights matrix at a later date, both without having to allocate new hardware, and without having to recalculate W with all previous data.Both of these factors are critical for continual online learning.
Evaluating the local learnability of W to implement transitions is also necessary to evaluate the biological plausibility of the model.In the original paper by Hopfield, the weights could be learned using the simple Hebbian rule where x ν i and x ν j are the activities of the post-and presynaptic neurons respectively, and δw ij the online synaptic efficacy update (Hebb 1949;Hopfield 1982).While the attractor terms within the network can be learned in this manner, the transition cross-terms that we have introduced require an altered version of the learning rule.If we simplify our network construction by removing the edge state attractors, then the local weight update required to learn a transition between states is given by where y, x and s are as previously defined.In removing the edge states, we disallow FSMs with consecutive edges with the same stimulus (e.g."father is, father is"), but this is not a problem if completely general FSM construction is not the goal per se (see Section VII-D, Figure 12).This state-transition learning rule is just as local as the original Hopfield learning rule, as the weight update from presynaptic neuron j to postsynaptic neuron i is dependent only upon information that may be made directly accessible in the preand postsynaptic neurons, and does not depend on information in other neurons to which the synapse is not connected (Khacef et al. 2022;Zenke and Neftci 2021).
From the hardware perspective, the locality of the learning rule means that if the matrix-vector multiplication step in the neuron state update rule is implemented using novel memristive crossbar circuits (Ielmini and Wong 2018;Xia and Yang 2019;Zidan and Lu 2020), then the weights matrix could be learned online and in-memory via a sequence of parallel conductance updates, rather than by computing the weights matrix offline and then writing the summed values to the devices' conductances.As long as the updates in the memristors' conductances are sufficiently linear and symmetric, then attractors and transitions could be sequentially learned in one-shot and in parallel by specifying the two hypervectors in the outer product weight update at the crossbar's inputs and outputs by appropriately shaped voltage pulses (Alibart et al. 2013;Li et al. 2021).

C. Scaling
When the FSM states are represented by dense bipolar hypervectors within the attractor network, we found a linear scaling between the size of the network N and the capacity C in terms of the size of FSM that could be embedded without errors.Although this is in keeping with the results in the Hopfield paper, this is not a favourable result when considering the biological plausibility of the system for large N (Hopfield 1982).Since the attractor network is fully connected, the capacity actually scales sublinearly C ∼ N syn with the number of synapses N syn , meaning that an increasing number of synapses are required per attractor and transition to be stored for large N , and so the network becomes increasingly inefficient.Additionally, the fact that every neuron is active at any time (or half of them, depending on interpretation of the −1 state) represents an unnecessarily large energy burden for any system utilising this model.This is in contrast to data from neural recordings, where a low per-neuron mean activity is ensured by the sparse coding of information (Barth and Poulet 2012;Olshausen and Field 2004;Rolls and Treves 2011).
We thus tested how the capacity of the network scales with N when the FSM states are instead represented by sparse binary hypervectors with coding level f , since it is well known that the number of sparse binary vectors that can be stored in an attractor network scales much more favourably, P ∼ N 2 /(log N ) 2 (Amari 1989).We found indeed that the sparse coding of the FSM states vastly improved the capacity of the network, scaling approximately quadratically with C ∼ N 1.90 , and so approximately linearly in the number of synapses.This linear scaling with the number of synapses ensures not only the efficient use of available synaptic resources in biological systems, but is especially important when one considers a possible implementation in neuromorphic hardware, where the number of synapses usually represents the main size constraint, rather than the number of neurons (Davies et al. 2018;Manohar 2022).
The coding level f was found to have an approximately inverse relationship with the attractor network size, f ∼ N −0.949 , which would imply that the number of active neurons N f in any attractor state grows very slowly, N f ∼ N 0.051 .This is in agreement with the theoretically optimal case, where the coding level for a sparse binary attractor network should scale like f ∼ (log N )/N , and so the number of active neurons in any pattern scales like N f ∼ log N (Amari 1989).Sparsity in the stored hypervectors is especially important when one considers how the weights matrix W could be learned in an online fashion, if the synapses are restricted to have only a few bits of precision.So far we have considered quantisation of the weights only after the summed values have been determined, whereas including weight quantisation while new patterns are being iteratively learned is a much harder problem, and implies attractor capacity relations as poor as P ∼ log N .One solution is for the states to be increasingly sparse, in which case the optimal scaling of P ∼ N 2 /(log N ) 2 can be recovered (Amit and Fusi 1994;Brunel et al. 1998).
In short, by letting the FSM states be represented by sparse binary hypervectors rather than dense bipolar hypervectors, we not only move closer to a more biologically realistic model of neural activity, but also benefit from the superior scaling properties of sparse binary attractor networks, which lets the maximum size of FSM that can be embedded scale approximately quadratically with the attractor network size rather than linearly.

VII. CONCLUSION
Attractor neural networks are robust abstract models of human memory, but previous attempts to endow them with complex and controllable attractor-switching capabilities have suffered mostly from being either non-distributed, not scalable, or not robust.We have here introduced a simple procedure by which any arbitrary FSM may be embedded into a largeenough Hopfield-like attractor network, where states and stimuli are represented by high-dimensional random hypervectors, and all information pertaining to FSM transitions is stored in the network's weights matrix in a fully distributed manner.Our method of modelling input to the network as a masking of the network state allows cross-terms between attractors to be stored in the weights matrix in a way that they are effectively obfuscated until the correct state-stimulus pair is present, much in a manner similar to the standard bindingunbinding operation in more conventional VSAs.
We showed that the network retains many of the features of attractor networks which make them suitable for biology, namely that the network is not reliant on synchronous dynamics and is robust to unreliable and imprecise weights, thus also making it highly suitable for implementation with high-density but noisy devices.We presented numerical results showing that the network capacity in terms of implementable FSM size scales linearly with the size of the attractor network for dense bipolar hypervectors, and approximately quadratically for sparse binary hypervectors.
In summary, we introduced an attractor-based neural state machine which overcomes many of the shortcomings that made previous models unsuitable for use in biology, and propose that attractor-based FSMs represent a plausible path by which FSMs may exist as a distributed computational in biological neural networks.hypervectors x µ , x ϕ , and s κ respectively: where in the third line we have made the same approximations as previously discussed.The postsynaptic sum is thus approximately x ϕ in all indices that are not currently being masked, which drives the network towards that (target) attractor.
In vector form, the above is written as where it is assumed that there exists a stored transition from state x µ to x ϕ with stimulus s κ , and ∝ ∼ denotes approximate proportionality.A similar calculation can be performed in the case that a stimulus is imposed which does not correspond to a valid transition for the current state.In this case, no terms of significant magnitude emerge from the transition summation, and we are left with i.e. the attractor dynamics are largely unaffected.Since we have not distinguished between our above attractor terms being node attractors or edge attractors, or our stimuli from being s a or s b stimuli, the above results can be applied to all relevant situations mutatis mutandis.
C. Why model input as masking?
One immediate question might be why we have chosen to model input to the network as a masking of the neural state vector (Equation 14), rather than simply modelling input as a Hadamard product, with a state update rule given by such that a component for which the input stimulus s i = −1 triggers a "flip" in the neuron state +1 ↔ −1.As will be shown, the problem with this construction is that it relies on the synchrony of input to the network, and does not allow for for the input to arrive asynchronously and with arbitrary delays.While this would not be a problem for a digital synchronous system, such timing constraints cannot be expected to be met in a network of asynchronously-firing biological neurons.In the synchronous case however, the edge terms E η in the weights matrix construction could be simplified to where as per previous notation, x and y are the source and target attractor states respectively, and s the stimulus to cause the transition.Superficially, this construction would then satisfy our main requirements for achieving the desired attractor itinerancy dynamics during input and rest scenarios, namely A subset of the stimulus hypervector s at each time step in this synchronous case.c) The attractor overlaps in the asynchronous case, where the stimulus s is applied over multiple time steps randomly.d) A subset of the stimulus hypervector s at each time step in this asynchronous case.For visual clarity, the two stimulus hypervectors shown were manually chosen rather than randomly generated.In the synchronous case, the network performs the correct walk between attractor states as intended.In the asynchronous case however, the stimuli fail to effect the desired transitions, since any changes in the network state caused by the input stimuli are short-lived, as they are quickly reversed on the next time step by the attractor network's pattern-correcting dynamics.
which ensures that while there is no input to the network, the states x are stable attractors of the network dynamics, and which ensures that inputting the stimulus s triggers the desired transition.The resulting dynamics for this network -when input is entirely synchronous -are shown in Figure 9a , and indeed the network performs the desired walk.
We then test the functionality of the attractor network with Hadamard input when the exact simultaneous arrival of input stimuli cannot be guaranteed, i.e. the input to the network is asynchronous.To model this, we consider that the arrival time of the stimulus is component-wise randomly and uniformly spread over 5 time steps, rather than just one.The same attractor network receiving the same sequence of Hadamard-product stimuli, but now asynchronously, is shown in Figure 9 c).The network does not perform the correct walk between attractor states, and instead remains localised near the initial attractor state across all time steps.This is due to the fact that, although when input is applied, the network begins to move away from the initial attractor state, these changes are immediately undone by the network's inherent attractor dynamics, since the neural state is still within the initial attractor's basin of attraction.Only when the timescale of the input is far faster than the timescale of the attractor dynamics (e.g.input is synchronous) may the input accumulate fast enough to escape the initial basin of attraction.
When input to the network is treated as masking operation however (Equation 14), the attractor itinerancy dynamics are robust to input asynchrony.To model this, the input stimulus is stochastically applied, with each component being delayed randomly and uniformly by up to 20 time steps.The stimulus is then held for 10 time steps, and stochastically removed over 20 time steps in the same manner.The attractor network with asynchronous masking input is shown in Figure 10, and functions as desired, performing the correct walk between attractor states.Modelling input to the network as a masking operation thus allows the network to operate robustly in asynchronous regimes, while modelling input to the network as a Hadamard product does not.

D. The need for edge states
The need for the edge state attractors arises when one wants to emulate an FSM where there are consecutive edges with the same stimulus.For example, in the FSM implemented throughout this article (Figure 1) there is an incoming edge from "Zeus" to "Kronos" with stimulus "father is" and then immediately an outgoing edge from "Kronos" to "Uranus" with stimulus "father is" also.More generally, consider that we wish to embed the transitions In the fully synchronous case, i.e. when input is applied for one time step only, there is no need for edge states.When the stimulus s is applied, the network will make one transition only.In the asynchronous case however, one cannot ensure that the stimulus is applied for one time step only.Thus, starting from x 1 , when the stimulus is applied "once" for an arbitrary number of time steps, the network may have the unwanted behaviour of transitioning to x 2 on the first time step, and then to x 3 on the second, effectively overshooting and skipping x 2 .In Figure 11 we see the dynamics of the attractor network constructed without any edge states, with inputs which are applied for 10 time steps each, and we indeed see the undesirable skipping behaviour.Similarly, bidirectional edges with the same stimulus (e.g."consort is") cause an unwanted oscillation between attractor states.The edge states offer a solution to this problem: by adding an intermediate attractor state for every edge, and splitting each edge into two transitions with stimuli s a and s b , we ensure that there are no consecutive edges with the same stimulus.
If we don't necessarily need to be able to embed FSMs with consecutive edges with the same stimulus, then we can rid of the edge states, and construct our weights matrix with simpler transition terms like in Equation 34.An attractor network constructed in this way is shown in Figure 12, for a chosen FSM that does not require edge states, but still contains statedependent transitions.The network performs the correct walk between attractor states as intended, and does not suffer from any of the unwanted skipping or oscillatory phenomena like in Figure 11.Thus, while the edge states are required to ensure that any FSM can be implemented in a "large enough" attractor network, they are not strictly necessary to achieve state-dependent stimulus-triggered attractor transition dynamics.

E. Sparse stimuli
One shortcoming of the model might be that we used dense bipolar hypervectors s to represent the stimuli, meaning that when s is being input to the network, masking all neurons for which s j = −1, approximately half of all neurons within Fig. 11: An attractor network receiving a sequence of stimuli to trigger a certain walk constructed a) without edge states and b) with edge states, with edge state overlaps being shown in c).Due to the consecutive edges in the FSM (Figure 1) with the same stimulus "father is", the edge-state-less network overshoots and skips the "Kronos" state, stopping instead at the "Uranus" state.Similarly, there is an unwanted oscillation between the states "Gaia" and "Uranus" due to the bidirectional edge with stimulus "consort is".The addition of the edge state attractors resolves these issues, and allows the network to function robustly when input stimuli are applied for an arbitrary number of time steps.the network are silenced.This was initially chosen because unbiased bipolar hypervectors are arguably the simplest and most common choice of VSA representation, and highlights the fact that VSA-based methods can be applied to the design of attractor networks with very little required tweaking (Gayler 1998;Kleyko, Rachkovskij, et al. 2022).
From the biological perspective however, it could be seen as somewhat implausible that the number of active neurons should change so drastically (halving) while a stimulus is present.Furthermore, if implemented with spiking neurons, the large changes in the total spiking activity could cause unwanted effects in the spike rate of the non-masked neurons.Also, this means that while the network is being masked, the size of the network (and so its capacity) is reduced to N/2, and so the network is especially prone to instability during the transition periods, if the network is nearing its memory capacity limits.
For these reasons, it is worth exploring whether the network could be constructed such that during a masking operation, fewer than half of all neurons are masked, i.e. s is biased to contain more +1 than −1 entries3 .To keep the notation consistent with the notation used for sparse binary hypervectors, we will denote the coding level of the attractor states as f z (where previously it was simply f ) and the coding level of the stimulus hypervectors as f s .The coding level of the stimulus hypervectors f s we define to be the fraction of components for which s j > 0. A stimulus hypervector with f s > 0.5 thus silences fewer neurons from the network during a masking operation.This is not the only change we need to make however.If we turn to our (sparse) edge terms (Equation 27), they were previously constructed such that they would produce a non-negligible overlap with the network state z sp if and only if the network is in the correct attractor state and is being masked by the correct stimulus.The important condition to be fulfilled is then 46) that is, the overlap should be negligible if the network is in the correct attractor state, but the stimulus is not present.This condition is satisfied if the components of s are generated according to where s j is the j'th component of s.This implies that for a stimulus hypervector biased towards having more positive entries (fewer neurons are masked), the negative entries must increase in magnitude to compensate for their infrequency.For the case that only a quarter of neurons are masked by the Fig. 12: Embedding an FSM that does not require edge states, since it does not have consecutive edges with the same stimulus.a) The FSM to be embedded, representing a simple decision tree.b) & c) An attractor network constructed to store this FSM, without any edge states, as a sequence of stimuli is input.The network performs the correct walks between attractor states as desired.To note is that the second stimulus ("is orange") and its transition are state-dependent, as the target state ("carrot" or "tangerine") is dependent upon the stimulus given 20 time steps before ("is round" or "is pointy").This illustrates that the edge states are not strictly necessary to implement state-dependent transitions between attractor states.s IP(s j = s) stimulus (f s = 0.75), the negative 25% of components must have the value −3, while for f s = 0.5 this of course collapses to the balanced bipolar hypervectors used throughout this article with IP(S j = 1) = IP(S j = −1) = 0.5 (Equation 2).We are forced to increase the magnitude of the negative terms, rather than reduce the positive terms, since the magnitude of the positive terms must remains identical to that of the stored attractor terms, in order to ensure that the correct target state is projected out during a transition.We can then construct our weights matrix in the same way as before, but using these biased stimulus hypervectors s.An attractor network was generated with coding levels f z = 0.1 (10% of neurons are active in any attractor hypervector) and f s = 0.9 (10% of neurons are masked by stimulus hypervectors), and the results are shown in Figure 13, with the neural state performing the correct walk between attractor states as desired.
To be noted is that as we approach f s → 1, the stimuli become less and less distributed, with the limiting case f s = 1 − 1/N implying that only one component of s is negative, and so by masking only one neuron, the network will switch between attractor states.This case is obviously a stark departure from the robustness which the more distributed representations afford us, since if that single neuron is faulty or dies, it would be catastrophic for the functioning of the network.Similarly, if another independent stimulus were to, by chance, choose the same component to be non-negative, this would cause similarly unwanted [z] i = 1 (neuron active) [z] i = 0 (neuron silent) [s] i < 0 (neuron masked) Fig. 13: An attractor network with both sparse states and sparse stimuli, constructed as described in Section VII-E.The values used here are N = 10, 000, f z = 0.1 (meaning only 10% of neurons are active at any time), and f s = 0.9 (meaning that only 10% of neurons are masked by the stimulus).a) The input hypervectors to the network, each masking a random 10% of neurons within the network.b) The overlap of the sparse network state z sp with stored attractor hypervectors.c) A subset of the neurons within the network, showing the active neurons (z j = 1) in red, as well as which neurons are currently being masked by the input (s j < 0).The network performs the correct walk between attractor states.The balanced bipolar stimulus hypervectors used throughout this paper may thus also be generalised to be sparse.
dynamics.Less catastrophic, but still worth considering is that the noise added per edge term, as a result of the negative terms becoming very large, has variance that scales like Var[s j ] ≈ 1/(1 − f s ), and so for f s → 1 contributes an increasing amount of unwanted noise to the system, destabilising the attractor dynamics.Nevertheless, this represents yet another trade-off in the attractor network's design, as needing to mask fewer neurons might be worth the increased noise within the system, decreasing its memory capacity.

Fig. 2 :
Fig.2: An attractor network transitioning through attractor states in a state-dependent manner, as a sequence of input stimuli is presented to the network.a) The input stimuli to the network, where for each unique stimulus (e.g."father is" in the FSM to be implemented (Figure1) a pair of hypervectors s a and s b have been generated.No stimulus, a stimulus s a , then a stimulus s b are input for 10 time steps each in sequence.b) & c) The similarity of the network state z t to stored node attractor states x ∈ X AN and stored edge states e respectively, computed via the inner product (Equation3).d) The similarity of the network state z t to the sparse output states r ∈ R AN .All similarities have been labelled with the state they represent and the colours are purely illustrative.The attractor transitions shown here are explicitly state-dependent, as can be seen from the repeated input of the stimulus "father is", which results in a transition to state "Kronos" when in "Hades", but to "Uranus" when in "Kronos".Additionally, the network is unaffected by nonsense input that does not correspond to a stored edge, as the network remains in the attractor "Uranus" when presented with the stimulus "father is".

Fig. 5 :
Fig.5: An attractor network performing a shorter walk than in Figure2, but where neurons are updated asynchronously, with each neuron having a 10% chance of updating on any time step.a) The similarity of the network state z t to stored node hypervectors, with the stimulus hypervectors being applied to the network labelled above.b) The evolution of a subset of neurons within the attractor network, where for visual clarity, three of the node hypervectors have been taken from columns of the N -dimensional Hadamard matrix, rather than being randomly generated.The network functions largely the same as in the synchronous case, but with transitions between attractor states now taking a finite number of time steps to complete.The model is thus not dependent on the precise timing of neuron updates, and should function robustly in asynchronous systems where timing is unreliable.

Fig. 6 :Fig. 7 :
Fig.6: The capacity of the attractor network for varying size N , in terms of the size of FSM that can be emulated before failure.For a given N , a random FSM was generated with number of nodes N Z and number of edges N E .An attractor network was then constructed as described in Section III, and a sequence of stimuli input to the network that should trigger a specific walk between attractor states.a) Every coloured square is a successful walk, with no unique (N Z , N E , N ) triplet being sampled more than once, and lower-N squares occlude higher-N squares.Since only graphs with at least as many edges as nodes were sampled, N E − N Z is given on the y-axis rather than N E .The overlain black lines are the SVM-fitted decision boundaries, distinguishing between values that succeeded and values that failed.b) The capacity C for varying Hopfield network sizes N , where C is defined to be the maximum size of of FSM which can be implemented before failure, for which N E = N Z .A linear fit is overlain, and shows a linear relationship in the capacity C in terms of N over the range explored.Assuming that the gradients of the linear fit in a) are equal, the boundary at which failure and success are equiprobable is given by N Z + 2.2N E = 0.10N .

Fig. 8 :
Fig. 8: The capacity of the attractor network with sparse binary activity and attractor states, for varying coding level f .a) Each coloured square is a successful walk, with no unique (N, f, N Z ) tuple being tested more than once, and lower-N squares occlude higher-N squares for visual clarity.To comply with the definition of the memory capacity C in the dense case, each FSM was generated with an equal number of states as edges, N Z = N E .The capacity C is taken as the maximum N Z value for an N at which the walk success probability p KDE ≥ 50%, estimated via a Gaussian KDE and indicated by the black crosses.b) The capacities C obtained by this procedure for varying attractor network sizes N , up to N = 40, 000, and c) the coding levels f at these points.Linear fits are overlain for each, implying an approximately quadratic scaling relation for the memory capacity C ∼ N 1.90 and an approximately inverse scaling relation for the coding level f ∼ N −0.949 .

Fig. 9 :
Fig.9: An attractor network constructed via the simpler weights construction method specified in Section VII-C, with input to the network modelled as Hadamard product binding, rather than component-wise masking.a) The similarity of the network state z t to stored node hypervectors, when the stimulus hypervector s is applied on one time step for all neurons simultaneously.b) A subset of the stimulus hypervector s at each time step in this synchronous case.c) The attractor overlaps in the asynchronous case, where the stimulus s is applied over multiple time steps randomly.d) A subset of the stimulus hypervector s at each time step in this asynchronous case.For visual clarity, the two stimulus hypervectors shown were manually chosen rather than randomly generated.In the synchronous case, the network performs the correct walk between attractor states as intended.In the asynchronous case however, the stimuli fail to effect the desired transitions, since any changes in the network state caused by the input stimuli are short-lived, as they are quickly reversed on the next time step by the attractor network's pattern-correcting dynamics.

Fig. 10 :
Fig.10:The attractor network performing a walk as masking input is applied asynchronously over multiple time steps with random delays.a) The similarities between the network state z t and stored node hypervectors x ∈ X AN .b) A subset of the stimulus hypervector s being applied to the network as a mask at each time step.Indices which are black on any time step have [s] i = −1 and so are being masked by the stimulus.For visual clarity, the two stimulus hypervectors shown were manually chosen, rather than randomly generated.The attractor transition dynamics are thus robust to input asynchrony when the input is modelled as a component-wise masking of the network state.

TABLE I :
A comparison of the notation used to represent states, stimuli and outputs in the FSM, and the corresponding hypervectors used to represent the FSM within the attractor network.

TABLE II :
Notation and frequently used symbols.