## Abstract

To accommodate structured approaches of neural computation, we propose a class of recurrent neural networks for indexing and storing sequences of symbols or analog data vectors. These networks with randomized input weights and orthogonal recurrent weights implement coding principles previously described in vector symbolic architectures (VSA) and leverage properties of reservoir computing. In general, the storage in reservoir computing is lossy, and crosstalk noise limits the retrieval accuracy and information capacity. A novel theory to optimize memory performance in such networks is presented and compared with simulation experiments. The theory describes linear readout of analog data and readout with winner-take-all error correction of symbolic data as proposed in VSA models. We find that diverse VSA models from the literature have universal performance properties, which are superior to what previous analyses predicted. Further, we propose novel VSA models with the statistically optimal Wiener filter in the readout that exhibit much higher information capacity, in particular for storing analog data.

The theory we present also applies to memory buffers, networks with gradual forgetting, which can operate on infinite data streams without memory overflow. Interestingly, we find that different forgetting mechanisms, such as attenuating recurrent weights or neural nonlinearities, produce very similar behavior if the forgetting time constants are matched. Such models exhibit extensive capacity when their forgetting time constant is optimized for given noise conditions and network size. These results enable the design of new types of VSA models for the online processing of data streams.

## 1 Introduction

An important aspect of information processing is data representation. In order to access and process data, addresses or keys are required to provide a necessary context. To enable flexible contextual structure as required in cognitive reasoning, connectionist models have been proposed that represent data and keys in a high-dimensional vector space. Such models include holographic reduced representations (HRR; Plate, 1991, 2003) and hyperdimensional computing (HDC) (Gayler, 1998; Kanerva, 2009), and will be referred to here by the umbrella term *vector symbolic architectures* (VSA; see Gayler, 2003; section 4.1.1). VSA models have been shown to be able to solve challenging tasks of cognitive reasoning (Rinkus, 2012; Kleyko & Osipov, 2014; Gayler, 2003). VSA principles have been recently incorporated into standard neural networks for advanced machine learning tasks (Eliasmith et al., 2012), inductive reasoning (Rasmussen & Eliasmith, 2011), and processing of temporal structure (Graves, Wayne, & Danihelka, 2014; Graves et al., 2016; Danihelka, Wayne, Uria, Kalchbrenner, & Graves, 2016). Typically, VSA models offer at least two operations: one to produce key-value bindings (also referred to as role-filler pairs) and a superposition operation that forms a working memory state containing the indexed data structures. For example, to represent a time sequence of data in a VSA, individual data points are bound to time-stamp keys and the resulting key-value pairs superposed into a working memory state.

Here, we show that input sequences can be indexed and memorized according to various existing VSA models by recurrent neural networks (RNNs) that have randomized input weights and orthonormal recurrent weights of particular properties. Conversely, this class of networks has a straightforward computational interpretation: in each cycle, a new random key is generated, a key-value pair is formed with the new input, and the indexed input is integrated into the network state. In the VSA literature, this operation has been referred to as *trajectory association* (Plate, 1993). The memory in these networks follows principles previously described in reservoir computing. The idea of reservoir computing is that a neural network with fixed recurrent connectivity can exhibit a rich reservoir of dynamic internal states. An input sequence can selectively evoke these states so that an additional decoder network can extract the input history from the current network state. These models produce and retain neural representations of inputs on the fly, entirely without relying on previous synaptic learning as in standard models of neural memory networks (Caianiello, 1961; Little & Shaw, 1978; Hopfield, 1982; Schwenker, Sommer, & Palm, 1996; Sommer & Dayan, 1998). Models of reservoir computing include state-dependent networks (Buonomano & Merzenich, 1995), echo-state networks (Jaeger, 2002; Lukoševičius & Jaeger, 2009), liquid-state machines (Maass, Natschläger, & Markram, 2002), and related network models of memory (White, Lee, & Sompolinsky, 2004; Ganguli, Huh, & Sompolinsky, 2008; Sussillo & Abbott, 2009). However, it is unclear how such reservoir models create representations that enable the selective readout of past input items. Leveraging the structured approach of VSAs to compute with distributed representations, we offer a novel framework for understanding reservoir computing.

## 2 Results

### 2.1 Indexing and Memorizing Sequences with Recurrent Networks

#### 2.1.1 Properties of the Matrices in the Encoding Network

The analysis simplifies considerably if the input and recurrent matrix satisfy certain conditions. Specifically, we investigate networks in which the input matrix has independent and identically distributed (i.i.d.) random entries and the recurrent weight matrix is orthogonal with mixing properties and long cycle length. The assumed properties of the network weights guarantee the following independence conditions of the indexing keys, which will be essential in our analysis of the network performance:

- •
- •
- •
- •

The class of RNNs (see equation 2.1) in which the weights fulfill properties 2.4 to 2.7 contains the neural network implementations of various VSA models. Data encoding with such networks has a quite intuitive interpretation. For each input , a pseudo-random key vector is computed that indexes both the input dimension and location in the sequence, . Each input is multiplied with this key vector to form a new key-value pair, which is added to the memory vector . Each pseudo-random key defines a spatial pattern for how an input is distributed to the neurons of the network.

#### 2.1.2 Types of Memories under Investigation

*Reset memory versus memory buffer.* In the case for finite input sequence length , the network is reset to the zero vector before the first input arrives, and the iteration is stopped after the th input has been integrated. We refer to these models as *reset memories*. In the VSA literature, the superposition operation (Plate, 1991, 2003; Gallant & Okaywe, 2013) corresponds to a reset memory and, in particular, trajectory-association (Plate, 1993). In reservoir computing, the distributed shift register (DSR; White et al., 2004) can also be related to reset memories. In contrast, a *memory buffer* can track information from the past in a potentially infinite input stream (). Most models for reservoir computing are memory buffers (Jaeger, 2002; White et al., 2004; Ganguli et al., 2008). A memory buffer includes a mechanism for attenuating older information, which replaces the hard external reset in reset memories to avoid overload. The mechanisms of forgetting we will analyze here are contracting recurrent weights or neural nonlinearities. Our analysis links contracting weights () and nonlinear activation functions () to the essential property of a memory buffer, the forgetting time constant, and we show how to optimize memory buffers to obtain extensive capacity.

*Memories for symbols versus analog input sequences.* The analysis considers data vectors that represent either symbolic or analog inputs. The superposition of discrete symbols in VSAs can be described by equation 2.1, where inputs are one-hot or zero vectors. A one-hot vector represents a symbol in an alphabet of size . The readout of discrete symbols involves a nonlinear error correction for producing one-hot vectors as output, the winner-take-all operation . Typical models for reservoir computing (Jaeger, 2002; White et al., 2004) process one-dimensional analog input, and the readout is linear, in equation 2.2. We derive the information capacity for both uniform discrete symbols and gaussian analog inputs.

*Readout by naive regression versus full minimum mean square error regression.* Many models of reservoir computing use full optimal linear regression and set the linear transform in equation 2.2 to the Wiener filter , which produces the minimum mean square error (MMSE) estimate of the stored input data. Here, is the covariance between input and memory state, and is the covariance matrix of the memory state. Obviously this readout requires inverting . In contrast, VSA models use , with a constant, which does not require matrix inversion. Thus, the readout in VSA models is computationally much simpler but can cause reduced readout quality. We show that the MMSE readout matrix can mitigate the crosstalk noise in VSA and improve readout quality in regimes where . This is particularly useful for the retrieval of analog input values, where the memory capacity exceeds many bits per neuron, limited only by neuronal noise.

### 2.2 Analysis of Memory Performance

For symbolic input, we will consider symbols from an alphabet of length , which are represented by one-hot vectors; that is, in each input vector, there is one component with value 1 and all other are 0. In this case, a multivariate threshold operation can be applied after the linear readout for error correction, the winner-take-all function: .

#### 2.2.1 The Accuracy of Retrieving Discrete Inputs

For symbolic inputs, we will analyze the readout of two distinct types of input sequences. In the first type, a symbol is entered in every time step, and retrieval consists in classification to determine which symbol was added at a particular time. The second type of input sequence can contain gaps: some positions in the sequence can be empty. If most inputs in the sequence are empty, this type of input stream has been referred to as a sparse input sequence (Ganguli & Sompolinsky, 2010). The retrieval task is then detection: whether a symbol is present and, if so, reveal its identity.

**Special Cases**

- As a sanity check, consider the classification accuracy, equation 2.12, in the vanishing sensitivity regime, for . The first factor in the integral, the gaussian, then becomes the inner derivative of the second factor, the cumulative gaussian raised to the th power. With , the integral can be solved analytically using the (inverse) chain rule to yield the correct chance value for the classification:
- The case makes sense for detection but not for classification retrieval. This case falls under classical signal detection theory (Peterson, Birdsall, & Fox, 1954), with being the sensitivity index. The detection accuracy, equation 2.14, in this case becomes The threshold trades off miss and false alarm errors. Formulas 2.14 to 2.16 generalize signal detection theory to higher-dimensional signals ().
- Consider classification retrieval, equation 2.12, in the case . Since the (rescaled and translated) random variables and (see Figure 2A) are uncorrelated, one can switch to a new gaussian variable representing their difference: with (see Figure 2B). Thus, for , one can compute equation 2.12 by just the normal cumulative density function (and avoiding the integration): The result, equation 2.19, is the special case of table entry 10,010.8 in Owen's table of normal integrals (Owen, 1980).

In general, for and nonzero sensitvity, the integral cannot be solved analytically, but can be numerically approximated to arbitrary precision (see Figure 14).

#### 2.2.2 Accuracy in the High-Fidelity Regime

Next, we derive steps to approximate the accuracy in the regime of high-fidelity recall, following the rationale of previous analyses of VSA models (Plate, 2003; Gallant & Okaywe, 2013). This work showed that the accuracy of retrieval scales linearly with the number of neurons in the network (). We will compare our analysis results with those of previous analyses and with simulation results.

#### 2.2.3 Information content and memory capacity

*Memory capacity for symbolic input.*The information content (Feinstein, 1954) is defined as the mutual information between the true sequence and the sequence retrieved from the superposition state . The mutual information between the individual item that was stored time steps ago () and the item that was retrieved () is given by where is the Kullback-Leibler divergence (Kullback & Leibler, 1951).

Note that if the accuracy is the same for all items, then , and by setting , one obtains the entire input information: .

*Memory capacity for analog input.*For analog inputs, we can compute the information content if the components of input vectors are independent with gaussian distribution, . In this case, distributions of the readout and the joint between input and readout are also gaussian. Therefore, the correlation between and is sufficient to compute the information (Gel'fand & Yaglom, 1957), with , where is the correlation between the input and output. There is a simple relation between the signal correlation and the SNR (9): , which gives the total information: The information content of a network is in units bits per neuron. The memory capacity is then the maximum of (27, 28) over all parameter settings of the network. The network has extensive memory when is positive, as grows large.

### 2.3 VSA Indexing and Readout of Symbolic Input Sequences

In this section, we analyze the network model, equation 2.1, with linear neurons, and without neuronal noise. After a reset to , the network receives a sequence of discrete inputs. Each input is a one-hot vector, representing one of symbols; we will show examples with alphabet size of , representing the 26 English letters and the space character. The readout, equation 2.2, involves the matrix and the winner-take-all function, with a scaling constant. This setting is important because, as we will show in the next section, it can implement the working memory operation in various VSA models from the literature.

Note that with linear neurons and unitary recurrent matrix, the argument can be dropped because there is no recency effect and all items in the sequence can be retrieved with the same accuracy.

#### 2.3.1 VSA Models from the Literature

Many connectionist models from the literature can be directly mapped to equations 2.1 and 2.2 with the settings described at the beginning of section 2.3. In the following, we will describe various VSA models and the properties of the corresponding encoding matrix and recurrent weight matrix . We will determine the moments of the code vectors required in equation 2.34 to estimate the accuracy with in the general case (for small ). For large values, we will show that all models perform similarly, and the accuracy can predicted by the universal sensitivity formula .

In hyperdimensional computing (HDC; Gayler, 1998; Kanerva, 2009), symbols are represented by -dimensional random i.i.d. bipolar high-dimensional vectors (hypervectors), and referencing is performed by a permutation operation (see section 4.1.1). Thus, network 2.1 implements encoding according to HDC when the components of the encoding matrix are bipolar uniform random i.i.d. variables or ; that is, their distribution is a uniform Bernoulli distribution: , and is a permutation matrix, a special case of a unitary matrix.

With these settings, we can compute the moments of . We have , , , and , which can be inserted in equation 2.34 to compute the retrieval accuracy. For large , the retrieval accuracy can be computed using equation 2.35. We implemented this model and compared multiple simulation experiments to the theory. The theory predicts the simulations precisely for all parameter settings of , , and (see Figures 3A and 3B).

In holographic reduced representation (HRR) (Plate, 1993, 2003), symbols are represented by vectors drawn from a gaussian distribution with variance : . The binding operation is performed by circular convolution, and trajectory association can be implemented by binding each input symbol to successive convolutional powers of a random *key* vector, . According to Plate (1995), the circular convolution operation can be transformed into an equivalent matrix multiply for a fixed vector by forming the circulant matrix from the vector (i.e., ). This matrix has elements (where the subscripts on a are interpreted modulo ). If , the corresponding matrix is unitary. Thus, HRR trajectory association can be implemented by an RNN with a recurrent circulant matrix and encoding matrix with entries drawn from a normal distribution. The analysis described for HDC carries over to HRR, and the error probabilities can be computed through the statistics of , with , giving , and with , giving . We compare simulations of HRR to the theoretical predictions in Figure 3C.

The *Fourier holographic reduced representation* (FHRR) (Plate, 2003) framework uses complex hypervectors as symbols, where components lay on the complex unit circle and have random phases: , with a phase angle drawn from the uniform distribution . The network implementation uses complex vectors of a dimension of . Since each vector component is complex, there are numbers to represent: one for the real part and one for the imaginary part (Danihelka et al., 2016). The first rows of the input matrix act on the real parts, and the second act on the imaginary part. Trajectory association can be performed with a random vector with complex elements acting as the key, raising the key to successive powers and binding this with each input sequentially. In FHRR, both element-wise multiply or circular convolution can be used as the binding operation, and trajectory association can be performed to encode the letter sequence with either mechanism (see section 4.1.2 for further details). These are equivalent to an RNN with the diagonal of as the key vector or as being the circulant matrix of the key vector.

Given that each draw from is a unitary complex number with , the statistics of are given by , , giving . For the variance, let and . Then . Letting , it is easy to see that it also the case that . Therefore, and giving . Again we simulate such networks and compare to the theoretical results (see Figures 3D and 3E).

A random unitary matrix acting as a binding mechanism has also been proposed in the *matrix binding with additive terms* framework (MBAT) (Gallant & Okaywe, 2013). Our theory also applies to equivalent RNNs with random unitary recurrent matrices (created by QR decomposition of random gaussian matrix), with the same (see Figure 3F). Picking an encoding matrix and unitary recurrent matrix at random satisfies the required assumptions 2.4 to 2.7 with high probability when is large.

#### 2.3.2 Sparse Input Sequences

We next analyze detection retrieval of sparse input sequences, in which the input data vector is nonzero only with some probability . The readout must first decide whether an input was present and determine its identity if present. With a random input matrix, linear neurons, and a unitary recurrent matrix, the sensitivity is . The crosstalk noise increments only when the input generates a one-hot vector. The threshold setting trades off hit and correct rejection accuracy (miss and false-positive error). We illustrate this in Figure 4A using equations 2.14 to 2.16 describing retrieval accuracy. The readout performance for sparse data sequences depends only on the product . Thus, it appears possible to recover memory items with high sensitivity even for large sequence lengths if is very small. However, our theoretical result requires assumptions 2.4 to 2.7, which break down for extremely long sequences. The result does not account for this breakdown and is optimistic for this scenario. Previous results that consider such extremely sparse input sequences have used the methods of compressed sensing and sparse inference (Ganguli & Sompolinsky, 2010; Charles, Yap, & Rozell, 2014; Charles, Yin, & Rozell, 2017), and show that recovering sparse input sequences with is possible.

#### 2.3.3 Sparse Codebook

Often neural activity is characterized as sparse, and some VSA models use sparse codebooks. Several studies point to sparsity as an advantageous coding strategy for connectionist models (Rachkovskij, 2001). Sparse codes can be studied within our framework, equation 2.1, with a sparse input matrix—a random matrix in which elements are zeroed out with a probability referred to as the *sparseness factor* (sf) Sparsity in the codebook affects both signal and the noise equally and cancels out to produce the same sensitivity, , as with a nonsparse input matrix. Thus, sparsity essentially has no effect on the capacity of the memory, up to the catastrophic point of sparsity where entire columns in the input matrix become zero (see Figure 4B).

#### 2.3.4 Neuronal Noise

There are other ways to model noise in the network. For the case where there is only white noise added during the retrieval operation, it is easy to see that this noise will be added to the variance of , giving . If the noise was instead like a bit-flip in readout hypervector, with the probability of bit-flip , then this gives . Finally, with these derivations of and equation 2.12, the empirical performance of simulated neural networks is predicted (see Figure 4C).

#### 2.3.5 Memory Capacity of VSAs with Symbolic Input

The original estimate for the capacity of distributed random codes (Plate, 1993) considered a slightly different setup (see section 4.2.2) but follows similar ideas as the FA-CR-LEE high-fidelity approximation, equation 2.24, and we reformulated the Plate (1993) derivation to compare with our analysis. This work first showed that random indexing has linear extensive capacity and that the memory capacity is at least 0.16 bits per neuron. Figure 5A compares the approximations, equations 2.21, 2.23, and 2.24, with the full theory evaluated numerically, equation 2.12. These approximations are good in the high-fidelity regime, where is near 1 but underestimate the performance in the low-fidelity regime.

With the relationship , the information contained in the activity state can be calculated. We compare the total information, equation 2.27, based on the approximations with the true information content determined by numeric evaluation of (see Figure 5B). In the linear scenario with unitary weight matrix, has no dependence on , and so the total information in this case is simply , equation 2.27.

Results for the case of a finite moderate-sized alphabet size () are shown in Figure 5B. The novel and most accurate high-fidelity approximation, equation 2.25, predicts 0.27 bits per neuron; the simpler high-fidelity approximations substantially underestimate the capacity.

Importantly, however, our full theory shows that the true information maximum lies outside the high-fidelity regime. The maximum capacity for is nearly 0.4 bits per neuron (see Figure 5B, black circle). Thus, the achievable capacity is about four times larger than previous analysis suggested.

In a wide range of , our full theory precisely predicts the empirically measured total information in simulations of the random sequence recall task (see Figure 5C). The total information per neuron scales linearly with the number of neurons, and the maximum amount of information per element that can be stored in the network is dependent on . The memory capacity increases with , reaching over 0.5 bits per neuron for large (see Figure 5D, solid line).

*Capacity without superposition.* As the alphabet size, , grows superlinear in (approaching ), one needs to reduce in order to maximize the memory capacity (see Figure 5E, dashed line). The theory breaks down when there is no superposition, that is, when . This case is different because there is no cross-talk, but for very large and randomly generated code vectors, collisions arise, another source of retrieval errors. Collisions are coincidental duplication of code vectors. The theory presented so far can describe effects of crosstalk but not of collisions.

In the presence of superposition, that is, for , the crosstalk noise becomes immediately the limiting factor for memory capacity. This is shown in Figure 5E for a small network of 100 neurons. For , the memory capacity drops to around 0.5 bits per neuron and decreases to about 0.2 bits per neuron for large values (see Figure 5E, black line). The capacity curves for fixed values of (5E, green lines) show the effect of crosstalk noise, which increases as more items are superposed (as increases). For (see Figure 5E, dark green line), equations 2.39 and 2.27 can be evaluated as grows to .

### 2.4 Indexed Memory Buffers with Symbolic Input

With the linear encoding network described in the previous section, there is no recency effect; the readout of the most recent input stored is just as the accurate as readout of the earliest input stored. In contrast, a network with a contracting recurrent weight matrix (see section 2.4.1) or nonlinear neural activation functions (see sections 2.4.2 and 2.4.3) will have a recency effect. In this section, the theory will be developed to describe memory in encoding networks (see equation 2.1) with recency effects. We juxtapose the performance of networks with recency effects in our two memory tasks, reset memory and memory buffer. We show that reset memories yield optimal capacity in the absence of any recency effect (Lim & Goldman, 2011). However, recency effects can avoid catastrophic forgetting when the input sequence is infinite. Through the recency effect, sequence items presented further back in time will be attenuated and eventually forgotten. Thus, the recency effect is a soft substitution of resetting the memory activity before the input of interest is entered. Further, we show that contracting recurrent weights and saturating neural activation functions have very similar behavior if their forgetting time constants are aligned (see section 2.4.4). Finally, we optimize the parameters of memory buffers for high memory capacity and show that extensive capacity (Ganguli et al., 2008) is achievable even in the presence of neuronal noise by keeping the forgetting time constant proportional to .

#### 2.4.1 Linear Neurons with Contracting Recurrent Weights

*Reset memory.*Consider a network of linear neurons in which the attenuation factor contracts the network activity in each time step. After a sequence of input symbols has been applied, the variance of is , and the signal decays exponentially with . The sensitivity for recalling the input that was added time steps ago is Thus, the sensitivity decays exponentially as increases, and the highest retrieval accuracy is from the most recent item stored in memory. The accuracy (see Figure 6A1) and information per item (see Figure 6B1) based on this formula for show the interdependence between the total sequence length () and the look-back distance () in the history.

In equation 2.41, the sensitivity is monotonically increasing as increases, and thus to maximize it for the th element in history given a finite set of stored tokens, we would want to maximize or have the weight matrix remain unitary with .^{1} The memory capacity is maximized as when is finite (see Figure 6C1) and as grows large (see Figure 6D1), and there is no benefit of contracting weights in reset memories.

*Memory buffer.* For an infinite stream of inputs, , the setting results in catastrophic forgetting. However with , the memory can operate even in this case because past signals fade away and make room for storing new inputs. These networks with , , and unitary have been denoted *normal networks* (White et al., 2004).

#### 2.4.2 Neurons with Clipped-Linear Transfer Function

We will analyze the HDC encoding scheme, a network with a bipolar random input matrix and the recurrent weights a permutation matrix. With this, the components of will always assume integer values, and due to the clipping, the components will be confined to . As a consequence, , defined as in equation 2.30, will also assume values limited to . To compute , we need to track the mean and variance of . This requires iterating the Chapman-Kolmogorov equation 2.3. To do so, we introduce a vector with , which tracks the probability distribution of . The probability of each of the integers from is enumerated in the indices of the vector , and is a bijective map from the values of to the indices of with inverse . To compute the sensitivity of a particular recall, we need to track the distribution of with before the item of interest is added, when the item of interest is added, and in the time steps after storing the item of interest. Note that is defined relative to the standard deviation of the codebook, . A simple scaling can generalize the following analysis to account for codebooks with different variance.

*Reset memory.*At initialization , and so . For each time step that an input arrives in the sequence prior to the input of interest, a or will randomly add to up until the bounds induced by , and this can be tracked with the following diffusion of : Once the vector of interest arrives at , all entries in will have added. This causes the probability distribution to skew: The inputs following the input of interest will then again cause the probability distribution to diffuse further based on equation 2.44. Finally, can be computed for this readout operation by calculating the mean and variance with :

*Memory buffer.* For the diffusion equation, 2.44, will reach a uniform equilibrium distribution with the values of uniformly distributed between : . This means, as with contracting recurrent weights, the clipped-linear function bounds the noise variance of the saturated state. Here, the variance bound of is the variance of the uniform distribution, . Thus, information can still be stored in the network even after being exposed to an infinite sequence of inputs. The sensitivity in the saturated state can be calculated with by replacing in equation 2.45 with , and then again using the diffusion equation 2.44 for the items following the item of interest.

Figure 7 illustrates this analysis of the distribution of . When the item of interest is added, the probability distribution is most skewed, and the signal degradation is relatively small. As more items are added later, the distribution diffuses to the uniform equilibrium and the signal decays to 0. The figure compares operation with the initial states corresponding to reset memory and memory buffer: empty (see Figures 7A1 to 7F1) and saturated (see Figures 7A2 to 7F2). Numerical optimization of memory capacity shows how increases with and decreases with , the parameter when (see Figure 7G).