Abstract

To accommodate structured approaches of neural computation, we propose a class of recurrent neural networks for indexing and storing sequences of symbols or analog data vectors. These networks with randomized input weights and orthogonal recurrent weights implement coding principles previously described in vector symbolic architectures (VSA) and leverage properties of reservoir computing. In general, the storage in reservoir computing is lossy, and crosstalk noise limits the retrieval accuracy and information capacity. A novel theory to optimize memory performance in such networks is presented and compared with simulation experiments. The theory describes linear readout of analog data and readout with winner-take-all error correction of symbolic data as proposed in VSA models. We find that diverse VSA models from the literature have universal performance properties, which are superior to what previous analyses predicted. Further, we propose novel VSA models with the statistically optimal Wiener filter in the readout that exhibit much higher information capacity, in particular for storing analog data.

The theory we present also applies to memory buffers, networks with gradual forgetting, which can operate on infinite data streams without memory overflow. Interestingly, we find that different forgetting mechanisms, such as attenuating recurrent weights or neural nonlinearities, produce very similar behavior if the forgetting time constants are matched. Such models exhibit extensive capacity when their forgetting time constant is optimized for given noise conditions and network size. These results enable the design of new types of VSA models for the online processing of data streams.

1  Introduction

An important aspect of information processing is data representation. In order to access and process data, addresses or keys are required to provide a necessary context. To enable flexible contextual structure as required in cognitive reasoning, connectionist models have been proposed that represent data and keys in a high-dimensional vector space. Such models include holographic reduced representations (HRR; Plate, 1991, 2003) and hyperdimensional computing (HDC) (Gayler, 1998; Kanerva, 2009), and will be referred to here by the umbrella term vector symbolic architectures (VSA; see Gayler, 2003; section 4.1.1). VSA models have been shown to be able to solve challenging tasks of cognitive reasoning (Rinkus, 2012; Kleyko & Osipov, 2014; Gayler, 2003). VSA principles have been recently incorporated into standard neural networks for advanced machine learning tasks (Eliasmith et al., 2012), inductive reasoning (Rasmussen & Eliasmith, 2011), and processing of temporal structure (Graves, Wayne, & Danihelka, 2014; Graves et al., 2016; Danihelka, Wayne, Uria, Kalchbrenner, & Graves, 2016). Typically, VSA models offer at least two operations: one to produce key-value bindings (also referred to as role-filler pairs) and a superposition operation that forms a working memory state containing the indexed data structures. For example, to represent a time sequence of data in a VSA, individual data points are bound to time-stamp keys and the resulting key-value pairs superposed into a working memory state.

Here, we show that input sequences can be indexed and memorized according to various existing VSA models by recurrent neural networks (RNNs) that have randomized input weights and orthonormal recurrent weights of particular properties. Conversely, this class of networks has a straightforward computational interpretation: in each cycle, a new random key is generated, a key-value pair is formed with the new input, and the indexed input is integrated into the network state. In the VSA literature, this operation has been referred to as trajectory association (Plate, 1993). The memory in these networks follows principles previously described in reservoir computing. The idea of reservoir computing is that a neural network with fixed recurrent connectivity can exhibit a rich reservoir of dynamic internal states. An input sequence can selectively evoke these states so that an additional decoder network can extract the input history from the current network state. These models produce and retain neural representations of inputs on the fly, entirely without relying on previous synaptic learning as in standard models of neural memory networks (Caianiello, 1961; Little & Shaw, 1978; Hopfield, 1982; Schwenker, Sommer, & Palm, 1996; Sommer & Dayan, 1998). Models of reservoir computing include state-dependent networks (Buonomano & Merzenich, 1995), echo-state networks (Jaeger, 2002; Lukoševičius & Jaeger, 2009), liquid-state machines (Maass, Natschläger, & Markram, 2002), and related network models of memory (White, Lee, & Sompolinsky, 2004; Ganguli, Huh, & Sompolinsky, 2008; Sussillo & Abbott, 2009). However, it is unclear how such reservoir models create representations that enable the selective readout of past input items. Leveraging the structured approach of VSAs to compute with distributed representations, we offer a novel framework for understanding reservoir computing.

2  Results

2.1  Indexing and Memorizing Sequences with Recurrent Networks

We investigate how a sequence of input vectors of dimension can be indexed by pseudo-random vectors and memorized by a recurrent network with neurons (see Figure 1). The data vectors are fed into the network through a randomized, fixed input matrix . In the context of VSA, the input matrix corresponds to the codebook, and the matrix columns contain the set of high-dimensional random vector-symbols (hypervectors) used in the distributed computation scheme. In addition, the neurons might also experience some independent neuronal noise with . Further, feedback is provided through a matrix of recurrent weights where is orthogonal and . The input sequence is encoded into a single network state by the recurrent neural network (RNN),
formula
2.1
with the component-wise neural activation function.
Figure 1:

Network model investigated.

Figure 1:

Network model investigated.

To estimate the input entered steps ago from the network state, the readout is of the form
formula
2.2
where is a linear transform to select the input that occurred time steps in the past (see Figure 1). In some models, the readout includes a nonlinearity to produce the final output.
The effect of one iteration of equation 2.1 on the probability distribution of the network state is a Markov chain stochastic process, governed by the Chapman-Kolmogorov equation (Papoulis, 1984),
formula
2.3
with a transition kernel , which depends on all parameters and functions in equation 2.1. Thus, to analyze the memory performance in general, one has to iterate equation 2.3 to obtain the distribution of the network state.

2.1.1  Properties of the Matrices in the Encoding Network

The analysis simplifies considerably if the input and recurrent matrix satisfy certain conditions. Specifically, we investigate networks in which the input matrix has independent and identically distributed (i.i.d.) random entries and the recurrent weight matrix is orthogonal with mixing properties and long cycle length. The assumed properties of the network weights guarantee the following independence conditions of the indexing keys, which will be essential in our analysis of the network performance:

  • Code vectors are composed of identically distributed components,
    formula
    2.4
    where is the distribution for a single component of a random code vector, and with , being the mean and variance of , as typically defined by , , with an arbitrary function.
  • Components within a code vector and between code vectors are independent:
    formula
    2.5
  • The recurrent weight matrix is orthogonal and thus preserves the mean and variance of every component of a code vector:
    formula
    2.6
  • The recurrent matrix preserves element-wise independence with a large cycle time (around the size of the reservoir):
    formula
    2.7

The class of RNNs (see equation 2.1) in which the weights fulfill properties 2.4 to 2.7 contains the neural network implementations of various VSA models. Data encoding with such networks has a quite intuitive interpretation. For each input , a pseudo-random key vector is computed that indexes both the input dimension and location in the sequence, . Each input is multiplied with this key vector to form a new key-value pair, which is added to the memory vector . Each pseudo-random key defines a spatial pattern for how an input is distributed to the neurons of the network.

2.1.2  Types of Memories under Investigation

Reset memory versus memory buffer. In the case for finite input sequence length , the network is reset to the zero vector before the first input arrives, and the iteration is stopped after the th input has been integrated. We refer to these models as reset memories. In the VSA literature, the superposition operation (Plate, 1991, 2003; Gallant & Okaywe, 2013) corresponds to a reset memory and, in particular, trajectory-association (Plate, 1993). In reservoir computing, the distributed shift register (DSR; White et al., 2004) can also be related to reset memories. In contrast, a memory buffer can track information from the past in a potentially infinite input stream (). Most models for reservoir computing are memory buffers (Jaeger, 2002; White et al., 2004; Ganguli et al., 2008). A memory buffer includes a mechanism for attenuating older information, which replaces the hard external reset in reset memories to avoid overload. The mechanisms of forgetting we will analyze here are contracting recurrent weights or neural nonlinearities. Our analysis links contracting weights () and nonlinear activation functions () to the essential property of a memory buffer, the forgetting time constant, and we show how to optimize memory buffers to obtain extensive capacity.

Memories for symbols versus analog input sequences. The analysis considers data vectors that represent either symbolic or analog inputs. The superposition of discrete symbols in VSAs can be described by equation 2.1, where inputs are one-hot or zero vectors. A one-hot vector represents a symbol in an alphabet of size . The readout of discrete symbols involves a nonlinear error correction for producing one-hot vectors as output, the winner-take-all operation . Typical models for reservoir computing (Jaeger, 2002; White et al., 2004) process one-dimensional analog input, and the readout is linear, in equation 2.2. We derive the information capacity for both uniform discrete symbols and gaussian analog inputs.

Readout by naive regression versus full minimum mean square error regression. Many models of reservoir computing use full optimal linear regression and set the linear transform in equation 2.2 to the Wiener filter , which produces the minimum mean square error (MMSE) estimate of the stored input data. Here, is the covariance between input and memory state, and is the covariance matrix of the memory state. Obviously this readout requires inverting . In contrast, VSA models use , with a constant, which does not require matrix inversion. Thus, the readout in VSA models is computationally much simpler but can cause reduced readout quality. We show that the MMSE readout matrix can mitigate the crosstalk noise in VSA and improve readout quality in regimes where . This is particularly useful for the retrieval of analog input values, where the memory capacity exceeds many bits per neuron, limited only by neuronal noise.

2.2  Analysis of Memory Performance

After encoding an input sequence, the memory state contains information indexed with respect to the dimension of the input vectors and with respect to the length dimension of the sequence. The readout of a vector component, , at a particular position of the sequence, , begins with a linear dot product operation,
formula
2.8
where is the th column vector of the decoding matrix .
For readout of analog-valued input vectors, we use linear readout: , where is decoding noise resulting from crosstalk and neuronal noise. The signal-to-noise ratio, , of the linear readout can then be defined as
formula
2.9
where we suppressed the component index and assume that the signal and noise properties are the same for all vector components.

For symbolic input, we will consider symbols from an alphabet of length , which are represented by one-hot vectors; that is, in each input vector, there is one component with value 1 and all other are 0. In this case, a multivariate threshold operation can be applied after the linear readout for error correction, the winner-take-all function: .

2.2.1  The Accuracy of Retrieving Discrete Inputs

For symbolic inputs, we will analyze the readout of two distinct types of input sequences. In the first type, a symbol is entered in every time step, and retrieval consists in classification to determine which symbol was added at a particular time. The second type of input sequence can contain gaps: some positions in the sequence can be empty. If most inputs in the sequence are empty, this type of input stream has been referred to as a sparse input sequence (Ganguli & Sompolinsky, 2010). The retrieval task is then detection: whether a symbol is present and, if so, reveal its identity.

For classification, if is the index of the hot component in , then the readout with the winner-take-all operation is correct if in equation 2.8, for all distracters . As we will see, under the independence conditions 2.4 to 2.7 and VSA readout, the readout variables are the true inputs plus gaussian noise. The classification accuracy, , is
formula
2.10
For clarity in the notation of gaussian distributions, the argument variable is added: .
The gaussian variables and in equation 2.10 can be shifted and rescaled to yield
formula
2.11
where is the normal cumulative density function.
Further simplification can be made when . The classification accuracy then becomes
formula
2.12
where the sensitivity for detecting the hot component from is defined:
formula
2.13
For detection, the retrieval involves two steps: detecting whether an input item was integrated time steps ago and identifying which symbol if one is detected. In this case, a rejection threshold, , is required, which governs the trade-off between the two error types: misses and false positives. If none of the components in exceed , then the readout will output that no item was stored. The detection accuracy is given by
formula
2.14
where is the probably that is a nonzero signal. If the distribution of is close to gaussian, the two conditional probabilities of equation 2.14 can be computed as follows. The accuracy, given a nonzero input was applied, can be computed analogous to equation 2.12:
formula
2.15
Note that equation 2.15 is of the same form as equation 2.12 but with different integration bounds. The second conditional probability in equation 2.14, for correctly detecting a zero input, can be computed by
formula
2.16

Special Cases

  1. As a sanity check, consider the classification accuracy, equation 2.12, in the vanishing sensitivity regime, for . The first factor in the integral, the gaussian, then becomes the inner derivative of the second factor, the cumulative gaussian raised to the th power. With , the integral can be solved analytically using the (inverse) chain rule to yield the correct chance value for the classification:
    formula
  2. The case makes sense for detection but not for classification retrieval. This case falls under classical signal detection theory (Peterson, Birdsall, & Fox, 1954), with being the sensitivity index. The detection accuracy, equation 2.14, in this case becomes
    formula
    The threshold trades off miss and false alarm errors. Formulas 2.14 to 2.16 generalize signal detection theory to higher-dimensional signals ().
  3. Consider classification retrieval, equation 2.12, in the case . Since the (rescaled and translated) random variables and (see Figure 2A) are uncorrelated, one can switch to a new gaussian variable representing their difference: with (see Figure 2B). Thus, for , one can compute equation 2.12 by just the normal cumulative density function (and avoiding the integration):
    formula
    The result, equation 2.19, is the special case of table entry 10,010.8 in Owen's table of normal integrals (Owen, 1980).

Figure 2:

Approximating the retrieval accuracy in the high-fidelity regime. (A) The retrieval is correct when the value drawn from distribution (blue) exceeds the values produced by draws from the distribution (black). In the example, the sensitivity is . (B) When , the two distributions can be transformed into one distribution describing the difference of both quantities, . (C) When , the random variables formed by such differences are correlated. Thus, in general, the multivariate cumulative gaussian integral, equation 2.20, cannot be factorized. The example shows the case , with the integration boundaries displayed by dashed lines. (D) However, for large , that is, in the high-fidelity regime, the factorial approximation, equation 2.21, becomes quite accurate. The panel shows again the example. (E) Linear relationship between the squared sensitivity and the logarithm of . The numerically evaluated full theory (dots) coincides more precisely with the approximated linear theories (lines) when the accuracy is high (the accuracy is indicated by copper-colored lines; see the legend). The simpler linear theory (equation 2.24; dashed lines) matches the slope of the full theory but exhibits a small offset. The more elaborate linear theory (equation 2.25; solid lines) provides a quite accurate fit of the full theory for high accuracy values.

Figure 2:

Approximating the retrieval accuracy in the high-fidelity regime. (A) The retrieval is correct when the value drawn from distribution (blue) exceeds the values produced by draws from the distribution (black). In the example, the sensitivity is . (B) When , the two distributions can be transformed into one distribution describing the difference of both quantities, . (C) When , the random variables formed by such differences are correlated. Thus, in general, the multivariate cumulative gaussian integral, equation 2.20, cannot be factorized. The example shows the case , with the integration boundaries displayed by dashed lines. (D) However, for large , that is, in the high-fidelity regime, the factorial approximation, equation 2.21, becomes quite accurate. The panel shows again the example. (E) Linear relationship between the squared sensitivity and the logarithm of . The numerically evaluated full theory (dots) coincides more precisely with the approximated linear theories (lines) when the accuracy is high (the accuracy is indicated by copper-colored lines; see the legend). The simpler linear theory (equation 2.24; dashed lines) matches the slope of the full theory but exhibits a small offset. The more elaborate linear theory (equation 2.25; solid lines) provides a quite accurate fit of the full theory for high accuracy values.

In general, for and nonzero sensitvity, the integral cannot be solved analytically, but can be numerically approximated to arbitrary precision (see Figure 14).

2.2.2  Accuracy in the High-Fidelity Regime

Next, we derive steps to approximate the accuracy in the regime of high-fidelity recall, following the rationale of previous analyses of VSA models (Plate, 2003; Gallant & Okaywe, 2013). This work showed that the accuracy of retrieval scales linearly with the number of neurons in the network (). We will compare our analysis results with those of previous analyses and with simulation results.

We now try to apply to the case what worked for (see equation 2.19): that is, get rid of the integral in equation 2.12 by transforming to new variables for each of the distracters with . We can write with
formula
In analogy to the case, equation 2.19, we can rewrite the result of equation 2.12 for by
formula
2.20
with multivariate cumulative gaussian, (see the table entry in Owen's table of normal integrals; Owen, 1980).
The multivariate cumulative distribution, equation 2.20, would factorize, but only for uncorrelated variables, when the covariance matrix is diagonal. The difficulty with is that the multiple variables are correlated: . The positive uniform off-diagonal entries in the covariance matrix means that the covariance ellipsoid of the 's is aligned with the vector and thus also with the displacement vector (see Figure 2C). In the high signal-to-noise regime, the integration boundaries are removed from the mean, and the exact shape of the distribution should not matter so much (see Figure 2D). Thus, the first step takes the factorized approximation (FA) to the multivariate gaussian to approximate in the high signal-to-noise regime:
formula
2.21
Note that for in the low-fidelity regime, this approximation fails; the chance probability is when (see equation 2.17), but equation 2.21 yields , which is much too small for .
For an analytic expression, an approximate formula is needed for the one-dimensional cumulative gaussian, which is related to the complementary error function by . A well-known exponential upper bound on the complementary error function is the Chernoff-Rubin bound (CR; Chernoff, 1952). Later work (Jacobs, 1966; Hellman & Raviv, 1970) produced a tightened version of this bound: . Using , we obtain , which can be inserted into equation 2.21 as the next step to yield an approximation of :
formula
2.22
With a final approximation step, using the local error expansion (LEE) when is near 0, we can set and rewrite
formula
2.23
Solving for provides a simple law relating the sensitivity with the input dimension,
formula
2.24
where .
The approximation, equation 2.24, is quite accurate (see Figure 2E, dashed lines) but not tight. Even if equation 2.21 was tight in the high-fidelity regime, there would still be a discrepancy because the CR bound is not tight. This problem of the CR bound has been noted for a long time, enticing efforts to derive tight bounds, usually involving more complicated multiterm expressions (e.g., Chiani, Dardari, & Simon, 2003). Quite recently, Chang, Cosman, and Milstein (2011) studied one-term bounds of the complementary error function of the form . First, they proved that there exists no parameter setting for tightening the original Chernoff-Rubin upper bound. Second, they reported a parameter range where the one-term expression becomes a lower bound: for . The lower bound becomes the tightest with and . This setting approximates the complementary error function as well as an eight-term expression derived in Chiani et al. (2003). Following Chang et al. (2011), we approximate the cumulative gaussian with the Chang bound (Ch) and follow the same FA and LEE steps to derive a tighter linear fit to the true numerically evaluated integral,
formula
2.25
with . This law fits the full theory in the high-fidelity regime (see Figure 2E, solid lines), but it is not as accurate for smaller sensitivity values.

2.2.3  Information content and memory capacity

Memory capacity for symbolic input. The information content (Feinstein, 1954) is defined as the mutual information between the true sequence and the sequence retrieved from the superposition state . The mutual information between the individual item that was stored time steps ago () and the item that was retrieved () is given by
formula
where is the Kullback-Leibler divergence (Kullback & Leibler, 1951).
For discrete input sequences, because the sequence items are chosen uniformly random from the set of symbols, both the probability of a particular symbol as input and the probability of a particular symbol as the retrieved output are the same: . The integral evaluates the conditional probability that the output item is the same as the input item:
formula
To evaluate the terms, is needed to compute the probability of choosing the incorrect symbol given the true input. The probability that the symbol is retrieved incorrectly is , and each of the distracters is equally likely to be the incorrectly retrieved symbol, thus:
formula
Plugging these into the mutual information and simplifying,
formula
2.26
where is the Bernoulli distribution. Note that the mutual information per item can be expressed as the Kullback-Leibler divergence between the actual recall accuracy and the recall accuracy achieved by chance, .
The total mutual information is the sum of the information for each item in the full sequence:
formula
2.27

Note that if the accuracy is the same for all items, then , and by setting , one obtains the entire input information: .

Memory capacity for analog input. For analog inputs, we can compute the information content if the components of input vectors are independent with gaussian distribution, . In this case, distributions of the readout and the joint between input and readout are also gaussian. Therefore, the correlation between and is sufficient to compute the information (Gel'fand & Yaglom, 1957), with , where is the correlation between the input and output. There is a simple relation between the signal correlation and the SNR (9): , which gives the total information:
formula
2.28
The information content of a network is in units bits per neuron. The memory capacity is then the maximum of (27, 28) over all parameter settings of the network. The network has extensive memory when is positive, as grows large.

2.3  VSA Indexing and Readout of Symbolic Input Sequences

In this section, we analyze the network model, equation 2.1, with linear neurons, and without neuronal noise. After a reset to , the network receives a sequence of discrete inputs. Each input is a one-hot vector, representing one of symbols; we will show examples with alphabet size of , representing the 26 English letters and the space character. The readout, equation 2.2, involves the matrix and the winner-take-all function, with a scaling constant. This setting is important because, as we will show in the next section, it can implement the working memory operation in various VSA models from the literature.

In this case, a sequence of inputs into the RNN, equation 2.1, produces the following memory vector:
formula
2.29
Under the conditions 2.4 to 2.7, the linear part of the readout, equation 2.8, results in a sum of independent random variables:
formula
2.30
Note that under the conditions 2.4 to 2.7, each is independent, and thus is a gaussian by the central limit theorem for large . The mean and variance of are given by and .
The quantity in equation 2.30 can be written as
formula
2.31
Given the conditions 2.4 to 2.7, the moments of can be computed:
formula
2.32
formula
2.33
with , being the mean and variance of , the distribution of a component in the codebook , as defined by equation 2.4.

Note that with linear neurons and unitary recurrent matrix, the argument can be dropped because there is no recency effect and all items in the sequence can be retrieved with the same accuracy.

For networks with large enough, . By inserting and into equation 2.11, the accuracy then becomes
formula
2.34
Analogous to equation 2.12 for large , the expression simplifies further to
formula
2.35
Interestingly, expression 2.35 is independent of the statistical moments of the coding vectors and thus applies to any distribution of coding vectors equation 2.4. Since is the ratio of to , it is easy to see that this network will have extensive capacity when is held constant: :
formula
2.36
However, it is a complex relationship between the parameter values that actually maximizes the mutual information. We will explore this in section 2.3.5. But first, in sections 2.3.1 to 2.3.4, we will show that , or a simple rescaling of it, describes the readout quality in many different VSA models.

2.3.1  VSA Models from the Literature

Many connectionist models from the literature can be directly mapped to equations 2.1 and 2.2 with the settings described at the beginning of section 2.3. In the following, we will describe various VSA models and the properties of the corresponding encoding matrix and recurrent weight matrix . We will determine the moments of the code vectors required in equation 2.34 to estimate the accuracy with in the general case (for small ). For large values, we will show that all models perform similarly, and the accuracy can predicted by the universal sensitivity formula .

In hyperdimensional computing (HDC; Gayler, 1998; Kanerva, 2009), symbols are represented by -dimensional random i.i.d. bipolar high-dimensional vectors (hypervectors), and referencing is performed by a permutation operation (see section 4.1.1). Thus, network 2.1 implements encoding according to HDC when the components of the encoding matrix are bipolar uniform random i.i.d. variables or ; that is, their distribution is a uniform Bernoulli distribution: , and is a permutation matrix, a special case of a unitary matrix.

With these settings, we can compute the moments of . We have , , , and , which can be inserted in equation 2.34 to compute the retrieval accuracy. For large , the retrieval accuracy can be computed using equation 2.35. We implemented this model and compared multiple simulation experiments to the theory. The theory predicts the simulations precisely for all parameter settings of , , and (see Figures 3A and 3B).

Figure 3:

Classification retrieval accuracy: Theory and simulation experiments. The theory (solid lines) matches the simulation results (dashed lines) of the sequence recall task for a variety of VSA frameworks. Alphabet length in all panels except panel B is . (A) Accuracy of HDC code as a function of the number of stored items for different dimensions of the hypervector. (B) Accuracy of HDC with different and for constant . (C) Accuracy of HRR code and circular convolution as binding mechanism. (D) Accuracy of FHRR code and circular convolution as the binding mechanism. (E) Accuracy of FHRR using multiply as the binding mechanism. (F) Accuracy achieved with random encoding and random unitary recurrent matrix also performs according to the same theory.

Figure 3:

Classification retrieval accuracy: Theory and simulation experiments. The theory (solid lines) matches the simulation results (dashed lines) of the sequence recall task for a variety of VSA frameworks. Alphabet length in all panels except panel B is . (A) Accuracy of HDC code as a function of the number of stored items for different dimensions of the hypervector. (B) Accuracy of HDC with different and for constant . (C) Accuracy of HRR code and circular convolution as binding mechanism. (D) Accuracy of FHRR code and circular convolution as the binding mechanism. (E) Accuracy of FHRR using multiply as the binding mechanism. (F) Accuracy achieved with random encoding and random unitary recurrent matrix also performs according to the same theory.

In holographic reduced representation (HRR) (Plate, 1993, 2003), symbols are represented by vectors drawn from a gaussian distribution with variance : . The binding operation is performed by circular convolution, and trajectory association can be implemented by binding each input symbol to successive convolutional powers of a random key vector, . According to Plate (1995), the circular convolution operation can be transformed into an equivalent matrix multiply for a fixed vector by forming the circulant matrix from the vector (i.e., ). This matrix has elements (where the subscripts on a are interpreted modulo ). If , the corresponding matrix is unitary. Thus, HRR trajectory association can be implemented by an RNN with a recurrent circulant matrix and encoding matrix with entries drawn from a normal distribution. The analysis described for HDC carries over to HRR, and the error probabilities can be computed through the statistics of , with , giving , and with , giving . We compare simulations of HRR to the theoretical predictions in Figure 3C.

The Fourier holographic reduced representation (FHRR) (Plate, 2003) framework uses complex hypervectors as symbols, where components lay on the complex unit circle and have random phases: , with a phase angle drawn from the uniform distribution . The network implementation uses complex vectors of a dimension of . Since each vector component is complex, there are numbers to represent: one for the real part and one for the imaginary part (Danihelka et al., 2016). The first rows of the input matrix act on the real parts, and the second act on the imaginary part. Trajectory association can be performed with a random vector with complex elements acting as the key, raising the key to successive powers and binding this with each input sequentially. In FHRR, both element-wise multiply or circular convolution can be used as the binding operation, and trajectory association can be performed to encode the letter sequence with either mechanism (see section 4.1.2 for further details). These are equivalent to an RNN with the diagonal of as the key vector or as being the circulant matrix of the key vector.

Given that each draw from is a unitary complex number with , the statistics of are given by , , giving . For the variance, let and . Then . Letting , it is easy to see that it also the case that . Therefore, and giving . Again we simulate such networks and compare to the theoretical results (see Figures 3D and 3E).

A random unitary matrix acting as a binding mechanism has also been proposed in the matrix binding with additive terms framework (MBAT) (Gallant & Okaywe, 2013). Our theory also applies to equivalent RNNs with random unitary recurrent matrices (created by QR decomposition of random gaussian matrix), with the same (see Figure 3F). Picking an encoding matrix and unitary recurrent matrix at random satisfies the required assumptions 2.4 to 2.7 with high probability when is large.

2.3.2  Sparse Input Sequences

We next analyze detection retrieval of sparse input sequences, in which the input data vector is nonzero only with some probability . The readout must first decide whether an input was present and determine its identity if present. With a random input matrix, linear neurons, and a unitary recurrent matrix, the sensitivity is . The crosstalk noise increments only when the input generates a one-hot vector. The threshold setting trades off hit and correct rejection accuracy (miss and false-positive error). We illustrate this in Figure 4A using equations 2.14 to 2.16 describing retrieval accuracy. The readout performance for sparse data sequences depends only on the product . Thus, it appears possible to recover memory items with high sensitivity even for large sequence lengths if is very small. However, our theoretical result requires assumptions 2.4 to 2.7, which break down for extremely long sequences. The result does not account for this breakdown and is optimistic for this scenario. Previous results that consider such extremely sparse input sequences have used the methods of compressed sensing and sparse inference (Ganguli & Sompolinsky, 2010; Charles, Yap, & Rozell, 2014; Charles, Yin, & Rozell, 2017), and show that recovering sparse input sequences with is possible.

Figure 4:

Detection retrieval accuracy, encoding sparsity and noise. (A) Retrieval of a sparse input sequence (, chance for a zero vector). The hit and correct rejection performance for simulated networks (dashed lines) with different detection thresholds matches the theory (solid lines): a rejection is produced if . (B) Performance is not affected by the level of encoding sparsity until catastrophic failing when all elements are 0. C. Simulations (dashed lines) match theory (solid lines) for networks corrupted by gaussian noise (top) and random bit-flips (bottom).

Figure 4:

Detection retrieval accuracy, encoding sparsity and noise. (A) Retrieval of a sparse input sequence (, chance for a zero vector). The hit and correct rejection performance for simulated networks (dashed lines) with different detection thresholds matches the theory (solid lines): a rejection is produced if . (B) Performance is not affected by the level of encoding sparsity until catastrophic failing when all elements are 0. C. Simulations (dashed lines) match theory (solid lines) for networks corrupted by gaussian noise (top) and random bit-flips (bottom).

2.3.3  Sparse Codebook

Often neural activity is characterized as sparse, and some VSA models use sparse codebooks. Several studies point to sparsity as an advantageous coding strategy for connectionist models (Rachkovskij, 2001). Sparse codes can be studied within our framework, equation 2.1, with a sparse input matrix—a random matrix in which elements are zeroed out with a probability referred to as the sparseness factor (sf) Sparsity in the codebook affects both signal and the noise equally and cancels out to produce the same sensitivity, , as with a nonsparse input matrix. Thus, sparsity essentially has no effect on the capacity of the memory, up to the catastrophic point of sparsity where entire columns in the input matrix become zero (see Figure 4B).

2.3.4  Neuronal Noise

Here, we consider the case where each neuron experiences i.i.d. gaussian noise in each time step in addition to the data input. The effect of the noise depends on the ratio between noise variance and the variance of a component in the code vectors . The sensitivity with neuronal noise is
formula
2.37
Thus, noise accumulation only scales by a constant factor.

There are other ways to model noise in the network. For the case where there is only white noise added during the retrieval operation, it is easy to see that this noise will be added to the variance of , giving . If the noise was instead like a bit-flip in readout hypervector, with the probability of bit-flip , then this gives . Finally, with these derivations of and equation 2.12, the empirical performance of simulated neural networks is predicted (see Figure 4C).

2.3.5  Memory Capacity of VSAs with Symbolic Input

The original estimate for the capacity of distributed random codes (Plate, 1993) considered a slightly different setup (see section 4.2.2) but follows similar ideas as the FA-CR-LEE high-fidelity approximation, equation 2.24, and we reformulated the Plate (1993) derivation to compare with our analysis. This work first showed that random indexing has linear extensive capacity and that the memory capacity is at least 0.16 bits per neuron. Figure 5A compares the approximations, equations 2.21, 2.23, and 2.24, with the full theory evaluated numerically, equation 2.12. These approximations are good in the high-fidelity regime, where is near 1 but underestimate the performance in the low-fidelity regime.

Figure 5:

Information content and memory capacity. (A) Approximations of retrieval accuracy derived in section 2.2.2 and Plate (1993) are compared to the numerically evaluated accuracy (). The approximations underestimate the accuracy in the low-fidelity regime (, ). (B) Total information content retrieved and memory capacity (solid points). High-fidelity retrieval recovers nearly all of the stored information (thin black line, , equation 2.27), but the true memory capacity is somewhat into the low-fidelity regime. (C) Retrieved information measured in simulations (dashed lines) compared to the predictions of the theory (solid lines). The memory capacity is dependent on . (D) Memory capacity as a function of (solid line) and information of the input sequence at retrieval maximum (, dashed). (E) Maximum information retrieved (solid black line) and total information stored (, dashed), where is a significant fraction of (). The retrieved information for fixed sequence lengths are plotted (green lines of different shades). For , retrieved and stored information come close; with larger , the gap grows.

Figure 5:

Information content and memory capacity. (A) Approximations of retrieval accuracy derived in section 2.2.2 and Plate (1993) are compared to the numerically evaluated accuracy (). The approximations underestimate the accuracy in the low-fidelity regime (, ). (B) Total information content retrieved and memory capacity (solid points). High-fidelity retrieval recovers nearly all of the stored information (thin black line, , equation 2.27), but the true memory capacity is somewhat into the low-fidelity regime. (C) Retrieved information measured in simulations (dashed lines) compared to the predictions of the theory (solid lines). The memory capacity is dependent on . (D) Memory capacity as a function of (solid line) and information of the input sequence at retrieval maximum (, dashed). (E) Maximum information retrieved (solid black line) and total information stored (, dashed), where is a significant fraction of (). The retrieved information for fixed sequence lengths are plotted (green lines of different shades). For , retrieved and stored information come close; with larger , the gap grows.

With the relationship , the information contained in the activity state can be calculated. We compare the total information, equation 2.27, based on the approximations with the true information content determined by numeric evaluation of (see Figure 5B). In the linear scenario with unitary weight matrix, has no dependence on , and so the total information in this case is simply , equation 2.27.

In the high-fidelity regime with and small , we can estimate, with equation 2.24,
formula
2.38
We can see that for any fixed, finite , the information per neuron depends on the admitted error and vanishes for . If the alphabet size is growing with and for fixed small error , the asymptotic capacity value is . Our best high-fidelity approximation, equation 2.25, increases the total capacity bound above previous estimates to 0.39.

Results for the case of a finite moderate-sized alphabet size () are shown in Figure 5B. The novel and most accurate high-fidelity approximation, equation 2.25, predicts 0.27 bits per neuron; the simpler high-fidelity approximations substantially underestimate the capacity.

Importantly, however, our full theory shows that the true information maximum lies outside the high-fidelity regime. The maximum capacity for is nearly 0.4 bits per neuron (see Figure 5B, black circle). Thus, the achievable capacity is about four times larger than previous analysis suggested.

In a wide range of , our full theory precisely predicts the empirically measured total information in simulations of the random sequence recall task (see Figure 5C). The total information per neuron scales linearly with the number of neurons, and the maximum amount of information per element that can be stored in the network is dependent on . The memory capacity increases with , reaching over 0.5 bits per neuron for large (see Figure 5D, solid line).

Capacity without superposition. As the alphabet size, , grows superlinear in (approaching ), one needs to reduce in order to maximize the memory capacity (see Figure 5E, dashed line). The theory breaks down when there is no superposition, that is, when . This case is different because there is no cross-talk, but for very large and randomly generated code vectors, collisions arise, another source of retrieval errors. Collisions are coincidental duplication of code vectors. The theory presented so far can describe effects of crosstalk but not of collisions.

For completeness, we briefly address the case of and very large . This case without cross talk shows that it is possible to achieve the theoretically optimal capacity of 1 bit per neuron and that crosstalk immediately limits the achievable capacity. If code vectors are drawn i.i.d. with a uniform Bernoulli distribution , then the probability of accurately identifying the correct code word is
formula
2.39
where is the probability of a vector having collisions with other vectors in the codebook of vectors, which is given by the binomial distribution,
formula
2.40
where is the likelihood of a pair of vectors colliding. The collisions reduce the accuracy to for in the limit . However, this reduction does not affect the asymptotic capacity. It is bits per neuron as , the same as for a codebook without collisions (see section 4.4.3). The effects of collisions at finite sizes can be seen in Figures 5E and 18A.

In the presence of superposition, that is, for , the crosstalk noise becomes immediately the limiting factor for memory capacity. This is shown in Figure 5E for a small network of 100 neurons. For , the memory capacity drops to around 0.5 bits per neuron and decreases to about 0.2 bits per neuron for large values (see Figure 5E, black line). The capacity curves for fixed values of (5E, green lines) show the effect of crosstalk noise, which increases as more items are superposed (as increases). For (see Figure 5E, dark green line), equations 2.39 and 2.27 can be evaluated as grows to .

2.4  Indexed Memory Buffers with Symbolic Input

With the linear encoding network described in the previous section, there is no recency effect; the readout of the most recent input stored is just as the accurate as readout of the earliest input stored. In contrast, a network with a contracting recurrent weight matrix (see section 2.4.1) or nonlinear neural activation functions (see sections 2.4.2 and 2.4.3) will have a recency effect. In this section, the theory will be developed to describe memory in encoding networks (see equation 2.1) with recency effects. We juxtapose the performance of networks with recency effects in our two memory tasks, reset memory and memory buffer. We show that reset memories yield optimal capacity in the absence of any recency effect (Lim & Goldman, 2011). However, recency effects can avoid catastrophic forgetting when the input sequence is infinite. Through the recency effect, sequence items presented further back in time will be attenuated and eventually forgotten. Thus, the recency effect is a soft substitution of resetting the memory activity before the input of interest is entered. Further, we show that contracting recurrent weights and saturating neural activation functions have very similar behavior if their forgetting time constants are aligned (see section 2.4.4). Finally, we optimize the parameters of memory buffers for high memory capacity and show that extensive capacity (Ganguli et al., 2008) is achievable even in the presence of neuronal noise by keeping the forgetting time constant proportional to .

2.4.1  Linear Neurons with Contracting Recurrent Weights

Reset memory. Consider a network of linear neurons in which the attenuation factor contracts the network activity in each time step. After a sequence of input symbols has been applied, the variance of is , and the signal decays exponentially with . The sensitivity for recalling the input that was added time steps ago is
formula
2.41
Thus, the sensitivity decays exponentially as increases, and the highest retrieval accuracy is from the most recent item stored in memory. The accuracy (see Figure 6A1) and information per item (see Figure 6B1) based on this formula for show the interdependence between the total sequence length () and the look-back distance () in the history.

In equation 2.41, the sensitivity is monotonically increasing as increases, and thus to maximize it for the th element in history given a finite set of stored tokens, we would want to maximize or have the weight matrix remain unitary with .1 The memory capacity is maximized as when is finite (see Figure 6C1) and as grows large (see Figure 6D1), and there is no benefit of contracting weights in reset memories.

Memory buffer. For an infinite stream of inputs, , the setting results in catastrophic forgetting. However with , the memory can operate even in this case because past signals fade away and make room for storing new inputs. These networks with , , and unitary have been denoted normal networks (White et al., 2004).

The value of affects both the signal and the crosstalk noise. For large , the noise variance is bounded by , and the network reaches its saturated equilibrium state. The sensitivity for the th element back in time from the saturated state is
formula
2.42
The theory (see equations 2.42 and 2.12) predicts the performance of simulated networks with contracting recurrent weights that store a sequence of symbols with (see Figure 6A2, solid lines) for different (see Figure 6A2, dashed lines). The information per item retrieved (see Figure 6B2) and the total information (see Figure 6C2) for different values of show the trade-off between fidelity and duration of storage. There is an ideal value that maximizes the memory capacity with for given and . This ideal value differs depending on the alphabet size () (see Figure 6D2). For larger alphabets (meaning more bits per symbol), the inputs should be forgotten more quickly, and memorizing a shorter history optimizes the memory capacity (see Figure 6E). The values of that maximize the memory capacity were computed numerically; they drop with increasing and (see Figure 6F).
Figure 6:

Linear network with contracting recurrent weights. (A1) Accuracy in networks with . Multiple evaluations of are shown as a function of for sequences of different lengths, . (, ). (B1) The information per item also depends on . (C1) The total retrieved information per neuron for different . The maximum is reached as approaches 1 when is finite (; ). (D1) The retrieved information is maximized as grows large (). (A2) Accuracy in networks with as (; ). (B2) Information per item. (C2) Total information retrieved as a function of the total information stored for different . There is a that maximizes the information content for a given and (). (D2) Total information retrieved as a function of the total information stored for different (). Retrieved information is optimized by a particular combination of and . (E) The total retrieved information per neuron versus the information stored per neuron for different and over a wide range. As increases, the information is maximized by decreasing . (F) Numerically determined values that maximize the information content of the network with for different and .

Figure 6:

Linear network with contracting recurrent weights. (A1) Accuracy in networks with . Multiple evaluations of are shown as a function of for sequences of different lengths, . (, ). (B1) The information per item also depends on . (C1) The total retrieved information per neuron for different . The maximum is reached as approaches 1 when is finite (; ). (D1) The retrieved information is maximized as grows large (). (A2) Accuracy in networks with as (; ). (B2) Information per item. (C2) Total information retrieved as a function of the total information stored for different . There is a that maximizes the information content for a given and (). (D2) Total information retrieved as a function of the total information stored for different (). Retrieved information is optimized by a particular combination of and . (E) The total retrieved information per neuron versus the information stored per neuron for different and over a wide range. As increases, the information is maximized by decreasing . (F) Numerically determined values that maximize the information content of the network with for different and .

2.4.2  Neurons with Clipped-Linear Transfer Function

Squashing nonlinear neural activation functions induce a recency effect, similar to contracting recurrent weights. Consider equation 2.1 with and the clipped-linear activation function, , in which the absolute value of the activity of neurons is limited by :
formula
2.43
Clipping functions of this type with specific -values play a role in VSAs that constrain the activation of memory vectors, such as the binary-spatter code (Kanerva, 1996) or the binary sparse-distributed code (Rachkovskij, 2001).

We will analyze the HDC encoding scheme, a network with a bipolar random input matrix and the recurrent weights a permutation matrix. With this, the components of will always assume integer values, and due to the clipping, the components will be confined to . As a consequence, , defined as in equation 2.30, will also assume values limited to . To compute , we need to track the mean and variance of . This requires iterating the Chapman-Kolmogorov equation 2.3. To do so, we introduce a vector with , which tracks the probability distribution of . The probability of each of the integers from is enumerated in the indices of the vector , and is a bijective map from the values of to the indices of with inverse . To compute the sensitivity of a particular recall, we need to track the distribution of with before the item of interest is added, when the item of interest is added, and in the time steps after storing the item of interest. Note that is defined relative to the standard deviation of the codebook, . A simple scaling can generalize the following analysis to account for codebooks with different variance.

Reset memory. At initialization , and so . For each time step that an input arrives in the sequence prior to the input of interest, a or will randomly add to up until the bounds induced by , and this can be tracked with the following diffusion of :
formula
2.44
Once the vector of interest arrives at , all entries in will have added. This causes the probability distribution to skew:
formula
2.45
The inputs following the input of interest will then again cause the probability distribution to diffuse further based on equation 2.44. Finally, can be computed for this readout operation by calculating the mean and variance with :
formula
2.46
formula
2.47

Memory buffer. For the diffusion equation, 2.44, will reach a uniform equilibrium distribution with the values of uniformly distributed between : . This means, as with contracting recurrent weights, the clipped-linear function bounds the noise variance of the saturated state. Here, the variance bound of is the variance of the uniform distribution, . Thus, information can still be stored in the network even after being exposed to an infinite sequence of inputs. The sensitivity in the saturated state can be calculated with by replacing in equation 2.45 with , and then again using the diffusion equation 2.44 for the items following the item of interest.

Figure 7 illustrates this analysis of the distribution of . When the item of interest is added, the probability distribution is most skewed, and the signal degradation is relatively small. As more items are added later, the distribution diffuses to the uniform equilibrium and the signal decays to 0. The figure compares operation with the initial states corresponding to reset memory and memory buffer: empty (see Figures 7A1 to 7F1) and saturated (see Figures 7A2 to 7F2). Numerical optimization of memory capacity shows how increases with and decreases with , the parameter when (see Figure 7G).

Figure 7:

Capacity for neurons with clipped-linear transfer function. (A1) The probability distribution of for retrieval of the first sequence item, as the sequence length is increased. The distribution evolves according to equations 2.44 and 2.45; it begins at a delta function (dark blue) and diffuses to the uniform equilibrium distribution when is large (light blue). (B1) The clipped-linear function causes the signal to degrade as more items are stored (; ; ). (C1) The variance of the distribution grows as more items are stored but is bounded. (D1) The accuracy theory fits empirical simulations, decoding the first input as more input symbols are stored (dashed lines; ; ; ). (A2) The probability distribution of for the memory buffer, that is, when . The most recent symbol encoded (dark blue) has the highest skew, and the distribution diffuses to the uniform equilibrium for readout further in the past (light blue). (B2) The signal is degraded from crosstalk and decays as a function of the look-back. (C2) The noise variance is already saturated and stays nearly constant. (D2) The accuracy exhibits a trade-off between fidelity and memory duration governed by . (E1) With reset memory, the information that can be decoded from the network reaches a maximum when is large (). (F1) The capacity increases with (). (E2) When , there is a trade-off between fidelity and memory duration; a particular value maximizes the retrieved information for a given and (). (F2) For a given memory duration (), an intermediate value maximizes the retrieved information. (G) The memory duration that maximizes the retrieved information.

Figure 7:

Capacity for neurons with clipped-linear transfer function. (A1) The probability distribution of for retrieval of the first sequence item, as the sequence length is increased. The distribution evolves according to equations 2.44 and 2.45; it begins at a delta function (dark blue) and diffuses to the uniform equilibrium distribution when is large (light blue). (B1) The clipped-linear function causes the signal to degrade as more items are stored (; ; ). (C1) The variance of the distribution grows as more items are stored but is bounded. (D1) The accuracy theory fits empirical simulations, decoding the first input as more input symbols are stored (dashed lines; ; ; ). (A2) The probability distribution of for the memory buffer, that is, when . The most recent symbol encoded (dark blue) has the highest skew, and the distribution diffuses to the uniform equilibrium for readout further in the past (light blue). (B2) The signal is degraded from crosstalk and decays as a function of the look-back. (C2) The noise variance is already saturated and stays nearly constant. (D2) The accuracy exhibits a trade-off between fidelity and memory duration governed by . (E1) With reset memory, the information that can be decoded from the network reaches a maximum when is large (). (F1) The capacity increases with (). (E2) When , there is a trade-off between fidelity and memory duration; a particular value maximizes the retrieved information for a given and (). (F2) For a given memory duration (), an intermediate value maximizes the retrieved information. (G) The memory duration that maximizes the retrieved information.

2.4.3  Neurons with Squashing Nonlinear Transfer Function

The case when the neural transfer function is a saturating or squashing function with bounded by a constant also implies is bounded with . For any finite fixed error, one can choose an large enough so that the distribution can be approximated by discretizing the state space into equal bins in . Like the clipped-linear transfer function, one can construct a bijecteve map from values to indices and track