The ability to encode and manipulate data structures with distributed neural representations could qualitatively enhance the capabilities of traditional neural networks by supporting rule-based symbolic reasoning, a central property of cognition. Here we show how this may be accomplished within the framework of Vector Symbolic Architectures (VSAs) (Plate, 1991; Gayler, 1998; Kanerva, 1996), whereby data structures are encoded by combining high-dimensional vectors with operations that together form an algebra on the space of distributed representations. In particular, we propose an efficient solution to a hard combinatorial search problem that arises when decoding elements of a VSA data structure: the factorization of products of multiple codevectors. Our proposed algorithm, called a resonator network, is a new type of recurrent neural network that interleaves VSA multiplication operations and pattern completion. We show in two examples—parsing of a tree-like data structure and parsing of a visual scene—how the factorization problem arises and how the resonator network can solve it. More broadly, resonator networks open the possibility of applying VSAs to myriad artificial intelligence problems in real-world domains. The companion article in this issue (Kent, Frady, Sommer, & Olshausen, 2020) presents a rigorous analysis and evaluation of the performance of resonator networks, showing it outperforms alternative approaches.
Cognition requires making use of learned knowledge in contexts never before encountered, a facility that requires information to be represented in terms of components that may be flexibly recombined. A long-standing goal for neuroscience and psychology has been to understand how such capacities are expressed by neural networks in the brain. Early artificial intelligence researchers developed frameworks of symbol manipulation to emulate cognition, but they were implemented with local data representations (where the meaning of a bit is tied to its location) that are brittle and nonadaptive (Kanerva, 1997). Connectionism, a movement started in psychology (McClelland, Rumelhart, & PDP Research Group, 1986), based itself on the premise that internal representations of knowledge must be highly distributed and be able to adapt to the statistics of the data so as to learn by example. Along the way, however, connectionism also gave up many of the rich capabilities offered by symbolic computation (Jackendoff, 2002). In recent years, it has become clear that a unification of the ideas behind each approach—distributed representation, adaptivity, and symbolic manipulation—will be required for reproducing the brain's ability to learn from few examples, to deal with novel situations, or to change behaviors when driven by internal information processing rather than purely by external events (Plate, 2003; Gayler, 2003; Kanerva, 2009; Lake, Ullman, Tenenbaum, & Gershman, 2017).
Digital computers owe their power and ubiquity to the abstraction of data structures, which support decomposing information into parts, referencing each part individually, and composing these parts with other data structures. Examples include trees, records with fields, and linked lists. Connectionist theories have long been criticized because it is hard to imagine how compound, hierarchical data structures could be represented and manipulated by neural networks (Hinton, 1990). Cognitive scientists have argued that at the very least, cognitive data structures should support three patterns of combination, which are familiar to any computer programmer (Fodor & Pylyshyn, 1988):
Key-value pairs: A key or variable is a placeholder for information to which a value can be assigned in a particular instance. This association, variable binding, generates what is called the systematicity of cognition (Fodor, 1975; Plate, 2003).
Sequential structures: A sequence is an ordered pattern of organization and computation required by many reasoning tasks.
Hierarchy: The notion that some aspects of knowledge can be decomposed recursively into a set of successively more fundamental parts.
Variable binding, sequence, and hierarchy are critical structures of cognition, and a comprehensive theory of intelligence must take these into account.
A family of models called Vector Symbolic Architectures (VSAs) encodes these structures into distributed representations, providing a framework that can reconcile the symbolic and connectionist perspectives (Plate, 2003; Gayler, 2003; Kanerva, 2009). Building on the concept of reduced representations (Hinton, 1990), VSAs allow one to express data structures holographically in a vector space of high but fixed dimensionality. The atoms of representation are random high-dimensional vectors, and data structures built from these atoms are vectors with the same dimension. Three operations are used to form and manipulate data structures—addition, multiplication, and permutation—which together form an algebra over the space of high-dimensional vectors. These operations enable building representations of sets, ordered lists (sequences), n-tuples, trees, key-value bindings, and records containing role-filler relationships which can be composed into hierarchies, as described in Plate (1995); Kanerva (1996, 1997); Joshi, Halseth, and Kanerva (2016); Frady, Kleyko, and Sommer (2018), and below.
In order to read out or access the components of a VSA-encoded data structure, the high-dimensional vector representing it must be decomposed into the primitives or atomic vectors from which it is built. This is the problem of decoding. For example, if the primitives are combined by addition only, the distributed representation can be decoded by a nearest-neighbor look-up or an autoassociative memory. However, hierarchical or compound data structures, such as a multilevel tree or an object with multiple attributes bound together, are built from combinations of addition, multiplication, and permutation operations on the primitives. In this case, decoding via a simple nearest-neighbor look-up would require storing every possible combination of the primitives (e.g., all possible paths in a tree or all the possible attribute combinations) essentially amounting to a combinatoric search problem. Past applications of VSAs have largely sidestepped this problem by limiting the depth of the data structures or using a brute force approach to consider all possible combinations when necessary (Plate, 2000a; Cox, Kachergis, Recchia, & Jones, 2011). As a result, the application of VSAs to real-world problems has been rather limited, since up to now, there has not been a solution for efficiently accessing elements of such compound data structures containing a product of multiple components.
The solution to this dilemma is to factorize the high-dimensional vector representing a compound data structure into the primitives from which it is composed. That is, given a high-dimensional vector formed from an element-wise product of two or more vectors, we must find its factors. This way, a nearest-neighbor look-up need only search over the alternatives for each factor individually rather than all possible combinations. Obviously, though, factorization poses a difficult computational problem in its own right.
Here, we propose an efficient algorithm for factorizing high-dimensional vectors that may be interpreted as a type of recurrent neural network, which we call a resonator network. The resonator network relies on the VSA principle of superposition to search through the combinatoric solution space without directly enumerating all possible factorizations. Given a high-dimensional vector as input, the network iteratively searches through many potential factorizations in parallel until a set of factors is found that agrees with the input. Solutions emerge as stable fixed points in the network dynamics.
In this article, part 1 of a two-part series in this issue, we first briefly introduce the VSA framework and the problem of factoring high-dimensional VSA representations. We then show using two examples—searching a binary tree and querying the contents of a visual scene—how VSAs may be used to build distributed representations of compound data structures and how resonator networks are used to decompose these data structures and solve the problem. The companion article in this issue (Kent, Frady, Sommer, & Olshausen, 2020) provides rigorous mathematical and simulation analysis of resonator networks and compares its performance with alternative approaches for solving high-dimensional vector factorization problems.
2 VSA Preliminaries
All entities in a VSA are represented as high-dimensional vectors in the same space, with vector dimension typically in the range of 1000 to 10,000. In this article, we focus on the VSA framework called Multiply-Add-Permute (Gayler, 1998, 2003). The atomic primitives are bipolar vectors whose components are 1, chosen randomly. These vectors are used as symbols to represent concepts. The set of atomic vectors representing specific items is stored in a codebook, which is a matrix of dimension , where is the number of atoms.
The use of high-dimensional vectors is an important aspect of the VSA framework, as it relies on the concentration of measure phenomenon (Ledoux, 2001) that independently chosen random vectors are very close to orthogonal, a property we refer to as quasi-orthogonality. This property allows vectors to act symbolically, as the similarity (inner product) between two different atomic vectors is small compared to their self-similarity (L2 norm). Furthermore, a much larger set of quasi-orthogonal vectors exists than orthogonal vectors, which may be exploited for combinatoric search.
Data structures are composed and computations are carried out via an algebra consisting of three vector operations: addition, multiplication, and permutation. The elements of a data structure are then read out (decoded) using the conventional vector dot product as a similarity measure to compare to items stored in the codebook. The VSA operations of addition, multiplication, and permutation act to manipulate the vector symbols in ways that preserve or destroy their similarity.
Formally, the VSA operations are defined as follows:
- Dot product () is the conventional vector inner product, , which is used to measure the similarity between vectors. This is used to decode the result of a VSA computation by comparing the vector to the set of vectors in the codebook:
Here, is the codebook of atomic vectors, and is a high-dimensional vector resulting from a VSA computation. The result of a VSA computation can be a single symbol indicated by the largest component of . Alternatively, the coefficients can be considered as a weighted sum, where each entry indicates a confidence level, probability, or intensity value.
- Addition () is used to superpose items together, like forming a set. It is defined by regular vector addition, the element-wise sum:
or . Depending on the circumstances, the sum may be kept as is or subsequently thresholded so that each is . In either case, the addition operation results in a vector that is similar to each of its superposed components; one can determine the members of the sum by similarity to the atomic vectors. Superposition is possible because of the quasi-orthogonal property. However, superposition produces a small amount of cross-talk noise, which increases with the number of items in the sum and is diminished with large vector dimensionality (see Frady et al., 2018, for a detailed characterization of superposition).
- Multiplication () is used to bind items together to form a conjunction, such as in assigning a value to a variable. It is defined by the Hadamard product between vectors, that is, the element-wise multiplication of vector components:or . This multiplication operation is invertible (), and it distributes over addition, . Note that in the MAP VSA, the bipolar primitive vectors are their own self-inverses. In contrast to addition, multiplication generates a vector that is dissimilar to each of its inputs (Kanerva, 2009).
- Permutation () is used to “protect” or “order” items. It operates on a single input vector. In principle, it can be any random permutation, but is typically a simple cyclic shift:or . Permutation distributes over both addition, , and multiplication, , and its function is complementary to addition and multiplication. Permutations are used to protect the components of a data structure built with these other operations, based on the fact that permutation and binding are noncommutative, . In essence, permutation rotates vectors into dimensions of the space that are almost orthogonal to the dimensions used by the original vectors. Information is thus protected when combined with other items, because vector components will not appear similar to or interfere with those other items. Permutations can also be used to index sequences (Frady et al., 2018), or levels in a hierarchy, by successive application of the permutation operation. For example to represent the sequence in a vector , with .
VSAs combine these operations to form data structures and to compute with them. The combination of atomic vectors into composite data structures is rather straightforward. But as we shall see, querying composite data structures often results in the problem of decoding terms composed of two (or perhaps many more) atomic vectors that are multiplied together. In order to decode such composite vectors, one must search through many combinations of atoms. In general, this is a hard combinatorial search problem, which typically requires directly testing every combination of factors. The resonator network can efficiently solve these problems without needing to directly test every combination of factors.
3 Factorization via Search in Superposition
The resonator network is an iterative approach to solve this problem without exhaustively searching through each possible combination of the factors. A key motivating idea behind resonator networks is the VSA principle of superposition. In VSAs, multiple symbols can be expressed simultaneously in a single high-dimensional vector via vector addition. Randomized atomic vectors are highly likely to be close to orthogonal in high-dimensional space, meaning that they can be superposed without much interference. However, there is some cross-talk noise between the superposed symbols, and “clean-up memory” (such as a Hopfield network) is thus utilized to reduce the cross-talk noise.
A resonator network combines the strategy of superposition and cleanup memory to efficiently search over the combinatorially large space of possible factorizations. The vectors , , and represent the current estimate for each factor. These vectors can be initialized to the superposition of all possible factors—for example, , . A particular factor can then be inferred from based on the estimates for the other two—for example, . Since binding distributes over addition, the product expresses every combination of factors in superposition because . For instance, if , then this initial guess represents combinations in superposition. Thus, many potential combinations of the pair of factors may be considered at once when inferring the third factor.
The inference process, however, is noisy if many guesses are tested simultaneously. This noise results from cross talk of many quasi-orthogonal vectors and can be reduced through a clean-up memory. This is built from the codebooks, which contain all the vectors that are possible factors of the input . Each clean-up memory projects the initial noisy estimate onto the span of the codebook. This computes a measure of confidence for whether each element in the codebook is a factor.
The result of the inference and clean-up leads to a new estimate for each factor. The new estimate is formed by a sum of dictionary items weighted by the confidence levels. This produces a better guess for each one of the factors. The inference can then be repeated with better guesses, which reduces cross-talk noise even further. By iteratively applying this procedure, the inference and clean-up stages cooperate to successively reduce cross-talk noise until the solution is found.
If we examine the clean-up memory for , which contains a matrix multiplication with and thresholding function , then we see this operation is nearly identical to a Hopfield network with outer-product Hebbian learning (Hopfield, 1982). Except here, rather than directly feeding back into itself, the result of the clean-up is sent to other parts of the network.
The set of equations in (3.2) defines a nonlinear dynamical system that has interesting empirical and theoretical properties, which we thoroughly examine through simulation experiments in Kent et al. (2020), the companion article in this issue. Empirically, the system bounces around in state space until the correct solution appears to resonate with the network dynamics, popping out as if in a moment of insight. We find that while there is no Lyapunov function governing these dynamics and no guarantee for convergence, the resonator network empirically converges to the correct solution with high probability as long as the number of product combinations to be searched is within the network's operational capacity. We show that the operational capacity is given by a quadratic function of . Compared to numerous alternative optimization methods that we considered, this capacity for resonator networks is higher by almost two orders of magnitude.
4 Decoding Data Structures with Resonator Networks
We now turn to two examples that illustrate how VSA operations can be combined to build distributed representations of data structures, how the factorization problem arises when parsing these representations, and how resonator networks can be designed to solve this problem.
4.1 Searching a Tree Data Structure
To set up the network for this problem, we first establish a maximum depth to search through; the maximum depth determines the number of factors that need to be estimated. For the tree shown in Figure 2, we need five estimators, because this is the depth of the deepest leaves, f and g.
Each factor estimate will determine whether to go left, right, or stop, for each level down the tree. To indicate stop, a special vector is used, the identity vector (a vector of all ones). By using the appropriate number of these identity vectors, each location in the tree can be thought of as a composition with the same depth (the maximum depth), even if the location is only partially down the tree. For instance, if we consider leaf c in Figure 2, then its position is also . This way, we can set up a resonator network for five factors and have it decode locations anywhere in the tree.
The process is demonstrated in Figure 2. The input vector to be factorized, , is first formed from the tree data structure and the query. For instance, to find the location of label c, is the input to the resonator network. Different leaves in the tree can be found by unbinding the leaf representation from the tree vector and using this result as the input.
We visualize the network dynamics by displaying the similarity of each factor estimate to the atoms stored in its corresponding codebook . The evolution of these similarity weights over time is shown as a heat map (see Figure 2, right). The heat maps show that the system initially jumps around chaotically, with the weighting of each estimate changing drastically with each iteration. But then there is a quite sudden transition to a stable equilibrium, where each estimate converges nearly simultaneously, and at this point, the output for each factor is essentially the codebook element with highest weight.
4.2 Visual Scene Analysis as a Factorization Problem
Any given scene can have between one and three of these objects, which are allowed to partially occlude one another. We generate symbolic vectors to encode color; to encode shape; , , to encode vertical position; and to encode horizontal position, which are stored in respective codebooks, .
The example scene (see Figure 3) contains a cyan7 at position top, left; a pink3 at position top, right; and a red8 at position middle, left. While this is a highly simplified type of visual scene, it illustrates the combinatorial challenge of representing and interpreting visual scenes. There are only 23 distinct atomic parameters (10 for digit identity, 7 for color, 3 each for vertical and horizontal position), and yet these combine to describe individual objects, and possible scenes with 1, 2, or 3 objects. This number of combinations still does not include the variability among exemplars for each shape, of which there are 50,000 in the MNIST data set.
The VSA approach to represent a scene like this is to form the conjunction of each of the four factors with the binding operation and superposing multiple objects together to form a single high-dimensional vector that constitutes a distributed representation of the entire scene. This encoding is depicted in Figure 3, and as in the previous examples, the encoding provides a flexible data structure such that aspects of the scene can be individually queried. One attractive property of this representation is that its dimensionality does not grow with the number of objects in the scene, nor does it impose any particular ordering on the objects.
To convert a new input image into a structured VSA representation, one challenge is to deal with the variability and correlations between the shapes of different hand-written digits. VSAs are designed for symbolic processing in neural networks. However, when dealing with sensor data streams, one must solve the encoding problem, which is how to map the input data into the symbolic space (Räsänen, 2015; Kleyko, Rahimi, Rachkovskij, Osipov, & Rabaey, 2018). We train a simple feedforward neural network with two fully connected hidden layers to produce the desired VSA encoding of the scene. The feedforward network was trained on a (uniformly) random sample of these scenes, with the MNIST digits chosen from an exclusive training set. A generative model creates the image of the scene from a random sample of factors for each object. From the chosen factors, the VSA representation of the scene is also generated through binding of VSA vectors for each factor and superposition for each object (see Figure 3). Supervised learning via backpropagation is used to train the network to output the VSA representation of the entire scene from the image pixels as input.
After training on images, we used the network to produce symbolic vectors for a held-out test set of images. The vector dimensionality is a free parameter, which we chose to be 500. If the exact ground-truth vector is provided to a resonator network, it will infer the factors with accuracy provided is large enough, a fact we establish in the companion article in this issue. For this small, visual scene example, it turns out more than suffices for the number of possible factorizations to be searched. Note that is lower than the total number of combinations of all the factors, which is 630.
A major quest for modern artificial intelligence is to build computational models that combine the abilities of neural networks with the abilities of rule-based reasoning. Vector Symbolic Architectures, a family of connectionist models, enable the formation of distributed representations of data structures, structured computation on these representations, and has provided valuable conceptual insights for cognition and computation. However, so far, VSA models have not been able to solve challenging artificial intelligence problems in real-world domains due to the combinatorial factorization problem that arises when processing complex, hierarchical data structures. Our contribution here has been to provide an efficient solution to the factorization problem, the resonator network, which we show in the companion article in this issue (Kent et al., 2020) vastly outperforms standard optimization methods.
The two applications we showed here—parsing a tree-like data structure and decomposing a visual scene—are intended as illustrative examples to show how factorization of multiterm products arises in querying a VSA data structure, and they show how to design resonator networks to solve such problems. Having a solution to the factorization problem now makes it possible to apply VSAs to myriad problems in computational neuroscience, cognitive science, and artificial intelligence, from visual scene analysis to natural language understanding and analogical reasoning.
5.1 Implications for Neuroscience
The ability to solve factorization problems is fundamental to both perception and cognition. In vision, for example, the signal measured by a photoreceptor contains a combination of illumination, surface reflectance, surface orientation, and atmospheric properties that essentially need to be “demultiplied” by the visual system in order to recover a representation of the underlying causes in a scene (Barrow & Tenenbaum, 1978; Adelson & Pentland, 1996; Barron & Malik, 2014). The problem of separating form and motion may also be posed as a factorization problem (Cadieu & Olshausen, 2012; Memisevic & Hinton, 2010; Anderson et al., 2020). In the domain of language, it has been argued that a factorization of sentence structure into “roles” and “fillers” is required for robust and flexible processing (Smolensky, 1990; Jackendoff, 2002). Many cognitive tasks, such as analogical reasoning, also require a form of factorization (Hummel & Holyoak, 1997; Kanerva, 1998; Plate, 2000a; Gayler & Levy, 2009). However, to date, it has been unclear how these factorization problems could be represented and solved efficiently by neural circuits in the brain. VSAs and resonator networks are a potential neural solution to these problems, and indeed developing more neurobiologically plausible models along these lines is a goal of ongoing work.
In the context of neuroscience and psychology, binding is widely theorized to be an important process by which the brain properly associates features belonging to the same physical object. However, how the brain may accomplish this is a hotly debated subject. Various solutions to this problem, also known as the neural binding problem, have been proposed based on attentional mechanisms or neural synchrony (Treisman & Gelade, 1980; von der Malsburg, 1999; Wolfe & Cave, 1999). Note that in these proposals, the binding information required to properly describe sets of compound objects has to be added to the individual feature representations, thus increasing the dimension for representing a compound object (or expanding the representation in time).
VSAs provide a general solution to the binding problem that early visual stages could employ. By using the VSA operations to represent and form data structures, the binding of features is easily expressed. Further, the dimension of the compound representation is not increased. The main computational challenge then becomes the factorization of VSA data structures formed in early sensory pathways, for which the resonator network provides a neurally plausible solution. Interestingly, Feldman's (2013) earlier discussion of the binding problem has already pointed out that the more fundamental problem of sensory processing is actually one of unbinding. Feldman argued that the raw sensory signals themselves can be thought of as being composites, containing multiple attributes that require factorization, such as in the examples described here.
In terms of modeling computation in biological neural circuits, resonator networks are clearly an abstraction. In particular, the implementation of VSAs presented here assumes that information is encoded by dense bipolar vectors (each element is nonzero), and the binding operation is performed by element-wise multiplication of vectors. At first glance, these types of representations and operations may not seem very biologically plausible. However, other variants of VSAs that utilize sparse, rather than dense, representations may help to reconcile this disconnect (Rachkovskij & Kussul, 2001; Laiho, Poikonen, Kanerva, & Lehtonen, 2015). Recently, we have shown that compound objects can be efficiently represented by sparse vectors with the same dimension as the atomic representations (Frady, Kleyko, & Sommer, 2020). The binding operation in this context relies on sigma-pi type operations (Mel & Koch, 1990; Plate, 2000b) that are potentially compatible with active nonlinearities found in dendritic trees. Complex-valued variations of VSAs (Plate, 2003) can also be linked to spike-timing codes (Frady & Sommer, 2019), which could further increase links to biology.
5.2 Implications for Machine Learning
In conventional deep learning approaches, given enough labeled data, a multilayer network can be trained end-to-end without worrying about understanding or parsing the representations formed by the intermediate layers. Users typically consider the interior of a deep network as a black box. However, this conceptual convenience becomes a disadvantage when it comes to improving the deficiencies of deep learning methods: susceptibility to adversarial attacks, the need for large amounts of labeled data, and a lack of generalization to novel situations. Moreover, while most machine-learning algorithms are focused on problems of pattern matching or learning a mapping from inputs to outputs, most problems in perception and cognitive reasoning require more than just pattern matching; they also the ability to form and manipulate data structures.
VSAs offer a transparent approach to forming distributed representations of potentially complex data structures that may be flexibly recombined to deal with novel situations. For any desired computation, the relevant elements in the data structure can be exposed, or decoded, and combined with other information to calculate a result. Here we have shown how these data structures can be formed and manipulated to solve challenging computational problems such as tree search or visual scene analysis. The key to solving these problems relies on the ability to factorize high-dimensional vectors, which can now be done by resonator networks. Given that the problem of factorization arises in many other machine-learning settings, such as simultaneous inference of multiple variables, inverse graphics, and language, it seems likely that resonator networks could provide an efficient solution in these domains as well.
One can potentially combine VSAs with deep learning to get the best of both worlds. An example of this may be seen in our solution to parsing a visual scene (see section 4.2). Rather than training a network to simply map images to class labels, our approach trains the network to map the image to a symbolic description that captures the compositional structure of a scene—that is, multiple objects combined with their properties—which can be used by downstream processes to reason about the scene. Importantly, because multiple object-property bindings can be superposed in the same space, the VSA encoding can handle the very large combinatoric space of possible scenes (in this case, 250 million) with a single vector of fixed dimensionality (500). The VSA representation does have a limited capacity and will begin to break down for more than a few objects. However it is worth noting that human working memory has similar limitations (Miller, 1956).
While there are undoubtedly alternative deep learning approaches for performing analysis of simple scenes, our goal here was to show how analysis of visual scenes could be approached by expressing the problem as a problem of factorization. Incorporating factorization into problems like scene analysis may enable reasoning in much more complex spaces, as such a system can utilize factorization to handle a very large combinatoric space. However, the simple hybrid approach presented here still has some shortcomings, such as requiring a large amount of training data to learn the encoding.
We believe that multilayer neural networks could be improved profoundly by enabling all layers to explicitly represent, learn, and factorize data structures. Some recent model innovations follow this direction, particularly the “transformer” neural network architecture, which encodes key-value pairs for modeling language and other types of data (Vaswani et al., 2017; Devlin, Chang, Lee, & Toutanova, 2018). Other model proposals enable the encoding of multiplicative relationships between features using the tensor product (Nickel, Tresp, & Kriegel, 2011; Socher, Chen, Manning, & Ng, 2013). VSAs could enable these models to represent and manipulate increasingly complex data structures, but this requires solving factorization problems. Resonator networks could thus serve as a critical component for building trainable neural networks that form, query, and manipulate large hierarchical data structures.
We thank members of the Redwood Center for Theoretical Neuroscience for helpful discussions, in particular Pentti Kanerva, whose work on Vector Symbolic Architectures originally motivated this project. This work was generously supported by NIH grant 1R01EB026955-01, NSF grants IIS1718991 and DGE1752814, the Intel Neuromorphic Research Community, Berkeley Deep-Drive, the Seminconductor Research Corporation and NSF under E2CDA-NRI, DARPA's Virtual Intelligence Processing program, and AFOSR FA9550-19-1-0241.