Abstract
We develop theoretical foundations of resonator networks, a new type of recurrent neural network introduced in Frady, Kent, Olshausen, and Sommer (2020), a companion article in this issue, to solve a high-dimensional vector factorization problem arising in Vector Symbolic Architectures. Given a composite vector formed by the Hadamard product between a discrete set of high-dimensional vectors, a resonator network can efficiently decompose the composite into these factors. We compare the performance of resonator networks against optimization-based methods, including Alternating Least Squares and several gradient-based algorithms, showing that resonator networks are superior in several important ways. This advantage is achieved by leveraging a combination of nonlinear dynamics and searching in superposition, by which estimates of the correct solution are formed from a weighted superposition of all possible solutions. While the alternative methods also search in superposition, the dynamics of resonator networks allow them to strike a more effective balance between exploring the solution space and exploiting local information to drive the network toward probable solutions. Resonator networks are not guaranteed to converge, but within a particular regime they almost always do. In exchange for relaxing the guarantee of global convergence, resonator networks are dramatically more effective at finding factorizations than all alternative approaches considered.
1 Introduction
This article is the second in a two-part series on resonator networks. “Resonator Networks, 1” shows how distributed representations of data structures may be formed using the algebra of Vector Symbolic Architectures and that decoding these representations often requires solving a vector factorization problem. We introduced resonator networks as a neural solution to this problem and demonstrated with two examples. Here, our primary objective is to establish the theoretical foundations of resonator networks and to perform a more comprehensive analysis of their convergence and capacity properties in comparison to optimization-based methods.
We limit our analysis to a particular definition of the factorization problem, which may seem somewhat abstract but in fact applies to practical usage of Vector Symbolic Architectures (VSAs). We consider bipolar vectors, whose elements are , used in the popular “Multiply, Add, Permute (MAP)” VSA (Gayler, 1998, 2004). These ideas extend to other VSAs, although we leave a detailed analysis to future work. Part 1 included commentary on the historical context and representational power of VSAs, which we will not cover here. For the purposes of this article, it is sufficient to stipulate that wherever VSAs are used to encode complex hierarchical data structures, a factorization problem must be solved. By solving this problem, resonator networks make the VSA framework scalable to a larger range of problems.
The core challenge of factorization is that inferring the factors of a composite object amounts to searching through an enormous space of possible solutions. Resonator networks do this in part by “searching in superposition,” a notion that we make precise in section 3. There are in fact many ways to search in superposition, and we introduce a number of them in section 5 as a benchmark for our model and to understand what makes our approach different. A resonator network is simply a nonlinear dynamical system designed to solve a particular factorization problem. It is defined by equations 4.1 and 4.2, each representing two separate variants of the network. The system is named for the way in which correct factorizations seemingly resonate out of what is initially an uninformative network state. The size of the factorization problem that can be reliably solved, as well as the speed with which solutions are found, characterizes the performance of all the approaches we introduce---in these terms, resonator networks are by far the most effective.
The main results are as follows:
We characterize stability at the correct solution, showing that one variant of resonator networks is always stable, while the other has stability properties related to classical Hopfield networks. We show that resonator networks are less stable than Hopfield networks because of a phenomenon we refer to as percolated noise (see section 6.1).
We define “operational capacity” as a metric of factorization performance and use it to compare resonator networks against six benchmark algorithms. We find that resonator networks have dramatically higher operational capacity (section 6.2).
Through simulation, we determine that operational capacity scales as a quadratic function of vector dimensionality. This quantity is proportional to the number of idealized neurons in a resonator network (also section 6.2).
We propose a theory for why resonator networks perform well on this problem (see section 6.6).
2 Statement of the Problem
Suppose we were to solve problem 2.1 using a brute force strategy. We might form all possible composite vectors from the sets , one at a time, until we generate the vector , which would indicate the appropriate factorization. Assuming no additional information is available, the number of trials taken to find the correct factorization is a uniform random variable and thus . If instead we could easily store all of the composite vectors ahead of time, we could compare them to any new composite vector via a single matrix-vector inner product, which, given our uniqueness assumption, will yield a value of for the correct factorization and values strictly less than for all other factorizations. The matrix containing all possible composite vectors requires bits to store. The core issue is that scales very poorly with the number of factors and number of possible codevectors to be entertained. If (4 factors) and (100 possible codevectors for each factor), then . In the context of Vector Symbolic Architectures, it is common to have . Therefore, the matrix with all possible composite vectors would require approximately to store. We aspire to solve problems of this size (and much larger), which are clearly out of reach for brute-force approaches. Fortunately, they are solvable using resonator networks.
3 Factoring by Search in Superposition
4 Resonator Networks
If, contrary to what we have defined in equations 4.1 and 4.2, the input to each subpopulation of neurons was , its own previous state, then one would in fact have a (“bipolar”) Hopfield network. In our case, however, rather than being autoassociative, in which is a direct function of , our dynamics are heteroassociative, basing updates on the states of the other factors. This change has a dramatic effect on the network's convergence properties and is also in some sense what makes resonator networks useful in solving the factorization problem, a fact that we elaborate on in the following sections. We imagine separate subpopulations of neurons that evolve together in time, each one responsible for estimating a different factor of . For now, we have just specified this as a discrete-time network in which updates are made one at a time, but it can be extended as a continuous-valued, continuous-time dynamical system along the same lines as was done for Hopfield networks (Hopfield, 1984). In that case, we can think about these subpopulations of neurons evolving in a truly parallel way. In discrete time, one has the choice of making asynchronous or synchronous updates to the factors, in a sense analogous to Hopfield networks. Our formulation of in equation 3.5 follows the asynchronous convention, which we find to converge faster. The formulation given in the companion article in this issue employed the synchronous convention for pedagogical reasons, but the distinction between the two vanishes in continuous time, where updates are instantaneous.
In practice, we have to choose an initial state using no knowledge of the correct codevector other than the fact it is one of the elements of the codebook . Therefore, we set , which, as we have said, has approximately equal cosine similarity to each term in the sum.
4.1 Difference between OP Weights and OLS Weights
The difference between outer product weights and OLS weights is via , the inverse of the so-called Gram matrix for , which contains inner products between each codevector. If the codevectors are orthogonal, the Gram matrix is , with the identity matrix. When is large (roughly speaking above 5000) and the codevectors are chosen randomly independent and identicaly distributed (i.i.d.) from , then they will be very nearly orthogonal, making a close approximation. Clearly, in this setting, the two variants of resonator networks produce nearly the same dynamics. In section 6.2, we define and measure a performance metric called operational capacity in such a way that does not particularly highlight the difference between the dynamics, that is, it is the setting where codevectors are nearly orthogonal. In general, however, the dynamics are clearly different. In our experience, applications that contain correlations between codevectors may enjoy higher operational capacity under OLS weights, but it is hard to say whether this applies in every setting.
One application-relevant consideration is that because consists of entries that are and , the outer product variant of a resonator network has an integer-valued weight matrix and can be implemented without any floating-point computation; hardware with large binary and integer arithmetic circuits can simulate this model very quickly. Coupled with noise tolerance properties we establish in section 6.5, this makes resonator networks (and, more generally, VSAs) a good fit for emerging device nanotechnologies (Rahimi et al., 2017).
5 The Optimization Approach
An alternative strategy for solving the factorization problem is to define a loss function that compares the current estimate with the composite that is to be factored, , choosing the loss function and a corresponding constraint set so that the global minimizer of this loss over the constraints yields the correct solution to 2.1. One can then design an algorithm that finds the solution by minimizing this loss. This is the approach taken by optimization theory. Here we consider algorithms that search in superposition, setting just like resonator networks, but that instead take the optimization approach.
. | Parameters of Quadratic Fit . | ||
---|---|---|---|
. | . | . | . |
0.293 | |||
2.002 | |||
1.404 | |||
1.024 | |||
0.874 | |||
0.669 |
. | Parameters of Quadratic Fit . | ||
---|---|---|---|
. | . | . | . |
0.293 | |||
2.002 | |||
1.404 | |||
1.024 | |||
0.874 | |||
0.669 |
5.1 Benchmark Algorithms
A common thread among the benchmark algorithms is that they take the activation function to be the identity , making a multilinear function of the coefficients, as we discussed in section 3. We experimented with other activation functions, but found none for which the optimization approach performed better. We consider two straightforward loss functions for comparing and . The first is one-half the squared Euclidean norm of the error, , which we call the squared error for short, and the second is the negative inner product . The squared error is minimized by , which is also true for the negative inner product when is constrained to . Both of these loss functions are convex, meaning that is a convex function of each separately.2 Some of the benchmark algorithms constrain directly, and when that is the case, our focus is on three different convex sets: the simplex , the unit ball , and the closed zero-one hypercube . Therefore, solving problem 5.1 with respect to each separately is a convex optimization problem. In the case of the negative inner product loss and simplex constraints , it is a bonafide linear program. The correct factorization is given by such that , which we know to be vectors with a single entry 1 and the rest 0; these are the standard basis vectors (where if and 0 otherwise). The initial states must be set with no prior knowledge of the correct factorization so, similar to how we do for resonator networks, we set each element of to the same value (which in general depends on the constraint set).
5.1.1 Alternating Least Squares
5.1.2 Gradient-Following Algorithms
Another natural strategy for solving problem 5.1 is to make updates that incorporate the gradient of with respect to the coefficients; each of the next five algorithms does this in a particular way (we write out the gradients for both loss functions in appendix E). The squared error loss is globally minimized by , so one might be tempted to start from some initial values for the coefficients and make gradient updates . In section E.1 we discuss why this does not work well. The difficulty is in being able to guarantee that the loss function is smooth enough that gradient descent iterates with a fixed step size will converge. Instead, the algorithms we apply to the squared error loss utilize a dynamic step size:
Iterative Soft Thresholding: The global minimizers of equation 5.1 are maximally sparse, . If one aims to minimize the squared error loss while loosely constrained to sparse solutions, it may make sense to solve the problem with Iterative Soft Thresholding (ISTA). The dynamics for ISTA are given by equation C.1 in Table 2.
Fast Iterative Soft Thresholding: We also considered fast iterative soft thesholding (FISTA), an enhancement due to Beck and Teboulle (2009), which utilizes Nesterov's momentum for accelerating first-order methods in order to alleviate the sometimes slow convergence of ISTA (Bredies & Lorenz, 2008). Dynamics for FISTA are given in equation C.2.
Projected Gradient Descent: Another benchmark algorithm we considered was Projected Gradient Descent on the negative inner product loss, where updates were projected onto either the simplex or unit ball (see equation C.3). A detailed discussion of this approach can be found in appendix G.
Multiplicative Weights: This is an algorithm that can be applied to either loss function, although we found it worked best on the negative inner product. It elegantly enforces a simplex constraint on by maintaining a set of auxilliary variables, the weights, which are used to set at each iteration. See equation C.5 for the dynamics of Multiplicative Weights, as well as appendix H.
Map-seeking Circuits: The final algorithm that we considered is map-seeking circuits, neural networks designed to solve invariant pattern recognition problems using the principle of superposition. Their dynamics are based on the gradient, but are different from what we have introduced so far (see equation C.5 and appendix I).
5.2 Contrasting Resonator Networks with the Benchmarks
5.2.1 Convergence of the Benchmarks
A remarkable fact about the benchmark algorithms is that each one converges for all initial conditions, which we directly prove, or refer to results proving, in appendixes D through I. That is, given any starting coefficients , their dynamics reach fixed points that are local minimizers of the loss function. In some sense, this property is an immediate consequence of treating factorization as an optimization problem: the algorithms we chose as the benchmarks were designed this way. Convergence to a local minimizer is a desirable property, but unfortunately the fundamental nonconvexity of the optimization problem implies that this may not guarantee good local minima in practice. In section 6, we establish a standardized setting where we measure how likely it is that these local minima are actually global minima. We find that as long as , the size of the search space, is small enough, each of these algorithms can find the global minimizers reliably. The point at which the problem becomes too large to reliably solve is what we call the operational capacity of the algorithm, and it is a main point of comparison with resonator networks.
5.2.2 An Algorithmic Interpretation of Resonator Networks
The benchmark algorithms generate estimates for the factors, , that move through the interior of the hypercube. Resonator networks, on the other hand, do not. The function ``bipolarizes'' inputs to the nearest vertex of the hypercube, and this highly nonlinear function, which not only changes the length but also the angle of an input vector, is key. We know the solutions exist at vertices of the hypercube, and these points are very special geometrically in the sense that in high dimensions, most of the mass of is concentrated relatively far from the vertices, a fact we will not prove here but that is based on standard results from the study of concentration inequalities (Boucheron, Lugosi, & Massart, 2013). Our motivation for using the activation function is that moving through the interior of the hypercube while searching for a factorization is unwise, a conjecture for which we will provide some empirical support in section 6.
Resonator networks have a large and practical regime of operation, where (the problem size) is small enough, in which nonconverging trajectories are extremely rare. It is simple to deal with these events, making the model still useful in practice despite the lack of a convergence guarantee. It has also been argued in several places (see Van Vreeswijk & Sompolinsky, 1996, for example) that cyclic or chaotic trajectories may be useful to a neural system, including in cases where there are multiple plausible states to entertain. This is just to say that we feel the lack of a convergence guarantee is not a critical weakness of our model, but rather an interesting and potentially useful characteristic. We attempted many different modifications to the model's dynamics that would provably cause it to converge, but these changes always hindered its ability to solve the factorization problem. We emphasize that unlike all of the models in section 5.1, a resonator network is not descending a loss function. Rather, it makes use of the fact that:
Each iteration is a bipolarized ALS update. It approximately moves the state toward the least squares solution for each factor.
The correct solution is a fixed point (guaranteed for OLS weights, highly likely for OP weights).
There may be a sizable basin of attraction around this fixed point, which the iterates help us descend.
The number of spurious fixed points (which do not give the correct factorization) is relatively small.
This last point is really what distinguishes resonator networks from the benchmarks, which we establish in section 6.6.
6 Results
We present a characterization of resonator networks along three main directions. The first direction is the stability of the solutions , which we relate to the stability of classical Hopfield networks. The second is a fundamental measure of factorization capability we call the “operational capacity.” The third is the speed with which factorizations are found. We argue that the marked difference in factorization performance between our model and the benchmark algorithms lies in the relative scarcity of spurious fixed points enjoyed by resonator network dynamics. We summarize the main results in bold throughout this section.
In each of the simulations, we choose codevectors randomly i.i.d. from the discrete uniform distribution over the vertices of the hypercube; each element of each codevector is a Rademacher random variable (assuming the value with probability 0.5 and with probability 0.5). We generate by choosing one vector at random from each of the codebooks and then computing the Hadamard product among these vectors. We choose vectors randomly because it makes the analysis of performance somewhat easier and more standardized, and it is the setting in which most of the well-known results on Hopfield network capacity apply; we will make a few connections to these results. It is also the setting in which we typically use the Multiply, Add, Permute VSA architecture (Gayler, 2004) and therefore these results on random vectors are immediately applicable to a variety of existing works.
6.1 Stable-Solution Capacity with Outer Product Weights
Suppose for all (we initialize it to the correct factorization; this will also apply to any at which the algorithm comes upon on its own). What is the probability that the state stays there—that is, that the correct factorization is a fixed point of the dynamics? This is the basis of what researchers have called the “capacity” of Hopfield networks, where are patterns that the network has been trained to store. We choose to call it the “stable-solution capacity” in order to distinguish it from operational capacity, which we define in section 6.2.
We first note that this analysis is necessary only for resonator networks with outer product weights; OLS weights guarantee that the solutions are stable, and this is one of the variant's desirable properties. If for all , then factor 1 in a resonator network “sees” an input at time . For OLS weights, the vector is exactly by the definition of orthogonal projection. True for all subsequent factors, this means that for OLS weights, is always a fixed point.
For a resonator network with outer product weights, we must examine the vector at each , and changing from the psuedoinverse to the transpose makes the situation significantly more complicated. At issue is the probability that has a sign different from , that is, that there is a bit flip in any particular component of the updated state. In general, one may not care whether the state is completely stable; it may be tolerable that the dynamics flip some small fraction of the bits of as long as it does not move the state too far away from . Amit, Gutfreund, and Sompolinsky (1985, 1987) established that in Hopfield networks, an avalanche phenomenon occurs where bit flips accumulate and the network becomes essentially useless for values of , at which point the approximate bit flip probability is 0.0036. While we don't attempt any of this complicated analysis on resonator networks, we do derive an expression for the bit flip probability of any particular factor that accounts for bit flips that ``percolate'' from factor to factor through the vector .
The main analytical result in this section is the sequence of equations 6.6 to 6.9, which allow one to compute the bit flip probabilities for each factor in an outer product resonator network. The fact that in general must be split between the two conditional probabilities and that there is a dependence on is what makes it different, for all but the first factor, from the bit flip probability for a Hopfield network (compare equations 6.8 and 6.9 against equation 6.1). But how much different? We are interested in the quantity .
Here is a simple intuition for what this is capturing. Suppose there are Hopfield networks all evolving under their own dynamics; they are running simultaneously but not interacting in any way. At time , the bit flip probabilities for the networks are all the same; there is nothing special about any particular one. A resonator network, however, is like a set of Hopfield networks that have been wired up to receive input , which reflects the state of the other factors. The networks are no longer independent. In particular, a bit flip in factor gets passed onto factors , , and so on. This affects the bit flip probability of these other factors, and the magnitude of this effect, which we call percolated noise, is measured by .
Let us first note that for a Hopfield network with self-connections the maximum bit flip probability is 0.02275, which occurs at . The ratio is what determines the bit flip probability (see appendix J for an explanation). Percolated noise is measured by the difference , which we plot in Figure 1. Panel (a) shows just five factors, illustrating that , but that in general. To see if there is some limiting behavior, we simulated 100 and factors; the differences are also shown in Figure 1. In the limit of large , there appears to be a phase change in residual bit flip probability that occurs at . In the Hopfield network literature, this is a very important number. It gives the point at which the codevectors transition away from being global minimizers of the Hopfield network energy function. When falls in between 0.056 and 0.138, the codevectors are only local minimizers, and there exist spin-glass states that have lower energy. We do not further explore this phase-change phenomenon, and leave the (in all likelihood, highly technical) analysis to future work.
Extra bit flip probability due to percolated noise. In the limit of large , there appears to be a phase change at . Below this value, resonator networks are just as stable as Hopfield networks, but above this value, they are strictly less stable (by the amount ).
Extra bit flip probability due to percolated noise. In the limit of large , there appears to be a phase change at . Below this value, resonator networks are just as stable as Hopfield networks, but above this value, they are strictly less stable (by the amount ).
In conclusion, the second major result of the section is that we have shown, via simulation, that for , the stability of a resonator network with outer product weights is the same as the stability of a Hopfield network. For , percolated noise between the factors causes the resonator network to be strictly less stable than a Hopfield network.
6.2 Operational Capacity
We now define a new notion of capacity that is more appropriate to the factorization problem. This performance measure, called the operational capacity, gives an expression for the maximum size of factorization problem that can be solved with high probability. This maximum problem size, which we denote by , varies as a function of the number of elements in each vector and the number of factors . It gives a very practical characterization of performance and will form the basis of our comparison between resonator networks and the benchmark algorithms we introduced in section 5.1. When the problem size is below the operational capacity of the algorithm, one can be quite sure that the correct factorization will be efficiently found.
The operational capacity of a factorization algorithm that solves 2.1 is the largest search space size such that the algorithm, when limited to a maximum number of iterations , gives a total accuracy .
We now define what we mean by total accuracy. Each algorithm we have introduced attempts to solve the factorization problem 2.1 by initializing the state and letting the dynamics evolve until some termination criterion is met. It is possible that the final state may not equal the correct factors at every component, but we can “decode” each by looking for its nearest neighbor (with respect to Hamming distance or cosine similarity) among the vectors in its respective codebook . This distance computation involves only vectors, rather than , which was what we encountered in one of the brute-force strategies of section 2. Compared to the other computations involved in finding the correct factorization out of total possibilities, this last step of decoding has a very small cost, and we always ``clean up'' the final state using its nearest neighbor in the codebook. We define the total accuracy to be the sum of accuracies for inferring each factor, which is if the nearest neighbor to is and 0 otherwise. For instance, correctly inferring one of three total factors gives a total accuracy of , two of three is , and three of three is 1.
Analytically deriving the expected total accuracy appears to be quite challenging, especially for a resonator network, because it requires that we essentially predict how the nonlinear dynamics will evolve over time. There may be a region around each such that states in this region rapidly converge to , the so-called basin of attraction, but our initial estimate is likely not in the basin of attraction, and it is hard to predict when, if ever, the dynamics will enter this region. Even for Hopfield networks, which obey much simpler dynamics than a resonator network, it is known that so-called ``frozen noise'' is built up in the network state, making the shapes of the basins highly anisotropic and difficult to analyze (Amari & Maginu, 1988). Essentially all of the analytical results on Hopfield networks consider only the stability of as a (very poor) proxy for how the model behaves when it is initialized to other states. This less useful notion of capacity, the stable-solution capacity, was what we examined in the previous section.
We can, however, estimate the total accuracy by simulating many factorization problems, recording the fraction of factors that were correctly inferred over many, many trials. We remind readers that our results in this article pertain to factorization of randomly drawn vectors that bear no particular correlational structure, but that notions of total accuracy and operational capacity would be relevant, and specific, to factorization of nonrandom vectors. We first note that for fixed vector dimensionality , the empirical mean of the total accuracy depends strongly on , the search space size. We can see this clearly in Figure 2. We show this phenomenon for a resonator network with outer product weights, but this general behavior is true for all of the algorithms under consideration. One can always make the search space large enough that expected total accuracy goes to zero.
Accuracy as a function of for resonator network with outer product weights. Three factors (); average over 5000 random trials.
Accuracy as a function of for resonator network with outer product weights. Three factors (); average over 5000 random trials.
Our notion of operational capacity is concerned with the that causes expected total accuracy to drop below some value . We see here a range of values for which the expected total accuracy is 1.0, beyond which this ceases to be the case. For all values of within this range, the algorithm essentially always solves the factorization problem.
In this article, we estimate operational capacity when ( or more of factors were inferred correctly) and (the model can search over at most of the entire search space). These choices are largely practical: or higher accuracy makes the model very reliable in practice, and this operating point can be estimated from a reasonable number (3000 to 5000) of random trials. Setting allows the number of iterations to scale with the size of the problem but restricts the algorithm to consider only a small fraction of the possible factorizations. While a resonator network has no guarantee of convergence, it almost always converges in far fewer than iterations, so long as we stay in this high-accuracy regime. Operational capacity is in general a function of and , which we will discuss shortly.
6.2.1 Resonator Networks Have Superior Operational Capacity
We estimated the operational capacity of the benchmark algorithms in addition to the two variants of resonator networks. Figure 3 shows the operational capacity estimated on several thousand random trials, where we display as a function of for problems with three factors. One can see that the operational capacity of resonator networks is roughly two orders of magnitude greater than the operational capacity of the other algorithms. Each of the benchmark algorithms has a slightly different operational capacity (due to the fact that they each have different dynamics and will, in general, find different solutions), but they are all similarly poor compared to the two variants of resonator networks. (See a similar plot for in appendix B.)
Operational capacity is dramatically higher for resonator networks (blue and red above) than for any of the benchmark algorithms. These points represent the size of factorization problem that can be solved reliably. Shown is operational capacity for factors. The gap is similarly large for other (see plot for in appendix B).
Operational capacity is dramatically higher for resonator networks (blue and red above) than for any of the benchmark algorithms. These points represent the size of factorization problem that can be solved reliably. Shown is operational capacity for factors. The gap is similarly large for other (see plot for in appendix B).
As increases, the performance difference between the two variants of resonator networks starts to disappear, ostensibly due to the fact that . The two variants are different in general, but the simulations in this article do not particularly highlight the difference between them. Except for ALS, each of the benchmark algorithms has at least one hyperperparameter that must be chosen. We simulated many thousand random trials with a variety of hyperparameter settings for each algorithm and chose the hyperparameter values that performed best on average. (We list these values for each of the algorithms in the appendix.) All of the benchmark algorithms converge on their own, and the tunable step sizes make a comparison of the number of iterations nonstandardized, so we did not impose a maximum number of iterations on these algorithms. The points shown represent the best the benchmark algorithms can do, even when not restricted to a maximum number of iterations.
6.2.2 Operational Capacity Scales Quadratically in
We carefully measured the operational capacity of resonator networks in search of a relationship between and . We focused on resonator networks with outer product weights. For and larger, randomly chosen codevectors are nearly orthogonal and capacity is approximately the same for OLS weights. We reiterate that operational capacity is specific to parameters and : is the threshold for total accuracy, and is the maximum number of iterations the algorithm is allowed to take (refer to definition 1). Here we report operational capacity for and on randomly sampled codevectors. The operational capacity is specific to these choices, which are practical for Vector Symbolic Architectures.
Our simulations revealed that, empirically, resonator network operational capacity scales as a quadratic function of , which we illustrate in Figure 4. The points in this figure are estimated from many thousands of random trials, over a range of values for and . In panel (a), we show operational capacity separately for each from 2 to 7, with the drawn curves indicating the least-squares quadratic fit to the measured points. In panel (b), we put these points on the same plot, following a logarithmic transformation to each axis in order to illustrate that capacity also varies as a function of . Appendix B provides some additional commentary on this topic, including some speculation on a scaling law that combines and . The parameters of this particular combined scaling are estimated from simulation and not derived analytically; therefore, they may deserve additional scrutiny, and we do not focus on them here. The main message of this section is that capacity scales quadratically in , regardless of how many factors are used.
Operational capacity of resonator networks with OP weights. (a) scales quadratically in . Red points are measured from simulation; black curves are the least-squares quadratic fits. Parameters of these fits included in appendix B. (b) varies as a function of both and . Over the measured range for , capacity is highest for and . Data for were omitted to better convey the trend for and higher, but see appendix B for the full picture.
Operational capacity of resonator networks with OP weights. (a) scales quadratically in . Red points are measured from simulation; black curves are the least-squares quadratic fits. Parameters of these fits included in appendix B. (b) varies as a function of both and . Over the measured range for , capacity is highest for and . Data for were omitted to better convey the trend for and higher, but see appendix B for the full picture.
The curves in Figure 4 are constructive in the following sense: given a fixed , they indicate the largest factorization problem that can be solved reliably. Conversely, and this is often the case in VSAs, the problem size is predetermined, while is variable. In this case, we know how large one must make . We include in the official software implementation that accompanies this article3 a text file with all of the measured operational capacities.
Quadratic scaling means that one can aspire to solve very large factorization problems, so long as he or she can build a resonator network with big enough . We attempted to estimate capacity for even larger values of than we report in Figure 4, but this was beyond the capability of our current computational resources. A useful contribution of follow-on work would be to leverage high-performance computing to measure some of these values. Applications of Vector Symbolic Architectures typically use , but there are other reasons one might attempt to push resonator networks further. Early work on Hopfield networks proposed a technique for storing solutions to the traveling salesman problem as fixed points of the model's dynamics (Hopfield & Tank, 1985), and this became part of a larger approach using nonlinear dynamical systems to solve hard search problems. We do not claim that any particular search problem, other than the factorization we have defined (see problem 2.1), can be solved by resonator networks. Supposing, however, that some other hard problem can be cast in the form of equation 2.1, the quadratic scaling of operational capacity makes this a potentially power tool.
Our result—measured operational capacity that indicates an approximately quadratic relationship between and —is an important characterization of resonator networks. It suggests that our framework scales to very large factorization problems and serves as a guideline for implementation. Our attempts to analytically derive this result were stymied by the toolbox of nonlinear dynamical systems theory. Operational capacity involves the probability that this system, when initialized to an effectively random state, converges to a particular set of fixed points. No results from the study of nonlinear dynamical systems that we are aware of allow us to derive such a strong statement about resonator networks. Still, the scaling of Figure 4 is fairly suggestive of some underlying law, and we are hopeful that a theoretical explanation exists, waiting to be discovered.
6.3 Search Speed
If a resonator network is not consistently descending an energy function, is it just aimlessly wandering around the space, trying every possible factorization until it finds the correct one? Figure 5 shows that it is not. We plot the mean number of iterations over 5000 random trials, as a fraction of , the search space size. This particular plot is based on a resonator network with outer product weights and . In the high-performance regime where is below operation capacity, the number of iterations is far fewer than the cutoff we used in the simulations of section 6.2; the algorithm is only ever considering a tiny fraction of the possible factorizations before it finds the solution.
Iterations until convergence; resonator network with outer product weights and . The number of iterations is a very small compared to the size of the search space.
Iterations until convergence; resonator network with outer product weights and . The number of iterations is a very small compared to the size of the search space.
Section 6.2.1 compared the operational capacity of different algorithms and showed that compared to the benchmarks, resonator networks can solve much larger factorization problems. This is in the sense that the dynamics eventually converge (with high probability) on the correct factorization, while the dynamics of the other algorithms converge on spurious factorizations. This result, however, does not directly demonstrate the relative speed with which factorizations are found in terms of either the number of iterations or the amount of time to convergence. We set up a benchmark to determine the relative speed of resonator networks, and our main finding is depicted in Figure 6.
Our benchmark of factorization speed. Implementation in Python with NumPy. Run on machine with Intel Core i7-6850k processor and 32 GB RAM. We generated 5000 random instantiations of the factorization problem with , , and , running each of the four algorithms in turn. (a) Convergence traces for 100 randomly drawn factorization problems (out of total 5000); each line is the cosine similarity between and over iterations of the algorithm. Each of the four algorithms is run on the same 100 factorization problems. All of the instances are solved by the resonator network, whereas a sizable fraction (around ) of the instances are not solved by the benchmark algorithms, at least within 100 iterations. (b) Average cosine similarity versus iteration number (only trials with accuracy 1.0). (c) Average cosine similarity versus wall-clock time (only trials with accuracy 1.0).
Our benchmark of factorization speed. Implementation in Python with NumPy. Run on machine with Intel Core i7-6850k processor and 32 GB RAM. We generated 5000 random instantiations of the factorization problem with , , and , running each of the four algorithms in turn. (a) Convergence traces for 100 randomly drawn factorization problems (out of total 5000); each line is the cosine similarity between and over iterations of the algorithm. Each of the four algorithms is run on the same 100 factorization problems. All of the instances are solved by the resonator network, whereas a sizable fraction (around ) of the instances are not solved by the benchmark algorithms, at least within 100 iterations. (b) Average cosine similarity versus iteration number (only trials with accuracy 1.0). (c) Average cosine similarity versus wall-clock time (only trials with accuracy 1.0).
Measured in number of iterations, resonator networks are comparable to the benchmark algorithms. We noted that ALS is the greediest of the benchmarks, and one can see from Figure 6 that it is the fastest in this sense. We are considering only trials that ultimately found the correct factorization, which in this simulation was roughly for each of the benchmarks. In contrast, resonator networks always eventually found the correct factorization. Measured in terms of wall-clock time, resonator networks are significantly faster than the benchmarks. This can be attributed to their nearly lower per iteration cost. Resonator networks with outer product weights utilize very simple arithmetic operations, and this explains the difference between Figures 6b and 6c.
6.4 Dynamics That Do Not Converge
One must be prepared for the possibility that the dynamics of a resonator network will not converge. Fortunately, for below the operational capacity, these will be exceedingly rare. From simulation, we identified three major regimes of different convergence behavior, which are depicted in Figure 7:
For small enough, almost all trajectories converge. Moreover, they converge to a state that yields the correct factorization. Limit cycles are possible but rare, and often still yield the correct factorization. There appear to be few if any spurious fixed points (those yielding an incorrect factorization). If the trajectory converges to a point attractor or limit cycle, one can be confident this state indicates the correct factorization.
As increases, nonconverging trajectories appear in greater proportion and yield incorrect factorizations. Any trajectories that converge on their own continue to yield the correct factorization, but these become less common.
Beyond some saturation value (roughly depicted as the transition from red to blue in the figure), both limit cycles and point attractors reemerge, and they yield the incorrect factorization.
Regimes of different convergence behavior. Curves show measurement from simulation of an outer product resonator network with three factors and . This is also meant as a diagram of convergence behavior for resonator networks in general. Shown in black is the average decoding accuracy, and shown in gray is the median number of iterations taken by the network. For low enough , the network always finds a fixed point yielding accuracy. The network will not converge to spurious fixed points in this regime (green). As is increased, more trajectories wander, not converging in any reasonable time (red). Those that are forcibly terminated yield incorrect factorizations. For large enough , the network is completely saturated, and most states are fixed points, regardless of whether they yield the correct factorization (blue). Resonator networks with OLS weights are always stable when , but OP weights give a bit flip probability that is zero only asymptotically in (see section 6.1 and appendix J).
Regimes of different convergence behavior. Curves show measurement from simulation of an outer product resonator network with three factors and . This is also meant as a diagram of convergence behavior for resonator networks in general. Shown in black is the average decoding accuracy, and shown in gray is the median number of iterations taken by the network. For low enough , the network always finds a fixed point yielding accuracy. The network will not converge to spurious fixed points in this regime (green). As is increased, more trajectories wander, not converging in any reasonable time (red). Those that are forcibly terminated yield incorrect factorizations. For large enough , the network is completely saturated, and most states are fixed points, regardless of whether they yield the correct factorization (blue). Resonator networks with OLS weights are always stable when , but OP weights give a bit flip probability that is zero only asymptotically in (see section 6.1 and appendix J).
In theory, limit cycles of any length may appear, although in practice, they tend to be skewed toward small cycle lengths. Networks with two factors are the most likely to find limit cycles, and this likelihood appears to decrease with increasing numbers of factors. Our intuition about what happens in the middle section of Figure 7 is that the basins of attraction become very narrow and hard to find for the resonator network dynamics. The algorithm will wander, since it has so few spurious fixed points (see section 6.6), but not be able to find any basin of attraction.
6.5 Factoring a “Noisy” Composite Vector
Our assumption has been that one combination of codevectors from our codebooks generates exactly. What if this is not the case? Perhaps the vector we are given for factorization has had some proportion of its components flipped, that is, we are given where differs from in exactly places. The vector has a factorization based on our codebooks, but does not. We should hope that a resonator network will return the factors of so long as the corruption is not too severe. This is an especially important capability in the context of Vector Symbolic Architectures, where will often be the result of some algebraic manipulations that generate noise and corrupt the original to some degree. We show in Figure 8 that a resonator network can still produce the correct factorization even after a significant number of bits have been flipped. This robustness is more pronounced when the number of factorizations is well below operational capacity, at which point the model can often still recover the correct factorization even when of the bits have been flipped.
Factoring a corrupted . For well below capacity (lighter curves above), one can sustain heavy corruption to and still find the correct factorization.
Factoring a corrupted . For well below capacity (lighter curves above), one can sustain heavy corruption to and still find the correct factorization.
6.6 A Theory for Differences in Operational Capacity
The failure mode of each benchmark algorithm is getting stuck at a spurious fixed point of the dynamics. This section develops a simple comparison between the spurious fixed points of resonator networks and the benchmarks as an explanation for why resonator networks enjoy relatively higher operational capacity. From among the benchmarks we focus on Projected Gradient Descent (PGD; applied to the negative inner product with the simplex constraint) to illustrate this point. We will show that the correct factorization is always stable under PGD (as it is with the OLS variant of resonator networks), but that incorrect factorizations are much more likely to be fixed points under PGD. The definition of PGD can be found in Table 2, with some comments in appendix G.
6.6.1 Stability of the Correct Factorization
This matches the stability of OLS resonator networks, which are, by construction, always stable at the correct factorization. We showed in section 6.1 that OP weights induce instability and that percolated noise makes the model marginally less stable than Hopfield networks, but there is still a large range of factorization problem sizes where the network is stable with overwhelming probability. What distinguishes the benchmarks from resonator networks is what we cover next, the stability of incorrect factorizations.
6.6.2 Stability of Incorrect Factorizations
Suppose initialization is done with a random combination of codevectors that do not produce . The vector will be a completely random bipolar vector. So long as is significantly smaller than , which it always is in our applications, will be nearly orthogonal to every vector in and its projection onto will be small, with each component equally likely to be positive or negative. Therefore, under the dynamics of a resonator network with OLS weights, each component will flip its sign compared to the initial state with probability , and the state for this factor will remain unchanged with the minuscule probability . The total probability of this incorrect factorization being stable, accounting for each factor, is therefore . Suboptimal factorizations are very unlikely to be fixed points. The same is true for a resonator network with OP weights because each element of the vector is approximately gaussian with mean zero (see section 6.1 and appendix J).
Contrast this against PGD. We recall from equation 6.13 that the requirement for to be a fixed point is that the th component of the gradient at this point be largest by a margin of 1 or more. This is a much looser stability condition than we had for resonator networks. Such a scenario will actually occur with probability for each factor, and the total probability is . While still a relatively small probability, in typical VSA settings is much larger than , meaning that compared to resonator networks, PGD is much more stable at incorrect factorizations. Empirically, the failure mode of PGD involves it settling on one of these spurious fixed points.
6.6.3 Stability in General
The cases of correct and incorrect factorizations drawn from the codebooks are two extremes along a continuum of possible states the algorithm can be in. For PGD, any state will be stable with probability in the interval , while for resonator networks (with OLS weights), the interval is . In practical settings for VSAs, the interval is, in a relative sense, much larger than . Vectors drawn uniformly from either or concentrate near the lower end of these intervals, suggesting that on average, PGD has many more spurious fixed points.
This statement is not fully complete in the sense that dynamics steer the state along specific trajectories, visiting states in a potentially nonuniform way, but it does suggest that PGD is much more susceptible to spurious fixed points. The next section shows that these trajectories do in fact converge on spurious fixed points as the factorization problem size grows.
6.6.4 Basins of Attraction for Benchmark Algorithms
We ran a simulation with and , at which PGD and Multiplicative Weights have a total accuracy of 0.625 and 0.525, respectively. We created 5000 random factorization problems, initializing the state according to equation 6.14 and allowing the dynamics to run until convergence. We did this first with a nudge toward the correct factorization (squares in Figure 9) and then with a nudge away from , toward a randomly chosen spurious factorization (triangles in Figure 9).
States in hypercube interior get pulled into spurious basins of attraction. PGD is in green and Multiplicative Weights in orange. Network is initialized at a distance from the center of the simplex (see equation 6.14), and allowed to converge. The -axis is the accuracy of the factorization implied by the converged state. Triangles indicate initialization slightly away from toward any of the other simplex vertices, which is most directions in the space. These initial states get quickly pulled into a spurious basin of attraction.
States in hypercube interior get pulled into spurious basins of attraction. PGD is in green and Multiplicative Weights in orange. Network is initialized at a distance from the center of the simplex (see equation 6.14), and allowed to converge. The -axis is the accuracy of the factorization implied by the converged state. Triangles indicate initialization slightly away from toward any of the other simplex vertices, which is most directions in the space. These initial states get quickly pulled into a spurious basin of attraction.
What Figure 9 shows is that by moving just a small distance toward the correct vertex, we very quickly fall into its basin of attraction. However, moving toward any of the other vertices is actually somewhat likely to take us into a spurious basin of attraction (where the converged state is decoded into an incorrect factorization). The space is full of these bad directions. It would be very lucky indeed to start from the center of the simplex and move immediately toward the solution. It is far more likely that initial updates take us somewhere else in the space, toward one of the other vertices, and this plot shows that these trajectories often get pulled toward a spurious fixed point. What we are demonstrating here is that empirically, the interior of the hypercube is somewhat treacherous from an optimization perspective, and this lies at the heart of why the benchmark algorithms fail.
From among the benchmarks, we restricted our analysis of spurious fixed points to PGD and, in Figure 9, Multiplicative Weights. This choice was made for clarity, and similar arguments apply for all of the benchmarks. While the details may differ slightly (e.g., spurious fixed points of ALS appear near the simplex center, not at a vertex), the failure mode of the benchmarks is strikingly consistent. They all become overwhelmed by spurious fixed points, long before this affect is felt by resonator networks. We have shown that in expectation, PGD has many more spurious fixed points than resonator networks. We have also shown that trajectories moving through the interior of the hypercube are easily pulled into these spurious basins of attraction.
7 Discussion
We studied a vector factorization problem that arises in the use of Vector Symbolic Architectures (as introduced in the companion article in this issue) showing that resonator networks solve this problem remarkably well. Their performance comes from a particular form of nonlinear dynamics, coupled with the idea of searching in superposition. Solutions to the factorization problem lie in a small sliver of (the corners of the bipolar hypercube ), and the highly nonlinear activation function of resonator networks serves to constrain the search to this subspace. We drew connections between resonator networks and a number of benchmark algorithms that cast factorization as a problem of optimization. This intuitively satisfying formulation appears to come at a steep cost. None of the benchmarks were competitive with resonator networks in terms of key metrics that characterize factorization performance. One explanation for this is that the benchmarks have comparatively many more spurious fixed points of their dynamics and that the loss function landscape in the interior of the hypercube induces trajectories that approach these spurious fixed points.
Unlike the benchmarks, resonator networks do not have a global convergence guarantee, and in some respects we see this as a beneficial characteristic of the model. Requiring global convergence appears to unnecessarily constrain the search for factorizations, leading to lower capacity. Besides, operational capacity (defined in this article) specifies a regime where the lack of a convergence guarantee can be practically ignored. Resonator networks almost always converge in this setting, and the fixed points yield the correct solution. The benchmarks are, by steadfastly descending a loss function, in some sense greedier than resonator networks. It appears that resonator networks strike a more natural balance between making updates based on the best-available local information and still exploring the solution space while not getting stuck. Our approach follows a kind of “Goldilocks principle” on this trade-off: not too much, not too little, but just right.
We are not the first to consider eschewing convergence guarantees to better solve hard search problems. For instance, randomized search algorithms utilize some explicit form of randomness to find better solutions, typically converging only if this randomness is reduced over time (Spall, 2005). In contrast, our model is completely deterministic, and the searching behavior comes from nonlinear heteroassociative dynamics. Another example is the proposal to add small amounts of random asymmetry to the (symmetric) weight matrix of Hopfield networks (Hertz et al., 1986). This modification removes the guaranteed absence of cyclic and chaotic trajectories that holds for the traditional Hopfield model. But at the same time, and without significantly harming the attraction of memory states, adding asymmetry to the weights can improve associative memory recall by shrinking the basins of attraction associated with spurious fixed points (Singh et al., 1995; Chengxiang et al., 2000).
We emphasize that while resonator networks appear to be better than alternatives for the particular vector factorization problem 2.1, this is not a claim they are appropriate for other hard search problems. Rather, resonator networks are specifically designed for the vector factorization problem at hand. There exist several prior works involving some aspect of factorization that we mention here, but we emphasize that each one of them deals with a problem or approach that is distinct from what we have introduced in this article.
Tensor decomposition is a family of problems that bear some resemblance to the factorization problem we have introduced, problem 2.1. Key differences include the object to be factored, which is a higher-order tensor, not a vector, and constraints on the allowable factors. We explain in appendix D how our factorization problem is different from traditional tensor decompositions. Our benchmarks actually included the standard tensor decomposition algorithm, Alternating Least Squares, reexpressed for 2.1, and we found that it is not well matched for this factorization problem. Bidirectional Associative Memory, proposed by Kosko (1988), is an extension of Hopfield networks that stores pairs of factors in a matrix using the outer product learning rule. The composite object is a matrix, rather than a vector, and is much closer to a particular type of tensor decomposition called the CP decomposition, which we elaborate on in appendix D. Besides the fact that this model applies only to two factor problems, its dynamics are different from ours and its capacity is relatively low (Kobayashi, Hattori, & Yamazaki, 2002). Subsequent efforts to extend this model to factorizations with three or more factors (Huang & Hagiwara, 1999; Kobayashi, Hattori, & Yamazaki, 2002) have had very limited success and still rely on matrices that connect pairs of factors rather than a single multilinear product, which we have in our model. Bilinear models of style and content (Tenenbaum & Freeman, 2000) was an inspiration for us in deciding to work on factorization problems. This article applies a different type of tensor decomposition, a Tucker decomposition (again see appendix D), to a variety of different real-valued data sets using what appears to be in one case a closed-form solution based on the singular value decomposition, and in the other case a variant of ALS. In that sense, their method is different from ours, the factorization problem is itself different, and they consider only pairs of factors. Memisevic and Hinton (2010) revisit the Tucker decomposition problem, but factor the core tensor representing interactions between factors in order to make estimation more tractable. They propose a Boltzmann machine that computes the factorization and show some results on modeling image transformations. Finally, there is a large body of work on matrix factorization of the form , the best known of which is probably nonnegative matrix factorization (Lee & Seung, 2001). The matrix can be thought of a sum of outer products, so this is really a type of CP decomposition with an additional constraint on the sign of the factors. Different still is the fact that is often interpreted as a basis for the columns of , with containing the coefficients of each column with respect to this basis. In this sense, vectors are being added to explain rather than combined multiplicatively; nonnegative matrix factorization is much closer to sparse coding (Hoyer, 2004).
The companion article in this issue illustrates how distributed representations of data structures can be built with the algebra of Vector Symbolic Architectures, as well as how resonator networks can decompose these data structures. VSAs are a powerful way to think about structured connectionist representations, and resonator networks make the framework much more scalable. Extending the examples found in our companion article to more realistic data (e.g., complex three-dimensional visual scenes) could be a useful application of resonator networks. This will likely require learning a transform from pixels into the space of high-dimensional symbolic vectors, and this learning should ideally occur in the context of the factorization dynamics, an exciting avenue for future study. Here we have not shown resonator circuits for anything other than bipolar vectors. However, a version of the model wherein vector elements are unit-magnitude complex phasors is a natural next extension and relevant to holographic reduced representations, a VSA developed by Plate (2003). A recent theory of sparse phasor associative memories (Frady & Sommer, 2019) may allow one to perform this factorization with a network of spiking neurons.
Resonator networks are an abstract neural model of factorization, introduced for the first time in this two-part series. We believe that as the theory and applications of resonator networks are further developed, they may help us understand factorization in the brain, which remains an important mystery.
Appendix A: Implementation Details
This appendix includes a few comments relevant to the implementation of resonator networks. Algorithm 1 gives pseudocode for ordinary least squares weights—the only change for outer product weights is to use instead of . So long as , computing has lower computational complexity than actually forming a single synaptic matrix and then computing in each iteration---it is faster to keep the matrices and separate. This of course assumes that implementation is on a conventional computer. If one can use specialized analog computation, such as large mesh circuits that directly implement matrix-vector multiplication in linear time (Cannon, 1969), then it would be preferable to store the synaptic matrix directly.
Lines 11 to 13 in algorithm 1 clean up using the nearest neighbor in the codebook and also resolve a sign ambiguity inherent to the factorization problem. The sign ambiguity is simply this: while is the factorization we are searching for, we also have , and, more generally, any even number of factors can have their signs flipped but still generate the correct . Resonator networks will sometimes find these solutions. We clean up using the codevector with the largest unsigned similarity to the converged , which remedies this issue. Note that we have written algorithm 1 to update factors in order from 1 to . This is completely arbitrary, and any ordering is fine. We have experimented with choosing a random update order during each iteration, but this did not seem to significantly affect performance.
Computing with the most recently updated values for factors 1 to (see equation 3.5) is a convention we call asynchronous updates, in rough analogy to the same term used in the context of Hopfield networks. An alternative convention is to, when computing , not use freshly updated values for factors 1 to , but rather their values before the update. This treats each factor as if it is being updated simultaneously, a convention we call synchronous updates. This distinction is an idiosyncrasy of modeling resonator networks in discrete time, and the difference between the two disappears in continuous time, where things happen instantaneously. Throughout this article, our analysis and simulations have been with asynchronous updates, which we find to converge significantly faster.
Not shown in algorithm 1 is the fact that, in practice, we record a buffer of past states, allowing us to detect when the dynamics fall into a limit cycle and to terminate early.
Appendix B: Operational Capacity
The main text introduced our definition of operational capacity and highlighted our two main results: that resonator networks have superior operational capacity compared to the benchmark algorithms, and that resonator network capacity scales as a quadratic function of . This appendix provides some additional support and commentary on these findings.
Figure 10 compares operational capacity among all of the considered algorithms when , the number of factors, is four. We previously showed this type of plot for , which was Figure 3 in the main text. Resonator networks have an advantage of between two and three orders of magnitude compared to all of our benchmarks; the general size of this gap was consistent in all of our simulations.
Comparing operational capacity against the benchmarks for (four factors).
We concluded in section 6.2 that the operational capacity of resonator networks scales quadratically in , which was shown in Figure 4. In Table 1 we provide parameters of the least-squares quadratic fits shown in that plot. One can see from Figure 4b that capacity is different depending on the number of factors involved, and in the limit of large , this difference is determined by the parameter . first rises from two to three factors and then falls with increasing . This implies that factorization is easiest for resonator networks when the decomposition is into three factors, an interesting phenomenon for which we do not have an explanation at this time.
Parameter of the quadratic scaling depends on . We find that it may follow an inverse power law for .
Parameter of the quadratic scaling depends on . We find that it may follow an inverse power law for .
Appendix C: Table of Benchmark Algorithms
Algorithm . | Dynamics for Updating . | Equation . |
---|---|---|
Alternating Least Squares | 6.2 | |
Iterative Soft Thresholding | C.1 | |
Fast Iterative Soft Thresholding | C.2 | |
Projected Gradient Descent | C.3 | |
Multiplicative Weights | C.4 | |
Map Seeking Circuits | C.5 | |
Algorithm . | Dynamics for Updating . | Equation . |
---|---|---|
Alternating Least Squares | 6.2 | |
Iterative Soft Thresholding | C.1 | |
Fast Iterative Soft Thresholding | C.2 | |
Projected Gradient Descent | C.3 | |
Multiplicative Weights | C.4 | |
Map Seeking Circuits | C.5 | |
Note: See appendixes D to I for discussion of each algorithm, including hyperparameters , , and , as well as initial conditions.
Appendix D: Tensor Decompositions and Alternating Least Squares
Tensors are multidimensional arrays that generalize vectors and matrices. An th-order tensor has elements that can be indexed by F separate indexes; a vector is a tensor of order 1 and a matrix is a tensor of order 2. As devices for measuring multivariate time series have become more prevalent, the fact that these data can be expressed as a tensor has made the study of tensor decomposition a popular subfield of applied mathematics. Hitchcock (1927) is often credited with originally formulating tensor decompositions, but modern tensor decomposition was popularized in the field of psychometrics by the work of Tucker (1966), Carroll and Chang (1970), and Harshman (1970). This section highlights the substantial difference between tensor decomposition and the factorization problem solved by resonator networks.
This fact of decomposition uniqueness illustrates one way that basic results from matrices fail to generalize to higher-order tensors (by higher-order, we simply mean where the order is 3 or more). Low-rank CP decomposition for matrices (tensors of order 2) may be computed with the truncated singular value decomposition (SVD). However, if is a matrix and its truncated SVD is , then any non-singular matrix generates an equally good CP decomposition . The decomposition is highly nonunique. All matrices have an SVD, whereas generic higher-order tensors are not guaranteed to have a CP decomposition. And yet, if a CP decomposition exists, under the mild condition of equation D.2, it is unique. This is a somewhat miraculous fact, suggesting that in this sense, CP decompostion of higher-order tensors is easier than matrices. The higher order of the composite object imposes many more constraints that make the decomposition unique.
Another interesting way that higher-order tensors differ from matrices is that computing matrix rank is easy, whereas in general, computing tensor rank is NP-hard, along with many other important tensor problems (Hillar & Lim, 2013). Our intuition about matrices largely fails us when dealing with higher-order tensors. In some ways the problems are easier and in some ways they are harder. See Sidiropoulos et al. (2017) for a more comprehensive comparison.
The vector factorization problem defined by equation 2.1 differs from CP decomposition in three key ways:
The composite object to be factored is a vector, not a higher-order tensor. This is an even more extreme difference than between matrices and higher-order tensors. In CP decomposition, the arrangement and numerosity of tensor elements constitute many constraints on what the factorization can be, so much so that it resolves the uniqueness issue we outlined above. In this sense, tensors contain much more information about the valid factorization, making the problem significantly easier. The size and form of these tensors may make finding CP decompositions a computational challenge, but CP decomposition is analytically easier than our vector factorization problem.
Search is conducted over a discrete set of possible factors. This differs from the standard formulation of CP decomposition, which makes no restriction to a discrete set of factors. It is however worth noting that a specialization of CP decomposition, called CANonical DEcomposition with LINear Constraints (CANDELINC) (Carroll, Pruzansky, & Kruskal, 1980), does in fact impose the additional requirement that factors are formed from a linear combination of some basis factors. In our setup the solutions are “one-hot” linear combinations.
The factors are constrained to , a small sliver of . This difference should not be underestimated. We showed in section 6.6 that the interior of this hypercube is treacherous from an optimization perspective and resonator networks avoid it by using a highly nonlinear activation function. This would not make sense in the context of standard CP decomposition.
Perhaps the most convincing demonstration that equation 2.1 is not CP decomposition comes from the fact that we applied Alternating Least Squares to it and found that its performance was relatively poor. ALS is in fact the workhorse algorithm of CP decomposition (Kolda & Bader, 2009), but it cannot compete with resonator networks on our different factorization problem (2.1). The excellent review of Kolda and Bader (2009) covers CP decomposition and ALS in significant depth, including the fact that ALS always converges to a local minimum of the squared error reconstruction loss. See, in particular, section 3.4 of their paper for more detail.
One special case of CP decomposition involves rank-1 components that are symmetric and orthogonal. For this problem, a special case of ALS called the tensor power method can be used to iteratively find the best low-rank CP decomposition through what is known as deflation, which is identical to the explaining away we introduced in the companion article in this issue. The tensor power method directly generalizes the matrix power method, and in this special case of symmetric, orthogonal tensors is effective at finding the CP decomposition. A good initial reference for the tensor power method is De Lathauwer, De Moor, and Vandewalle (2000b). A discussion of applying tensor decompositions to statistical learning problems is covered by Anandkumar, Ge, Hsu, Kakade, and Telgarsky (2014), which develops a robust version of the tensor power method and contains several important probabilistic results for applying tensor decompositions to noisy data. The tensor power method differs from resonator networks in the same key ways as ALS: composite objects are higher-order tensors, not vectors, search is not necessarily over a discrete set, the vectors are not constrained to , and the dynamics make linear least squares updates in each factor.
The summary of this section is that vector factorization problem 2.1 is not tensor decomposition. In some sense it is more challenging. Perhaps not surprisingly, the standard algorithm for tensor decompositions, ALS, is not particularly competitive on this problem when compared to resonator networks. It is interesting to consider whether tensor decomposition might be cast into a form amenable to solution by resonator networks. Given the importance of tensor decomposition as a tool of data analysis, we believe this warrants a closer look.
Appendix E: General Notes on Gradient-Based Algorithms
E.1 Fixed-Step-Size Gradient Descent on the Squared Error
Appendix F: Iterative Soft Thresholding and Fast Iterative Soft Thresholding
Convergence analysis of ISTA is beyond the scope of this article, but it has been shown in various places (Bredies & Lorenz, 2008, for instance) that ISTA will converge at a rate . ISTA works well in practice, although for four or more factors, we find that it is not quite as effective as the algorithms that do constrained descent on the negative inner product loss. By virtue of not directly constraining the coefficients, ISTA allows them to grow outside of . This may make it easier to find the minimizers , but it may also lead the method to encounter more suboptimal local minimizers, which we found to be the case in practice.
One common criticism of ISTA is that it can get trapped in shallow parts of the loss surface and thus suffers from slow convergence (Bredies & Lorenz, 2008). A straightforward improvement, based on Nesterov's momentum for accelerating first-order methods, was proposed by Beck and Teboulle (2009), which they call Fast Iterative Soft Thresholding (FISTA). The dynamics of FISTA are written in equation C.2, and converge at the significantly better rate of , a result proven in Beck and Teboulle (2009). Despite this difference in worst-case convergence rate, we find that the average-case convergence rate on our particular factorization problem does not significantly differ. Initial coefficients are set to , and auxiliary variable is initialized to 1. For all experiments, was set the same as for ISTA, to 0.01.
Appendix G: Projected Gradient Descent
There exist algorithms for efficiently computing projections onto both the simplex and the ball (see Held, Wolfe, & Crowder, 1974; Duchi, Shalev-Shwartz, Singer, & Chandra, 2008; Condat, 2016). We use a variant summarized in Duchi et al. (2008) that has computational complexity ; recall that has components, so this is the dimensionality of the simplex or the ball being projected onto. When constraining to the simplex, we set the initial coefficients to , the center of the simplex. When constraining to the unit ball, we set to , so that all coefficients are equal but the vector is on the interior of the ball. The only hyperparameter is , which in all experiments was set to 0.01. Recall that we defined the null space of the projection operation with equation 6.11 in section 6.6, and the special case for the simplex constraint in equations 6.12 and 6.13.
Taking projected gradient steps on the negative inner product loss works well and is guaranteed to converge, whether we use the simplex or the ball constraint. Convergence is guaranteed due to this intuitive fact: any part of not in , induces a change in , denoted by , which must make an acute angle with . This is by the definition of orthogonal projection, and it is a sufficient condition for showing that decreases the value of the loss function. Projected gradient descent iterates always reduce the value of the negative inner product loss or leave it unchanged; the function is bounded below on the simplex and the ball, so this algorithm is guaranteed to converge.
Applying Projected Gradient Descent on the squared error did not work, which is related to the smoothness issue we discussed in section E.1, although the behavior was not as dramatic as with unconstrained gradient descent. We observed in practice that Projected Gradient Descent on the squared error loss easily falls into limit cycles of the dynamics. It was for this reason that we restricted our attention with Projected Gradient Descent to the negative inner product loss.
Appendix H: Multiplicative Weights
When we have simplex constraints , the Multiplicative Weights algorithm is an elegant way to perform the superposition search. It naturally enforces the simplex constraint by maintaining a set of auxiliary variables, the “weights,” which define the choice of at each iteration. (See equation C.4 for the dynamics of Multiplicative Weights.) We choose a fixed step size and initial values for the weights all one: . In experiments in this article we set . The variable exists to normalize the term so that each element lies in the interval .
Multiplicative weights is an algorithm primarily associated with game theory and online optimization, although it has been independently discovered in a wide variety of fields (Arora, Hazan, and Kale, 2012). See Arora's excellent review of Multiplicative Weights for a discussion of the fascinating historical and analytical details of this algorithm. Multiplicative Weights is often presented as a decision policy for discrete-time games. However, through a straightforward generalization of the discrete actions into directions in a continuous vector space, one can apply Multiplicative Weights to problems of online convex optimization, which is discussed at length in Arora, Hazan, and Kale (2012) and Hazan (2016). We can think of solving our problem 5.1 as if it were an online convex optimization problem where we update each factor according to its own multiplicative weights update, one at a time. The function is convex with respect to , but is changing at each iteration due to the updates for the other factors. It is in this sense that we are treating problem 5.1 as an online convex optimization problem.
H.1 Multiplicative Weights Is a Descent Method
Appendix I: Map-Seeking Circuits
Map-seeking circuits (MSCs) are neural networks designed to solve invariant pattern recognition problems. Their theory and applications have been gradually developed by Arathorn and colleagues over the past 18 years (see, e.g., Arathorn, 2001, 2002; Gedeon & Arathorn, 2007; Harker et al., 2007), but remain largely unknown outside of a small community of vision researchers. In their original conception, they solve a “correspondence maximization” or “transformation discovery” problem in which the network is given a visually transformed instance of some template object and has to recover the identity of the object as well as a set of transformations that explain its current appearance. The approach taken in MSC is to superimpose the possible transformations in the same spirit as we have outlined for solving the factorization problem. We cannot give the topic a full treatment here but simply note that the original formulation of MSC can be directly translated to our factorization problem wherein each type of transformation (e.g., translation, rotation, scale) is one of the factors, and the particular values of the transformation are vectors in the codebooks . The loss function is , and the constraint set is (both by convention in MSC). The dynamics of MSC are given in equation C.5, with initial values for each factor. The small threshold is a hyperparameter, which we set to in experiments, along with the step size . Gedeon and Arathorn (2007) and Harker et al. (2007) proved (with some minor technicalities we will not detail here) that MSC always converge to either a scalar multiple of a canonical basis vector or the zero vector. That is, (where if and 0 otherwise, and is a positive scalar).
Due to the normalizing term , the updates to can never be positive. Among the components of that are negative, the one with the largest magnitude corresponds to a component of that sees an update of 0. All other components are decreased by an amount proportional to the gradient. We noted in comments on equation E.1 that the smallest element of corresponds to the codevector that best matches , a “suggestion” for based on the current states of the other factors. The dynamics of MSC thus preserve the weight of the codevector that matches best and decrease the weight of the other codevectors by an amount proportional to their own match. Once the weight on a codevector drops below the threshold, it is set to zero and no longer participates in the search. The phenomenon wherein the correct coefficient drops out of the search is called “sustained collusion” by Arathorn (2002) and is a failure mode of MSC.
Appendix J: Percolated Noise in Outer Product Resonator Networks
A resonator network with outer product weights that is initialized to the correct factorization is not guaranteed to remain there, just as a Hopfield network with outer product weights initialized to one of the “memories” is not guaranteed to remain there. This is in contrast to a resonator network (and a Hopfield network) with ordinary least squares weights , for which each of the codevectors is always a fixed points. In this section, when we refer simply to a resonator network or a Hopfield network, we are referring to the variants of these models that use outer product weights.
The bit flip probability for the th factor of a resonator network is denoted and defined in equation 6.2. Section J.1 derives , which is equal to the bit flip probability for a Hopfield network, introduced by equation 6.1 in the main text. Section J.2 derives , and then section J.3 collects all of the ingredients to express the general .
J.1 First Factor
We care about the ratio and how the bit flip probability scales with this number. We've called this to denote the Hopfield bit flip probability, but it is also , the bit flip probability for the first factor of a resonator network. We'll see that for the second, third, fourth, and other factors, will not equal , which is what we mean by percolated noise, the focus of section 6.1. If we eliminate all “self-connection” terms from by setting each element on the diagonal to zero, then the second term in equation J.1 is eliminated and the bit flip probability is . This is actually significantly different from equation J.2, which we can see in Figure 13. With self-connections, the bit flip probability is maximized when (readers can verify this via simple algebra), and its maximum value is approximately 0.023. Without self-connections, the bit flip probability asymptotes at 0.5. The actual useful operating regime of both these networks is where is significantly less than , which we zoom in on in Figure 13b. A mean-field analysis of Hopfield networks developed by Amit et al. showed that when , a phase-change phenomenon occurs in which a small number of initial bit flips (when the probability is 0.0036 according to the above approximation) build up over subsequent iterations and the network almost always moves far away from , making it essentially useless. We can see that the same bit flip probability is suffered at a significantly higher value for when we have self-connections; the vector is significantly more stable in this sense. We also found that a resonator network has higher operational capacity (see section 6.2) when we leave in the self-connections. As a third point of interest, computing is often much faster when we keep each codebook matrix separate (instead of forming the synaptic matrix directly), in which case removing the self-connection terms involves extra computation in each iteration of the algorithm. For all of these reasons, we choose to keep self-connection terms in the resonator network.
J.2 Second Factor
J.3 All Other Factors
What we have derived here in appendix J are equations 6.1 to 6.9. This result agrees very well with data generated in experiments where one actually counts the bit flips in a randomly instantiated resonator network. In Figure 14 we show the sampling distribution of from these experiments compared to the analytical expresssion for . Dots indicate the mean value for , and the shaded region indicates 1 standard deviation about the mean, the standard error of this sampling distribution. We generated this plot with 250 i.i.d. random trials for each point. Solid lines are simply the analytical values for , which one can see are in very close agreement with the sampling distribution.
Agreement between simulation and theory for . Shades indicate factors 1 to 5 (light to dark).
Agreement between simulation and theory for . Shades indicate factors 1 to 5 (light to dark).
Notes
This is through the composition of an affine function with a convex function.
Acknowledgments
We thank members of the Redwood Center for Theoretical Neuroscience for helpful discussions, in particular Pentti Kanerva, whose work on Vector Symbolic Architectures originally motivated this project. This work was generously supported by the National Science Foundation under graduate research fellowship DGE1752814 and research grant IIS1718991, the National Institute of Health, grant 1R01EB026955-01, the Semiconductor Research Corporation under E2CDA-NRI, DARPA's Virtual Intelligence Processing program, and AFOSR FA9550-19-1-0241.