## Abstract

The connection density of nearby neurons in the cortex has been observed to be around 0.1, whereas the longer-range connections are present with much sparser density (Kalisman, Silberberg, & Markram, 2005). We propose a memory association model that qualitatively explains these empirical observations. The model we consider is a multiassociative, sparse, Willshaw-like model consisting of binary threshold neurons and binary synapses. It uses recurrent synapses for iterative retrieval of stored memories. We quantify the usefulness of recurrent synapses by simulating the model for small network sizes and by doing a precise mathematical analysis for large network sizes. Given the network parameters, we can determine the precise values of recurrent and afferent synapse densities that optimize the storage capacity of the network. If the network size is like that of a cortical column, then the predicted optimal recurrent density lies in a range that is compatible with biological measurements. Furthermore, we show that our model is able to surpass the standard Willshaw model in the multiassociative case if the information capacity is normalized per strong synapse or per bits required to store the model, as considered in Knoblauch, Palm, and Sommer (2010).

## 1 Introduction

Neural associative memories have been subject of continuous research over the past 50 years. In technical terms, they are computing architectures that unify computation and data storage (unlike the standard von Neumann computer architecture; Knoblauch, Palm, & Sommer, 2010). These systems store associations between pattern pairs of binary vectors . Typically, learning in such systems is done in a fast, one-shot way and simple Hebbian-inspired learning rules are used (Knoblauch, 2010). Associative memories have close relationships with Hebbian cell assemblies (Hebb, 1949) and are used in neuroscience as models for various brain structures such as the neocortex (Braitenberg, 1978; Palm, 1990, 2012; Lansner, 2009; Fransén & Lansner, 1998) and the hippocampus (Marr, Willshaw, & McNaughton, 1991; Rolls, 1996).

The two most widely studied models of associative memories are the Willshaw (Willshaw, Buneman, & Longuet-Higgins, 1969; Steinbuch, 1961) and Hopfield networks (Hopfield, 1982). Other proposed associative memory models are the Amit-Fusi model (Amit & Fusi, 1994) and a mixture of the Amit-Fusi model and the Willshaw model in Einarsson, Lengler, and Steger (2014). The standard version of the Willshaw model consists of a bipartite graph with two layers and of size , which stores pattern pairs , where and are random subsets of size of and , respectively. Storage is done by adding all the edges between the pair and for to the graph. Retrieval is done by activating one subset and checking which vertices in layer have at least edges to the vertices in . Retrieval is called successful if besides the vertices in not many other vertices have this property. This simple model achieves maximal information capacity if about half of the possible edges are present in the graph. The information capacity is , but only when the pattern size is (Willshaw et al., 1969). There is a natural neural interpretation of this model: and correspond to two neural populations, the sets and correspond to Hebbian cell assemblies, and the edges correspond to strong synapses between neurons (Braitenberg, 1978).

Standard weaknesses of the Willshaw model are its high retrieval errors in the optimal regime, that capacity is optimal only for , and that it uses a complete underlying graph (Knoblauch et al., 2010). For the first problem, several extensions of the described model have been proposed. These include iterative and bidirectional retrieval schemes (Schwenker, Sommer, & Palm, 1996; Sommer & Palm, 1999) and improved threshold strategies (Buckingham & Willshaw, 1993; Graham & Willshaw, 1995). More recently, the second problem has been tackled by an alternative learning scheme for the Willshaw model, the so-called zip-net, which significantly improves capacity in the large regime (Knoblauch, 2016; Knoblauch & Sommer, 2016). The second, and especially the third, problems show discrepancies between the Willshaw model and biological networks, as the assemblies would be too small to be stable (Latham & Nirenberg, 2004) and the network much denser than observed in biology (Tomm, Avermann, Petersen, Gerstner, & Vogels, 2014). Proposed solutions to these problems improve capacity for the nonoptimal regime by considering a sparse network that obtains reasonable capacity per synapse (between 0.24 and 0.69 depending on the level of connectivity and threshold strategy) for larger than (Graham & Willshaw, 1997; Bosch & Kurfess, 1998).

It has been observed that the brain shows a particular kind of structural uniformity that locally resembles a sparse random graph of neurons interconnected by synapses (Mountcastle, 1997; Perin, Berger, & Markram, 2011; Tomm et al., 2014). Moreover, it is widely believed that information is stored in groups of neurons that spike together (Georgopoulos, Kettner, & Schwartz, 1988; Brown, Kass, & Mitra, 2004). Furthermore, there is evidence towards the fact that synapses are bistable (Petersen, Malenka, Nicoll, & Hopfield, 1998; Montgomery & Madison, 2004; O’Connor, Wittenberg, & Wang, 2005), at least on longer time scales. Thus, a sparse network with binary synapses between and as in Graham and Willshaw (1997) and Bosch and Kurfess (1998) can be regarded as a simplified model for how information is associated between two different cortical columns in the brain.

Still, one crucial problem comes into play. It is well known that in the neocortex, local connections are present and in fact have a much higher density than connections between areas separated by a longer distance (Kalisman et al., 2005; Lefort, Tomm, Sarria, & Petersen, 2009; Binzegger, Douglas, & Martin, 2004). As such, the density inside the patterns should be higher than the density between the patterns. So far, the effect of such local connections has been neglected in the context of memory association studies.

In this work we consider a sparse version of the Willshaw model similar to the one found in Graham and Willshaw (1997) and Bosch and Kurfess (1998), and more important, we add recurrent synapses inside the assemblies in . We investigate the usefulness of these recurrent connections with regard to the model storage capacity.

### 1.1 Our Contribution

We show that in the classical Willshaw model, recurrent synapses decrease storage capacity, but that if we assume that patterns in are associated with the same one in (multiassociation with factor ), the storage capacity is optimized (under the specified learning rule) for a recurrent density that is significantly higher than zero. Figure 1 illustrates that the storage capacity is maximized for a certain ratio of recurrent and afferent density. If is at least 5, then in the optimal case, the recurrent density is larger than the afferent density. These results are robust and hold for almost any choice of pattern size and activation threshold. Moreover, they hold over different capacity measures.

There are well-known reports of a many-to-one map in many regions of the cortex. For example, in case of multisensory integration, many different types of sensory input project to the same population of neurons (Stein, Stanford, & Rowland, 2009; Lemus, Hernández, Luna, Zainos, & Romo, 2010). In the medial temporal lobe, there are reports of neurons that respond selectively to visual stimuli showing a certain individual, landmark, or object (Quiroga, Reddy, Kreiman, Koch, & Fried, 2005). For example, a neuron was found that fires when any picture of the actress Jennifer Aniston or her name written out in letters is shown to the subject. This observation suggests that many different representations in the early visual pathway are mapped to one representation of a certain concept in the medial temporal lobe. Assuming that information in the brain is represented by Hebbian cell assemblies, then one of the simplest ways of implementing a many-to-one map is projecting multiple cell assemblies to one, which is exactly the multiassociation task. Many-to-one maps also appear in state-of-the-art deep neural networks for image processing. These networks process information in a way that is invariant to translations of the visual input by using convolutional layers that are followed by max-pooling layers (Fukushima, 1980; Krizhevsky, Sutskever, & Hinton, 2012). If activations of single neurons in these models relate to activations of cell assemblies in real neural networks, then max pooling is analogous to the model of multiassociation presented here.

Theoretically, associating one more input pattern to an already existing output pattern is easier than associating a new input to a new output pattern. This, however, cannot be captured by feedforward networks like the Willshaw one because they are unable to store correlations between output patterns. Moreover, multiassociation in our model is comparable to allowing miss query noise with factor (Knoblauch et al., 2010), that is, if only a fraction of the patterns in layer is activated during retrieval (see section 2.4 for more detail).

In the multiassociation task with factor , the asymptotic Willshaw information capacity is reduced to . Our model surpasses the Willshaw model with respect to standard capacity measures (Knoblauch et al., 2010) that normalize the stored information by the number of strong synapses or the number of physical memory bits required.

The fact that recurrent synapses might be useful for storing memory association was already observed in Einarsson et al. (2014) for a palimpsest Willshaw model. In this letter, we consider the standard sparse Willshaw model and extend it to the multiassociation task. We provide a precise mathematical analysis of the model for large network sizes (see the appendix). This gives deep insight into the model parameter space and allows finding the optimal parameter values. The asymptotic analysis shows that the observations that we made for small network sizes also hold for large network sizes—most important, that storage capacity is optimized for some recurrent density strictly larger than 0 if is large enough.

## 2 Model

### 2.1 Model Description

Let and be two layers of neurons (vertices) each. We assume that a neuron in is connected to a neuron in by an afferent synapse (afferent edge) with some probability and that two neurons in are connected by a recurrent synapse (recurrent edge) with probability . Initially, all the synapses are weak (synaptic weight is 0), and memories can be stored in the network by turning synapses strong (synaptic weight is 1).

The goal is to store multiassociation patterns in the network, meaning that for each , patterns in are associated with one pattern in . More precisely, a multiassociation pattern consists of patterns and one pattern , where the and are subsets of and of size chosen uniformly at random. The model should perform the following task. Whenever one of the patterns is activated (while is inactive), then almost all neurons in should be activated, whereas almost all neurons in should stay inactive. Note that for the special case , this is the standard memory association task (Willshaw et al., 1969).

We consider the following binary Hebbian storage procedure. We make all afferent synapses (these are present with probability ) between and strong, and we also make all the recurrent synapses in (which are present with probability ) strong. The rule can be summarized as “fire together, wire together” (Hebb, 1949).

The retrieval of the patterns is done in a bootstrap percolation fashion using McCulloch-Pitts neurons (McCulloch & Pitts, 1943). If we want to recall the association , we start by activating , and then activity proceeds in rounds. Let be the activation threshold for all the neurons. In every round, all the neurons that have at least active neighbors are activated (or, equivalently, those that have excitatory input at least ). Neurons that are active stay active for the whole retrieval process, which ends as soon as no new neurons are activated in a round.

### 2.2 Fidelity Measure

Assume that multiassociation patterns are stored in the network, and let and be positive real numbers. Intuitively, a model with fidelity parameters should satisfy the following for every stored association : if is activated, then at least neurons in are activated, whereas at most neurons in are activated. However, as the underlying network of the model is random, it cannot be avoided that some associations cannot be learned at all. Therefore, the expected number of activated neurons are used in the formal definitions. Formally, we say that the model achieves fidelity with parameters if once a random pattern is activated, then in expectation

At least neurons in are activated.

At most neurons in are activated.

The two previously defined fidelity properties are standard in the context of associative memories (Knoblauch, 2005). The expectation is taken over all random events in the model: the randomness of the underlying graph, the random choice of the patterns, and the random choice which of the patterns is retrieved.

Given a fixed recurrent density , property I determines the optimal afferent density , and, given both densities, property II determines the maximal number of patterns that can be stored in the network. Indeed, assume for now that is fixed, and we want to retrieve pattern by activating . Neurons in receive more input the larger is. Thus, the larger is, the more neurons in turn active. Hence, property I demands that is at least some minimal value. Now suppose the number of patterns stored in the network increases. Then the number of strong synapses between the active pattern and the rest of layer increases. It follows that larger leads to higher activity in the whole graph, and therefore property II sets an upper bound on . Similarly, increasing the density of afferent synapses increases the final number of strong synapses. Therefore, it is best to choose exactly such that property I is fulfilled but not higher. Thus, a given determines the optimal value of . By the same argument, the optimal values of and are determined by property I if the fraction is fixed. Then the maximal number of patterns is again determined by property II.

### 2.3 Capacity Measures

We consider the following three capacity measures, which are similar to the standard measures found in the literature (Knoblauch et al., 2010):

the maximal number of associations that can be stored in the network, by

the maximal number of associations per synapse, and by

the maximal number of associations per strong synapse,

where in the last line, and stand for the density of strong afferent and recurrent synapses (see the appendix).

Maximizing for determines how many associations can be stored in a network. The number of synapses corresponds to the total number of bits required to store the model if for each synapse, 1 bit of memory is used to store whether the synapse is strong or weak and no further compression scheme is applied. Thus, is a normalized version of . Finally, the strong synapses are responsible for a significant proportion of metabolic cost in the brain (Attwell & Laughlin, 2001; Laughlin & Sejnowski, 2003; Lennie, 2003). As such, corresponds to the energy efficiency of the model.

In section 3.5, we compare our model to the Willshaw model using capacity measures that are motivated by information theory. Following Knoblauch et al. (2010), we study the following:

, the information stored per synapse

, the information stored per bit of memory

, the information stored per strong synapse

Formally, is the total mutual information or transinformation (Shannon & Weaver, 1949; Cover & Thomas, 1991) that is stored between layers and . It is the information that is gained about the patterns during retrieval. For every pattern , the gained information is which of the neurons forms the pattern . In a model without retrieval errors, this corresponds to bits of information. Therefore, . Note, however, that a precise computation of needs to take into account the false activations in layer when retrieving a pattern (Knoblauch et al., 2010) or equation 4.2. Observe that if and are fixed, then maximizing is equivalent to maximizing .

### 2.4 The Multiassociation Task

Multiassociation with factor causes the Willshaw information capacity to drop from to . This drop can be explained intuitively. The information capacity is defined as the amount of information that is gained about the patterns in during retrieval. In the single association task, each retrieved association adds bits of information, whereas in the multiassociation task, the same output pattern is retrieved times, and therefore only bits of information are added per multiassociation pattern. The fact that storing single associations results in times more information than storing one multiassociation suggests that the latter should be an easier task. However, this is not true for feedforward networks (which, by definition, have no recurrent connections), since they cannot store any correlation between the output patterns. A feedforward network requires the same number of afferent synapses to store one multiassociation pattern as to store single associations. This explains that the information capacity of feedforward networks is times smaller for the multiassociation task than for the single association task. Correspondingly the capacity values that the Willshaw model achieves asymptotically (, and (see Knoblauch et al., 2010) are reduced to , and for the multiassociation task. Rigorous information theoretical arguments show for any model that , , and holds, and thus the classical Willshaw model is to some extent optimal. Our model achieves values higher than and values higher than (see section 3.5).

We remark that multiassociation is equivalent to allowing query noise . For a multiassociation pattern , the union of the patterns has expected size if is much smaller than . Requiring that activating some should activate in expectation neurons in is the same as requiring that activating some fraction of should activate in expectation neurons in . When regarding as one pattern, that is, associating patterns of size in with patterns of size in , then the latter requirement is known as allowing miss query noise with factor . Query noise was analyzed for the classical Willshaw model in Knoblauch et al. (2010) and causes the Willshaw capacity to drop from to as well.

## 3 Results

### 3.1 Optimal Densities

We determine the best recurrent and afferent densities when optimizing for each of the measures introduced in section 2.3. In general, we fix layer size , pattern size , and activation threshold and simulate the model to determine the optimal afferent and recurrent densities and , respectively. If the multiassociation factor is small, the best capacity is achieved when no recurrent synapses are present. For medium-sized (between 3 and 5), the optimal recurrent density increases up to some value and then stays essentially constant even if is increased further (see Figure 2). Furthermore, this optimal recurrent density is larger than the optimal afferent density when is not too small (asymptotically )^{1} and is close to optimal (asymptotically ).

Figure 1 illustrates the effect of the relative density of afferent and recurrent synapses, , on the capacities and . If is large enough, the capacities increase until reaching a peak for some between 0.5 and 1, and then decreases as continues to grow. This increase and decrease become steeper, the larger we choose .

Figure 1d shows that the afferent density needed to fulfill fidelity property I decreases as increases. Therefore, recurrent synapses can replace afferent synapses effectively. However, note that the afferent density cannot go below a certain value as afferent synapses need to invoke some activity in layer . If is small, replacing afferent synapses by recurrent ones hampers the capacity of the model. This behavior changes sharply if is large. As afferent synapses are then more likely to get strong than recurrent ones, a higher recurrent density results in a smaller number of strong synapses. This shifts the optimum regime from one where afferent synapses activate the whole pattern to one where the afferent synapses activate only a fraction of the pattern in the second layer.

### 3.2 Robustness of the Results

The observation that the optimal recurrent density is strictly positive is true for a very large range of pattern size values and activation threshold values as shown in Figures 3a and 3c. Figures 3b and 3d show that the model with recurrent synapses achieves higher information capacity values than the model without recurrent synapses (sparse Willshaw model).

Note that the optimal recurrent density is higher than the optimal afferent density for the optimal choice of (see Figures 3c and 3d). For very large , recurrent synapses become less useful due to the following reason. The afferent input of a neuron in pattern is distributed. In the case with recurrent synapses, afferent input needs to activate only a small part of pattern , and therefore its expected value, , can be below . In the case without recurrent synapses, most of pattern has to be activated by afferent input, and therefore has to be higher than . The larger is, the more concentrated the binomial distribution is, and thus in both cases, is very close to . It follows that only very few afferent synapses can be replaced by increasing the recurrent density.

### 3.3 Features of the Model

The afferent input needed to activate a pattern is significantly reduced when the recurrent density is increased. The magnitude of this reduction is determined by the binomial distribution , which represents the afferent input that a neuron in receives. In the absence of recurrent connections, almost all the probability mass of the distribution has to lie above , while in the presence of recurrent connections, only a fraction of the probability mass has to lie above . Thus, it is the standard deviation of the distribution that crucially determines how much can be decreased. This decrease of is shown in Figure 1d.

The presence of recurrent connections causes an all-or-none effect both inside a pattern and in the whole layer . As increases from 0 to 1, the activity in after retrieval transits from no activity to complete activity. This transition is very sharp for high recurrent densities and very soft for low recurrent densities, as illustrated in Figure 4a. Recurrent synapses ensure in general that either nearly no neurons or nearly all neurons turn active. For the activity in layer , this means that activating only some neurons outside a pattern likely causes the complete layer to turn active. Figure 4b shows that the probability that at least neurons in turn active is only slightly smaller than the probability that at least 20 neurons outside the retrieved pattern turn active. Thus, by avoiding an activity of neurons, the model automatically fulfills fidelity property II, unless is set very small.

Another feature of our model is that the number of rounds required to activate some can be tuned by adjusting the density values, which is illustrated in Figure 4c. For the parameter values yielding the peak in Figure 1d, activation happens within seven rounds. Increasing the densities by decreases the number of required rounds to four. In contrast, decreasing the densities by activates the pattern only partially in most cases.

### 3.4 Noise Tolerance

In this section, we present simulations illustrating the tolerance of our model to add noise in the input (Knoblauch et al., 2010). Given parameter , we define the add noise of a pattern as an additional random vertices in layer , which turn active (together with ) during our recall procedure. The model then has the same task as defined before, that is, to activate the corresponding pattern in layer and avoid activating too many vertices outside . Note that the Willshaw model is known to be add noise tolerant (Palm & Sommer, 1996; Sommer & Palm, 1999; Knoblauch, 2004). Figure 5a shows that our model has a similar tolerance to add noise as the standard sparse Willshaw model. Moreover, we compare our retrieval scheme to the voltage reset retrieval scheme, where the membrane voltages of the neurons are reset to 0 in every round, meaning that the input from earlier rounds is forgotten. More precisely, the active set in round consists of the neurons in that have active neighbors in . Note that in this retrieval scheme, is not necessarily a superset of and that neurons do not have self-connections. The standard retrieval schemes used in the literature for recurrent networks (Schwenker et al., 1996; Sommer & Palm, 1999) usually reset the voltage. Note that for our retrieval scheme, keeping the voltage outperforms resetting the voltage. We discuss this decision to keep the voltage of neurons after every round in section 5.

### 3.5 Information Theoretical Capacity Measures

In this section, we analyze what our model can achieve in terms of the information capacities , and . Recurrent connections improve the compressed information capacity, , and the information capacity per strong synapse, , in the regime that is optimal for the Willshaw model (). Additionally, if is small and is large, then our model obtains values for and that are higher than the asymptotic Willshaw capacities and (see Figure 6). Figure 7 shows that these observations are true for a wide range of pattern size values. Note that larger values of seem to be the more interesting parameter regime for biological networks as they achieve much higher values. Our model also surpasses the and values of the Willshaw model in that regime. The reason that our model outperforms the Willshaw model for measures and , whereas it does not for the measure , can be explained intuitively. Many recurrent synapses have to be added in order to compensate a smaller afferent density. Then, if the multiassociation factor, , is large, only a small fraction of the recurrent synapses are strong, and therefore the entropy of the strong recurrent synapses is far from the optimal case, where half the recurrent synapses are strong. The measures and allow the model to use the synapses in layer more efficiently since they are normalized by the number of strong synapses and by the number of required bits, respectively. In the brain, this can be interpreted in terms of the pruning of synapses and energy efficiency (as the strong synapses require more energy to be maintained). We discuss ideas that could lead to an improvement of in section 5.

### 3.6 Asymptotic Behavior

To further reassure the correctness of the asymptotic approximations done in the analysis, Figure 8 plots the asymptotic against the from a simulation with parameters , , , and . For comparisons, we pick and equal to what is obtained in the simulations. The plot then shows that the asymptotic calculations are a very good predictor of how many patterns can be stored in the network in the limit. For further details on how the plot was built, see section 4.

Note that the model achieves constant information capacity since the information per multipattern is , and, by equation A.2, the critical densities and are of order . We compute the information capacity of the model more precisely by considering false activations in (for details see section 4) and plot it against for in Figure 9b. Note, in particular, that our model also achieves a significantly better information capacity than the sparse Willshaw model and the classical (complete) Willshaw model with .

## 4 Methods

We simulate the model using standard computer hardware. Two layers of neurons are created, and independent coin tosses decide which synapses are present (a recurrent synapse is present with probability , an afferent synapse with probability ). The patterns are always chosen uniformly at random.

In the following, we describe the method used in the simulations to ensure that properties I and II are satisfied. These are then used to obtain the presented data.

Given a fixed or a fixed , we search for the smallest possible that satisfies property I. For this purpose we simulate for every (up to a precision of ) the spread of activity in pattern when is activated, with afferent connection probability and recurrent connection probability , when is turned active. This is repeated 10,000 times for randomly chosen connections, and we pick the smallest value, for which on average at least () of the pattern was active (when is activated).

With the values of and at hand, we search for the maximal , or such that fidelity Property II is satisfied. To this end, we insert one multiassociation pattern after the other and test in each step whether property II is violated for . We test this by activating 100 times a random pattern and then counting the number of active neurons in . If the average activity in is above , then property II is violated. The values of , , and are averaged over 30 repetitions of the entire simulation.

#### Figures 1, 2 and 3

In Figures 1a to 1c, the ratio is fixed, and we search for the smallest possible that satisfies property I. With and at hand, we search for the maximal , , and that satisfy property II, and these values are plotted over . The parameter values are given in the figure.

In Figures 1d, 2, and 3, for each value of (up to a precision of ), we search for the smallest that satisfies property I. With these values at hand (for each , the corresponding ), we search for the maximal , , or that satisfies property II. The optimal values of and are the ones for which the largest , , or are obtained (which of the three measures is taken is stated in the figure). Again, parameter values are given in the figure.

#### Figure 4

In Figure 4c, we plot the average activity in pattern , which is obtained by simulating bootstrap percolation 1000 times between two patterns and for the given (-axis) and (legend) with and .

The data points of Figure 4b are obtained by simulations. As is fixed, we first search for the smallest such that property I is satisfied if . Then we store one multiassociation pattern after the other and test in each step for 100 random patterns whether (1) at least 20 neurons in turn active and (2) at least neurons in turn active. We repeat the whole procedure 30 times and plot for the proportion of successful trials ( trials in total). Parameter values are given in the figure. To compute the strong density, we count the number of strong synapses and divide by the total number of synapses.

In order to compute the mean and standard deviation for Figure 4c, we simulate bootstrap percolation 10,000 times between two patterns and for the given and with and .