Abstract

Neural associative memories (NAM) are perceptron-like single-layer networks with fast synaptic learning typically storing discrete associations between pairs of neural activity patterns. Gripon and Berrou (2011) investigated NAM employing block coding, a particular sparse coding method, and reported a significant increase in storage capacity. Here we verify and extend their results for both heteroassociative and recurrent autoassociative networks. For this we provide a new analysis of iterative retrieval in finite autoassociative and heteroassociative networks that allows estimating storage capacity for random and block patterns. Furthermore, we have implemented various retrieval algorithms for block coding and compared them in simulations to our theoretical results and previous simulation data. In good agreement of theory and experiments, we find that finite networks employing block coding can store significantly more memory patterns. However, due to the reduced information per block pattern, it is not possible to significantly increase stored information per synapse. Asymptotically, the information retrieval capacity converges to the known limits $C=ln2≈0.69$ and $C=(ln2)/4≈0.17$ also for block coding. We have also implemented very large recurrent networks up to $n=2·106$ neurons, showing that maximal capacity $C≈0.2$ bit per synapse occurs for finite networks having a size $n≈105$ similar to cortical macrocolumns.

1  Introduction

Neural associative memory (NAM), a simple artificial neural network that works as an associative memory, can be understood as a module of memory and learning by synaptic plasticity (for a review, see Palm, 2013). The retrieval of information from such a memory is typically not achieved by looking up a content under a final address, but rather by associating a meaningful output pattern with a meaningful input pattern. During the learning phase, associations are stored or learned by a form of local synaptic plasticity (the change of connection strength of one synapse depends only on the activity of the pre- and the postsynaptic neuron in the two patterns that are presented to be associated), typically a variant of Hebbian plasticity (Hebb, 1949). In classical associative memory, one distinguishes between heteroassociation and autoassociation concerning the identity of the output patterns to the input patterns. Heteroassociation is more similar to the technical address $→$ content scheme, but it is different because the input patterns are also considered to be meaningful, and therefore their similarity (in terms of some vector distance, such as the Hamming-distance or overlap for binary patterns) should reflect similarity of content and should be roughly preserved by the association mapping (similar inputs to similar outputs). Autoassociation is typically used for pattern completion or pattern correction.

NAMs originated in the 1960s, where Steinbuch's (1961) “Lernmatrix” and Gabor's (1969) “holographic memory” were probably the first concrete examples. They have been analyzed by mathematicians, engineers, physicists, and computer scientists, mostly in terms of their capacity to store a large number of memory patterns. In spite of the apparent simplicity of this measure, which roughly counts the number of patterns that can be stored in and (more or less completely) retrieved from the memory, and which is usually called memory capacity, or just capacity, by various authors, there are several subtle differences in their definitions that can result in large differences in the achievable values. So we are faced with a zoo of different capacity definitions, which we try to sort out in this letter.

The first analysis of NAM was provided by Willshaw, Buneman, and Longuet-Higgins (1969), showing a “memory capacity” of $ln2≈0.69$ bit per synapse in the limit of large networks. Palm (1980) used an information-theoretic criterion to optimize the parameters of a NAM for finite network size leading to an “information capacity.” He found that optimal capacity values are obtained for sparse patterns (binary patterns with only very few 1s). He also distinguished heteroassociation and autoassociation with an asymptotic information capacity of $ln2$ and $12ln2$, respectively. The autoassociative NAM network was also considered as a dynamical system in theoretical physics of so-called spin glasses, notably by Hopfield (1982), who realized that overloading the network with memory patterns leads to a catastrophic breakdown of fixed-point retrieval (i.e., retrieving the stored patterns as fixed points of the autoassociative NAM dynamics). He defined capacity as the critical number of retrievable memories (normalized by network size), here called the critical pattern capacity, and found a value of about 0.14 for suboptimal nonsparse patterns. Later, the importance of sparseness was also recognized in the (spin-glass) physics community (Gardner, 1987, 1988; Amari, 1989); in particular Tsodyks and Feigel'man (1988) found a critical pattern capacity of $12ln2$ for sparse patterns. For this result, the critical number of memories was further normalized by the information content $I(p)$ of one output bit, which is smaller than one for sparse patterns. This correction had previously been introduced by Willshaw et al. (1969) and Gardner (1987, 1988)

These results are not so relevant for practical applications, because in fixed-point retrieval (using already perfect input patterns) close to the critical pattern capacity limit, the memory can practically be used only as recognition memory (Palm & Sommer, 1992), not for retrieving a stored pattern from a similar but not identical input pattern (reconstruction memory). Thus, the information that can actually be retrieved from NAM at critical pattern capacity is much smaller, even close to zero in many cases. Correspondingly, the information capacity is smaller for autoassociative fixed-point retrieval, namely, $≤12ln2$ (Palm, 1980; Palm & Sommer, 1992). In practical applications, one should consider sparse coding methods that create memory patterns in the near-optimal parameter range (Palm, 1987b; Bentz, Hagstroem, & Palm, 1989; Palm, Schwenker, & Sommer, 1994; Knoblauch, Palm, & Sommer, 2010), and in the case of autoassociation, one should also consider effective iterative retrieval methods (Schwenker, Sommer, & Palm, 1996) that allow retrieving the stored patterns from arbitrary parts, usually variations of dynamical fixed-point retrieval.

More recently, Gripon and Berrou (2011) rediscovered the inefficiency of the Hopfield model and the advantages of sparse coding. In fact, they introduced a particular sparse-coding method that was claimed to be better than previous results (see also Gripon & Rabbat, 2013; Aliabadi, Berrou, Gripon, & Jiang, 2014; Aboudib, Gripon, & Jiang, 2014; Ferro, Gripon, & Jiang, 2016), namely, block coding (the multiple 1-out-of-n code, see Palm, 1987c), also known as the Potts model in spin-glass physics (e.g., Wu, 1982; Kanter, 1988) and analyzed by Kryzhanovsky, Litinskii, and Mikaelian (2004), Kryzhanovsky, Kryzhanovsky, and Fonarev (2008), and Kryzhanovsky and Kryzhanovsky (2008), that is more recently called neural cliques or the Grippon-Berrou-neural-network (GBNN). They also invented a truly new iterative retrieval method for block coding yielding comparatively high information capacity, which even exceeds the asymptotic theoretical value for finite n, similar to other iterative retrieval methods introduced much earlier by Schwenker et al. (1996). Unfortunately, their methods are described in an unusual terminology but also refer to the critical pattern capacity (calling it “diversity”), so we tried to translate their work into the usual NAM terminology and use information-theoretical capacity measures for a direct quantitative comparison. To this end, we had to extend previous information capacity definitions a bit. Previously, information capacity is the maximum information contained in the storage matrix (network connectivity matrix) about the set of patterns to be stored. Usually in information theory, one tries to find the maximum over all possible input pattern distributions. Practically, one often restricts the class of distributions (e.g., to independently generated patterns) and also tries to use some suitable retrieval methods to estimate the amount of information that can maximally be extracted. If we want to compare different coding and retrieval methods, we have to include an explicit restriction of both the memory pattern distribution and the retrieval method into the capacity definition. Following this strategy, we found that in terms of information capacity, the only new and really interesting improvement found by Berrou and colleagues is the iterative retrieval method mentioned before: the so-called sum-of-max retrieval (Gripon & Berrou, 2012; Yao, Gripon, & Rabbat, 2014). Otherwise their results are well in the ballpark of other similar methods. In particular, their results on block coding do not affect asymptotic information capacities.

This letter is organized as follows. In section 2, we introduce the basic concepts and our mathematical terminology and distinguish the different capacity concepts in more detail. Then, in section 3, we describe the retrieval strategies for autoassociation and bidirectional heteroassociation, including retrieval strategies for block coding. In section 4, we analyze these methods in terms of information capacity, first for fixed network size $n$, then asymptotically for $n→∞$. In section 5, we present numerical experiments with randomly generated patterns used as a standard benchmark to compare various methods. In section 6, we discuss our asymptotic and numerical results and conclude the letter.

2  Basic Concepts and Research Questions

The learning task of NAM is to store associations between pairs of memory patterns $uμ$ and $vμ$ for $μ=1,2,…,M$ that may be interpreted as neural activation vectors or patterns of synaptically linked neuron populations $u$ and $v$. In the case of heteroassociation, $u$ and $v$ are two different populations, whereas $u$ and $v$ are identical for autoassociation. Most models employ a local immediate learning rule $R$ to determine the synaptic weight $wij$ from neuron $ui$ to $vj$,
$wijμ=R(wijμ-1,uiμ,vjμ),$
(2.1)
after the $μ$th learning step, where synapses are initially silent ($wij0=0$). For example, for the NAM with clipped Hebbian learning suggested by Steinbuch (1961) and Willshaw et al. (1969) for binary patterns, it is $R(w,u,v)=max(w,u·v)$ such that after storing $M$ memory associations, the binary synaptic weights are
$wij=maxμ=1Muiμvjμ=1,∑μ=1Muiμvjμ>00,otherwise∈{0,1}.$
(2.2)
For heteroassociation we can interpret the two-layer neural network as a bipartite graph such that each pattern association $uμ↦vμ$ corresponds to a biclique comprising the active units in the patterns $uμ$ and $vμ$ as nodes; that is, every active node in $uμ$ is connected to every active node in $vμ$. For autoassociation, each pattern $uμ$ corresponds to a clique of neurons in a recurrently connected population $u$. In the following we concentrate on analyzing this so-called Willshaw model for binary patterns $uiμ,vjμ∈{0,1}$, which seems most efficient if implemented in digital hardware. Other prominent models employ additive or linear learning rules $R(w,u,v)=w+r(u,v)$—for example, $r(u,v)=uv$ for integer-valued synapses corresponding to the Hebb rule for binary patterns $ui,vi∈{0,1}$ and to the Hopfield rule for $ui,vi∈{-1,1}$. The optimal additive rule (to maximize storage capacity) is the covariance rule, for example, $r(0,0)=p2$, $r(0,1)=r(1,0)=-p(1-p)$, $r(1,1)=(1-p)2$ (Sejnowski, 1977a, 1977b; Dayan & Willshaw, 1991; Palm & Sommer, 1996), that depends on the fraction $p:=pr[ui>0]$ of active units per pattern vector and results in real-valued synapses. In the sparse limit, that is, for $p→0$, this rule turns into the Hebb rule or even the Willshaw model, and for $p=12$, it turns into the Hopfield rule (Palm, 1991; Knoblauch, 2011). Similar to the optimal Bayesian learning rule (Lansner & Ekeberg, 1989; Knoblauch, 2010a, 2011), the resulting synapses have real-valued weights leading to relatively expensive implementations in digital hardware unless weights are discretized properly (Knoblauch, 2010b, 2016).
There are several methods for retrieving the stored patterns from the memory. For heteroassociation, the most natural use is to retrieve the output $vμ$ from input $uμ$ (pattern mapping). Typically, the starting point is from a noisy version $u˜μ$ of the original input, and the retrieval output $v^μ$ may not always be identical to the original output patterns. In an iterative fashion, one may also reconstruct $u^μ$ in addition to $v^μ$ (bidirectional retrieval; Kosko, 1988; Sommer & Palm, 1999). For autoassociation the most natural retrieval method is to consider the feedback network with connectivity matrix $W$ and the dynamical neural network system (Gibson & Robinson, 1992; Schwenker et al., 1996),
$u^(t+1)=H(u^(t)W-Θ(t)),$
(2.3)
and to retrieve the patterns $uμ$ as fixed points of equation 2.3 starting with $u^(0):=u˜$, where $H$ is the Heaviside function with $H(x)=1$ for $x>0$ and $H(x)=0$ for $x<0$. The retrieval methods may differ in the choice of thresholds $Θ(t)$ (threshold regulation; see section 3 and Palm, 1982) and perhaps in special criteria for stopping the iteration. There are two basically different approaches concerning the starting patterns: either one wants to verify that a pattern $u$ is indeed a fixed point, in which case, $uμ$, or a pattern very close to $uμ$ is used as starting point (fixed-point retrieval or recognition of stored patterns), or one wants to find the next correct pattern $uμ$ from a substantially different starting pattern (pattern correction or pattern completion).
Retrieval quality can be judged from the output noise$ε^$ defined as the expected $L1$-norm or Hamming distance $∥.∥1$ between retrieval outputs and the original patterns. For example, for heteroassociative pattern mapping, we define
$ε^:=E(εμ)forεμ:=∥v^-vμ∥1∥vμ∥1=∑j|v^j-vjμ|k.$
(2.4)
A similar definition applies for autoassociation ($v=u$), and for bidirectional retrieval, we compute a (weighted) average of the noise in $u$ and $v$. It is important to note that zero output noise $ε=0$ is not viable, and even the limit $ε→0$ for large networks must be carefully defined to avoid a strongly diminished storage capacity (see Knoblauch et al., 2010, pp. 298, 307).

We now give a general information-theoretic storage capacity definition that is independent from particular retrieval methods. Then we examine these retrieval methods in more detail to define the corresponding retrieval capacities.

2.1  General Information Storage Capacity

In order to evaluate and compare different versions of associative memories of a given size $n$ and to optimize their parameters, one needs a numerical performance criterion. The most natural approach to a capacity definition is based on information theory and the classical channel capacity (Shannon & Weaver, 1949; Cover & Thomas, 1991; Palm, 2012), which is the maximal transinformation $T(M,W)$ between the input and the output of a channel. For a fixed associative storage method, this “channel” is the mapping from the set of patterns to be stored—for example, $M:={(uμ,vμ):μ=1,…,M}$ for heteroassociation or $M:={uμ:μ=1,…,M}$ for autoassociation—to the weight matrix $W$ which is generated by it. In this letter, we focus on the Willshaw model, equation 2.2, and we define information storage capacity,
$C:=maxT(M,W)#connections,$
(2.5)
where the maximum has to be taken over all pattern distributions (i.e., all distributions on $M$) in principle. In practice, it has turned out that one can assume independently generated input and output patterns $uμ$ and $vμ$ ($μ=1,…,M$) and that the maximum is attained for sparse patterns (Willshaw et al., 1969; Palm, 1980, 1987b). Later, it was found that the maximum requires sparse patterns with a fixed number $k$ of ones (Knoblauch et al., 2010). Willshaw et al. (1969) and Palm (1980) had already shown that $C=ln2$ for heteroassociation, and this value can be reached by simple pattern mapping. Because of the symmetry of the matrix $W$, the obvious corresponding result for autoassociation should be $C=(ln2)/2$. Arguments for this were first given by Palm (1980).1 A simpler argument can be given by reduction of autoassociation to heteroassociation.2

This general definition of information capacity can be further restricted if one considers particular methods of pattern retrieval, which may introduce additional parameters that then have to be optimized. Of course, the more restricted optimization will tend to result in smaller capacity values in general. Here we distinguish four subforms of information storage capacity that can be expressed in terms of the transinformation between the stored patterns and the retrieved patterns. All following definitions of information storage capacity may be extended in a hardware-specific way in order to account for the minimal physical resources necessary to realize the network—for example, in the main memory (RAM) of digital hardware or in a synaptic network of the brain (Knoblauch et al., 2010; Knoblauch, Körner, Körner, & Sommer, 2014; Knoblauch & Sommer, 2016).

2.2  Mapping Capacity

For heteroassociation and pattern mapping, we simply consider the transinformation
$T(V;V^):=I(V)-I(V|V^)$
(2.6)
between the sets of stored output patterns $V:=(v1,…,vM)$ and retrieved output patterns $V^:=(v^1,…,v^M)$ for heteroassociation, where $I(V)$ is the Shannon information of the pattern set $V$ and $I(V|V^)$ is the conditional information of $V$ given $V^$. Normalizing to the number of synaptic connections yields the mapping capacity,
$Cv:=T(V;V^)n2,$
(2.7)
assuming that both neuron populations have size $n$ and the network is completely connected.3 Palm (1980) provided the first complete analysis of pattern mapping showing a mapping capacity of $ln2$.

2.3  Completion Capacity

For autoassociation, part of the information of the retrieval outputs $U^:=(u^1,…,u^M)$ on the original patterns $U:=(u1,…,uM)$ is already present in the initial set of noisy inputs $U˜:=(u˜1,…,v˜M)$ that must therefore be subtracted,
$Cu:=T(U;U^)-T(U;U˜)n2,$
(2.8)
in order to get a fair measure for the completion capacity, where, to get positive values, output noise level $ε$ should be lower than input noise. Equations 2.7 and 2.8 can be further simplified for independently generated patterns $uμ$, $vμ$, $u˜μ$ (see section A.1 in appendix A for details).

A complete analysis of the completion capacity is mathematically demanding. For one-step and two-step retrieval, it has been done by Schwenker et al. (1996). The optimal input pattern contains half of the 1s of a stored pattern, and the remaining 1s can be retrieved with a very low probability for additional wrong 1s in the retrieved pattern. This yields a capacity of $Cu=ln24$, which also is the asymptotic value for the completion capacity (Sommer, 1993). Interestingly, this value can be exceeded for finite $n$ (Schwenker et al., 1996). In section 2.4, we will see that the more restricted patterns used in block coding yield the same asymptotic capacity.

2.4  Bidirectional Capacity

For bidirectional iterative heteroassociation, both the input and output patterns are reconstructed such that an adequate measure of the retrieved information per synapse is the bidirectional capacity,
$Cuv:=Cu+Cv,$
(2.9)
if we assume that the patterns $uμ$ and $vμ$ are independently generated.

The iterative bidirectional retrieval method was first analyzed by Sommer and Palm (1998, 1999), surprisingly yielding the same asymptotic capacity of ln 2 as simple pattern mapping alone.

2.5  Critical Pattern Capacity and the Sparse Limit

Perhaps the simplest capacity measure that does not rely on the idea of channel capacity, is the (critical) pattern capacity$Mε$ determining the maximum number of pattern associations $M$ that can be stored without $ε^(M)$ from equation 2.4 exceeding a tolerated output noise level $ε$,
$Mε:=max{M:ε^(M)≤ε},$
(2.10)
where in the limit of large networks ($n→∞$), the noise level $ε$ becomes irrelevant (unless $ε→0$ too fast) and $Mε$ corresponds to the critical pattern capacity mentioned in section 1. Note that we can similarly constrain information storage capacity by a maximal noise level $ε$—for example, writing $Cε$ instead of $C$.

In general, a comparison by $Mε$ is not meaningful as soon as the compared models store different types of pattern vectors that have, for example, different numbers of active units $k$. This means that one has to introduce an appropriate normalization; it is not sufficient just to divide $Mε$ by $n$ as Hopfield (1982) introduced.

For the Willshaw model, it turned out that sparse patterns maximize both the information capacity and the critical pattern capacity. It is therefore useful to define more formally the sparse limit. For each $n$, we assume that the patterns $uμ$ (or $vμ$) are drawn randomly and independently from the $nk$ possible patterns with exactly $k$ 1s (and $n-k$ 0s) and that $k∼logn$.

For autoassociation, it is known that the number $M$ of patterns that can be retrieved as fixed points grows as $M∼n2klogn$, faster than proportinal to $n$. In order to evaluate this growth numerically, sometimes the coefficient $α$ with $M=αn2klogn$ is considered (Gardner, 1987, 1988; Gripon, Heusel, Löwe, & Vermet, 2016), which also reflects the reduced information contained in sparse patterns, which is $∼klogn$ instead of $n$ bit. Still, $α$ is just the total information contained in all $M$ fixed points (normalized by $n2$), but not the information that can be retrieved about these $M$ patterns. The numerical value of this normalized pattern capacity, or, rather, “growth factor,”4
$α:=MεI(uμ)n2=Mεkldnn2,$
(2.11)
strongly depends on the quality criteria for fixed-point retrieval. The usual value for autoassociation is obtained for the asymptotically vanishing (expected) number of errors in both the input pattern and the fixed point ($λ→1$, $κ→0$, $εk→0$; see section 2.7). For the additive correlation or Hebb rule, this has been analyzed in the sparse limit by Tsodyks and Feigel'man (1988) yielding a capacity of $α=12ln2$. Willshaw et al. (1969) have already used an equivalent quality criterion yielding $α=ln2$. Interestingly, these values agree with the corresponding information-theoretical mapping capacity values (Palm, 1991) for nonbinary and binary memory matrices.

2.6  Recognition Capacity

For recognition memory, the task is to decide if an input pattern $u˜$ has already been stored previously. This two-class problem can be solved by employing an autoassociative network in the following way. First, the dendritic potentials $xj:=∑i=1nwiju˜i$ are computed as in one-step retrieval. Second, the sum $S:=∑ju˜jxj$ over all dendritic potentials of active input units is computed. Third, the new input $u˜$ is classified as familiar (i.e., already stored) if $S$ exceeds some threshold $ΘC$; otherwise $u˜$ is classified as new. For example, for the Willshaw model, equation 2.2, with all original patterns having the same activity $k=∥uμ∥$, we can simply choose $ΘC=k2$ because any previously stored pattern $u˜=uμ$ is represented in the weight matrix by a clique of size $k$ having $k2$ connections. Equivalently, one may check if $u˜=uμ$ (or a superset thereof) is a fixed point for the dynamics (see equation 2.3) using $Θ(t)=k$. For correctly computing the recognition capacity$Cu,rcg$, it is important to see that for the recognition tasks, the completion capacity, equation 2.8, is zero (because for familiar inputs $u˜μ=uμ$, completion is not necessary). Instead, $Cu,rcg$ follows from the information given by the binary class label associated with each potential input pattern $u˜$ (for more details see appendix C). So a simple capacity definition would consider just the maximal number $M$ of patterns that after storing become fixed points of the autoassociative threshold dynamics, equation 2.3, that is,
$Mf:=max{M=|M|:uμisfixedpoint∀uμ∈M}.$
(2.12)
It turns out that in the large $n$ limit, $Mf/Mε→1$, that is, $Mf$ is essentially the same as the critical pattern capacity $Mε$ defined above. The information-theoretic version of the idea would be quite different again. From the matrix $W$, we generate the set $F$ of all fixed points of equation 3.1. Thus, we can define the recognition capacity,
$Cu,rcg:=maxT(M,F)n2.$
(2.13)
In the literature, it has been noticed that often $F$ is much larger than $M$, which is known as the problem of spurious states (Bruck & Roychowdhury, 1990; Hertz, Krogh, & Palmer, 1991), where spurious states are just the elements of $F∖M$. When we try to optimize $T(M,F)$, this leads to a kind of quality criterion restricting the number of spurious states. Indeed,
$T(M,F)=I(M)-I(M|F)=nk·IMnk-F·IMF,$
(2.14)
where $F$ is the total number of fixed points in $F$. If we assume that $M$ is small compared to $F$ (see appendix C), we obtain
$T(M,F)=MldnkM-MldFM=MldnkF=-MldpF,$
(2.15)
where $pF$ is the probability that a random pattern is a fixed point of equation 2.3. Now $F$ or $pF$ has to be estimated. This can be done as in Palm and Sommer (1992) or as in the more recent paper by Gripon et al. (2016). If we consider their analysis, we can also see the difference between different quality criteria for fixed-point capacity and the information-based recognition capacity. In fact, in their theorems 3.1 and 3.4, Gripon et al. (2016) consider the limit
$M∼nlogn2loglogn,$
(2.16)
which is larger than the usual sparse limit but ensures that $pF$ vanishes asymptotically given an appropriate proportionality factor in equation 2.16. However, it turns out that then $F$ still increases too fast, resulting in an asymptotically vanishing recognition capacity. In other words: if $M$ grows as equation 2.16, the information that can be retrieved about $M$ is almost zero for large networks. Then they consider $M∼nlogn2$ as in the sparse limit, which allows nonvanishing asymptotic completion and critical pattern capacity as discussed before and also yields nonvanishing recognition capacity. Their argumentation can also be used to show that $Cu,rcg=1n2T(M;F)=-Mn2ldpF≤ln22$ and that the same result also holds for block coding.

Thus, the recognition capacity apparently reaches the information storage capacity for autoassociation (see note 2), $C=ln22$, which corresponds to half the value of the critical pattern capacity and twice the value of the practically relevant completion capacity (see sections 2.3 and 2.5). For further details on how to compute $Cu,rcg$, see appendix C. There we also show that it may actually be possible to exceed equation 2.16 but only for patterns that have an even sparser activity than in the sparse limit $k∼logn$. Specifically, we show $C→ln22$ for $M∼n2lnln⋯lnn$ and $k$ being almost constant, assuming asymptotic independence of the synaptic weights (Knoblauch, 2008c).

2.7  Random and Block Patterns for Maximal Capacity

Note that each subform of the general information storage capacity, equation 2.5, will result in a lower capacity value, because both the retrieval and the assumed pattern distribution are specifically restricted to technically reasonable assumptions, whereas the information storage capacity is simply defined as the maximal transinformation independent of computationality or practicality of its retrieval. In the following, we describe, analyze, and simulate a number of different retrieval procedures for the Willshaw model. Moreover, some retrieval methods rely on a particular coding of the memory patterns. For example, some procedures require that all stored patterns have an identical number $k$ of active units; others also require block coding to represent integer-type vectors where each binary pattern consists of $k$ blocks of size $N$, as illustrated in Figure 1A. Obviously, it would be necessary to compute the storage capacity for each combination of learning procedure, retrieval procedure, and distribution over the stored patterns. Because “channel capacity” commonly refers to the maximum transinformation, we focus on maximum entropy distributions, that is, random patterns subject to some constraints as required by the retrieval procedures. Specifically, for the classical retrieval methods, we choose random patterns$uμ,vμ$ uniformly from the set of $nk$ patterns of size $n$ with activity $k$. Similarly, for the block coding methods, we choose $uμ,vμ$ uniformly from the set of $Nk$ possible block patterns (with $n=Nk$). To generate a noisy input $u˜$, we select in each case a fraction $λk$ of the $k$ one-entries from one of the original input patterns $uμ$ at random ($0<λ≤1$), and, in general, we add $κk$ false one-entries at random ($κ≥0$).5 (For more details on how the underlying pattern distributions affect stored information see section A.1.)

Figure 1:

(A) Example of block coding an integer vector (2,1,4,2,3) by a binary pattern vector $u$ consisting of $k=5$ blocks, each having $N=4$ components. Correspondingly, the binary pattern $u$ has length $n=kN=20$ and $k=5$ one-entries. (B) Illustration of iterative “core-and-halo” retrieval with block coding constraint (IRB). If the initial query $u˜$ is a subset of the original input pattern $uμ$, then each subsequent one-step-retrieval operation with block coding constraint (R1B) also activates a subset of the stored memory ($u^(t)⊆uμ$, $v^(t)⊆vμ$), whereas simple one-step-retrieval without block constraint (R1) activates a superset of the stored memory ($R1(u^(t))⊇vμ$, $R1T(v^(t))⊇uμ$). We call the intermediary retrieval outputs $u^(t),v^(t)$ after step $t$ cores of the stored patterns, whereas we call the additional active neurons in supersets halos. During iterative retrieval, cores grow, whereas the halos shrink with each iteration $t$.

Figure 1:

(A) Example of block coding an integer vector (2,1,4,2,3) by a binary pattern vector $u$ consisting of $k=5$ blocks, each having $N=4$ components. Correspondingly, the binary pattern $u$ has length $n=kN=20$ and $k=5$ one-entries. (B) Illustration of iterative “core-and-halo” retrieval with block coding constraint (IRB). If the initial query $u˜$ is a subset of the original input pattern $uμ$, then each subsequent one-step-retrieval operation with block coding constraint (R1B) also activates a subset of the stored memory ($u^(t)⊆uμ$, $v^(t)⊆vμ$), whereas simple one-step-retrieval without block constraint (R1) activates a superset of the stored memory ($R1(u^(t))⊇vμ$, $R1T(v^(t))⊇uμ$). We call the intermediary retrieval outputs $u^(t),v^(t)$ after step $t$ cores of the stored patterns, whereas we call the additional active neurons in supersets halos. During iterative retrieval, cores grow, whereas the halos shrink with each iteration $t$.

3  Retrieval Procedures for Block Patterns and Random Patterns

Retrieval means finding a maximal biclique (or clique) that has maximal overlap with a given query pattern $u˜$ presented to the input layer. Here we assume pattern part retrieval where the query pattern $u˜$ contains a subset of $λk$ of the $k$ one-entries of an original input pattern $uμ$ (where $0<λ≤1$). There are several possibilities for computing the retrieval output $v^$:

One-step retrieval (R1). An output unit $vj$ gets activated iff it is connected to at least $Θ$ active input units $ui$ with $u˜i=1$, that is,
$v^j=1,xj:=∑i=1nwiju˜i≥Θ0,otherwise.$
(3.1)
As we assume $u˜⊆uμ$ and full network connectivity, we can use the maximal threshold $Θ=λk$. Note that this will activate a superset $v^⊇vμ$ of the active units in the original output pattern, because learning by equation 2.2 implies that each output unit $vj$ with $vjμ=1$ will be connected to all of the active input units. One-step-retrieval is the standard retrieval procedure of many previous works. We use it as the baseline for comparison with the more refined retrieval strategies of block coding.

One-step retrieval with block constraint (R1B). As the output patterns are block codes, we can exploit that there is only one active unit per block. Thus, given the R1 result $v^$, we can conclude for each one-entry $v^j=1$ that it is correct if there is no second one-entry in the same block. By erasing all one-entries in ambiguous blocks of $v^$ that have multiple active units, we will effectively decrease output noise, equation 2.4, as soon as there are more than two active units in a block. Note also that by exploiting the block constraint, the retrieval output $v^$ again becomes a subset of the original output pattern $vμ$.

Simple iterative-retrieval with block constraint (sIRB). As after doing R1B the output $v^$ is a subset of $vμ$, we can repeat the same procedure again, where input and output layers have changed their roles, and so on, leading to the following algorithm:

1. Let $u^(0):=u˜$ be the original input query and set $t:=0$.

2. Increase $t:=t+1$.

3. Compute the next output estimate $v^(t):=R1B(u^(t-1))$ by employing the R1B procedure already described on the previous estimation of the input pattern $u^(t-1)$.

4. Compute the next input estimate $u^(t):=R1BT(v^(t))$ by employing the “transposed” R1B procedure on the current output estimate $v^(t)$ (from layer $v$ to layer $u$ with the transposed weight matrix $WT$).

5. As long as a stopping criterion (e.g., $u^(t)=u^(t-1)$ or exceeding a maximal number of retrieval steps) is not reached, go to step 2.

Note that all estimates of inputs or outputs are subsets of the original patterns.

Iterative retrieval with block constraint (IRB). Following the last remark, it is obvious that the sIRB algorithm can be improved by OR-ing new estimates of output and input patterns with the previous estimates. Thus, after initializing $v^(0):={}$ in step 1, we replace the operations of steps 3 and 4 by
$v^(t):=v^(t-1)∪R1B(u^(t-1)),$
(3.2)
$u^(t):=u^(t-1)∪R1BT(v^(t)),$
(3.3)
where we interpret patterns as sets of their active units for convenience. Note that patterns are monotonically growing with time $t$, that is, $v^(t)⊇v^(t-1)$ and $u^(t)⊇u^(t-1)$ for all $t=1,2,…$, as illustrated in Figure 1B. Thus, there is a unique stopping criterion $u^(t)=u^(t-1)$ for step 5. As block coding limits the number of active units (in the stored patterns) to $k$, the algorithm terminates after at most $tmax≤(1-λ)k$ iterative retrieval steps. Note that for the special case of autoassociation, an explicit implementation of operations 3.2 and 3.3 is not necessary, because sIRB implicitly realizes the OR-ing.6 We sometimes use a variant IRB-R1 where IRB is followed by an additional R1 step in order to increase retrieved information by constructing a superset (“halo”) of the original output pattern.
Iterative retrieval with block constraint and sum-of-max strategy (IRB-SMX). In a valid block pattern, there is exactly one active neuron per block. This constraint suggests an interesting retrieval strategy where each neuron can receive at most one synaptic input per block (Gripon & Berrou, 2012; Yao et al., 2014). For that, equation 3.1 must be replaced by the following R1B-SMX procedure:
$v^j=1,xj:=∑b=1kmaxi=N(b-1)+1Nbwiju˜i≥Θ0,otherwise.$
(3.4)
Obviously, if the input pattern $u˜$, is a superset of the original pattern $uμ$, then a threshold $Θ=k$ equaling the number of blocks yields a retrieval output $v^$ that is as well a superset of the original output pattern $vμ$. In that case, R1-SMX can consistently be iterated for both autoassociation and heteroassociation. In the case $u˜⊂uμ$, one may activate all empty blocks of $u˜$ before starting iterative retrieval as suggested for the original algorithm (Gripon & Berrou, 2012; Yao et al., 2014). However, this may activate a vast number of neurons and slow down the retrieval procedure considerably. For that reason, our version of IRB-SMX starts with R1 in the first retrieval step and then continuous with R1-SMX (and R1-SMX$T$) in the remaining iterations:
1. $u^(0):=u˜$;

2. $v^(1):=R1(u˜)$;

3. $u^(1):=R1B-SMXT(v^(1))$;

4. $t:=1$

5. $t:=t+1$

6. $v^(t):=v^(t-1)∩R1B-SMX(u^(t))$

7. $u^(t):=u^(t-1)∩R1B-SMXT(v^(t))$

8. IF $v^(t)$ or $u^(t)$ have changed (or $t$ has reached its maximum value) THEN goto step 5.

For autoassociation we can set $v^(t):=u^(t)$ and skip steps 3 and 7. By simulation experiments, we have verified that our version performs equivalently to the original algorithm initializing by fully activating empty blocks (data not shown). In case $u˜$ is already a superset of an original pattern, initialization steps 2 and 3 must be skipped.

Iterative retrieval of core and halo patterns. Note that combining, for example, an iteration of R1 and R1B enables a retrieval scheme where both supersets and subsets of the original patterns can be retrieved, as illustrated by Figure 1B. Here we call a subset of an original pattern “core” and a superset “halo.” Obviously unions of cores are cores and intersections of halos are halos. By that, we may combine different strategies to find cores and halos to improve the retrieval result. For example, different cores can be obtained for block coding by:

• Applying R1B to a core (with threshold $|$core$|$)

• Applying R1B to a halo (with threshold $k$)

• R1B-cSMX: Applying R1B-SMX to a halo followed by deactivating blocks with more than one active unit.

Similarly, different halos can be obtained for block coding by:

• Applying R1 to a core (with threshold $|$core$|$)

• Applying R1 to a halo (with threshold $k$)

• Applying R1B-SMX to a halo

By combining the different core and halo procedures, it is in principle possible to improve iterative retrieval outputs at the cost of increased computing time. However, preliminary simulations have shown that such combinations yield only minor improvements (data not shown). At least, the variant IRB-cSMX of iterating R1B-cSMX has the property of minimizing output noise $ε^$ because in each silenced block, at least one or even more wrongly active neurons are eliminated.

We have compared the retrieval algorithms for block patterns to some standard variants of iterative retrieval (IR) for random patterns. For this, we have tested two implementations of IR:

• IR-KWTA: This is a $k$-winners-take-all (KWTA) strategy, setting the treshold to the largest possible value such that at least $k$ neurons get active.

• IR-LK+: This is another variant based on the LK+ strategy introduced by Schwenker et al. (1996) for autoassociative networks. Here, the idea is to combine R1 in the first iteration (setting threshold $Θ=c:=λk$) to the number of “correct” units in the input pattern) with AND-ing the outputs in further retrieval steps using a threshold $Θ=k$ equal to the cell assembly size. This obviously yields a sequence of halo patterns as each retrieval step yields a superset of the original memory. This idea generalizes to bidirectional retrieval for heteroassociation in an obvious way.

In all implementations of the described algorithms, we have included the following optimization steps: First, for autoassociation, we have optimized the iterative algorithms by AND-ing or OR-ing at the earliest possible time. Moreover, we limited the number of active units during each retrieval step to a maximum value (that depended on pattern size $k$ and was at least $max(2k,1000))$ in order to prevent uncontrolled spreading of activity, which would otherwise result in a strong slowdown of simulations (as the time required for each retrieval step grows in proportion to the number of active units). In case activity exceeded the maximum value, iterative retrieval was aborted, and the result of the previous iteration was returned as final retrieval output.

All simulations were performed using the PyFelix++ simulation tool (see Knoblauch, 2003b, appendix C) on multicore compute clusters (BWUniCluster at the Steinbuch Centre for Computing at the KIT, using a maximum of 32 cores and 64 GB RAM per simulation; and a costum Intel Xeon 2 GHz installation at University of Ulm with 70 cores and 1.5 TB RAM for large networks with up to $n=2·106$ neurons).

4  Analysis of Output Noise and Information Storage Capacity for Block Coding

For a detailed analysis of one-step-retrieval for random activity patterns see Knoblauch et al. (2010). The following section develops a similar analysis for block patterns (see Figure 1A), where the active unit of a block is selected uniformly and independent of other blocks. Thus, the probability of a pattern unit being active is $pr[uiμ=1]=pr[vjμ=1]=1/N=:p$ independently for all blocks and, as in the previous analysis, we assume that the dendritic potentials follow a binomial distribution (Knoblauch, 2008c).

4.1  One-Step Retrieval with Block Coding (R1B)

After storing $M$ pattern associations, the matrix load $p1:=pr[Wij=1]$ is7
$p1=1-1-1N2M.$
(4.1)
Thus, for one-step pattern part retrieval using a query with a fraction $λ$ of correct one-entries, the chance of a wrong activiation in the output is8
$p01:=pr[v^j=1|vjμ=0]=p1λk.$
(4.2)
Therefore, the probability that after R1, a missing block is reconstructed by having a unique active neuron is
$pbr:=pr[blockreconstructedbyuniqueactiveneuron]=(1-p01)N-1.$
(4.3)
For heteroassociation, the expected number of reconstructed blocks with unique neurons is therefore $E(uniqueblocks)=k(1-p01)N-1$, which equals the expected number of one-entries in the output pattern of the R1B procedure. Therefore, the output pattern completeness $λ^$ defined as the average fraction of correct one-entries in the output pattern is
$λ^=pbr=(1-p1λk)N-1,$
(4.4)
and there will be no false one-entries. We may allow a small fraction $ε^=1-λ^$ of missing one-entries (e.g., $ε^≤ε=0.01$). Solving this using equation 4.4 yields an upper bound $p1ε$ for the matrix load $p1$. Corresponding bounds $Mε$ and $Cε$ for memory number and stored information follow then by equation 4.1, like that analyzed in Knoblauch et al. (2010).

4.2  Iterative Retrieval for Block Coding

The output of the first R1B step is used as the input pattern for the next sIRB or IRB step. For the simple variant sIRB without OR-ing, this defines an iteration with fixed-point equation $λ^(λ)=λ$, as illustrated in Figure 2. To solve it, we can linearize $λ^(λ)≈λ^(1)-(1-λ)λ^'(1)$ around $λ=1$ by using the derivate from equation 4.4:
$λ^'(λ)=-(N-1)(1-p1λk)N-2kp1λklnp1.$
(4.5)
Intersecting $λ^(λ)$ with the line $λ↦λ$ then yields an approximation for the maximal output completeness $λ^max$ that can be achieved by sIRB after a sufficient number of iterations,9
$λ^max≈λ^(1)-λ^'(1)1-λ^'(1)forλ^(1)=(1-p1k)N-1andλ^'(1)=-k(N-1)(1-p1k)N-2p1kln(p1),$
(4.6)
if $λ^'(1)<1$. Our next goal is to estimate pattern capacity $Mmax$, that is, the maximal number of memory patterns such that we can expect a successful iterative retrieval procedure. This is the case if $λ$ is larger than the repellent fixedpoint, that is, if either $λ^>λ$ or $λ^≥λmax$. In the former case, we may require that the output completeness $λ^$ exceeds the input completeness $λ$ by some margin, for example,
$λ^≥!minλ+rk,λmax,$
(4.7)
to require an average improvement of at least $r≥1$ reconstructed blocks in the first retrieval step.10 Equivalently the block reconstruction probability $pbr$ must exceed a corresponding threshold $L$,
$pbr=(1-p1λk)N-1≥!L:=minλ+rk,1-εmin,$
(4.8)
where $εmin$ equals either $1-λmax$ as before or a small, positive tolerance value $ε≈>0$.11 Solving for the matrix load $p1$ gives the upper bound:
$p1≤!p1,max:=(1-L1N-1)1λk⇔k≈ld(1-L1N-1)λldp1,max.$
(4.9)
As $p1$ is monotonically increasing in $M$, we can estimate the maximal possible number of pattern associations by solving equation 4.1 for $M$, which yields the pattern capacity
$Mmax=ln(1-p1,max)ln(1-1N2)≈-n2ln(1-p1,max)k2≈-λ2n2(ldp1,max)2ln(1-p1,max)[ld(1-Lp)]2,$
(4.10)
where the approximations hold for $N=n/k→∞$ corresponding to sparse patterns with $p=1/N=k/n→0$, where the Willshaw model is known to be most efficient. Note that this analysis holds for both heteroassociation and autoassociation without OR-ing (sIRB). We argue that the described analysis for sIRB applies as well for heteroassociative IRB with OR-ing because the OR-ing becomes effective only after the second iteration of R1B (see equations 3.2 and 3.3), whereas the decision of a successful retrieval happens within the first two R1B steps, due to the exponential decrease of errors (see equation 4.2).
Figure 2:

(A) Illustration of average output completeness $λ^$ as a function of input completeness $λ$ for heteroassociation with $n=4096$, $k=16$, $N=256$, and matrix load $p1=0.45$. Note that $λ^(λ)=λ$ defines a fixedpoint equation. There are three fixedpoints, illustrated by red circles. Dashed lines indicate the (average) time dynamics $λ^(t)$ of a retrieval with initial completeness $λ=λ^(0)=0.5$. The attracting fixedpoint at $λ=λ^max≈1$ determines the retrieval quality of the original memory pattern. The repelling fixedpoint at $λ≈0.46$ defines the size of the basin of attraction. (B) Curves similar to panel A for different matrix loads $p1=0.1,0.2,0.3,0.4,0.5,0.57,0.6,0.7,0.8$ as indicated. With increasing load, the two nonzero fixedpoints move toward each other until they melt into one (here for $p1$ between 0.57 and 0.6). Beyond this limit, there remains only a single fixedpoint at $λ=0$ and, without the OR-ing mechanism (sIRB), output activity will always die out. Green dashed lines correspond to autoassociation with OR-ing. (C) Similar to panel B but for larger networks: $k=24$, $N=4096$, $n=98,304$. (D) Similar to panel B, but for very large networks: $k=100$, $N=2k/2$, $n=kN$.

Figure 2:

(A) Illustration of average output completeness $λ^$ as a function of input completeness $λ$ for heteroassociation with $n=4096$, $k=16$, $N=256$, and matrix load $p1=0.45$. Note that $λ^(λ)=λ$ defines a fixedpoint equation. There are three fixedpoints, illustrated by red circles. Dashed lines indicate the (average) time dynamics $λ^(t)$ of a retrieval with initial completeness $λ=λ^(0)=0.5$. The attracting fixedpoint at $λ=λ^max≈1$ determines the retrieval quality of the original memory pattern. The repelling fixedpoint at $λ≈0.46$ defines the size of the basin of attraction. (B) Curves similar to panel A for different matrix loads $p1=0.1,0.2,0.3,0.4,0.5,0.57,0.6,0.7,0.8$ as indicated. With increasing load, the two nonzero fixedpoints move toward each other until they melt into one (here for $p1$ between 0.57 and 0.6). Beyond this limit, there remains only a single fixedpoint at $λ=0$ and, without the OR-ing mechanism (sIRB), output activity will always die out. Green dashed lines correspond to autoassociation with OR-ing. (C) Similar to panel B but for larger networks: $k=24$, $N=4096$, $n=98,304$. (D) Similar to panel B, but for very large networks: $k=100$, $N=2k/2$, $n=kN$.

For autoassociative IRB with OR-ing, the previous analysis of heteroassociative sIRB and IRB must be slightly adapted to account for the identical neuron populations $u$ and $v$, where the OR-ing becomes effective after the first R1B step. In particular, there will be a larger fraction of correct one-entries in the output pattern after the first R1B step, because $λk$ correct neurons are already known from the input $u˜$. It turns out that $λ^max$, $L$, $p1,max$, $M1,max$ have to be substituted by some related quantities $λ^max,AA$, $LAA$, $p1,max,AA$, $Mmax,AA$. (For details, see section A.2.)

4.3  Information-Theoretic Capacity Measures

Finally, we can compute the stored Shannon information. As each pattern corresponds to $k$ integers from ${1,…,N}$, the Shannon information of a single random block pattern is $kldN$. Thus, the information of the stored pattern set is $MkldN$, and after normalizing to the number of synapses at the capacity limit $M=Mmax$ from equation 4.10, the maximal pattern information normalized to synapse number is
$Imax:=MmaxkldNn2≈-ln(1-p1,max)ldNk≈-λld(p1,max)ln(1-p1,max)lnNln(1-Lp).$
(4.11)
From this, we can compute various measures that evaluate storage capacity in bit per synapse. For heteroassociation, the mapping capacity is
$Cv:=λ^maxImax.$
(4.12)
However, as bidirectional retrieval will reconstruct both input and output patterns, the bidirectional capacity is12
$Cuv:=(2λ^max-λ)Imax,$
(4.13)
where a fraction $λ$ of the $k$ integers is already given in the input query $u˜$. Correspondingly, for autoassociation (without OR-ing), the completion capacity within the input population $u$ is the difference between $Cuv$ und $Cv$:
$Cu:=(λ^max-λ)Imax.$
(4.14)
For autoassociation with OR-ing, we have to replace again $λ^max$, $L$, $p1,max$, $M1,max$ by $λ^max,AA$, $LAA$, $p1,max,AA$, $Mmax,AA$ in the above formulas.

4.4  The Limit of Large Networks

The Willshaw model is known to have three regimes of operation (Knoblauch et al., 2010, sec. 3.4). In the regime of balanced potentiation, the matrix load, equation 4.1, converges toward a value between zero and one, $0 for large networks with $n→∞$. By equation B.1, it corresponds to the sparse limit of section 2.5 with $k∼ldn$, where the basic Willshaw model can store a positive amount $0 bit of information per synapse $Wij$. In the other two regimes, $C→0$ as the weight matrix contains either (almost) only zeros or ones such that the entropy of a synaptic weight $Wij$ approaches zero. These so-called sparse and dense potentation regimes are actually very interesting if the network employs additional mechanisms like compression of the weight matrix or structural synaptic plasticity (Knoblauch, Körner, Körner, & Sommer, 2014; Knoblauch, 2017). However, here we will focus only on the basic Willshaw model with balanced potentiation, evaluating the analyses of sections 4.1 to 4.3 for the sparse limit with $n,k,N→∞$, $p=kn=1N→0$, $L=λ+rk$, $0, $plnL→0$, and $k∼logn$ corresponding to fixed $0 for constant $λ,ε$. Typically, $r$ is constant or $r∼k$.

First, linearizations $ln(x+Δx)=ln(x)+O(Δxx)$ for $Δx→0$ and $Lp=exp(plnL)=1+plnL+O((plnL)2)$ for $plnL→0$ imply
$1-Lp=-plnL+O((plnL)2)→0and$
(4.15)
$ln(1-Lp)=ln(-plnL)+O(plnL)∼-lnN.$
(4.16)
Inserting equation 4.16 in 4.9 reveals that $k$ must grow logarithmically with the neuron number $n=kN$:
$k≈ldN-λldp1,max.$
(4.17)
This implies that $Mmax$ as computed in equation 4.10 defines a sharp border between perfect retrieval and loss of all memories, a phenomenon that is often referred to as catastrophic forgetting (Robins & McCallum, 1998; French, 1999; Knoblauch et al., 2014; Knoblauch, 2017),
$λ^MMmax→1-HMMmax-1,$
(4.18)
where $H$ is a variant of the Heaviside function with a special value at zero—for example, $H(0):=λ$ for constant $r$. This means that for large networks, $λ^max→1$ as long as $limn→∞MMmax<1$. For a detailed analysis, see section A.3. There it is also shown that equations 4.15 to 4.18 hold basically as well for autoassociation with OR-ing (replacing $L$ by $LAA$ and using a slightly modified threshold function $H$). As a consequence, large networks doing autoassociation with OR-ing can reach only the same maximal matrix load $p1,max$ and criticial pattern capacity $Mmax$ as autoassociation without OR-ing and heteroassociation.
Finally, with $λ^max→1$ and $ld(1-Lp)≃ld(1-LAAp)≃-ldN≃-ldn$, the asymptotic storage capacities follow from equations 2.11 and 4.10 to 4.14:
$Mmax,∞=-λ2(ldp1,max)2ln(1-p1,max)n2ld2n≤1.219n2ld2n,$
(4.19)
$αmax,∞:=Mmax,∞kldnn2=λldp1,maxln(1-p1,max)≤ln2≈0.69,$
(4.20)
$Cv,∞=λldp1,maxln(1-p1,max)≤ln2≈0.69,$
(4.21)
$Cuv,∞=λ(2-λ)ldp1,maxln(1-p1,max)≤ln2≈0.69,$
(4.22)
$Cu,∞=λ(1-λ)ldp1,maxln(1-p1,max)≤ln24≈0.17.$
(4.23)
All normalized capacity measures (see equations 4.20 to 4.23) are maximal for maximum entropy of synaptic weights with $p1,max=0.5$, where equation 4.17 yields the corresponding optimal pattern activity $k=1λldN$. Mapping capacity $Cv$ and bidirectional capacity $Cuv$ reach the upper bound $ln2≈0.69$ for zero query noise with $λ=1$, whereas completion capacity $Cu$ reaches the bound $ln24≈0.17$ for $λ=0.5$ and $k=2λldN$. Critical pattern capacity $Mmax$ gets maximal for $p1,maxldp1,max=2(1-p1,max)ld(1-p1,max)$ or $p1,max≈0.1603653$, $λ=1$, and even sparser $k≈12.64λldN$. Thus, due to the Heaviside-type shape of the asymptotic output completeness, equation 4.18, all capacity measures for IRB are identical to the corresponding capacities of one-step-retrieval (R1) in the original Willshaw et al. (1969) model without block coding (e.g., see equation B.4 in appendix B). Correspondingly, inserting equation 3.15 in equation B.4, yields for $p→0$
$gp1:=maxp1forIRBmaxp1forR1≈-lnλε1/λk→1,$
(4.24)
such that all gain factors, equations B.4 to B.6 and equations B.6 to B.9 approach 1. Thus, asymptotically, the values of one step retrieval cannot be exceeded by block coding or bidirectional or iterative retrieval strategies.13

5  Simulation Experiments

This section evaluates and compares the mapping, completion, bidirectional, and pattern capacities of various model variants by means of storage and retrieval of randomly generated patterns. As in section 4, we assume that each stored pattern of size $n$ has exactly $k$ active units. For the block pattern algorithms, we use block patterns where each block of size $N=n/k$ has exactly one randomly drawn active unit (see Figure 1A). For the other unconstrained algorithms, we use random patterns where the $k$ active units are drawn randomly from the $n$ neurons. All patterns are generated independently of each other. It is generally believed that patterns generated at random maximize stored information, but note that the information storage and the completion capacity introduced in section 2 are practically not accessible to simulation studies (because all combinatorially possible patterns have to be considered) and also less relevant for practical applications. Unless otherwise specified, we use input patterns that have a subset $λk$ of the active units of the original patterns, but no additional active units $κk=0$ (see section 2.7).14 All theoretical estimates use a conservative value $r=1$ for the average improvement in the first retrieval step (see equation 4.7). Iterative retrieval was limited to a maximum of 10 iterations but could be stopped before the tenth iteration if the output pattern was identical to the previous iteration or if an activity explosion was detected (number of active units becoming larger than $max(1000,2k)$), which may occur in some model variants (e.g., IR-LK+) if exceeding the critical pattern capacity. Each data point is obtained from averaging over 50,000 retrievals in 10 different networks (5000 per network).

5.1  Validation of Network Implementations

To validate our network implementation we first tried to reproduce some reference results of previous work (Memis, 2015; Schwenker et al., 1996; Gripon & Berrou, 2011). Figure 3 shows output noise and storage capacity as a function of stored memories $M$ for various model variants of a relatively small autoassociative network of $n=4096$ neurons and $k=16$ blocks or active units per pattern. We tested retrieval with half-input patterns having $k/2$ active units from the stored memory patterns ($λ=0.5$) to maximize completion capacity (see equation 4.23).

Figure 3:

Comparison of different learning and retrieval models for networks of size $n=4096$ where each activity pattern has $k=16$ active units. (A) Output noise $ε^$ as a function of stored memories $M$. Curves are shown for R1, IR-LK+, IR-KWTA, sIRB, IRB, IRB-R1, IRB-SMX, and IRB-cSMX as indicated by the key. Our data closely match reference data REF_LK+ and REF_BLK from independent implementations of IR-LK+ and block coding similar to IRB-R1 (Memis, 2015, tables 7.23,7.24). (B) Stored information per synapse $Cu$ as a function of stored memories $M$. Stored information was computed bit-wise from equation A.4 for random patterns (R1, IR-LK+, IR-KWTA), and block-wise from equation A.11 for block patterns (sIRB, IRB, IRB-R1, IRB-SMX, IRB-cSMX). Neglecting the decreased information of block patterns (compared to random patterns) may significantly overestimate storage capacity (as demonstrated here by REF_BLK; see Figure 11).

Figure 3:

Comparison of different learning and retrieval models for networks of size $n=4096$ where each activity pattern has $k=16$ active units. (A) Output noise $ε^$ as a function of stored memories $M$. Curves are shown for R1, IR-LK+, IR-KWTA, sIRB, IRB, IRB-R1, IRB-SMX, and IRB-cSMX as indicated by the key. Our data closely match reference data REF_LK+ and REF_BLK from independent implementations of IR-LK+ and block coding similar to IRB-R1 (Memis, 2015, tables 7.23,7.24). (B) Stored information per synapse $Cu$ as a function of stored memories $M$. Stored information was computed bit-wise from equation A.4 for random patterns (R1, IR-LK+, IR-KWTA), and block-wise from equation A.11 for block patterns (sIRB, IRB, IRB-R1, IRB-SMX, IRB-cSMX). Neglecting the decreased information of block patterns (compared to random patterns) may significantly overestimate storage capacity (as demonstrated here by REF_BLK; see Figure 11).

Our data for IR-LK+ (red dashed line) and IRB-R1 (blue dash-dotted line) tightly reproduce the output noise data from the earlier works (big circle and triangle markers), thus validating our implementations of learning and retrieval. As reported previously, it can be seen that the block coding algorithms (sIRB, IRB, IRB-R1, IRB-SMX, IRB-cSMX) significantly reduce output noise compared to the models without block coding (R1,IR-LK+, IR-KWTA). Correspondingly, block coding can significantly increase pattern capacity $Mε$, defined as the maximum number of memories that can be stored at a tolerated noise level $ε$. Yet, despite this increase in pattern capacity, we could not observe a corresponding increase in maximum storage capacity $Cu$ for most block coding models. That is, LK+ and IR can store more information per synapse than sIRB, IRB, IRB-R1, and IRB-cSMX. Only the halo-type sum-of-max strategy IRB-SMX can slightly exceed the maximum capacity of the standard models, where peak capacities occur at relatively high output noise levels (Buckingham & Willshaw, 1992). The reason for this discrepancy is that each block pattern bears significantly less information than a random pattern without block coding (see Figure 11), which compensates the increase in pattern capacity. Neglecting this fact (e.g., by computing $Cu$ using equation A.4 instead of equation A.11) can significantly overestimate information storage capacity for block coding (blue triangles; compare to the blue dash-dotted curve of IRB-R1) in Figure 3. We have also simulated the further block coding variants mentioned in the model section (e.g., combining various core and halo strategies), but they turned out to provide only minor improvements over IRB and IRB-R1 and could not exceed the performance of IRB-SMX and IRB-cSMX (data not shown).

5.2  Testing the Quality of Our Theory

Next, we tested the quality of our theory developed for the block coding model variants sIRB and IRB in section 4 and for the standard iterative or bidirectional retrieval procedures IR-LK+ and IR-KWTA in appendix B. To this end, Figure 4 shows output noise $ε^$ as function of stored memory number $M$ for various network sizes $n=kN$ using pattern activity $k=2ld(N)$ and block size $N=2k/2$ that are chosen optimally to maximize information storage capacity for half-input patterns ($λ=0.5,κ=0$). For each network size $n$, our theory (30,69,73,79)) provides the maximal pattern number $Mmax,th$ that can be stored at a tolerated noise level $ε=0.01$ (see the legends). This value can be compared to the actual pattern capacity $Mmax$ at which the simulated output noise reaches level $ε$. To allow comparison over different network sizes, we have normalized pattern number $M$ to the theoretical maximum $Mmax,th$.

Figure 4:

Output noise $ε^$ as a function of normalized number of stored patterns $M/Mmax,th$ for IRB (top panels A, B), sIRB (middle panels C, D), and IR-KWTA (bottom panels E, F). Panels A, C, and E show results for heteroassociation (solid, $+$) as a reference and autoassociation for comparison (dotted, $×$). Similarly, panels B, D, and F show results for autoassociation (solid, $+$) as reference and heteroassociation for comparison (dotted, $×$). Reference pattern capacity $Mmax,th$ at noise level $ε=0.01$ is estimated by our theory (equations 4.10 and A.24) for sIRB/IRB and equations B.8 and B.2 for IR-KWTA. Other parameters are $r=1$, $k=2ld(N)$, $N=2k/2$, $λ=0.5$, $κ=0$.

Figure 4:

Output noise $ε^$ as a function of normalized number of stored patterns $M/Mmax,th$ for IRB (top panels A, B), sIRB (middle panels C, D), and IR-KWTA (bottom panels E, F). Panels A, C, and E show results for heteroassociation (solid, $+$) as a reference and autoassociation for comparison (dotted, $×$). Similarly, panels B, D, and F show results for autoassociation (solid, $+$) as reference and heteroassociation for comparison (dotted, $×$). Reference pattern capacity $Mmax,th$ at noise level $ε=0.01$ is estimated by our theory (equations 4.10 and A.24) for sIRB/IRB and equations B.8 and B.2 for IR-KWTA. Other parameters are $r=1$, $k=2ld(N)$, $N=2k/2$, $λ=0.5$, $κ=0$.

It can be seen that the curves of $ε^$ as a function of $M/Mmax,th$ converge to the Heaviside function as predicted by our theory, equation 4.18. Note that for $M→∞$, IRB and sIRB have limited output noise $ε^→ε∞≤1$ because retrieval outputs are always subsets (“cores”) of the original memory patterns and employing OR-ing in each iteration step ($ε∞=1$ for heteroassociative sIRB; $ε∞=0.75$ for heteroassociative IRB; $ε∞=0.5$ for both IRB and sIRB due to the implicit OR-ing mentioned in section 3). By contrast, $ε^→∞$ for IR-KWTA.

The accuracy $Mmax/Mmax,th$ of our theory generally improves with network size $n$. For example, for heteroassociative sBRB and $k=10,12,16,20,24,28$ corresponding to network sizes $n=kN=k2k/2$, the estimated accuracy increases as $71%,80%,89%,92%,94%$, and $95%$. Similarly, for autoassociative sIRB and $k=10,…,28$ accuracy increases as $57%,…,93%$. Interestingly, the pattern capacity for autoassociation is significantly higher than for heteroassociation. It can be seen that our theory provides lower bounds of $Mmax$ for both heteroassociation and autoassociation and relevant network sizes. Additionally, for heteroassociation, we can use the theory for autoassociation to get upper bounds of $Mmax$. For heteroassociative IR-KWTA, the theory of appendix B slightly underestimates the true storage capacity for large networks but still converges in the limit $n→∞$.

5.3  Comparing Capacities for Different Model Variants

Next, we did a direct comparison of critical pattern capacity $Mε$ and information storage capacity $Cε$ (in bit/synapse) for different model variants at output noise level $ε=0.01$. We considered autoassociation and heteroassociation in both small and large networks ($n=16·256=4096$ versus $n=22·2048=45,056$) and tested over the whole range of relevant pattern activities $k$ between 4 and $n/2$ that allowed a division of the $n$ neurons into $k$ blocks of size $N=n/k$ (specifically, we tested $k=4,8,16,32,64,128,256,512,1024$, $2048,4096,5632,11,264,22,528$). The results are summarized in Figures 5 ($Mε$) and 6 ($Cε$). Simulations (markers) are again compared to theory (lines). Theoretical values for maximal $Mε$ were computed as described in the previous section (using equations B.1 and B.2 for R1). $Cε$ was computed from $Mε$ using equation A.4 for random coding (R1, IR-KWTA, IR-LK+) and equations A.13 to A.15 for block coding (IRB, IRB-SMX).

Figure 5:

Critical pattern capacity $Mmax$ (or $Mε$) as a function of active neurons $k$ for different retrieval algorithms for block patterns (IRB,IRB-SMX) and random patterns (R1,IR-KWTA) at output noise level $ε=0.01$. Left panels correspond to small networks ($n=4096$) and right panels to large networks ($n=45,056$). Top panels show data for autoassociation using half input patterns ($λ=0.5$, $κ=0$). Middle and bottom panels correspond to heteroassociation with $λ=0.5$ and $λ=1$, respectively. Lines correspond to theory, whereas markers correspond to simulations (see the keys).

Figure 5:

Critical pattern capacity $Mmax$ (or $Mε$) as a function of active neurons $k$ for different retrieval algorithms for block patterns (IRB,IRB-SMX) and random patterns (R1,IR-KWTA) at output noise level $ε=0.01$. Left panels correspond to small networks ($n=4096$) and right panels to large networks ($n=45,056$). Top panels show data for autoassociation using half input patterns ($λ=0.5$, $κ=0$). Middle and bottom panels correspond to heteroassociation with $λ=0.5$ and $λ=1$, respectively. Lines correspond to theory, whereas markers correspond to simulations (see the keys).

Figure 6:

Information storage capacity $Cε$ as a function of active neurons $k$ per pattern for small ($n=4096$, left panels) and large networks ($n=45,056$, right panels) at noise level $ε=0.01$. Data correspond to the same experiments as Figure 5. For autoassociation (top), $Cε$ corresponds to completion capacity $Cu$. For heteroassociation (middle, $λ=0.5$; bottom, $λ=1$), $Cε$ corresponds to either mapping capacity $Cv$ (R1) or bidirectional capacity $Cuv$ (IR-KWTA,IRB,IRB-SMX).

Figure 6:

Information storage capacity $Cε$ as a function of active neurons $k$ per pattern for small ($n=4096$, left panels) and large networks ($n=45,056$, right panels) at noise level $ε=0.01$. Data correspond to the same experiments as Figure 5. For autoassociation (top), $Cε$ corresponds to completion capacity $Cu$. For heteroassociation (middle, $λ=0.5$; bottom, $λ=1$), $Cε$ corresponds to either mapping capacity $Cv$ (R1) or bidirectional capacity $Cuv$ (IR-KWTA,IRB,IRB-SMX).

In general, theory fits again well to the simulations. In particular, the theory for one-step retrieval (R1) precisely predicts the true capacities already for relatively small networks unless pattern activities $k$ become large (Knoblauch, 2008c; Knoblauch et al., 2010). For small $k$, the theory slightly underestimates the true values, whereas for large $k$, the theory significantly overestimates both $Mε$ and $Cε$. Compared to R1, the iterative retrieval procedures like IR-KWTA and IRB can significantly increase storage capacity if the initial inputs are incomplete ($λ<0.5$; top and middle panels). For very sparse coding with $k≤ldn$, this increase can be more than an order of magnitude. For large $k→n$, storage capacity becomes very small, where all models tend to identical pattern capacity $Mε$. Still, information storage capacity $Cε$ is larger for models employing random patterns (IR-LK+, IR-KWTA, R1) because the information per block pattern is significantly decreased compared to random patterns. For medium $k$, the block pattern models IRB and IRB-SMX can store significantly more patterns but less information than IR. For extremely small activity $k$, IRB can store fewer patterns than IR-KWTA, whereas IRB-SMX has a slightly larger capacity for patterns. Among the block pattern models, IRB-SMX has significantly higher storage capacity than IRB only for small and medium $k$, not for large $k$.

As expected, autoassociation can store more patterns than heteroassociation for most $k$. However, surprisingly, for very sparse patterns (e.g., $k=4$), heteroassociation performs better than autoassociation. For example, for IR-KWTA with $n=45,046$ and $k=4$, heteroassociation can store $Mmax≈1.45·106$ patterns, whereas autoassociation can store only $0.780·106$ patterns. Similarly, heteroassociative IRB and IRB-SMX can store $Mmax≈0.445·106$ and $Mmax≈1.49·106$ patterns, whereas autoassociative IRB and IRB-SMX can store only $Mmax≈0.437·106$ and $Mmax≈0.878·106$ patterns, respectively.

Note that the fit of theory to data depends on the choice of parameter $r$ defined as the average improvement of the first retrieval step (see equations 4.7, 4.8, and A.22). We have chosen $r=1$, which obviously implies upper bounds for storage capacity. For larger $r$, our theory predicts lower capacities. For example, for $r=0.25k$, the theory curves get shifted toward smaller values, and we get better fits, in particular for larger $k$ (data not shown). This is consistent with the finding that all models become equivalent to R1 for large $k$, that is, then the first retrieval step will typically complete the pattern almost perfectly ($r=(1-λ)k$).

Another issue is how to measure output noise $ε^$ for heteroassociative bidirectional retrieval, that is, how to weigh retrieval errors in the two populations $u$ and $v$. For example, for IRB, $u$ will generally have less noise than $v$ because the $λk$ correct one-entries in the input pattern are preserved over iterations by the OR-ing. Therefore we have used a weighted average,
$ε^weighted=(1-λ)ε^u+ε^v2-λ,$
(5.1)
of the noise contributions $ε^u$ and $ε^v$ in the input and output populations, taking into account only unresolved input blocks. For IR-KWTA and IR-LK+, noise typically distributes much more uniformly among populations $u$ and $v$ such that simple averaging by $ε^simple=ε^u+ε^v2$ yields equivalent results (although it may occur that very small and very large patterns alternate during iterations). These different methods to measure output noise have only a minor effect for large networks, large-input noise (e.g., $λ=0.5$), and large activity $k$, but they can be significant for small networks, low-input noise (e.g., $λ=1$), and low activity. For example, for $n=4096$, $k=4$, $λ=0.5$, we have obtained for IR-KWTA, IRB, IRB-SMX pattern capacities $Mmax≈19,152,10,177,14,228$ for simple averaging, but $Mmax≈17,264,9224,12,668$ for weighted averaging. For larger $k$ or larger $n$, the differences were much smaller. However, for larger $λ$, the differences can be more significant.

For complete input patterns with $λ=1$ (bottom panels), block coding and iterative retrieval methods cannot improve over one-step retrieval R1. In fact, without stabilizing the input activity in population $u$ (e.g., by OR-ing or AND-ing), iterative retrieval will typically deteriorate retrieval outputs at the capacity limit, because fixedpoints occur at nonzero noise (see Figure 2). For a small-output noise level $ε=0.01$, this deterioriation is negligible, as can be seen in our data. Here, all models perform almost identical to one-step retrieval. Only when measuring output noise by simple (instead of weighted) averaging may the storage capacity seem to differ significantly for the IRB-type models retrieving core patterns. For example, for large $n=45,056$, at $k=4$ weighted averaging yields $Mmax≈4.01·106,4.01·106,3.93·106,3.96·106$ for R1, IR-KWTA, IRB, IRB-SMX, whereas simple averaging would predict $Mmax≈4.01·106,3.97·106,4.70·106,3.96·106$ instead (data not shown). Here the seemingly large value for IRB results from simple averaging $ε^=ε^u+ε^v2=ε^v2$ because OR-ing preserves perfect inputs ($ε^u=0$) and thus tolerates double the noise $ε^v≤2ε=0.01$ in the output population compared to R1 (and the other models with balanced noise).

5.4  Gains in $M$⁠, $C$⁠, and $p1$ for Block Coding and Iterative Retrieval

The previous figures show capacity $M$ and $C$ on a logarithmic scale. To get more precise quantitative judgments of the improvements of block coding and iterative or bidirectional retrieval over one-step retrieval, Figures 7 and 8 illustrate the gain factors $gM$, $gC$, and $gp1$ defined in equations B.4 to B.9 and compare theoretical results (lines) to simulation data (markers). Specifically, we compare block coding (IRB), iterative retrieval for random coding (IR-KWTA), and one-step retrieval for random coding (R1) by taking the quotients of the relevant quantities of pattern capacity $M$, information storage capacity $C$, and matrix load $p1$ (= fraction of one-entries in the weight matrix). For example, for the comparison IRB versus R1 (top panels), $gM=Mmax,BRBMmax,R1$ is the quotient of the maximal pattern capacities for IRB divided by R1. Similarly, we have compared IRB versus IR-KWTA (bottom panels).

Figure 7:

Gain factors of autoassociation as a function of active neurons $k$ per pattern for small networks ($n=4096$, left panels) and large networks ($n=45,056$, right panels) using half-input patterns ($λ=0.5$, $κ=0$). Gain factors compare pattern capacity ($gM$), information storage capacity ($gC$), and matrix load ($gp1$). Top panels compare IRB versus R1 according to the definitions (see equations B.4 to B.6) and compare theoretical results (lines) to simulations (markers). Similarly, bottom panels compare IRB versus IR-KWTA using (equations B.4 to B.9).

Figure 7:

Gain factors of autoassociation as a function of active neurons $k$ per pattern for small networks ($n=4096$, left panels) and large networks ($n=45,056$, right panels) using half-input patterns ($λ=0.5$, $κ=0$). Gain factors compare pattern capacity ($gM$), information storage capacity ($gC$), and matrix load ($gp1$). Top panels compare IRB versus R1 according to the definitions (see equations B.4 to B.6) and compare theoretical results (lines) to simulations (markers). Similarly, bottom panels compare IRB versus IR-KWTA using (equations B.4 to B.9).

Figure 8:

Gain factors for heteroassociation as a function of active neurons $k$ per pattern (otherwise, settings are the same as in Figure 7).

Figure 8:

Gain factors for heteroassociation as a function of active neurons $k$ per pattern (otherwise, settings are the same as in Figure 7).

Figure 7 illustrates the gain factors for autoassociation in small and large networks. The most significant increase of IRB over R1 (top panels) occurs for small $k$ (i.e., for sparse and balanced potentiation with $p1≤0.5$), where $M$ and $C$ for IRB may be more than double the values for R1, whereas the gains approach 1 for large $k$ (corresponding to dense potentiation with $p1→1$) or become even smaller (in case of $gC$ due to the smaller information content of block patterns).

It is also visible that the increase in $M$ and $C$ for sparse potentiation ($p1<0.5$, $k) implies an increase in the matrix load $p1$. We have argued previously that such sparsely potentiated networks are efficient only for structural compression of the weight matrix, for example, by Huffman/Golomb coding or structural plasticity (Knoblauch, 2003a; Knoblauch et al., 2010). This means that the increase in $M$ will be counteracted by the weight matrix becoming less compressible. For dense potentation ($p1>0.5$, $k>ldn$), both effects work in the same direction. The (modest) increase in $M$ will further increase the matrix load $p1→1$ such that the network will become even more compressible, rendering networks that can store more memory patterns while requiring less physical memory to represent the memories.

Here, our main interest is in quantifying the gains of block coding. Comparing the two iterative procedures IRB and IR-KWTA (middle panels), we observe that IRB can store more patterns $M$ than IR-KWTA only for balanced or dense potentation $p1≥0.5$. For sparse potentiation $p1≪0.5$, IRB does not improve $M$ over IR-KWTA. It is again visible that IRB cannot improve $C$ over IR-KWTA due to the reduced information content of block patterns (see Figure 11). Comparing IR-KWTA versus R1 (bottom panels) yields similar gains as IRB versus R1. This indicates that a significant part of the gain of IRB over R1 can be credited to the iterative retrieval procedure of IRB rather than block coding.

Figure 8 shows corresponding data for heteroassociation, largely confirming the results for autoassociation. The theory for heteroassociation is even more precise than for autoassociation. The fit of theory to simulations could be further improved by selecting more appropriate values than $r=1$ for average improvement in the first retrieval step (see equation 4.7; data not shown).

5.5  Asymptotic and Maximal Information Storage Capacity

Asymptotic information storage capacities are long known to be $Cv=ln2≈0.69$ bit per synapse for heteroassociation and $Cu=ln2/4≈17.3$ bit per synapse for autoassociative pattern completion (see equations 4.21 to 4.23). While earlier it had been assumed that the maximal storage capacity would be identical to the asymptotic capacity, (e.g., Willshaw et al., 1969; Palm, 1980), later studies have observed in simulation experiments that the completion capacity of finite autoassociative networks can actually exceed the asmyptotic limit and continues to increase for viable network sizes (Schwenker et al., 1996). Therefore, one may question the theoretical bounds or even assume that the asymptotic limit for “optimal retrieval” in autoassociative networks may be larger than $ln2/4$. To clarify these questions and find the maximum capacity, we simulated iterative pattern completion in very large Willshaw networks having up to $n≈2.1·106$ neurons and up to $n2≈4.4·1012$ synapses.

Some results are displayed in Figure 9 showing capacity data for autoassociative networks for optimal input noise and pattern activity to maximize completion capacity $Cu$, that is, $λ=0.5$ and $k=2ldnk$ (see equation 4.23). For block coding, this choice corresponds to block size $N=2k/2$ and network size $n=kN=k·2k/2$. Each curve shows $Cu$ as a function of $M$ normalized to the maximum pattern number (at noise level $ε=0.01$) estimated by our theory, similar to Figure 4. The insets show enlarged plots of the maximum region for each model variant. For example, IR-LK+ (panel C) reaches maximum capacity $Cu≈0.191$ bit per synapse for a network size of $n=98,304$ (and $k=24,N=4096$). The next smaller simulated network with $n=45,056$ (and $k=22,N=2048$) achieves almost the same value such that the true maximum is likely between 50,000 and 100,000 neurons. Although the maximum seems rather flat, it is nevertheless remarkable that it occurs at about the same size as a cortical macrocolumn of size 1 mm$3$ (size about $n=105$; Braitenberg & Schüz, 1991) Willshaw networks are often used for as generic models (Palm, 1982; Palm, Knoblauch, Hauser, & Schüz, 2014; Knoblauch & Sommer, 2016). For larger $n>105$ information capacity $Cu$ decreases again toward the asymptotic value $Cu→ln24≈0.173$ bps. Similar results are visible for model variants IR-KWTA (panel B) achieving maximum $Cu≈0.200$ more unequivocally at the larger network size $n=98,304$.

Figure 9:

Asymptotic completion capacity $Cu$ in bit per synapse (panels A–D). Each curve shows $Cu$ as function of normalized number of stored patterns $M/Mmax,th$ where $Mmax,th$ is theoretical pattern capacity at output noise level $ε=0.01$. Different curves correspond to different network sizes $n=kN$, as indicated in the key. Pattern activity $k=2ld(N)$ and block size $N=2k/2$ are chosen optimally to maximize information storage capacity for half-input patterns ($λ=0.5,κ=0$). Theoretical pattern capacity $Mmax,th$ is given in the key for each network size. Different panels correspond to IRB (A), IR-KWTA (B), and IR-LK+ (C), and using maximal threshold to activate all units of a random memory pattern without block coding), R1 (D). Panels E and F show corresponding output noise $ε^$ for IR-LK+ and R1 (for IRB, IR-KWTA, see Figure 5).

Figure 9:

Asymptotic completion capacity $Cu$ in bit per synapse (panels A–D). Each curve shows $Cu$ as function of normalized number of stored patterns $M/Mmax,th$ where $Mmax,th$ is theoretical pattern capacity at output noise level $ε=0.01$. Different curves correspond to different network sizes $n=kN$, as indicated in the key. Pattern activity $k=2ld(N)$ and block size $N=2k/2$ are chosen optimally to maximize information storage capacity for half-input patterns ($λ=0.5,κ=0$). Theoretical pattern capacity $Mmax,th$ is given in the key for each network size. Different panels correspond to IRB (A), IR-KWTA (B), and IR-LK+ (C), and using maximal threshold to activate all units of a random memory pattern without block coding), R1 (D). Panels E and F show corresponding output noise $ε^$ for IR-LK+ and R1 (for IRB, IR-KWTA, see Figure 5).

Block coding with IRB (panel A) seems to reach its maximum capacity $Cu≈0.177$ at a larger network size around $n=983,040$ (and $k=30,N=32,768$). As the experimental results were very close ($Cu=0.1767$ for $n=30·32,768=983,040$; $Cu=0.1763$ for $n=32·65,536=2,097,152$) and we could not simulate networks larger than $n=2,097,152$ due to hardware limitations, it may also be that the maximum occurs at slightly larger $n$. For one-step retrieval R1 (panel D) we could not observe the maximum capacity for viable network sizes, but the performance of R1 will obviously be bounded by the IR-KWTA and IR-LK+.

Figure 10 shows maximum capacity as a function of network size for the various models (data correspond to Figure 9). The data support the conclusion that all models have an asymptotic capacity $Cu=ln24$ including IRB and IRB-SMX, where maximum capacity occurs at a finite network size comparable to a cortical macrocolumn (Braitenberg & Schüz, 1991). The maximum capacity $Cu≈0.200$ bps for standard iterative retrieval is obtained for the $k$-winners-take-all strategy (IR-KWTA) and occurs around $n≈105$. This value is slightly exceeded by block-coding retrieval with the sum-of-max strategy (IRB-SMX) reaching $Cu≈0.204$ bps for $n$ between 50,000 and 100,000.15

Figure 10:

Information storage capacity $C$ as function of network size $n$ for different network and retrieval models. (A) Results for autoassociation assuming optimal noise $λ=0.5$ and corresponding pattern activity with $n=k2k/2$. Lines correspond to theoretical curves of R1 (black), IR (red), and IRB (blue) assuming $r=1$. Markers correspond to simulation experiments. (B) Corresponding results for heteroassociation assuming optimal $λ=1.0$ (i.e., full input patterns) and corresponding pattern capacity with $n=k2k$ (solid lines). For comparison, dotted lines show results for heteroassociation optimized for using half-input patterns ($λ=0.5$, $n=k2k/2$).

Figure 10:

Information storage capacity $C$ as function of network size $n$ for different network and retrieval models. (A) Results for autoassociation assuming optimal noise $λ=0.5$ and corresponding pattern activity with $n=k2k/2$. Lines correspond to theoretical curves of R1 (black), IR (red), and IRB (blue) assuming $r=1$. Markers correspond to simulation experiments. (B) Corresponding results for heteroassociation assuming optimal $λ=1.0$ (i.e., full input patterns) and corresponding pattern capacity with $n=k2k$ (solid lines). For comparison, dotted lines show results for heteroassociation optimized for using half-input patterns ($λ=0.5$, $n=k2k/2$).

6  Discussion and Conclusion

Motivated by previous promising results (Gripon & Berrou, 2011; Gripon & Rabbat, 2013; Aliabadi et al., 2014; Aboudib et al., 2014; Ferro et al., 2016; Memis, 2015), we have investigated how block coding can improve retrieval quality and memory capacity for Willshaw-type associative networks employing iterative or bidirectional retrieval (Willshaw et al., 1969; Palm, 1980; Schwenker et al., 1996; Sommer & Palm, 1998, 1999). To this end we have analyzed a number of different network and retrieval models and validated our theory by simulation experiments.

For many practical applications of NAMs, the asymptotic results (which actually are unaffected by the recent developments) have to be complemented by concrete optimizations of network parameters and retrieval procedures for large, finite memories. For this purpose, the use of randomly generated patterns as a benchmark is well established for several reasons. First, independent random patterns seem the simplest and most generic assumption, allowing comparison to a large body of previous analyses and simulation experiments of various NAM models. Second, random patterns are thought to be optimal to maximize pattern and information capacity defined in section 2, thereby providing an upper bound for real-world applications. Third, there are various recognition architectures that actually employ NAM mappings with random patterns (e.g., Palm, 1982; Kanerva, 1988; Knoblauch, 2012). Fourth, it is known that activity and connectivity patterns of various brain structures that are thought to work as NAM have random character (Braitenberg, 1978; Braitenberg & Schüz, 1991; Rolls, 1996; Albus, 1971; Bogacz, Brown, & Giraud-Carrier, 2001; Laurent, 2002; Pulvermüller, 2003; Lansner, 2009). In many real applications, the patterns to be stored will of course not be random, and they have to be coded into sparse binary patterns fitting the parameters of optimal or near-optimal NAM configurations (Palm, 1987a; Austin & Stonham, 1987; Krikelis & Weems, 1997; Hodge & Austin, 2003; Palm, 2013; Sa-Couto & Wichert, in press).

Here we adapted a previous finite-size analysis of one-step retrieval (Knoblauch et al., 2010) to iterative and bidirectional retrieval with and without block coding (see section 4 and appendix B). In contrast to the previous analyses, our theory allows not only estimating retrieval errors for a given network size and memory number, but also directly computing the pattern capacity $Mmax$ and the information storage capacity $Cmax$ (in bit/synapse) for a given network and tolerated noise level. Although the theory becomes exact only in the limit of large networks, it already provides reasonable approximations for finite networks and captures most effects that can be seen in simulation experiments when comparing the different network and retrieval models.

The most important finding is that block coding can significantly increase $M$ but not $C$. As the error-correcting capability of block coding reduces output noise, it is possible to store more memories $M$ at a maximal tolerated noise level $ε$. This increase is actually strongest for pattern activities where the Willshaw model is most efficient, that is, for patterns having about $k∼ldn$ active units, and is in the range between 10% and 20% for relevant network sizes (see Figures 7C, 7D, 8C, and 8D). By contrast, it is not possible to significantly increase the information $C$ that a synapse can store. The main reason is that a block pattern contains less information than a random pattern (see Figure 11 and section A.1). Thus, unfortunately, the increase in pattern number $M$ is mostly compensated by the decrease in pattern information such that the resulting $C$ typically decreases by about 10% to 20% for block coding (see Figures 7 and 8). Only the optimal “sum-of-max” block-coding strategy IRB-SMX can increase $C$ by a few percent (see Figures 3 and 10).

Figure 11:

Comparison of the (trans-)information per memory pattern for different types of random patterns $u$ (pRND = independent components with $pr[ui=1]=p:=k/n$; kRND = $k$ out of $n$ active units; BLK = block coding with $k$ blocks). (A) Information ratios I(kRND)/I(pRND) (red) and I(BLK)/I(pRND) (blue) for original patterns without noise ($λ=1,κ=0$) as a function of pattern activity $k$ for different network sizes $n=4096$ (solid line), $n=45056$ (dashed), $n=458,752$ (dotted). (B) Ratios T(kRND)/T(pRND) (red) and T(BLK)/T(pRND) (blue) of the transinformation between original patterns $u$ and noisy patterns $u˜$ with completeness $λ=1$ and add noise $κ=1$ (meaning each pattern has $λk=k$ correct and $κk=k$ false one-entries. (C) Similar to panel B, but for $λ=0.5$, $κ=0$. (D) Similar to panels B and C but for $λ=0.5$, $κ=0.5$. Data were computed as follows: T(pRND) computed from equation A.4 for $p01=κk/(n-k)$, $p10=1-λ$; T(kRND)$=ldnk-ldn-zk-c-ldzf$ for $c:=λk$, $f:=κk$, $z:=c+f$; T(BLK_ufn)$=c(ld(N)-ld(1+f/k))+(k-c)(ld(N)-ld(N-f/k))$; T(BLK_noh)$=c(ld(N)-ld(1+f/c))$; T(BLK_nom)$=cld(N)+(k-c)(ld(N)-ld(N-f/(k-c)))$. The three variants of BLK refer to “uniform noise” (ufn; $f$ false units are assumed to distribute uniformly among block hits and misses), “noise on hits” (noh; $f$ false units only on block hits where the correct block neuron is active), “noise on misses” (nom; $f$ false units only on block misses where the correct block neuron is inactive). As the true distribution of correct and false units per block is difficult to compute (and depends on the retrieval algorithm), these three variants can be seen as extreme cases). In panels A to C, the three variants produce identical results. They differ significantly only for mixed noise with both $λ<1$ and $κ>0$.

Figure 11:

Comparison of the (trans-)information per memory pattern for different types of random patterns $u$ (pRND = independent components with $pr[ui=1]=p:=k/n$; kRND = $k$ out of $n$ active units; BLK = block coding with $k$ blocks). (A) Information ratios I(kRND)/I(pRND) (red) and I(BLK)/I(pRND) (blue) for original patterns without noise ($λ=1,κ=0$) as a function of pattern activity $k$ for different network sizes $n=4096$ (solid line), $n=45056$ (dashed), $n=458,752$ (dotted). (B) Ratios T(kRND)/T(pRND) (red) and T(BLK)/T(pRND) (blue) of the transinformation between original patterns $u$ and noisy patterns $u˜$ with completeness $λ=1$ and add noise $κ=1$ (meaning each pattern has $λk=k$ correct and $κk=k$ false one-entries. (C) Similar to panel B, but for $λ=0.5$, $κ=0$. (D) Similar to panels B and C but for $λ=0.5$, $κ=0.5$. Data were computed as follows: T(pRND) computed from equation A.4 for $p01=κk/(n-k)$, $p10=1-λ$; T(kRND)$=ldnk-ldn-zk-c-ldzf$ for $c:=λk$, $f:=κk$, $z:=c+f$; T(BLK_ufn)$=c(ld(N)-ld(1+f/k))+(k-c)(ld(N)-ld(N-f/k))$; T(BLK_noh)$=c(ld(N)-ld(1+f/c))$; T(BLK_nom)$=cld(N)+(k-c)(ld(N)-ld(N-f/(k-c)))$. The three variants of BLK refer to “uniform noise” (ufn; $f$ false units are assumed to distribute uniformly among block hits and misses), “noise on hits” (noh; $f$ false units only on block hits where the correct block neuron is active), “noise on misses” (nom; $f$ false units only on block misses where the correct block neuron is inactive). As the true distribution of correct and false units per block is difficult to compute (and depends on the retrieval algorithm), these three variants can be seen as extreme cases). In panels A to C, the three variants produce identical results. They differ significantly only for mixed noise with both $λ<1$ and $κ>0$.

While our theory is valid for a standard iterative retrieval procedure for block coding (IRB), we have simulated a number of further optimized retrieval variants (see section 3). Although these variants can further increase pattern capacity by 10% to 20% compared to IRB, they can rarely exceed the traditional sparse coding models in terms of stored information per synapse. Other drawbacks are the relatively complicated implementations that are difficult to interpret neurobiologically and consume more computing time. Still, they may offer useful applications for fast, approximate nearest-neighbor search as suggested previously for associative networks (Palm, 1987b; Bentz et al., 1989; Hodge & Austin, 2005; Knoblauch, 2005, 2007, 2008a; Knoblauch et al., 2010; Knoblauch, 2012; Sacramento, Burnay, & Wichert, 2012; Ferro et al., 2016; Gripon, Löwe, & Vermet, 2018).

Another result of this study is a better understanding of iterative retrieval for finite network size. For example, although the asymptotic capacity of one-step retrieval has long been known to be $ln2≈0.69$ and $ln2/4≈0.173$ for heteroassociation and autoassociation (Willshaw et al., 1969; Palm, 1980; Palm & Sommer, 1992, 1996), simulations of iterative retrieval revealed that the storage capacity of finite networks increases beyond these values (Schwenker et al., 1996). Therefore, one might question these asymptotic values, in particular for iterative retrieval in autoassociation, where the theoretical analysis is still incomplete and it might be possible to get close to the upper bound of $(ln2)/2$ derived by Palm (1980) and in note 2. Here we explain at least the finite size effects of iterated one-step retrieval. In simulations of very large networks (with up to $n>2·106$ cells), we have shown that the autoassociative completion capacity has its global maximum at around $C≈0.20$ bit per synapse for a network size between $n=50,000$ and $n=100,000$. Remarkably, this coincides with the generic size of a cortical macrocolumn the Willshaw model has often been used as a model for (Braitenberg & Schüz, 1991; Palm, 1982; Palm et al., 2014; Knoblauch & Sommer, 2016). Together with the well-known empirical fact that small networks ($n<1000$) have a much smaller capacity, our theory can easily explain the phenomenon of a unique global maximum. It predicts for noisy input patterns (with $λ<1$) that maximal completion capacity $Cmax(n)$ is a decreasing function for large network size $n→∞$, where the theory becomes exact. By contrast, in networks with close to zero input noise ($λ≈1$) that would be optimal for heteroassociation, the asymptotic capacity is approached from below.

Networks employing simple block coding (IRB) can store more patterns, although they have a slightly lower maximal capacity $C≈0.18$ for larger networks around $n=106$. Otherwise, they seem to behave qualitatively similar to IR-KWTA and IR-LK+. For optimized retrieval with the sum-of-max strategy (IRB-SMX), the maximum capacity 0.204 is even slightly larger than for IR-KWTA and occurs at a slightly smaller network size (around $n=50,000$). We still cannot exclude that there may exist even better retrieval algorithms exceeding this maximum (Palm, 1980). For example, in some modified memory tasks like familiarity detection, it is known that autoassociative networks can be used in a way to achieve up to $(ln2)/2≈0.347$ bit/synapse (Bogacz et al., 2001; Bogacz & Brown, 2003; Palm & Sommer, 1996). At least for the pattern completion task, this seems impossible, and if such retrieval algorithms existed, we believe they would be very inefficient computationally compared to one-step or iterative retrieval (Knoblauch et al., 2010).

We conclude that block coding in finite networks can, in some parameter ranges, modestly increase pattern storage capacity, whereas the improvement is often negligible (or even absent) when measured as stored information per synapse. Asymptotically, block coding has the same limits as random coding. The higher pattern storage capacity and error-correcting capabilities may render block coding networks (in particular, IRB-SMX) better suited for applications such as a fast nearest-neighbor search for object classification than previous approaches (Ferro et al., 2016; Knoblauch et al., 2010).

In this study we have focused on balanced potentiation where the fraction of potentiated synapses $p1≈0.5$ maximizes the entropy of the weight matrix. In the future, it may be interesting to investigate the minimal entropy regimes of sparse potentiation with $p1→0$ and dense potentiation with $p1→1$. In the minimal entropy regimes, the weight matrix is compressible such that very efficient network implementations are possible (Knoblauch et al., 2010; Bentz et al., 1989). In particular, dense potentiation has previously been identified to be most promising for applications if implemented with inhibitory networks (Knoblauch, 2007, 2008a, 2012). Here the unpotentiated “silent” synapses with weight 0 are replaced by inhibitory synapses with weight $-1$, whereas the potentiated synapses with weight 1 can be pruned. Then block coding could further boost information efficiency because an even small increase of $p1$ toward 1 may significantly decrease the number of remaining “silent” synapses that must be represented in an inhibitory network. Another interesting question that should be addressed in future work is whether our observation that maximal capacity occurs at the size of cortical macrocolumns holds as well for neuroanatomically more realistic conditions such as sparsely connected networks and the involvement of structural plasticity (Knoblauch & Sommer, 2016; Knoblauch, 2017).

Appendix A:  Analysis of Block Coding

A.1  Transinformation and Channel Capacity for Block Coding

We can interpret storage and retrieval in associative networks as sending pattern vectors $vμ$ over a memory channel and receiving output patterns $v^μ$ (Shannon & Weaver, 1949; Cover & Thomas, 1991).

For random coding with independent pattern components (pRND), this corresponds to a bit-wise transmission over the channel, and it is therefore sufficient to consider binary random variables $X=viμ∈{0,1}$ and $Y=v^iμ∈{0,1}$. For $p:=pr[X=1]$ the information I(X) equals (Shannon & Weaver, 1949)
$I(p):=-p·ldp-(1-p)·ld(1-p)≈-p·ldp,p≪0.5-(1-p)·ld(1-p),1-p≪0.5.$
(A.1)
If transmission errors occur independently, the binary channel is determined by the two error probabilities $p01:=pr[Y=1|X=0]$ and $p10:=pr[Y=0|X=1]$, and we can write
$I(Y)=IY(p,p01,p10):=Ip1-p10+1-pp01,$
(A.2)
$I(Y|X)=IY|X(p,p01,p10):=p·I(p10)+(1-p)·I(p01),$
(A.3)
$T(X;Y)=T(p,p01,p10):=IY(p,p01,p10)-IY|X(p,p01,p10),$
(A.4)
where $T(X;Y)$ is the transinformation or mutual information meaning the information of $Y$ contained in $X$ (and vice versa). The transinformation for one pattern vector is then simply $nT(X;Y)$, and the stored information of the whole pattern set is $MnT(X;Y)$. For low-output noise, $ε^≪1$ it is $T(X;Y)≈I(p)$. (For further details, see Knoblauch, 2009, appendix E.) For a random pattern $u$ having exactly $k$ out $n$ active units (kRND), the transinformation to a noise pattern $u˜$ with $c$ correct active units from $u$ and $f$ additional false active units would be
$T(n,k,c,f):=ldnk-ldn-c-fk-c-ldc+ff.$
(A.5)
For simplicity, we have generally estimated storage capacity in simulations with random coding by equation A.4 instead of equation A.5. This is justified, as Figure 11 shows that pRND and kRND yield virtually identical results.
For block coding, individual bits of a pattern vector $vμ$ are not independent. It is therefore more adequate to consider the transmission of blocks $X,Y∈{0,1}N$. At the input side each block, $X$ has a single one-entry. Therefore, $I(X)=ldN$, assuming a uniform distribution. On the output side, $Y$ may have an arbitrary number of one-entries $|Y|∈{0,1,2,…,N}$. However, by construction of the pattern part retrieval algorithms R1B and IRB, either $|Y|=0$ or $Y$ will be a superset of $X$. Thus, reconstructing $X$ from $Y$ requires selecting one of the $|Y|$ one-entries from $Y$ and the conditional information of $X$ given $Y$ for fixed $|Y|$ is
$I(X|Y,|Y|)=ld|Y|,1≤|Y|≤NldN,|Y|=0.$
(A.6)
For given $|Y|$, the conditional transinformation $T(X;Y||Y|):=I(X)-I(X|Y,|Y|)$ is therefore $ld(N)-ld|Y|$ for $1≤|Y|≤N$, and 0 for $|Y|=0$. As the channel is defined by the distribution of output block activity $|Y|$,
$pb(z):=pr[|Y|=z],$
(A.7)
the transinformation for transmitting one block over the channel is thus obtained by averaging over all $|Y|$:
$T(X;Y)=I(X)-I(X|Y)=∑z=1N-1pb(z)(ld(N)-ld(z))≥pb(1)ldN.$
(A.8)
The transinformation for one pattern vector is then $kT(X;Y)$, and the stored information of the whole pattern set is $MkT(X;Y)$. For low-output noise with $ε^≪1$ and $pb(1)≈1$, it is $T(X;Y)≈ldN$.
For general retrieval scenarios (like winners-take-all threshold strategies), the true one-entry of a block $Y$ may be erased with probability $p10$, and spurious one-entries may appear independently with probability $p01$. Then the number of false one-entry $f$ in an output pattern has a binomial distribution:
$p(f)=pB(f;N-1,p01):=N-1fp01f(1-p01)N-1-f.$
(A.9)
For the cases where the correct one entry is preserved (fraction $1-p10$), one has to choose between the $f+1$ one-entries the correct one. Otherwise, in the cases where the correct one-entry is erased (fraction $p10$), one has to choose the correct one-entries among the $N-f$ zero-entries of the block $Y$. Therefore, the conditional information is
$I(X|Y)=(1-p10)∑f=0N-1pB(f;N-1,p01)ld(1+f),+p10∑f=0N-1pB(f;N-1,p01)ld(N-f),$
(A.10)
and therefore the transinformation for transmitting one block over the channel is
$T(X;Y)=I(X)-I(X|Y)=ldN-I(X|Y)=(1-p10)∑f=0N-1pB(f;N-1,p01)ldN1+f+p10∑f=0N-1pB(f;N-1,p01)ldNN-f=(1-p10)EfldN1+f+p10EfldNN-f=Ef(1-p10)ldN1+f+p10ldNN-f,$
(A.11)
where the expectations $Ef(.)$ can easily be evaluated in network simulations by averaging the weighted logarithms $ldN1+f$ and $ldNN-f$ over the measured numbers $f∈{0,1,…,N-1}$ of false one-entries per output block. Note that for $p10=0$, this general result, equation A.11, is consistent with the previous result, equation A.8, for pattern part retrieval.
Now we can compute the transinformation between pattern sets, for example, between the original input patterns $U:={u1,…,uM}$ and noisy queries $U˜:={u˜1,…,u˜M}$. Assuming that queries $u˜μ$ consist of $λ˜k$ correct one-entries of $uμ$ and additionally a fraction $κ˜k$ false one-entries, then the component errors are
$p10=k-λ˜kk=1-λ˜andp01=κ˜kn-k=κ˜N-1,$
(A.12)
and, correspondingly, the transinformation between queries and original inputs is
$T(U;U˜)=MkEf˜λ˜ldN1+f˜+(1-λ˜)ldNN-f˜.$
(A.13)
Similarly, one obtains the transinformation $T(U;U^)$ between reconstructed inputs $U^:={u^1,…,u^M}$ and original inputs $U:={u1,…,uM}$, and the transinformation $T(V;V^)$ between reconstructed outputs $V^:={v^1,…,v^M}$ and original output patterns $V:={v1,…,vM}$:
$T(U;U^)=MkEf^uλ^uldN1+f^u+(1-λ^u)ldNN-f^u,$
(A.14)
$T(V;V^)=MkEf^vλ^vldN1+f^v+(1-λ^v)ldNN-f^v,$
(A.15)
where $λ^u$ and $λ^v$ are average fractions of correct one-entries in reconstructed input and output patterns and the expectations are over the number of false one-entries $f^u$ and $f^v$ in a reconstructed input and ouput pattern, respectively.
Thus, we can compute storage capacities for our network simulations by using the definitions
$Cv:=T(V;V^)n2,$
(A.16)
$Cu:=T(U;U^)-T(U;U˜)n2,$
(A.17)
$Cuv:=Cv+Cu.$
(A.18)
For pattern part retrieval (without any false one-entries in the queries, $κ˜=0$), these general results can be compared against the theoretical estimates, equations 4.12 to 4.14.

Figure 11 illustrates some differences of information content for the different types of random memory patterns. We refer to pRND for independently generated pattern components, kRND for patterns having exactly $k$ out of $n$ active units, and BLK for different variants of block patterns (see the figure keys for details). It can be seen that pRND and kRND are almost equivalent. Only for extremely sparse patterns does ($k<10$) their information contents differ by a few percent. For that reason, we have computed information storage capacities simply from equation A.4 in all our simulations of networks storing kRND patterns. By contrast, block patterns typically have a significantly lower information content than pRND and kRND—in particular, for large $k$. One exception for this rule is the case of noisy patterns $λ<1$ and $κ>0$. Here for small $k$ the (trans-)information is only slightly lower than for pRND and, surprisingly, for large $k$, the block patterns can have a much larger transinformation than pRND and kRND, where the transinformation ratio diverges for $k→n/2$. For this reason we have computed information storage capacities from equations A.13 to A.18 in all our simulations of block-coded memory patterns.

A.2  Iterative Retrieval for Autoassociation with OR-ing (IRB)

The analysis in section 4.2 focuses on sIRB without OR-ing and heteroassociative IRB. For autoassociative IRB with OR-ing, we have to slightly adapt the previous analysis as the OR-ing becomes effective after the first R1B step. This implies a larger fraction of correct one-entries in the output pattern, because $λk$ correct neurons are already known from the input $u˜$. As equation 4.3 still holds for any of the remaining $(1-λ)k$ inactive block, equation 4.4 increases to
$λ^=λ+(1-λ)(1-p1λk)N-1.$
(A.19)
Correspondingly, the derivative of $λ^(λ)$ becomes
$λ^'(λ)=1-(1-p1λk)N-1-(1-λ)(N-1)(1-p1λk)N-2kp1λklnp1,$
(A.20)
and because of $λ^(1)=1$, linearizing around $λ=1$ as in equation 4.6 always yields
$λ^max,AA≈λ^(1)-λ^'(1)1-λ^'(1)=1.$
(A.21)
Thus, requiring $λ^≥!min(λ+rk,1)$ as in equation 4.7 yields with equation A.19 the corresponding condition,
$pbr=(1-p1λk)N-1≥!LAA:=minr/k1-λ,1-εmin,$
(A.22)
and after replacing $L$ by $LAA$, the same formulas for maximal matrix load $p1,max,AA$ and pattern capacity $Mmax,AA$ as in equations 4.9 and 4.10. Due to the lower threshold, $LAA, autoassociation with OR-ing achieves a factor $γp1$ higher matrix load and, correspondingly, can store factor $γM$ more memories,
$γp1:=p1,max,AAp1,max=1-LAA1N-11-L1N-11λk,$
(A.23)
$γM:=Mmax,AAMmax=ln(1-p1,max,AA)ln(1-p1,max)=ln(1-(1-LAA1N-1)1λk)ln(1-(1-L1N-1)1λk),$
(A.24)
than heteroassociation or autoassocation without OR-ing.

A.3  The Limit of Large Networks for Balanced Potentiation

As described in section 4.4, we consider the limit $n,k,N→∞$, $p=kn→0$, $L=λ+rk$, $0, $plnL→0$ and fixed $0 for constant $λ,ε$. Moreover we assume that $r$ is typically constant, or $r∼k$.

First, we show that after replacing $L$ by $LAA$, equations 4.15 and 4.16 hold as well for autoassociation with OR-ing. Using again $ln(x+Δx)=ln(x)+O(Δxx)$ and $LAA=rk(1-λ)$ from equation A.22, it is $ln(1-LAAp)=ln(-plnLAA)+O(plnLAA)∼-ln(N)+lnlnkr∼-ln(N)$ because $lnlnkln(n/k)→0$ for any sublinear $k=O(nd)$ with $d<1$.

Second, as a consequence, we show for equations A.23 and A.24,
$γp1→1andthusalsoγM→1,$
(A.25)
that is, large networks doing autoassociation with OR-ing cannot exceed matrix load $p1,max$ and memory number $Mmax$ of heteroassociation. To prove this, note $lnlnLAAlnL=lnlnrk(1-λ)ln(λ+1k)=O(lnlnk)$. With this, equation A.23 with 4.16 implies $lnγp1=ln(1-LAAp)-ln(1-Lp)λk=ln(-plnLAA)-ln(-plnL)+O(-plnLAA)+O(-plnL)λk=lnlnLAAlnL+O(-plnLAA)λk=O(lnlnk)λk→0$ and therefore $γp1→1$.
Third, we prove equation 4.18 in the limit of fixed $0 where $k∼lnn$ grows logarithmic in $n$, as shown by equation 4.17. Let $p1<1$ also be fixed. Then equation 4.2 implies $p01=p1λk→0$, and therefore $lnλ^=(N-1)ln(1-p01)≈-(N-1)p01$ from equation 4.4. Inserting equation 4.9 yields $p01=p1logp1(1-L1N-1)logp1p1,max=1-L1N-1ld(p1)ld(p1,max)$ and with equation 4.15, we get $p01≈-plnλld(p1)/ld(p1,max)$. Inserting equation 4.15 yields $p01≈-plnLld(p1)/ld(p1,max)$. Therefore, for $L∞:=limn→∞L=limn→∞(λ+rk)$, we get $-ln(λ^)≈Np01=N1-ld(p1)ld(p1,max)(-lnL)ld(p1)ld(p1,max)→0,ifld(p1)ld(p1,max)>1∞,ifld(p1)ld(p1,max)<1-lnL∞,ifld(p1)ld(p1,max)=1$ or equivalently,
$λ^→1,ifp1p1,maxL∞,ifp1=p1,maxandthusλ^MMmax→1,MMmax<1L∞,MMmax=10,MMmax>1.$
(A.26)
where typically $L∞=λ$ for constant $r$.
Fourth, we show that almost the same results hold for autoassociation with OR-ing: The proof is identical as for equation A.26 except $λ^=λ+(1-λ)(1-p1λk)N-1$ and $LAA=r/k1-λ$ from equations A.19 and A.22. The former yields $lnλ^-λ1-λ=(N-1)ln(1-p1λk)≈-N-plnLAAld(p1)/ld(p1,max)=-N1-ld(p1)ld(p1,max)(-lnLAA)ld(p1)ld(p1,max)$ and therefore again,
$-lnλ^-λ1-λ→0,ld(p1)ld(p1,max)>1∞,ld(p1)ld(p1,max)<1-lnLAA,∞,ld(p1)ld(p1,max)=1orλ^→1,p1p1,maxλ+(1-λ)LAA,∞,p1=p1,max,$
where $LAA,∞:=limn→∞r/k1-λ$. Thus, the result is again a sharp transition from maximal output completeness $λ^=1$ to minimal $λ^=λ$.

Appendix B:  Analysis of One-Step Retrieval (R1) and Iterative Retrieval (IR)

A similar analysis as in section 4 for heteroassociative one-step-retrieval R1 (without exploiting block coding) yields the following results (see Knoblauch et al., 2010, equations 3.2, 3.7–3.11),
$p1ε≈εkn-k1/λk≈(εp)1/λk⇔k≈ld(εp)λldp1ε$
(B.1)
$Mε=ln(1-p1ε)ln(1-p2)≈-n2ln(1-p1ε)k2≈-λ2n2(ldp1ε)2ln(1-p1ε)[ld(εp)]2$
(B.2)
$Cε:=Mε(ldnk-ld(1+ε)k(1-λkn2≈MεnT(p,εp,0)n2≈-(1-εT)λld(p1ε)ln(1-p1ε)1+lnεlnp,$
(B.3)
where $ε$ is the maximal amount of output noise $ε^$ allowed in the retrieval output $v^$, $p1ε$, $Mε$, and $Cv,ε$ are maximal matrix load, pattern capacity, and output capacity at output noise level $ε$ that may be compared to equations 4.9, 4.10, and 4.12, for example. Further, $T(.)$ is the transinformation of one pattern component as defined in equation A.4, $εT:=1-T(.)/I(p)$ with $I$ as in equation A.1 is the relative output loss of information, and the last approximation uses $ldnk≈kldn$. While equations B.1 and B.2 hold as well for autoassociation, equation B.3 must be corrected by subtracting the transinformation between input patterns and original patterns as in equation A.17 or by multiplying a factor $(1-λ1-εT)$ or $(1-λ)$, similarly as in equations 4.14 and 4.23.
To evaluate the gain in storage capacity of IRB over R1, we can compute the relative increases of maximal $p1$, $M$, and $C$:
$gp1:=p1,maxp1ε≈1-Lpεp1/λk,$
(B.4)
$gM:=MmaxMε=ln(1-p1,max)ln(1-p1ε),$
(B.5)
$gC:=CvCε=MmaxMε·λmaxkldN(1-εT)ldnk≈1-lnklnngM.$
(B.6)
These expressions can then be used for two types of comparisons. First, for estimating the gain of block coding over random coding, we may assume $ε=1-L$, which corresponds to a similar “output noise after first retrieval step must be smaller than input noise” criterion as used in section 4.2 for the analysis of IRB. However, note that without block coding, input and output noise are asymmetric as output noise corresponds to additional spurious one-entries in the output pattern, whereas the input query $u˜$ has only missing one-entries. We still consider this a reasonable conservative criterion; at least for very sparse patterns, $k=O(lnn)$, add-noise seems less detrimental to retrieval quality than miss-noise (Knoblauch et al., 2010, Fig. 8a). Second, for estimating the gain of iterative retrieval (IRB) over one-step retrieval, a fair comparison requires that R1 achieves low-output noise comparable to IRB—for example, $ε=1-λmax$ or $ε≤0.01$.
We can similarly compare one-step and iterative retrieval IR without any block coding (assuming sparse coding $k≤log(n)$ and a reasonable threshold strategy for retrieval steps 2, 3, …, for example, $k$-WTA selecting the $k$ most activated units) by using the gains (e.g., again for $ε≤0.01$)
$gp1*:=p1(1-L)p1ε≈1-Lε1/λk,$
(B.7)
$gM*:=M1-LMε=ln(1-p1(1-L))ln(1-p1ε),$
(B.8)
$gC*:=C1-LCε≈gM*,$
(B.9)
where $L:=1-min(λ+rk,1-ε^)$ is defined in analogy to equation 4.8.16

Appendix C:  Recognition Capacity of the Willshaw Model

As explained in section 2, the task of recognition memory is to determine whether a pattern $u˜$ corresponds to one of the $M$ familiar patterns $uμ$ that have previously been stored in the weight matrix. To do a capacity analysis similar to that in section 4, we can think of recognition memory associating a set of $M$ familiar patterns $uμ$ with labels $vμ=1$ and the remaining $M*$ possible unseen patterns with labels $vμ'=0$. In the autoassociative Willshaw model, equation 2.2, with dendritic potentials $xj:=∑i=1nWiju˜i$ as in equation 3.1, recognition memory can be realized by the one-step decision
$v^=1,∑i=1nWiju˜i≥Θ0,otherwise,$
(C.1)
where the threshold $Θ=∥u˜∥2$ can be chosen maximally if the familiar inputs $u˜$ are subsets of the original patterns $uμ$. We further assume again that all patterns have identical activity $k=∥uμ∥$ and $∥u˜∥=λk$ for $0<λ≤1$ such that the number of unseen patterns is $M*=nk-M$. Then equation C.1 means that for a positive decision $v^=1$, all “clique synapses” of the pattern $u˜$ have to be potentiated, $wij=1$ for $i,j∈u˜$. Obviously, this will be true if $u˜$ is actually a subset of a familiar pattern that has previously been stored. If $u˜$ corresponds to one of the $M*$ unseen patterns, there is a chance that all “clique synapses” have been potentiated accidentally by several of the previously stored patterns, implying a false-positive (FP) response corresponding to a spurious state in the autoassociative dynamics. Obviously, the chance of a FP is
$pF:=q01:=pr[∀i,j∈u˜∃μ:i,j∈uμ]≈>p1λk(λk-1)/2,$
(C.2)
where $p1=1-(1-k2n2)M$ is the matrix load (see equation 4.1), assuming that each of the $λk2=λk(λk-1)2$ relevant “upper” matrix entries is potentiated independently (note that the autoassociative weight matrix is symmetric, and the chance that a diagonal element $wii$ is potentiated is close to one). Similar to equation 2.4, we can define output noise,
$ε^=q01M*M≈p1λk(λk-1)/2M*M,$
(C.3)
as the number of FPs normalized to the actual familiar patterns. Imposing a quality criterion $ε^≤!ε$ as in appendix B then requires for $q:=MM*+M≈MM*$ that $p1λk(λk-1)/2≤εq$, which yields the maximal matrix load at error level $ε$,17
$p1≤p1ε:=(εq)2(λk)2-λk⇔k=1+1+2ld(εq)ldp1ε2λ.$
(C.4)
and by solving $p1=1-(1-k2n2)M$ for $M$ the pattern capacity,
$Mε=ln(1-p1ε)ln(1-k2n2)≈-n2ln(1-p1ε)k2=-4λ2n2ln(1-p1ε)1+1+2ld(εq)ldp1ε2$
(C.5)
$≈<λ22ld(εq)ld(p1ε)ln(1-p1ε)n2,$
(C.6)
where the upper bound becomes tight for vanishing $k/n→0$ and diverging $ld(εq)/ld(p1ε)→∞$, for example, constant $p1ε$ and vanishing $εq→0$.
The stored information $(M+M*)T(q,q01,0)$ corresponds to the transinformation (see equation A.4) between the true class labels $vμ'$ and the NAM's decisions $v^μ'$ for $μ'=1,…,M+M*$. Normalizing to the number of synapses yields the recognition capacity,
$Cu,rcg,ε:=Mε(1+1q)T(q,εq,0)n2≈<λ22ld(p1ε)ln(1-p1ε)T(q,εq,0)qld(εq),$
(C.7)
where the approximation becomes tight again for $ld(εq)/ld(p1ε)→∞$. To maximize $Cu,rcg,ε$, we can choose $q$ arbitrarily to maximize
$f(ε,q):=T(q,εq,0)-qld(εq).$
As we have shown $T(q,q01,0)≤-qldq01$ and $T(q,q01,0)≈-qldq01$ for $q/q01→0$ elsewhere (see equation A.6 in Knoblauch et al., 2010), we obtain the maximum $f(ε,q)→1$ for $q/q01=1ε→0$. Thus, in a high noise limit $n→∞$ with diverging $ε→∞$ implying many more “spurious states” than true memories, we obtain maximal capacity,
$Cu,rcg,ε→λ22ld(p1ε)ln(1-p1ε)≤ln22≈0.35,$
(C.8)
where the upper bound is obtained for $p1ε=0.5$, $λ=1$, $q→0$, $ε→∞$, and $q01=εq→0$. This means that for maximum capacity, the asymptotically optimal pattern activity, equation C.4, must grow as
$k≈12λ2ld(εq)ldp1ε∼-ld(εq).$
(C.9)
This allows $k→∞$ to be almost constant as $εq→0$ may vanish arbitrarily slowly, where $ε<1q=nkM∼nk-2→∞$ may grow very fast.18 On the other end of the optimal range, if $ε→∞$ is almost constant, we obtain $-ld(εq)≈-ldq=ldnk(n/k)2≈kldn$ and thus19$k=ldn2λ2$. Thus, for recognition memory it is possible to store up to $ln22≈0.35$ bit per synapse, but only for high noise levels where there are many times more spurious states than stored memories. The resulting device can still be useful for applications as typically “new” inputs occur relatively seldom during operation and the chance $q01$ of a new input that evokes a false-positive response is very low. Interestingly, this bound can be reached for any diverging pattern activity with $k≤ldn2λ2$, whereas reconstruction memory is optimal only for logarithmic $k≈1λldn$ (see equations B.1 and 4.9).

A limitation of our analysis is the binomial approximation of the error probability, equation C.2, assuming that all one-entries in the weight matrix would have been generated independent of each other. For reconstruction memory we have shown the convergence of this approximation (Knoblauch, 2008b, 2008c), which is relatively easy to show because the probability of a single-output component error depends only on the synaptic weights of the output unit (see equation 4.2). By contrast, for recognition memory, the error probability, equation C.2, depends on the synaptic weights of all active neurons, which may include subtle dependencies that are difficult to analyze precisely. Also, a quantitative verification of our theory by simulation experiments is here much more difficult because the error probabilities (see equation C.2) become extremely small, $q01∼0.5k2$, and thus are difficult to test with sufficient precision.20

Notes

1

In principle, these early results did not really consider our very general definition of $C$, so they only show that $C≥ln2$ for heteroassociation and $C≥ln2/2$ for autoassociation. The equality seems very plausible after more than 40 years of associative memory research, but to our knowledge, it is still a conjecture.

2

Split the patterns $uμ$ into two parts of length $(1-r)n$ and $rn$. Then the matrix $W$ contains a submatrix $W(11)$ for autoassociation of the first pattern parts and a matrix $W(12)$ for heteroassociation from the first to the second parts. The parameters for these submatrices are in the close-to-optimal range, so their general information capacities converge to the limits $CA:=C$ and $CH≥ln2$ for autoassociation and heteroassociation, respectively. Similarly, the capacity of a third submatrix $W(22)$ for autoassociation of the second pattern parts converges also to $CA$. From this, we get $(1-r)2CA+r(1-r)CH≤CA≤(1-r)2CA+r(1-r)CH+r2CA$ for any $0. Solving for $CA$ yields equivalently $1-r2-rCH≤CA≤CH2$. Thus, $CA=CH2$ for $r→0$.

3

This means there is exactly one synapse $wij$ between any neuron pair $ui,vj$. This assumes autapses for autoassociation and bidirectional synapses for heteroassociation.

4

We use $ldn:=log2(n)$ as an abbreviation for the logarithmus dualis.

5

In most of our analyses and simulation experiments, we assume $λ=0.5$ or $λ=1$ but no additional false one-entries ($κ=0$) because this is known to maximize information capacity. Further simulations using different values for $λ$ and $κ>0$ have confirmed our results (data not shown).

6

For autoassociation, there are no synaptic links within a block except self-connections. Therefore, a silent neuron never gets input from an active neuron within the same block and therefore cannot reach the firing threshold (equaling the number of all active neurons). As a consequence, sIRB and IRB are equivalent for autoassociative pattern part retrieval.

7

The chance that a synapse is potentiated after learning one pattern association is $(1/N)2$. Correspondingly, the chance that a synapse is not potentiated after learning $M$ associations is $p0:=(1-1N2)M$, and $p1:=1-p0$. Note that for autoassociation, within a block only autapses $Wii$ may be potentiated.

8

We assume here that the one-entries in a column of the weight matrix would be placed independently, corresponding to a binomial distribution of dendritic potentials. It has been shown that this so-called binomial approximation of $p01$ becomes exact in the limit of large networks with sufficiently sparse memory patterns, including the case $k=O(n/log2n)$ (Knoblauch, 2008c).

9

Obviously, solving this fixed-point equation yields only an upper bound of $λmax$ for two reasons. First, because equation 4.4 gives only a mean value, due to a positive variance, about half of the retrievals will result in worse-than-average values, such that iterative retrieval may get stuck in a spurious fixedpoint. Second, the linear approximation of $λ^(λ)$ also yields only an upper bound for $λmax$ due to the concave form of the curve for relevant $λ$ (see Figure 2). A quadratic approximation would obviously yield better results, but to preserve the upper bound and simplify the following capacity analyses, we keep to the linearization. The simulations in section 5 show that our analysis still provides useful results.

10

As output errors (see equation 4.2) decrease exponentially with $λ$, the first retrieval step is most critical for convergence of activity toward the original pattern. Therefore, the resulting approximations should be sufficiently good in spite of not analyzing further retrieval steps.

11

In the following capacity analysis of the maximal matrix load $p1$ satisfying equation 4.8, we do not know $λmax$ beforehand (as it depends itself on $p1$). To avoid an iteration between $L$ and maximal $p1$, we can assume a fixed tolerance value for convenience, for example, $εmin:=ε=0.01$, similar to previous analyses (Knoblauch et al., 2010). Note that for $λ<1$ and large networks with $r/k→0$, the value of $εmin$ becomes irrelevant anyhow.

12

In general, $λ^max$ is the mean completeness averaged over both areas $u$ and $v$. For example, for IRB with OR-ing, the maximal completeness in area $u$ is typically (slightly) larger than in $v$.

13

Strictly speaking, we cannot definitely exclude that there may exist more optimal retrieval strategies exceeding the classical limits. For example, for block coding, IRB-SMX seems currently to be the optimal retrieval strategy (see the simulation results in section 5). Although an analysis of IRB-SMX is more challenging, our simulations suggest that IRB-SMX also cannot exceed the asymptotic values of R1. First, our simulations show that maximum completion capacity (for $n$ in the range between $104$ and $105$) is only slightly larger than for classical IR models and seems to decrease to the classical value for larger $n>105$. Second, some further analysis (not shown) reveals that in order to exceed the classical values, any iterative retrieval method must be able to eliminate a number of noisy neuron activities (e.g., after the first R1 step) that scale larger than any power of $k$. However, our simulations of very large IRB-SMX networks show that this number, at maximal capacity, seems to scale at most with $k2$ (see note 15).

14

Most experiments use $λ=0.5$ or $λ=1$ because this is optimal for autoassociation and heteroassociation, respectively. But further simulation experiments have confirmed our results also for various other values of $λ$ and $κ>0$ (data not shown).

15

Although we cannot strictly exclude that the asymptotic capacity of the optimal IRB-SMX method is above the classical asymptotic bounds, our simulation data do not support such a hypothesis. In fact, we have measured output noise $ε^$ after the first R1-step for IRB-SMX at maximum capacity (and also after convergence). For $k=20,22,24,26,28$ we obtained $ε^=1.69,1.86,1.98,2.16,2.31$ after the first R1 step (corresponding to $ε^=0.136,0.117,0.087,0.098,0.113$ after the last iteration). These data support only a linear increase of output noise $ε^∼k$ (or a quadratic increase of the number of wrongly activated neurons with $k$). On the other hand, we have argued in section 4.4 (see note 13) that exceeding the classical asymptotic bounds would require that the output noise $εε$ grow faster than any power of $k$. Thus, our data support the hypothesis that IRB-SMX also has the same asymptotic bounds as the classical retrieval methods.

16

In equation 4.8, we defined $L$ as a lower bound on the output completeness $λ^$ being a function of input completeness $λ$. Expressing the same relation in terms of input noise $ε˜:=1-λ$ and output noise $ε^=1-λ^$, condition 4.8 becomes equivalently $ε^≤!1-L=1-min(λ+rk,λmax)=max(1-λ-rk,1-λmax)=max(ε˜-rk,ε^min)$ with the minimally possible output noise $ε^min:=1-λ^max$. Transferring this result from IRB to IR then simply means requiring that the output noise after the first R1 step must be bounded by $ε:=1-L$ as implied by equations B.7 to B.9. Here we may use the minimally possible output noise $ε^min:=(n-k)p1kk$ that follows from the minimal output component error probability $p01,min=p1k$ for perfect inputs with $λ=1$. However, to avoid again iterated computations (as $ε^min$ depends on $p1$), we simply choose fixed $ε^min=ε$ as for R1.

17

The right part of equation C.4 follows from solving $lnp1ε=2(λk)2-λklnεα$ or the quadratic equation $λk2-k-2lnεαλlnp1ε=0$.

18

For example, for $-ln(εq)∼lnln⋯lnn→∞$ it follows from equation C.6 that $M∼n2lnln⋯lnn$ scales almost with the number of synapses. Correspondingly, the normalized critical pattern capacity diverges as $α:=Mn2/ln2n∼ln2nlnln⋯lnn→∞$.

19

Inserting $-ld(εq)≈kldn$ in equation C.4 yields $k≈2kldn/(2λ)$ and solving for $k$ yields $k=ldn2λ2$.

20

In our current approach (data not shown), we have used randomly selected unseen patterns to estimate $q01$, but this allows precise estimation only for small networks with relatively large error probabilities—for example, $q01≥10-5$. A possible future approach may try a maximum clique approach to specifically search for cliques of size $λk$ in the graph of the synaptic weight matrix and, by that, estimate the probability that a randomly selected clique corresponds to either a spurious state or a familiar memory.

Acknowledgments

We are grateful to Friedhelm Schwenker and Fritz Sommer for valuable discussions. We are also very grateful to Hans Kestler for letting us use his computing infrastructure to simulate large neural networks. We acknowledge support by the state of Baden-Württemberg through bwHPC.

References

Aboudib
,
A.
,
Gripon
,
V.
, &
Jiang
,
X.
(
2014
).
A study of retrieval algorithms of sparse messages in networks of neural cliques
. In
Proceedings of the 6th International Conference on Advanced Cognitive Technologies and Applications
(pp.
140
146
). https://www.iaria.org/conferences2014/COGNITIVE14.html
Albus
,
J.
(
1971
).
A theory of cerebellar function
.
Mathematical Biosciences
,
10
,
25
61
.
,
B. K.
,
Berrou
,
C.
,
Gripon
,
V.
, &
Jiang
,
X.
(
2014
).
Storing sparse messages in networks of neural cliques
.
IEEE Transactions on Neural Networks and Learning Systems
,
25
,
980
989
.
Amari
,
S.-I.
(
1989
).
Characteristics of sparsely encoded associative memory
.
Neural Networks
,
2
,
451
457
.
Austin
,
J.
, &
Stonham
,
T.
(
1987
).
Distributed associative memory for use in scene analysis
.
Image and Vision Computing
,
5
(
4
),
251
260
.
Bentz
,
H.
,
Hagstroem
,
M.
, &
Palm
,
G.
(
1989
).
Information storage and effective data retrieval in sparse matrices
.
Neural Networks
,
2
,
289
293
.
Bogacz
,
R.
, &
Brown
,
M.
(
2003
).
Comparison of computational models of familiarity discrimination in the perirhinal cortex
.
Hippocampus
,
13
,
494
524
.
Bogacz
,
R.
,
Brown
,
M.
, &
Giraud-Carrier
,
C.
(
2001
).
Model of familiarity discrimination in the perirhinal cortex
.
Journal of Computational Neuroscience
,
10
,
5
23
.
Braitenberg
,
V.
(
1978
). Cell assemblies in the cerebral cortex. In
R.
Heim
&
G.
Palm
(Eds.),
Lecture Notes in Biomathematics: Vol. 21. Theoretical Approaches to Complex Systems
(pp.
171
188
).
Berlin
:
Springer-Verlag
.
Braitenberg
,
V.
, &
Schüz
,
A.
(
1991
).
Anatomy of the cortex: Statistics and geometry.
Berlin
:
Springer-Verlag
.
Bruck
,
J.
, &
Roychowdhury
,
V.
(
1990
).
On the number of spurious memories in the Hopfield model
.
IEEE Transactions on Information Theory
,
36
(
2
),
393
397
.
Buckingham
,
J.
, &
Willshaw
,
D.
(
1992
).
Performance characteristics of the associative net
.
Network: Computation in Neural Systems
,
3
,
407
414
.
Cover
,
T.
, &
Thomas
,
J.
(
1991
).
Elements of information theory.
New York
:
Wiley
.
Dayan
,
P.
, &
Willshaw
,
D.
(
1991
).
Optimising synaptic learning rules in linear associative memory
.
Biological Cybernetics
,
65
,
253
265
.
Ferro
,
D.
,
Gripon
,
V.
, &
Jiang
,
X.
(
2016
). Nearest neighbour search using binary neural networks. In
Proceedings of the International Joint Conference on Neural Networks
.
Piscataway, NJ
:
IEEE
.
French
,
R.
(
1999
).
Catastrophic forgetting in connectionist networks: Causes, consequences and solutions
.
Trends in Cognitive Sciences
,
3
(
4
),
128
135
.
Gabor
,
D.
(
1969
).
Associative holographic memories
.
IBM Journal of Research and Development
,
13
(
2
),
156
159
.
Gardner
,
E.
(
1987
).
Maximum storage capacity in neural networks
.
Europhysics Letters
,
4
,
481
485
.
Gardner
,
E.
(
1988
).
The space of interactions in neural network models
.
J. Phys. A: Math. Gen.
,
21
,
257
270
.
Gibson
,
W.
, &
Robinson
,
J.
(
1992
).
Statistical analysis of the dynamics of a sparse associative memory
.
Neural Networks
,
5
,
645
662
.
Gripon
,
V.
, &
Berrou
,
C.
(
2011
).
Sparse neural networks with large learning diversity
.
IEEE Transactions on Neural Networks
,
22
(
7
),
1087
1096
.
Gripon
,
V.
, &
Berrou
,
C.
(
2012
). Nearly-optimal associative memories based on distributed constant weight codes. In
Proceedings of the IEEE Information Theory and Applications Workshop
(pp.
269
273
).
Piscataway, NJ
:
IEEE
.
Gripon
,
V.
,
Heusel
,
J.
,
Löwe
,
M.
, &
Vermet
,
F.
(
2016
).
A comparative study of sparse associative memories
.
Journal of Statistical Physics
,
164
(
1
),
105
129
.
Gripon
,
V.
,
Löwe
,
M.
, &
Vermet
,
F.
(
2018
).
Associative memories to accelerate approximate nearest neighbor search
.
Applied Sciences
,
8
(
9
),
1676
.
Gripon
,
V.
, &
Rabbat
,
M.
(
2013
). Maximum likelihood associative memories. In
Proceedings of the IEEE Information Theory Workshop
(pp.
1
5
).
Piscataway, NJ
:
IEEE
.
Hebb
,
D.
(
1949
).
The organization of behavior: A neuropsychological theory.
New York
:
Wiley
.
Hertz
,
J.
,
Krogh
,
A.
, &
Palmer
,
R.
(
1991
).
Introduction to the theory of neural computation.
Redwood City, CA
:
.
Hodge
,
V.
, &
Austin
,
J.
(
2003
).
A comparison of standard spell checking algorithms and a novel binary neural approach
.
IEEE Transactions on Knowledge and Data Engineering
,
15
(
5
),
1073
1081
.
Hodge
,
V.
, &
Austin
,
J.
(
2005
).
A binary neural $k$-nearest neighbour technique
.
Knowledge and Information Systems
,
8
,
276
291
.
Hopfield
,
J.
(
1982
).
Neural networks and physical systems with emergent collective computational abilities
.
Proceedings of the National Academy of Science, USA
,
79
,
2554
2558
.
Kanerva
,
P.
(
1988
).
Sparse distributed memory.
Cambridge, MA
:
MIT Press
.
Kanter
,
I.
(
1988
).
Potts-Glass models of neural networks
.
Physical Review A
,
37
(
7
),
2739
2742
.
Knoblauch
,
A.
(
2003a
). Optimal matrix compression yields storage capacity 1 for binary Willshaw associative memory. In
Lecture Notes in Computer Science: Vol. 2714
.
O.
Kaynak
,
E.
Alpaydin
,
E.
Oja
, &
L.
Xu
(Eds.),
Artificial Neural Networks and Neural Information Processing
(pp.
325
332
).
Berlin
:
Springer-Verlag
.
Knoblauch
,
A.
(
2003b
).
Synchronization and pattern separation in spiking associative memory and visual cortical areas.
PhD dissertation, University of Ulm, Germany.
Knoblauch
,
A.
(
2005
).
Neural associative memory for brain modeling and information retrieval
.
Information Processing Letters
,
95
,
537
544
.
Knoblauch
,
A.
(
2007
).
On the computational benefits of inhibitory neural associative networks
(
HRI-EU Report 07-05
).
Offenbach/Main, Germany
: Honda Research Institute Europe.
Knoblauch
,
A.
(
2008a
).
Best-match hashing with inhibitory associative networks for real-world object recognition
(
HRI-EU Report 08-05
).
Offenbach/Main, Germany
: Honda Research Institute Europe.
Knoblauch
,
A.
(
2008b
).
Closed-form expressions for the moments of the binomial probability distribution
.
SIAM Journal on Applied Mathematics
,
69
(
1
),
197
204
.
Knoblauch
,
A.
(
2008c
).
Neural associative memory and the Willshaw-Palm probability distribution
.
SIAM Journal on Applied Mathematics
,
69
(
1
),
169
196
.
Knoblauch
,
A.
(
2009
).
Neural associative networks with optimal Bayesian learning
(
HRI-EU Report 09-02
).
Offenbach/Main, Germany
: Honda Research Institute Europe.
Knoblauch
,
A.
(
2010a
). Optimal synaptic learning in non-linear associative memory. In
Proceedings of the International Joint Conference on Neural Networks
(pp.
3205
3211
).
Picataway, NJ
:
IEEE
.
Knoblauch
,
A.
(
2010b
). Zip nets: Efficient associative computation with binary synapses. In
Proceedings of the International Joint Conference on Neural Networks
(pp.
4271
4278
).
Picataway, NJ
:
IEEE
.
Knoblauch
,
A.
(
2011
).
Neural associative memory with optimal Bayesian learning
.
Neural Computation
,
23
(
6
),
1393
1451
.
Knoblauch
,
A.
(
2012
).
Method and device for realizing an associative memory based on inhibitory neural networks
. U.S. Patent No. 8,335,752.
Knoblauch
,
A.
(
2016
).
Efficient associative computation with discrete synapses
.
Neural Computation
,
28
(
1
),
118
186
.
Knoblauch
,
A.
(
2017
). Impact of structural plasticity on memory formation and decline. In
A. van
Ooyen
&
M.
Butz
(Eds.),
Rewiring the brain: A computational approach to structural plasticity in the adult brain
(pp.
361
386
).
London
:
.
Knoblauch
,
A.
,
Körner
,
E.
,
Körner
,
U.
, &
Sommer
,
F.
(
2014
).
Structural plasticity has high memory capacity and can explain graded amnesia, catastrophic forgetting, and the spacing effect
.
PLOS One
,
9
(
5
),
e96485
,
1
19
.
Knoblauch
,
A.
,
Palm
,
G.
, &
Sommer
,
F.
(
2010
).
Memory capacities for synaptic and structural plasticity
.
Neural Computation
,
22
(
2
),
289
341
.
Knoblauch
,
A.
, &
Sommer
,
F.
(
2016
).
Structural plasticity, effectual connectivity, and memory in cortex
.
Frontiers in Neuroanatomy
,
10
(
63
),
1
20
.
Kosko
,
B.
(
1988
).
Bidirectional associative memories
.
IEEE Transactions on Systems, Man, and Cybernetics
,
18
,
49
60
.
Krikelis
,
A.
, &
Weems
,
C.
(
1997
).
Associative processing and processors.
Piscataway, NJ
:
IEEE Press
.
Kryzhanovsky
,
B.
, &
Kryzhanovsky
,
V.
(
2008
).
A binary pattern classification using Potts model
.
Optical Memory and Neural Networks (Information Optics)
,
17
(
4
),
308
316
.
Kryzhanovsky
,
B.
,
Litinskii
,
L.
, &
Mikaelian
,
A.
(
2004
). Vector-neuron models of associative memory. In
Proceedings of the 2004 IEEE International Joint Conference on Neural Networks
(pp.
909
914
).
Piscataway, NJ
:
IEEE
.
Kryzhanovsky
,
V.
,
Kryzhanovsky
,
B.
, &
Fonarev
,
A.
(
2008
). Application of potts-model perceptron for binary patterns identification. In
V.
Kurkova-Pohlova
,
R.
Neruda
, &
J.
Koutnik
(Eds.),
Lecture Notes in Computer Science, Vol. 5163, Proceedings of the 18th International Conference on Artificial Neural Networks
(pp.
553
561
).
Berlin
:
Springer-Verlag
.
Lansner
,
A.
(
2009
).
Associative memory models: From the cell-assembly theory to biophysically detailed cortex simulations
.
Trends in Neurosciences
,
32
(
3
),
178
186
.
Lansner
,
A.
, &
Ekeberg
,
O.
(
1989
).
A one-layer feedback artificial neural network with a Bayesian learning rule
.
International Journal of Neural Systems
,
1
(
1
),
77
87
.
Laurent
,
G.
(
2002
).
Olfactory network dynamics and the coding of multidimensional signals
.
Nature Reviews Neuroscience
,
3
,
884
895
.
Memis
,
I.
(
2015
).
Vergleich von Kodierungsstrategien im Willshaw-Modell
. Bachelor's thesis, University of Ulm, Germany.
Palm
,
G.
(
1980
).
On associative memories
.
Biological Cybernetics
,
36
,
19
31
.
Palm
,
G.
(
1982
).
Neural assemblies: An alternative approach to artificial intelligence
.
Berlin
:
Springer
.
Palm
,
G.
(
1987a
). Associative memory and threshold control in neural networks. In
J.
Casti
, &
A.
Karlqvist
(Eds.),
Real brains–artificial minds.
Amsterdam
:
North-Holland
.
Palm
,
G.
(
1987b
).
Computing with neural networks
.
Science
,
235
,
1227
1228
.
Palm
,
G.
(
1987c
). On associative memories. In
E.
Caianiello
(Ed.),
Physics of cognitive processes
(pp.
380
422
).
Singapore
:
World Scientific
.
Palm
,
G.
(
1991
).
Memory capacities of local rules for synaptic modification. A comparative review
.
Concepts in Neuroscience
,
2
,
97
128
.
Palm
,
G.
(
2012
).
Novelty, information and surprise.
Berlin
:
Springer
.
Palm
,
G.
(
2013
).
Neural associative memories and sparse coding
.
Neural Networks
,
37
,
165
171
.
Palm
,
G.
,
Knoblauch
,
A.
,
Hauser
,
F.
, &
Schüz
,
A.
(
2014
).
Cell assemblies in the cerebral cortex
.
Biological Cybernetics
,
108
(
5
),
559
572
.
Palm
,
G.
,
Schwenker
,
F.
, &
Sommer
,
F.
(
1994
). Associative memory networks and sparse similarity preserving codes. In
V.
Cherkassky
,
J.
Friedman
, &
H.
Wechsler
(Eds.),
From statistics to neural networks: Theory and pattern recognition applications
(pp.
283
302
).
Berlin
:
Springer-Verlag
.
Palm
,
G.
, &
Sommer
,
F.
(
1992
).
Information capacity in recurrent McCulloch-Pitts networks with sparsely coded memory states
.
Network
,
3
,
177
186
.
Palm
,
G.
, &
Sommer
,
F.
(
1996
). Associative data storage and retrieval in neural nets. In
E.
Domany
,
J. van
Hemmen
, &
K.
Schulten
(Eds.),
Models of neural networks
(
Vol. 3
, pp.
79
118
).
New York
:
Springer-Verlag
.
Pulvermüller
,
F.
(
2003
).
The neuroscience of language: On brain circuits of words and serial order.
Cambridge
:
Cambridge University Press
.
Robins
,
A.
, &
McCallum
,
S.
(
1998
).
Catastrophic forgetting and the pseudorehearsal solution in Hopfield type networks
.
Connection Science
,
7
,
121
135
.
Rolls
,
E.
(
1996
).
A theory of hippocampal function in memory
.
Hippocampus
,
6
,
601
620
.
Sa-Couto
,
L.
, &
Wichert
,
A.
(in press).
Storing object dependent sparse codes in a Willshaw associative network
.
Neural Computation
.
Sacramento
,
J.
,
Burnay
,
F.
, &
Wichert
,
A.
(
2012
).
Regarding the temporal requirements of a hierarchical willshaw network
.
Neural Networks
,
25
,
84
93
.
Schwenker
,
F.
,
Sommer
,
F.
, &
Palm
,
G.
(
1996
).
Iterative retrieval of sparsely coded associative memory patterns
.
Neural Networks
,
9
,
445
455
.
Sejnowski
,
T.
(
1977a
).
Statistical constraints on synaptic plasticity
.
Journal of Theoretical Biology
,
69
,
385
389
.
Sejnowski
,
T.
(
1977b
).
Storing covariance with nonlinearly interacting neurons
.
Journal of Mathematical Biology
,
4
,
303
321
.
Shannon
,
C.
, &
Weaver
,
W.
(
1949
).
The mathematical theory of communication.
Urbana, IL
:
University of Illinois Press
.
Sommer
,
F.
(
1993
).
Theorie neuronaler Assoziativspeicher—Lokales Lernen und iteratives Retrieval von Information.
Hänsel-Hohenhausen
.
Sommer
,
F.
, &
Palm
,
G.
(
1998
). Bidirectional retrieval from associative memory. In
M. I.
Jordan
,
M. J.
Kearns
, &
S. A.
Solla
(Eds.),
Advances in neural information processing systems, 10
(pp.
675
681
).
Cambridge, MA
:
MIT Press
.
Sommer
,
F.
, &
Palm
,
G.
(
1999
).
Improved bidirectional retrieval of sparse patterns stored by Hebbian learning
.
Neural Networks
,
12
,
281
297
.
Steinbuch
,
K.
(
1961
).
Die Lernmatrix
.
Kybernetik
,
1
,
36
45
.
Tsodyks
,
M.
, &
Feigel'man
,
M.
(
1988
).
The enhanced storage capacity in neural networks with low activity level
.
Europhysics Letters
,
6
,
101
105
.
Willshaw
,
D.
,
Buneman
,
O.
, &
Longuet-Higgins
,
H.
(
1969
).
Non-holographic associative memory
.
Nature
,
222
,
960
962
.
Wu
,
F.
(
1982
).
The Potts model
.
Reviews of Modern Physics
,
54
,
235
268
.
Yao
,
Z.
,
Gripon
,
V.
, &
Rabbat
,
M.
(
2014
). A GPU-based associative memory using sparse neural networks. In
Proceedings of the IEEE International Conference on High-Performance-Computing and Simulation
(pp.
688
692
).
Piscataway, NJ
:
IEEE
.