## Abstract

Neural associative memories (NAM) are perceptron-like single-layer networks with fast synaptic learning typically storing discrete associations between pairs of neural activity patterns. Gripon and Berrou (2011) investigated NAM employing block coding, a particular sparse coding method, and reported a significant increase in storage capacity. Here we verify and extend their results for both heteroassociative and recurrent autoassociative networks. For this we provide a new analysis of iterative retrieval in finite autoassociative and heteroassociative networks that allows estimating storage capacity for random and block patterns. Furthermore, we have implemented various retrieval algorithms for block coding and compared them in simulations to our theoretical results and previous simulation data. In good agreement of theory and experiments, we find that finite networks employing block coding can store significantly more memory patterns. However, due to the reduced information per block pattern, it is not possible to significantly increase stored information per synapse. Asymptotically, the information retrieval capacity converges to the known limits $C=ln2\u22480.69$ and $C=(ln2)/4\u22480.17$ also for block coding. We have also implemented very large recurrent networks up to $n=2\xb7106$ neurons, showing that maximal capacity $C\u22480.2$ bit per synapse occurs for finite networks having a size $n\u2248105$ similar to cortical macrocolumns.

## 1 Introduction

Neural associative memory (NAM), a simple artificial neural network that works as an associative memory, can be understood as a module of memory and learning by synaptic plasticity (for a review, see Palm, 2013). The retrieval of information from such a memory is typically not achieved by looking up a content under a final address, but rather by associating a meaningful output pattern with a meaningful input pattern. During the learning phase, associations are stored or learned by a form of local synaptic plasticity (the change of connection strength of one synapse depends only on the activity of the pre- and the postsynaptic neuron in the two patterns that are presented to be associated), typically a variant of Hebbian plasticity (Hebb, 1949). In classical associative memory, one distinguishes between heteroassociation and autoassociation concerning the identity of the output patterns to the input patterns. Heteroassociation is more similar to the technical address $\u2192$ content scheme, but it is different because the input patterns are also considered to be meaningful, and therefore their similarity (in terms of some vector distance, such as the Hamming-distance or overlap for binary patterns) should reflect similarity of content and should be roughly preserved by the association mapping (similar inputs to similar outputs). Autoassociation is typically used for pattern completion or pattern correction.

NAMs originated in the 1960s, where Steinbuch's (1961) “Lernmatrix” and Gabor's (1969) “holographic memory” were probably the first concrete examples. They have been analyzed by mathematicians, engineers, physicists, and computer scientists, mostly in terms of their capacity to store a large number of memory patterns. In spite of the apparent simplicity of this measure, which roughly counts the number of patterns that can be stored in and (more or less completely) retrieved from the memory, and which is usually called memory capacity, or just capacity, by various authors, there are several subtle differences in their definitions that can result in large differences in the achievable values. So we are faced with a zoo of different capacity definitions, which we try to sort out in this letter.

The first analysis of NAM was provided by Willshaw, Buneman, and Longuet-Higgins (1969), showing a “memory capacity” of $ln2\u22480.69$ bit per synapse in the limit of large networks. Palm (1980) used an information-theoretic criterion to optimize the parameters of a NAM for finite network size leading to an “information capacity.” He found that optimal capacity values are obtained for sparse patterns (binary patterns with only very few 1s). He also distinguished heteroassociation and autoassociation with an asymptotic information capacity of $ln2$ and $12ln2$, respectively. The autoassociative NAM network was also considered as a dynamical system in theoretical physics of so-called spin glasses, notably by Hopfield (1982), who realized that overloading the network with memory patterns leads to a catastrophic breakdown of fixed-point retrieval (i.e., retrieving the stored patterns as fixed points of the autoassociative NAM dynamics). He defined capacity as the critical number of retrievable memories (normalized by network size), here called the critical pattern capacity, and found a value of about 0.14 for suboptimal nonsparse patterns. Later, the importance of sparseness was also recognized in the (spin-glass) physics community (Gardner, 1987, 1988; Amari, 1989); in particular Tsodyks and Feigel'man (1988) found a critical pattern capacity of $12ln2$ for sparse patterns. For this result, the critical number of memories was further normalized by the information content $I(p)$ of one output bit, which is smaller than one for sparse patterns. This correction had previously been introduced by Willshaw et al. (1969) and Gardner (1987, 1988)

These results are not so relevant for practical applications, because in fixed-point retrieval (using already perfect input patterns) close to the critical pattern capacity limit, the memory can practically be used only as recognition memory (Palm & Sommer, 1992), not for retrieving a stored pattern from a similar but not identical input pattern (reconstruction memory). Thus, the information that can actually be retrieved from NAM at critical pattern capacity is much smaller, even close to zero in many cases. Correspondingly, the information capacity is smaller for autoassociative fixed-point retrieval, namely, $\u226412ln2$ (Palm, 1980; Palm & Sommer, 1992). In practical applications, one should consider sparse coding methods that create memory patterns in the near-optimal parameter range (Palm, 1987b; Bentz, Hagstroem, & Palm, 1989; Palm, Schwenker, & Sommer, 1994; Knoblauch, Palm, & Sommer, 2010), and in the case of autoassociation, one should also consider effective iterative retrieval methods (Schwenker, Sommer, & Palm, 1996) that allow retrieving the stored patterns from arbitrary parts, usually variations of dynamical fixed-point retrieval.

More recently, Gripon and Berrou (2011) rediscovered the inefficiency of the Hopfield model and the advantages of sparse coding. In fact, they introduced a particular sparse-coding method that was claimed to be better than previous results (see also Gripon & Rabbat, 2013; Aliabadi, Berrou, Gripon, & Jiang, 2014; Aboudib, Gripon, & Jiang, 2014; Ferro, Gripon, & Jiang, 2016), namely, block coding (the multiple 1-out-of-n code, see Palm, 1987c), also known as the Potts model in spin-glass physics (e.g., Wu, 1982; Kanter, 1988) and analyzed by Kryzhanovsky, Litinskii, and Mikaelian (2004), Kryzhanovsky, Kryzhanovsky, and Fonarev (2008), and Kryzhanovsky and Kryzhanovsky (2008), that is more recently called neural cliques or the Grippon-Berrou-neural-network (GBNN). They also invented a truly new iterative retrieval method for block coding yielding comparatively high information capacity, which even exceeds the asymptotic theoretical value for finite *n*, similar to other iterative retrieval methods introduced much earlier by Schwenker et al. (1996). Unfortunately, their methods are described in an unusual terminology but also refer to the critical pattern capacity (calling it “diversity”), so we tried to translate their work into the usual NAM terminology and use information-theoretical capacity measures for a direct quantitative comparison. To this end, we had to extend previous information capacity definitions a bit. Previously, information capacity is the maximum information contained in the storage matrix (network connectivity matrix) about the set of patterns to be stored. Usually in information theory, one tries to find the maximum over all possible input pattern distributions. Practically, one often restricts the class of distributions (e.g., to independently generated patterns) and also tries to use some suitable retrieval methods to estimate the amount of information that can maximally be extracted. If we want to compare different coding and retrieval methods, we have to include an explicit restriction of both the memory pattern distribution and the retrieval method into the capacity definition. Following this strategy, we found that in terms of information capacity, the only new and really interesting improvement found by Berrou and colleagues is the iterative retrieval method mentioned before: the so-called sum-of-max retrieval (Gripon & Berrou, 2012; Yao, Gripon, & Rabbat, 2014). Otherwise their results are well in the ballpark of other similar methods. In particular, their results on block coding do not affect asymptotic information capacities.

This letter is organized as follows. In section 2, we introduce the basic concepts and our mathematical terminology and distinguish the different capacity concepts in more detail. Then, in section 3, we describe the retrieval strategies for autoassociation and bidirectional heteroassociation, including retrieval strategies for block coding. In section 4, we analyze these methods in terms of information capacity, first for fixed network size $n$, then asymptotically for $n\u2192\u221e$. In section 5, we present numerical experiments with randomly generated patterns used as a standard benchmark to compare various methods. In section 6, we discuss our asymptotic and numerical results and conclude the letter.

## 2 Basic Concepts and Research Questions

*learning task*of NAM is to store associations between pairs of memory patterns $u\mu $ and $v\mu $ for $\mu =1,2,\u2026,M$ that may be interpreted as neural activation vectors or patterns of synaptically linked neuron populations $u$ and $v$. In the case of heteroassociation, $u$ and $v$ are two different populations, whereas $u$ and $v$ are identical for autoassociation. Most models employ a local immediate learning rule $R$ to determine the synaptic weight $wij$ from neuron $ui$ to $vj$,

*retrieving the stored patterns*from the memory. For heteroassociation, the most natural use is to retrieve the output $v\mu $ from input $u\mu $ (

*pattern mapping*). Typically, the starting point is from a noisy version $u\u02dc\mu $ of the original input, and the retrieval output $v^\mu $ may not always be identical to the original output patterns. In an iterative fashion, one may also reconstruct $u^\mu $ in addition to $v^\mu $ (

*bidirectional retrieval*; Kosko, 1988; Sommer & Palm, 1999). For autoassociation the most natural retrieval method is to consider the feedback network with connectivity matrix $W$ and the dynamical neural network system (Gibson & Robinson, 1992; Schwenker et al., 1996),

*threshold regulation*; see section 3 and Palm, 1982) and perhaps in special criteria for stopping the iteration. There are two basically different approaches concerning the starting patterns: either one wants to verify that a pattern $u$ is indeed a fixed point, in which case, $u\mu $, or a pattern very close to $u\mu $ is used as starting point (

*fixed-point retrieval*or

*recognition of stored patterns*), or one wants to find the next correct pattern $u\mu $ from a substantially different starting pattern (

*pattern correction*or

*pattern completion*).

*output noise*$\epsilon ^$ defined as the expected $L1$-norm or Hamming distance $\u2225.\u22251$ between retrieval outputs and the original patterns. For example, for heteroassociative pattern mapping, we define

We now give a general information-theoretic storage capacity definition that is independent from particular retrieval methods. Then we examine these retrieval methods in more detail to define the corresponding retrieval capacities.

### 2.1 General Information Storage Capacity

*information storage capacity*,

^{1}A simpler argument can be given by reduction of autoassociation to heteroassociation.

^{2}

This general definition of information capacity can be further restricted if one considers particular methods of pattern retrieval, which may introduce additional parameters that then have to be optimized. Of course, the more restricted optimization will tend to result in smaller capacity values in general. Here we distinguish four subforms of information storage capacity that can be expressed in terms of the transinformation between the stored patterns and the retrieved patterns. All following definitions of information storage capacity may be extended in a hardware-specific way in order to account for the minimal physical resources necessary to realize the network—for example, in the main memory (RAM) of digital hardware or in a synaptic network of the brain (Knoblauch et al., 2010; Knoblauch, Körner, Körner, & Sommer, 2014; Knoblauch & Sommer, 2016).

### 2.2 Mapping Capacity

*mapping capacity*,

^{3}Palm (1980) provided the first complete analysis of pattern mapping showing a mapping capacity of $ln2$.

### 2.3 Completion Capacity

A complete analysis of the *completion capacity* is mathematically demanding. For one-step and two-step retrieval, it has been done by Schwenker et al. (1996). The optimal input pattern contains half of the 1s of a stored pattern, and the remaining 1s can be retrieved with a very low probability for additional wrong 1s in the retrieved pattern. This yields a capacity of $Cu=ln24$, which also is the asymptotic value for the completion capacity (Sommer, 1993). Interestingly, this value can be exceeded for finite $n$ (Schwenker et al., 1996). In section 2.4, we will see that the more restricted patterns used in block coding yield the same asymptotic capacity.

### 2.4 Bidirectional Capacity

*bidirectional capacity*,

### 2.5 Critical Pattern Capacity and the Sparse Limit

*(critical) pattern capacity*$M\epsilon $ determining the maximum number of pattern associations $M$ that can be stored without $\epsilon ^(M)$ from equation 2.4 exceeding a tolerated output noise level $\epsilon $,

In general, a comparison by $M\epsilon $ is not meaningful as soon as the compared models store different types of pattern vectors that have, for example, different numbers of active units $k$. This means that one has to introduce an appropriate normalization; it is not sufficient just to divide $M\epsilon $ by $n$ as Hopfield (1982) introduced.

For the Willshaw model, it turned out that sparse patterns maximize both the information capacity and the critical pattern capacity. It is therefore useful to define more formally the *sparse limit*. For each $n$, we assume that the patterns $u\mu $ (or $v\mu $) are drawn randomly and independently from the $nk$ possible patterns with exactly $k$ 1s (and $n-k$ 0s) and that $k\u223clogn$.

^{4}

### 2.6 Recognition Capacity

*recognition memory*, the task is to decide if an input pattern $u\u02dc$ has already been stored previously. This two-class problem can be solved by employing an autoassociative network in the following way. First, the dendritic potentials $xj:=\u2211i=1nwiju\u02dci$ are computed as in one-step retrieval. Second, the sum $S:=\u2211ju\u02dcjxj$ over all dendritic potentials of active input units is computed. Third, the new input $u\u02dc$ is classified as familiar (i.e., already stored) if $S$ exceeds some threshold $\Theta C$; otherwise $u\u02dc$ is classified as new. For example, for the Willshaw model, equation 2.2, with all original patterns having the same activity $k=\u2225u\mu \u2225$, we can simply choose $\Theta C=k2$ because any previously stored pattern $u\u02dc=u\mu $ is represented in the weight matrix by a clique of size $k$ having $k2$ connections. Equivalently, one may check if $u\u02dc=u\mu $ (or a superset thereof) is a fixed point for the dynamics (see equation 2.3) using $\Theta (t)=k$. For correctly computing the

*recognition capacity*$Cu,rcg$, it is important to see that for the recognition tasks, the completion capacity, equation 2.8, is zero (because for familiar inputs $u\u02dc\mu =u\mu $, completion is not necessary). Instead, $Cu,rcg$ follows from the information given by the binary class label associated with each potential input pattern $u\u02dc$ (for more details see appendix C). So a simple capacity definition would consider just the maximal number $M$ of patterns that after storing become fixed points of the autoassociative threshold dynamics, equation 2.3, that is,

*spurious states*are just the elements of $F\u2216M$. When we try to optimize $T(M,F)$, this leads to a kind of quality criterion restricting the number of spurious states. Indeed,

Thus, the recognition capacity apparently reaches the information storage capacity for autoassociation (see note 2), $C=ln22$, which corresponds to half the value of the critical pattern capacity and twice the value of the practically relevant completion capacity (see sections 2.3 and 2.5). For further details on how to compute $Cu,rcg$, see appendix C. There we also show that it may actually be possible to exceed equation 2.16 but only for patterns that have an even sparser activity than in the sparse limit $k\u223clogn$. Specifically, we show $C\u2192ln22$ for $M\u223cn2lnln\cdots lnn$ and $k$ being almost constant, assuming asymptotic independence of the synaptic weights (Knoblauch, 2008c).

### 2.7 Random and Block Patterns for Maximal Capacity

Note that each subform of the general information storage capacity, equation 2.5, will result in a lower capacity value, because both the retrieval and the assumed pattern distribution are specifically restricted to technically reasonable assumptions, whereas the information storage capacity is simply defined as the maximal transinformation independent of computationality or practicality of its retrieval. In the following, we describe, analyze, and simulate a number of different retrieval procedures for the Willshaw model. Moreover, some retrieval methods rely on a particular coding of the memory patterns. For example, some procedures require that all stored patterns have an identical number $k$ of active units; others also require *block coding* to represent integer-type vectors where each binary pattern consists of $k$ blocks of size $N$, as illustrated in Figure 1A. Obviously, it would be necessary to compute the storage capacity for each combination of learning procedure, retrieval procedure, and distribution over the stored patterns. Because “channel capacity” commonly refers to the maximum transinformation, we focus on *maximum entropy distributions*, that is, random patterns subject to some constraints as required by the retrieval procedures. Specifically, for the classical retrieval methods, we choose *random patterns*$u\mu ,v\mu $ uniformly from the set of $nk$ patterns of size $n$ with activity $k$. Similarly, for the block coding methods, we choose $u\mu ,v\mu $ uniformly from the set of $Nk$ possible *block patterns* (with $n=Nk$). To generate a noisy input $u\u02dc$, we select in each case a fraction $\lambda k$ of the $k$ one-entries from one of the original input patterns $u\mu $ at random ($0<\lambda \u22641$), and, in general, we add $\kappa k$ false one-entries at random ($\kappa \u22650$).^{5} (For more details on how the underlying pattern distributions affect stored information see section A.1.)

## 3 Retrieval Procedures for Block Patterns and Random Patterns

*Retrieval* means finding a maximal biclique (or clique) that has maximal overlap with a given query pattern $u\u02dc$ presented to the input layer. Here we assume pattern part retrieval where the query pattern $u\u02dc$ contains a subset of $\lambda k$ of the $k$ one-entries of an original input pattern $u\mu $ (where $0<\lambda \u22641$). There are several possibilities for computing the retrieval output $v^$:

*One-step retrieval (R1).*An output unit $vj$ gets activated iff it is connected to at least $\Theta $ active input units $ui$ with $u\u02dci=1$, that is,

• *One-step retrieval with block constraint (R1B).* As the output patterns are block codes, we can exploit that there is only one active unit per block. Thus, given the R1 result $v^$, we can conclude for each one-entry $v^j=1$ that it is correct if there is no second one-entry in the same block. By erasing all one-entries in ambiguous blocks of $v^$ that have multiple active units, we will effectively decrease output noise, equation 2.4, as soon as there are more than two active units in a block. Note also that by exploiting the block constraint, the retrieval output $v^$ again becomes a subset of the original output pattern $v\mu $.

• *Simple iterative-retrieval with block constraint (sIRB).* As after doing R1B the output $v^$ is a subset of $v\mu $, we can repeat the same procedure again, where input and output layers have changed their roles, and so on, leading to the following algorithm:

Let $u^(0):=u\u02dc$ be the original input query and set $t:=0$.

Increase $t:=t+1$.

Compute the next output estimate $v^(t):=R1B(u^(t-1))$ by employing the R1B procedure already described on the previous estimation of the input pattern $u^(t-1)$.

Compute the next input estimate $u^(t):=R1BT(v^(t))$ by employing the “transposed” R1B procedure on the current output estimate $v^(t)$ (from layer $v$ to layer $u$ with the transposed weight matrix $WT$).

As long as a stopping criterion (e.g., $u^(t)=u^(t-1)$ or exceeding a maximal number of retrieval steps) is not reached, go to step 2.

Note that all estimates of inputs or outputs are subsets of the original patterns.

*Iterative retrieval with block constraint (IRB).*Following the last remark, it is obvious that the sIRB algorithm can be improved by OR-ing new estimates of output and input patterns with the previous estimates. Thus, after initializing $v^(0):={}$ in step 1, we replace the operations of steps 3 and 4 by

^{6}We sometimes use a variant IRB-R1 where IRB is followed by an additional R1 step in order to increase retrieved information by constructing a superset (“halo”) of the original output pattern.

*Iterative retrieval with block constraint and sum-of-max strategy (IRB-SMX).*In a valid block pattern, there is exactly one active neuron per block. This constraint suggests an interesting retrieval strategy where each neuron can receive at most one synaptic input per block (Gripon & Berrou, 2012; Yao et al., 2014). For that, equation 3.1 must be replaced by the following

*R1B-SMX*procedure:

*IRB-SMX*starts with R1 in the first retrieval step and then continuous with R1-SMX (and R1-SMX$T$) in the remaining iterations:

$u^(0):=u\u02dc$;

$v^(1):=R1(u\u02dc)$;

$u^(1):=R1B-SMXT(v^(1))$;

$t:=1$

$t:=t+1$

$v^(t):=v^(t-1)\u2229R1B-SMX(u^(t))$

$u^(t):=u^(t-1)\u2229R1B-SMXT(v^(t))$

IF $v^(t)$ or $u^(t)$ have changed (or $t$ has reached its maximum value) THEN goto step 5.

For autoassociation we can set $v^(t):=u^(t)$ and skip steps 3 and 7. By simulation experiments, we have verified that our version performs equivalently to the original algorithm initializing by fully activating empty blocks (data not shown). In case $u\u02dc$ is already a superset of an original pattern, initialization steps 2 and 3 must be skipped.

• *Iterative retrieval of core and halo patterns.* Note that combining, for example, an iteration of R1 and R1B enables a retrieval scheme where both supersets and subsets of the original patterns can be retrieved, as illustrated by Figure 1B. Here we call a subset of an original pattern “core” and a superset “halo.” Obviously unions of cores are cores and intersections of halos are halos. By that, we may combine different strategies to find cores and halos to improve the retrieval result. For example, different cores can be obtained for block coding by:

Applying R1B to a core (with threshold $|$core$|$)

Applying R1B to a halo (with threshold $k$)

*R1B-cSMX*: Applying R1B-SMX to a halo followed by deactivating blocks with more than one active unit.

Similarly, different halos can be obtained for block coding by:

Applying R1 to a core (with threshold $|$core$|$)

Applying R1 to a halo (with threshold $k$)

Applying R1B-SMX to a halo

By combining the different core and halo procedures, it is in principle possible to improve iterative retrieval outputs at the cost of increased computing time. However, preliminary simulations have shown that such combinations yield only minor improvements (data not shown). At least, the variant *IRB-cSMX* of iterating *R1B-cSMX* has the property of minimizing output noise $\epsilon ^$ because in each silenced block, at least one or even more wrongly active neurons are eliminated.

We have compared the retrieval algorithms for block patterns to some *standard variants of iterative retrieval (IR) for random patterns*. For this, we have tested two implementations of IR:

*IR-KWTA*: This is a $k$-winners-take-all (*KWTA*) strategy, setting the treshold to the largest possible value such that at least $k$ neurons get active.*IR-LK*+: This is another variant based on the*LK*+ strategy introduced by Schwenker et al. (1996) for autoassociative networks. Here, the idea is to combine R1 in the first iteration (setting threshold $\Theta =c:=\lambda k$) to the number of “correct” units in the input pattern) with AND-ing the outputs in further retrieval steps using a threshold $\Theta =k$ equal to the cell assembly size. This obviously yields a sequence of halo patterns as each retrieval step yields a superset of the original memory. This idea generalizes to bidirectional retrieval for heteroassociation in an obvious way.

In all implementations of the described algorithms, we have included the following optimization steps: First, for autoassociation, we have optimized the iterative algorithms by AND-ing or OR-ing at the earliest possible time. Moreover, we limited the number of active units during each retrieval step to a maximum value (that depended on pattern size $k$ and was at least $max(2k,1000))$ in order to prevent uncontrolled spreading of activity, which would otherwise result in a strong slowdown of simulations (as the time required for each retrieval step grows in proportion to the number of active units). In case activity exceeded the maximum value, iterative retrieval was aborted, and the result of the previous iteration was returned as final retrieval output.

All simulations were performed using the PyFelix++ simulation tool (see Knoblauch, 2003b, appendix C) on multicore compute clusters (BWUniCluster at the Steinbuch Centre for Computing at the KIT, using a maximum of 32 cores and 64 GB RAM per simulation; and a costum Intel Xeon 2 GHz installation at University of Ulm with 70 cores and 1.5 TB RAM for large networks with up to $n=2\xb7106$ neurons).

## 4 Analysis of Output Noise and Information Storage Capacity for Block Coding

For a detailed analysis of one-step-retrieval for random activity patterns see Knoblauch et al. (2010). The following section develops a similar analysis for block patterns (see Figure 1A), where the active unit of a block is selected uniformly and independent of other blocks. Thus, the probability of a pattern unit being active is $pr[ui\mu =1]=pr[vj\mu =1]=1/N=:p$ independently for all blocks and, as in the previous analysis, we assume that the dendritic potentials follow a binomial distribution (Knoblauch, 2008c).

### 4.1 One-Step Retrieval with Block Coding (R1B)

^{7}

^{8}

### 4.2 Iterative Retrieval for Block Coding

^{9}

^{10}Equivalently the block reconstruction probability $pbr$ must exceed a corresponding threshold $L$,

^{11}Solving for the matrix load $p1$ gives the upper bound:

*with*OR-ing because the OR-ing becomes effective only after the second iteration of R1B (see equations 3.2 and 3.3), whereas the decision of a successful retrieval happens within the first two R1B steps, due to the exponential decrease of errors (see equation 4.2).

For autoassociative IRB with OR-ing, the previous analysis of heteroassociative sIRB and IRB must be slightly adapted to account for the identical neuron populations $u$ and $v$, where the OR-ing becomes effective after the first R1B step. In particular, there will be a larger fraction of correct one-entries in the output pattern after the first R1B step, because $\lambda k$ correct neurons are already known from the input $u\u02dc$. It turns out that $\lambda ^max$, $L$, $p1,max$, $M1,max$ have to be substituted by some related quantities $\lambda ^max,AA$, $LAA$, $p1,max,AA$, $Mmax,AA$. (For details, see section A.2.)

### 4.3 Information-Theoretic Capacity Measures

^{12}

### 4.4 The Limit of Large Networks

The Willshaw model is known to have three regimes of operation (Knoblauch et al., 2010, sec. 3.4). In the regime of balanced potentiation, the matrix load, equation 4.1, converges toward a value between zero and one, $0<p1<1$ for large networks with $n\u2192\u221e$. By equation B.1, it corresponds to the sparse limit of section 2.5 with $k\u223cldn$, where the basic Willshaw model can store a positive amount $0<C\u2264ln2\u22480.69$ bit of information per synapse $Wij$. In the other two regimes, $C\u21920$ as the weight matrix contains either (almost) only zeros or ones such that the entropy of a synaptic weight $Wij$ approaches zero. These so-called sparse and dense potentation regimes are actually very interesting if the network employs additional mechanisms like compression of the weight matrix or structural synaptic plasticity (Knoblauch, Körner, Körner, & Sommer, 2014; Knoblauch, 2017). However, here we will focus only on the basic Willshaw model with balanced potentiation, evaluating the analyses of sections 4.1 to 4.3 for the *sparse limit* with $n,k,N\u2192\u221e$, $p=kn=1N\u21920$, $L=\lambda +rk$, $0<r<k$, $plnL\u21920$, and $k\u223clogn$ corresponding to fixed $0<p1,max<1$ for constant $\lambda ,\epsilon $. Typically, $r$ is constant or $r\u223ck$.

^{13}

## 5 Simulation Experiments

This section evaluates and compares the mapping, completion, bidirectional, and pattern capacities of various model variants by means of storage and retrieval of randomly generated patterns. As in section 4, we assume that each stored pattern of size $n$ has exactly $k$ active units. For the block pattern algorithms, we use block patterns where each block of size $N=n/k$ has exactly one randomly drawn active unit (see Figure 1A). For the other unconstrained algorithms, we use random patterns where the $k$ active units are drawn randomly from the $n$ neurons. All patterns are generated independently of each other. It is generally believed that patterns generated at random maximize stored information, but note that the information storage and the completion capacity introduced in section 2 are practically not accessible to simulation studies (because all combinatorially possible patterns have to be considered) and also less relevant for practical applications. Unless otherwise specified, we use input patterns that have a subset $\lambda k$ of the active units of the original patterns, but no additional active units $\kappa k=0$ (see section 2.7).^{14} All theoretical estimates use a conservative value $r=1$ for the average improvement in the first retrieval step (see equation 4.7). Iterative retrieval was limited to a maximum of 10 iterations but could be stopped before the tenth iteration if the output pattern was identical to the previous iteration or if an activity explosion was detected (number of active units becoming larger than $max(1000,2k)$), which may occur in some model variants (e.g., IR-LK+) if exceeding the critical pattern capacity. Each data point is obtained from averaging over 50,000 retrievals in 10 different networks (5000 per network).

### 5.1 Validation of Network Implementations

To validate our network implementation we first tried to reproduce some reference results of previous work (Memis, 2015; Schwenker et al., 1996; Gripon & Berrou, 2011). Figure 3 shows output noise and storage capacity as a function of stored memories $M$ for various model variants of a relatively small autoassociative network of $n=4096$ neurons and $k=16$ blocks or active units per pattern. We tested retrieval with half-input patterns having $k/2$ active units from the stored memory patterns ($\lambda =0.5$) to maximize completion capacity (see equation 4.23).

Our data for IR-LK+ (red dashed line) and IRB-R1 (blue dash-dotted line) tightly reproduce the output noise data from the earlier works (big circle and triangle markers), thus validating our implementations of learning and retrieval. As reported previously, it can be seen that the block coding algorithms (sIRB, IRB, IRB-R1, IRB-SMX, IRB-cSMX) significantly reduce output noise compared to the models without block coding (R1,IR-LK+, IR-KWTA). Correspondingly, block coding can significantly increase pattern capacity $M\epsilon $, defined as the maximum number of memories that can be stored at a tolerated noise level $\epsilon $. Yet, despite this increase in pattern capacity, we could not observe a corresponding increase in maximum storage capacity $Cu$ for most block coding models. That is, LK+ and IR can store more information per synapse than sIRB, IRB, IRB-R1, and IRB-cSMX. Only the halo-type sum-of-max strategy IRB-SMX can slightly exceed the maximum capacity of the standard models, where peak capacities occur at relatively high output noise levels (Buckingham & Willshaw, 1992). The reason for this discrepancy is that each block pattern bears significantly less information than a random pattern without block coding (see Figure 11), which compensates the increase in pattern capacity. Neglecting this fact (e.g., by computing $Cu$ using equation A.4 instead of equation A.11) can significantly overestimate information storage capacity for block coding (blue triangles; compare to the blue dash-dotted curve of IRB-R1) in Figure 3. We have also simulated the further block coding variants mentioned in the model section (e.g., combining various core and halo strategies), but they turned out to provide only minor improvements over IRB and IRB-R1 and could not exceed the performance of IRB-SMX and IRB-cSMX (data not shown).

### 5.2 Testing the Quality of Our Theory

Next, we tested the quality of our theory developed for the block coding model variants sIRB and IRB in section 4 and for the standard iterative or bidirectional retrieval procedures IR-LK+ and IR-KWTA in appendix B. To this end, Figure 4 shows output noise $\epsilon ^$ as function of stored memory number $M$ for various network sizes $n=kN$ using pattern activity $k=2ld(N)$ and block size $N=2k/2$ that are chosen optimally to maximize information storage capacity for half-input patterns ($\lambda =0.5,\kappa =0$). For each network size $n$, our theory (30,69,73,79)) provides the maximal pattern number $Mmax,th$ that can be stored at a tolerated noise level $\epsilon =0.01$ (see the legends). This value can be compared to the actual pattern capacity $Mmax$ at which the simulated output noise reaches level $\epsilon $. To allow comparison over different network sizes, we have normalized pattern number $M$ to the theoretical maximum $Mmax,th$.

It can be seen that the curves of $\epsilon ^$ as a function of $M/Mmax,th$ converge to the Heaviside function as predicted by our theory, equation 4.18. Note that for $M\u2192\u221e$, IRB and sIRB have limited output noise $\epsilon ^\u2192\epsilon \u221e\u22641$ because retrieval outputs are always subsets (“cores”) of the original memory patterns and employing OR-ing in each iteration step ($\epsilon \u221e=1$ for heteroassociative sIRB; $\epsilon \u221e=0.75$ for heteroassociative IRB; $\epsilon \u221e=0.5$ for both IRB and sIRB due to the implicit OR-ing mentioned in section 3). By contrast, $\epsilon ^\u2192\u221e$ for IR-KWTA.

The accuracy $Mmax/Mmax,th$ of our theory generally improves with network size $n$. For example, for heteroassociative sBRB and $k=10,12,16,20,24,28$ corresponding to network sizes $n=kN=k2k/2$, the estimated accuracy increases as $71%,80%,89%,92%,94%$, and $95%$. Similarly, for autoassociative sIRB and $k=10,\u2026,28$ accuracy increases as $57%,\u2026,93%$. Interestingly, the pattern capacity for autoassociation is significantly higher than for heteroassociation. It can be seen that our theory provides lower bounds of $Mmax$ for both heteroassociation and autoassociation and relevant network sizes. Additionally, for heteroassociation, we can use the theory for autoassociation to get upper bounds of $Mmax$. For heteroassociative IR-KWTA, the theory of appendix B slightly underestimates the true storage capacity for large networks but still converges in the limit $n\u2192\u221e$.

### 5.3 Comparing Capacities for Different Model Variants

Next, we did a direct comparison of critical pattern capacity $M\epsilon $ and information storage capacity $C\epsilon $ (in bit/synapse) for different model variants at output noise level $\epsilon =0.01$. We considered autoassociation and heteroassociation in both small and large networks ($n=16\xb7256=4096$ versus $n=22\xb72048=45,056$) and tested over the whole range of relevant pattern activities $k$ between 4 and $n/2$ that allowed a division of the $n$ neurons into $k$ blocks of size $N=n/k$ (specifically, we tested $k=4,8,16,32,64,128,256,512,1024$, $2048,4096,5632,11,264,22,528$). The results are summarized in Figures 5 ($M\epsilon $) and 6 ($C\epsilon $). Simulations (markers) are again compared to theory (lines). Theoretical values for maximal $M\epsilon $ were computed as described in the previous section (using equations B.1 and B.2 for R1). $C\epsilon $ was computed from $M\epsilon $ using equation A.4 for random coding (R1, IR-KWTA, IR-LK+) and equations A.13 to A.15 for block coding (IRB, IRB-SMX).

In general, theory fits again well to the simulations. In particular, the theory for one-step retrieval (R1) precisely predicts the true capacities already for relatively small networks unless pattern activities $k$ become large (Knoblauch, 2008c; Knoblauch et al., 2010). For small $k$, the theory slightly underestimates the true values, whereas for large $k$, the theory significantly overestimates both $M\epsilon $ and $C\epsilon $. Compared to R1, the iterative retrieval procedures like IR-KWTA and IRB can significantly increase storage capacity if the initial inputs are incomplete ($\lambda <0.5$; top and middle panels). For very sparse coding with $k\u2264ldn$, this increase can be more than an order of magnitude. For large $k\u2192n$, storage capacity becomes very small, where all models tend to identical pattern capacity $M\epsilon $. Still, information storage capacity $C\epsilon $ is larger for models employing random patterns (IR-LK+, IR-KWTA, R1) because the information per block pattern is significantly decreased compared to random patterns. For medium $k$, the block pattern models IRB and IRB-SMX can store significantly more patterns but less information than IR. For extremely small activity $k$, IRB can store fewer patterns than IR-KWTA, whereas IRB-SMX has a slightly larger capacity for patterns. Among the block pattern models, IRB-SMX has significantly higher storage capacity than IRB only for small and medium $k$, not for large $k$.

As expected, autoassociation can store more patterns than heteroassociation for most $k$. However, surprisingly, for very sparse patterns (e.g., $k=4$), heteroassociation performs better than autoassociation. For example, for IR-KWTA with $n=45,046$ and $k=4$, heteroassociation can store $Mmax\u22481.45\xb7106$ patterns, whereas autoassociation can store only $0.780\xb7106$ patterns. Similarly, heteroassociative IRB and IRB-SMX can store $Mmax\u22480.445\xb7106$ and $Mmax\u22481.49\xb7106$ patterns, whereas autoassociative IRB and IRB-SMX can store only $Mmax\u22480.437\xb7106$ and $Mmax\u22480.878\xb7106$ patterns, respectively.

Note that the fit of theory to data depends on the choice of parameter $r$ defined as the average improvement of the first retrieval step (see equations 4.7, 4.8, and A.22). We have chosen $r=1$, which obviously implies upper bounds for storage capacity. For larger $r$, our theory predicts lower capacities. For example, for $r=0.25k$, the theory curves get shifted toward smaller values, and we get better fits, in particular for larger $k$ (data not shown). This is consistent with the finding that all models become equivalent to R1 for large $k$, that is, then the first retrieval step will typically complete the pattern almost perfectly ($r=(1-\lambda )k$).

For complete input patterns with $\lambda =1$ (bottom panels), block coding and iterative retrieval methods cannot improve over one-step retrieval R1. In fact, without stabilizing the input activity in population $u$ (e.g., by OR-ing or AND-ing), iterative retrieval will typically deteriorate retrieval outputs at the capacity limit, because fixedpoints occur at nonzero noise (see Figure 2). For a small-output noise level $\epsilon =0.01$, this deterioriation is negligible, as can be seen in our data. Here, all models perform almost identical to one-step retrieval. Only when measuring output noise by simple (instead of weighted) averaging may the storage capacity seem to differ significantly for the IRB-type models retrieving core patterns. For example, for large $n=45,056$, at $k=4$ weighted averaging yields $Mmax\u22484.01\xb7106,4.01\xb7106,3.93\xb7106,3.96\xb7106$ for R1, IR-KWTA, IRB, IRB-SMX, whereas simple averaging would predict $Mmax\u22484.01\xb7106,3.97\xb7106,4.70\xb7106,3.96\xb7106$ instead (data not shown). Here the seemingly large value for IRB results from simple averaging $\epsilon ^=\epsilon ^u+\epsilon ^v2=\epsilon ^v2$ because OR-ing preserves perfect inputs ($\epsilon ^u=0$) and thus tolerates double the noise $\epsilon ^v\u22642\epsilon =0.01$ in the output population compared to R1 (and the other models with balanced noise).

### 5.4 Gains in $M$, $C$, and $p1$ for Block Coding and Iterative Retrieval

The previous figures show capacity $M$ and $C$ on a logarithmic scale. To get more precise quantitative judgments of the improvements of block coding and iterative or bidirectional retrieval over one-step retrieval, Figures 7 and 8 illustrate the gain factors $gM$, $gC$, and $gp1$ defined in equations B.4 to B.9 and compare theoretical results (lines) to simulation data (markers). Specifically, we compare block coding (IRB), iterative retrieval for random coding (IR-KWTA), and one-step retrieval for random coding (R1) by taking the quotients of the relevant quantities of pattern capacity $M$, information storage capacity $C$, and matrix load $p1$ (= fraction of one-entries in the weight matrix). For example, for the comparison IRB versus R1 (top panels), $gM=Mmax,BRBMmax,R1$ is the quotient of the maximal pattern capacities for IRB divided by R1. Similarly, we have compared IRB versus IR-KWTA (bottom panels).

Figure 7 illustrates the gain factors for autoassociation in small and large networks. The most significant increase of IRB over R1 (top panels) occurs for small $k$ (i.e., for sparse and balanced potentiation with $p1\u22640.5$), where $M$ and $C$ for IRB may be more than double the values for R1, whereas the gains approach 1 for large $k$ (corresponding to dense potentiation with $p1\u21921$) or become even smaller (in case of $gC$ due to the smaller information content of block patterns).

It is also visible that the increase in $M$ and $C$ for sparse potentiation ($p1<0.5$, $k<ldn$) implies an increase in the matrix load $p1$. We have argued previously that such sparsely potentiated networks are efficient only for structural compression of the weight matrix, for example, by Huffman/Golomb coding or structural plasticity (Knoblauch, 2003a; Knoblauch et al., 2010). This means that the increase in $M$ will be counteracted by the weight matrix becoming less compressible. For dense potentation ($p1>0.5$, $k>ldn$), both effects work in the same direction. The (modest) increase in $M$ will further increase the matrix load $p1\u21921$ such that the network will become even more compressible, rendering networks that can store more memory patterns while requiring less physical memory to represent the memories.

Here, our main interest is in quantifying the gains of block coding. Comparing the two iterative procedures IRB and IR-KWTA (middle panels), we observe that IRB can store more patterns $M$ than IR-KWTA only for balanced or dense potentation $p1\u22650.5$. For sparse potentiation $p1\u226a0.5$, IRB does not improve $M$ over IR-KWTA. It is again visible that IRB cannot improve $C$ over IR-KWTA due to the reduced information content of block patterns (see Figure 11). Comparing IR-KWTA versus R1 (bottom panels) yields similar gains as IRB versus R1. This indicates that a significant part of the gain of IRB over R1 can be credited to the iterative retrieval procedure of IRB rather than block coding.

Figure 8 shows corresponding data for heteroassociation, largely confirming the results for autoassociation. The theory for heteroassociation is even more precise than for autoassociation. The fit of theory to simulations could be further improved by selecting more appropriate values than $r=1$ for average improvement in the first retrieval step (see equation 4.7; data not shown).

### 5.5 Asymptotic and Maximal Information Storage Capacity

Asymptotic information storage capacities are long known to be $Cv=ln2\u22480.69$ bit per synapse for heteroassociation and $Cu=ln2/4\u224817.3$ bit per synapse for autoassociative pattern completion (see equations 4.21 to 4.23). While earlier it had been assumed that the maximal storage capacity would be identical to the asymptotic capacity, (e.g., Willshaw et al., 1969; Palm, 1980), later studies have observed in simulation experiments that the completion capacity of finite autoassociative networks can actually exceed the asmyptotic limit and continues to increase for viable network sizes (Schwenker et al., 1996). Therefore, one may question the theoretical bounds or even assume that the asymptotic limit for “optimal retrieval” in autoassociative networks may be larger than $ln2/4$. To clarify these questions and find the maximum capacity, we simulated iterative pattern completion in very large Willshaw networks having up to $n\u22482.1\xb7106$ neurons and up to $n2\u22484.4\xb71012$ synapses.

Some results are displayed in Figure 9 showing capacity data for autoassociative networks for optimal input noise and pattern activity to maximize completion capacity $Cu$, that is, $\lambda =0.5$ and $k=2ldnk$ (see equation 4.23). For block coding, this choice corresponds to block size $N=2k/2$ and network size $n=kN=k\xb72k/2$. Each curve shows $Cu$ as a function of $M$ normalized to the maximum pattern number (at noise level $\epsilon =0.01$) estimated by our theory, similar to Figure 4. The insets show enlarged plots of the maximum region for each model variant. For example, IR-LK+ (panel C) reaches maximum capacity $Cu\u22480.191$ bit per synapse for a network size of $n=98,304$ (and $k=24,N=4096$). The next smaller simulated network with $n=45,056$ (and $k=22,N=2048$) achieves almost the same value such that the true maximum is likely between 50,000 and 100,000 neurons. Although the maximum seems rather flat, it is nevertheless remarkable that it occurs at about the same size as a cortical macrocolumn of size 1 mm$3$ (size about $n=105$; Braitenberg & Schüz, 1991) Willshaw networks are often used for as generic models (Palm, 1982; Palm, Knoblauch, Hauser, & Schüz, 2014; Knoblauch & Sommer, 2016). For larger $n>105$ information capacity $Cu$ decreases again toward the asymptotic value $Cu\u2192ln24\u22480.173$ bps. Similar results are visible for model variants IR-KWTA (panel B) achieving maximum $Cu\u22480.200$ more unequivocally at the larger network size $n=98,304$.

Block coding with IRB (panel A) seems to reach its maximum capacity $Cu\u22480.177$ at a larger network size around $n=983,040$ (and $k=30,N=32,768$). As the experimental results were very close ($Cu=0.1767$ for $n=30\xb732,768=983,040$; $Cu=0.1763$ for $n=32\xb765,536=2,097,152$) and we could not simulate networks larger than $n=2,097,152$ due to hardware limitations, it may also be that the maximum occurs at slightly larger $n$. For one-step retrieval R1 (panel D) we could not observe the maximum capacity for viable network sizes, but the performance of R1 will obviously be bounded by the IR-KWTA and IR-LK+.

Figure 10 shows maximum capacity as a function of network size for the various models (data correspond to Figure 9). The data support the conclusion that all models have an asymptotic capacity $Cu=ln24$ including IRB and IRB-SMX, where maximum capacity occurs at a finite network size comparable to a cortical macrocolumn (Braitenberg & Schüz, 1991). The maximum capacity $Cu\u22480.200$ bps for standard iterative retrieval is obtained for the $k$-winners-take-all strategy (IR-KWTA) and occurs around $n\u2248105$. This value is slightly exceeded by block-coding retrieval with the sum-of-max strategy (IRB-SMX) reaching $Cu\u22480.204$ bps for $n$ between 50,000 and 100,000.^{15}

## 6 Discussion and Conclusion

Motivated by previous promising results (Gripon & Berrou, 2011; Gripon & Rabbat, 2013; Aliabadi et al., 2014; Aboudib et al., 2014; Ferro et al., 2016; Memis, 2015), we have investigated how block coding can improve retrieval quality and memory capacity for Willshaw-type associative networks employing iterative or bidirectional retrieval (Willshaw et al., 1969; Palm, 1980; Schwenker et al., 1996; Sommer & Palm, 1998, 1999). To this end we have analyzed a number of different network and retrieval models and validated our theory by simulation experiments.

For many practical applications of NAMs, the asymptotic results (which actually are unaffected by the recent developments) have to be complemented by concrete optimizations of network parameters and retrieval procedures for large, finite memories. For this purpose, the use of randomly generated patterns as a benchmark is well established for several reasons. First, independent random patterns seem the simplest and most generic assumption, allowing comparison to a large body of previous analyses and simulation experiments of various NAM models. Second, random patterns are thought to be optimal to maximize pattern and information capacity defined in section 2, thereby providing an upper bound for real-world applications. Third, there are various recognition architectures that actually employ NAM mappings with random patterns (e.g., Palm, 1982; Kanerva, 1988; Knoblauch, 2012). Fourth, it is known that activity and connectivity patterns of various brain structures that are thought to work as NAM have random character (Braitenberg, 1978; Braitenberg & Schüz, 1991; Rolls, 1996; Albus, 1971; Bogacz, Brown, & Giraud-Carrier, 2001; Laurent, 2002; Pulvermüller, 2003; Lansner, 2009). In many real applications, the patterns to be stored will of course not be random, and they have to be coded into sparse binary patterns fitting the parameters of optimal or near-optimal NAM configurations (Palm, 1987a; Austin & Stonham, 1987; Krikelis & Weems, 1997; Hodge & Austin, 2003; Palm, 2013; Sa-Couto & Wichert, in press).

Here we adapted a previous finite-size analysis of one-step retrieval (Knoblauch et al., 2010) to iterative and bidirectional retrieval with and without block coding (see section 4 and appendix B). In contrast to the previous analyses, our theory allows not only estimating retrieval errors for a given network size and memory number, but also directly computing the pattern capacity $Mmax$ and the information storage capacity $Cmax$ (in bit/synapse) for a given network and tolerated noise level. Although the theory becomes exact only in the limit of large networks, it already provides reasonable approximations for finite networks and captures most effects that can be seen in simulation experiments when comparing the different network and retrieval models.

The most important finding is that block coding can significantly increase $M$ but not $C$. As the error-correcting capability of block coding reduces output noise, it is possible to store more memories $M$ at a maximal tolerated noise level $\epsilon $. This increase is actually strongest for pattern activities where the Willshaw model is most efficient, that is, for patterns having about $k\u223cldn$ active units, and is in the range between 10% and 20% for relevant network sizes (see Figures 7C, 7D, 8C, and 8D). By contrast, it is not possible to significantly increase the information $C$ that a synapse can store. The main reason is that a block pattern contains less information than a random pattern (see Figure 11 and section A.1). Thus, unfortunately, the increase in pattern number $M$ is mostly compensated by the decrease in pattern information such that the resulting $C$ typically decreases by about 10% to 20% for block coding (see Figures 7 and 8). Only the optimal “sum-of-max” block-coding strategy IRB-SMX can increase $C$ by a few percent (see Figures 3 and 10).

While our theory is valid for a standard iterative retrieval procedure for block coding (IRB), we have simulated a number of further optimized retrieval variants (see section 3). Although these variants can further increase pattern capacity by 10% to 20% compared to IRB, they can rarely exceed the traditional sparse coding models in terms of stored information per synapse. Other drawbacks are the relatively complicated implementations that are difficult to interpret neurobiologically and consume more computing time. Still, they may offer useful applications for fast, approximate nearest-neighbor search as suggested previously for associative networks (Palm, 1987b; Bentz et al., 1989; Hodge & Austin, 2005; Knoblauch, 2005, 2007, 2008a; Knoblauch et al., 2010; Knoblauch, 2012; Sacramento, Burnay, & Wichert, 2012; Ferro et al., 2016; Gripon, Löwe, & Vermet, 2018).

Another result of this study is a better understanding of iterative retrieval for finite network size. For example, although the asymptotic capacity of one-step retrieval has long been known to be $ln2\u22480.69$ and $ln2/4\u22480.173$ for heteroassociation and autoassociation (Willshaw et al., 1969; Palm, 1980; Palm & Sommer, 1992, 1996), simulations of iterative retrieval revealed that the storage capacity of finite networks increases beyond these values (Schwenker et al., 1996). Therefore, one might question these asymptotic values, in particular for iterative retrieval in autoassociation, where the theoretical analysis is still incomplete and it might be possible to get close to the upper bound of $(ln2)/2$ derived by Palm (1980) and in note 2. Here we explain at least the finite size effects of iterated one-step retrieval. In simulations of very large networks (with up to $n>2\xb7106$ cells), we have shown that the autoassociative completion capacity has its global maximum at around $C\u22480.20$ bit per synapse for a network size between $n=50,000$ and $n=100,000$. Remarkably, this coincides with the generic size of a cortical macrocolumn the Willshaw model has often been used as a model for (Braitenberg & Schüz, 1991; Palm, 1982; Palm et al., 2014; Knoblauch & Sommer, 2016). Together with the well-known empirical fact that small networks ($n<1000$) have a much smaller capacity, our theory can easily explain the phenomenon of a unique global maximum. It predicts for noisy input patterns (with $\lambda <1$) that maximal completion capacity $Cmax(n)$ is a decreasing function for large network size $n\u2192\u221e$, where the theory becomes exact. By contrast, in networks with close to zero input noise ($\lambda \u22481$) that would be optimal for heteroassociation, the asymptotic capacity is approached from below.

Networks employing simple block coding (IRB) can store more patterns, although they have a slightly lower maximal capacity $C\u22480.18$ for larger networks around $n=106$. Otherwise, they seem to behave qualitatively similar to IR-KWTA and IR-LK+. For optimized retrieval with the sum-of-max strategy (IRB-SMX), the maximum capacity 0.204 is even slightly larger than for IR-KWTA and occurs at a slightly smaller network size (around $n=50,000$). We still cannot exclude that there may exist even better retrieval algorithms exceeding this maximum (Palm, 1980). For example, in some modified memory tasks like familiarity detection, it is known that autoassociative networks can be used in a way to achieve up to $(ln2)/2\u22480.347$ bit/synapse (Bogacz et al., 2001; Bogacz & Brown, 2003; Palm & Sommer, 1996). At least for the pattern completion task, this seems impossible, and if such retrieval algorithms existed, we believe they would be very inefficient computationally compared to one-step or iterative retrieval (Knoblauch et al., 2010).

We conclude that block coding in finite networks can, in some parameter ranges, modestly increase pattern storage capacity, whereas the improvement is often negligible (or even absent) when measured as stored information per synapse. Asymptotically, block coding has the same limits as random coding. The higher pattern storage capacity and error-correcting capabilities may render block coding networks (in particular, IRB-SMX) better suited for applications such as a fast nearest-neighbor search for object classification than previous approaches (Ferro et al., 2016; Knoblauch et al., 2010).

In this study we have focused on balanced potentiation where the fraction of potentiated synapses $p1\u22480.5$ maximizes the entropy of the weight matrix. In the future, it may be interesting to investigate the minimal entropy regimes of sparse potentiation with $p1\u21920$ and dense potentiation with $p1\u21921$. In the minimal entropy regimes, the weight matrix is compressible such that very efficient network implementations are possible (Knoblauch et al., 2010; Bentz et al., 1989). In particular, dense potentiation has previously been identified to be most promising for applications if implemented with inhibitory networks (Knoblauch, 2007, 2008a, 2012). Here the unpotentiated “silent” synapses with weight 0 are replaced by inhibitory synapses with weight $-1$, whereas the potentiated synapses with weight 1 can be pruned. Then block coding could further boost information efficiency because an even small increase of $p1$ toward 1 may significantly decrease the number of remaining “silent” synapses that must be represented in an inhibitory network. Another interesting question that should be addressed in future work is whether our observation that maximal capacity occurs at the size of cortical macrocolumns holds as well for neuroanatomically more realistic conditions such as sparsely connected networks and the involvement of structural plasticity (Knoblauch & Sommer, 2016; Knoblauch, 2017).

## Appendix A: Analysis of Block Coding

### A.1 Transinformation and Channel Capacity for Block Coding

We can interpret storage and retrieval in associative networks as sending pattern vectors $v\mu $ over a memory channel and receiving output patterns $v^\mu $ (Shannon & Weaver, 1949; Cover & Thomas, 1991).

*random coding*with independent pattern components (pRND), this corresponds to a bit-wise transmission over the channel, and it is therefore sufficient to consider binary random variables $X=vi\mu \u2208{0,1}$ and $Y=v^i\mu \u2208{0,1}$. For $p:=pr[X=1]$ the information I(X) equals (Shannon & Weaver, 1949)

*block coding*, individual bits of a pattern vector $v\mu $ are not independent. It is therefore more adequate to consider the transmission of blocks $X,Y\u2208{0,1}N$. At the input side each block, $X$ has a single one-entry. Therefore, $I(X)=ldN$, assuming a uniform distribution. On the output side, $Y$ may have an arbitrary number of one-entries $|Y|\u2208{0,1,2,\u2026,N}$. However, by construction of the

*pattern part retrieval algorithms R1B and IRB*, either $|Y|=0$ or $Y$ will be a superset of $X$. Thus, reconstructing $X$ from $Y$ requires selecting one of the $|Y|$ one-entries from $Y$ and the conditional information of $X$ given $Y$ for fixed $|Y|$ is

*general retrieval scenarios*(like winners-take-all threshold strategies), the true one-entry of a block $Y$ may be erased with probability $p10$, and spurious one-entries may appear independently with probability $p01$. Then the number of false one-entry $f$ in an output pattern has a binomial distribution:

*transinformation between pattern sets*, for example, between the original input patterns $U:={u1,\u2026,uM}$ and noisy queries $U\u02dc:={u\u02dc1,\u2026,u\u02dcM}$. Assuming that queries $u\u02dc\mu $ consist of $\lambda \u02dck$ correct one-entries of $u\mu $ and additionally a fraction $\kappa \u02dck$ false one-entries, then the component errors are

*storage capacities*for our network simulations by using the definitions

Figure 11 illustrates some differences of information content for the different types of random memory patterns. We refer to pRND for independently generated pattern components, kRND for patterns having exactly $k$ out of $n$ active units, and BLK for different variants of block patterns (see the figure keys for details). It can be seen that pRND and kRND are almost equivalent. Only for extremely sparse patterns does ($k<10$) their information contents differ by a few percent. For that reason, we have computed information storage capacities simply from equation A.4 in all our simulations of networks storing kRND patterns. By contrast, block patterns typically have a significantly lower information content than pRND and kRND—in particular, for large $k$. One exception for this rule is the case of noisy patterns $\lambda <1$ and $\kappa >0$. Here for small $k$ the (trans-)information is only slightly lower than for pRND and, surprisingly, for large $k$, the block patterns can have a much larger transinformation than pRND and kRND, where the transinformation ratio diverges for $k\u2192n/2$. For this reason we have computed information storage capacities from equations A.13 to A.18 in all our simulations of block-coded memory patterns.

### A.2 Iterative Retrieval for Autoassociation with OR-ing (IRB)

### A.3 The Limit of Large Networks for Balanced Potentiation

As described in section 4.4, we consider the limit $n,k,N\u2192\u221e$, $p=kn\u21920$, $L=\lambda +rk$, $0<r<k$, $plnL\u21920$ and fixed $0<p1,max<1$ for constant $\lambda ,\epsilon $. Moreover we assume that $r$ is typically constant, or $r\u223ck$.

First, we show that after replacing $L$ by $LAA$, equations 4.15 and 4.16 hold as well for autoassociation with OR-ing. Using again $ln(x+\Delta x)=ln(x)+O(\Delta xx)$ and $LAA=rk(1-\lambda )$ from equation A.22, it is $ln(1-LAAp)=ln(-plnLAA)+O(plnLAA)\u223c-ln(N)+lnlnkr\u223c-ln(N)$ because $lnlnkln(n/k)\u21920$ for any sublinear $k=O(nd)$ with $d<1$.

## Appendix B: Analysis of One-Step Retrieval (R1) and Iterative Retrieval (IR)

*one-step-retrieval R1*(without exploiting block coding) yields the following results (see Knoblauch et al., 2010, equations 3.2, 3.7–3.11),

*iterative retrieval IR*without any block coding (assuming sparse coding $k\u2264log(n)$ and a reasonable threshold strategy for retrieval steps 2, 3, …, for example, $k$-WTA selecting the $k$ most activated units) by using the gains (e.g., again for $\epsilon \u22640.01$)

^{16}

## Appendix C: Recognition Capacity of the Willshaw Model

*maximal matrix load*at error level $\epsilon $,

^{17}

*pattern capacity*,

*recognition capacity*,

^{18}On the other end of the optimal range, if $\epsilon \u2192\u221e$ is almost constant, we obtain $-ld(\epsilon q)\u2248-ldq=ldnk(n/k)2\u2248kldn$ and thus

^{19}$k=ldn2\lambda 2$. Thus, for recognition memory it is possible to store up to $ln22\u22480.35$ bit per synapse, but only for high noise levels where there are many times more spurious states than stored memories. The resulting device can still be useful for applications as typically “new” inputs occur relatively seldom during operation and the chance $q01$ of a new input that evokes a false-positive response is very low. Interestingly, this bound can be reached for any diverging pattern activity with $k\u2264ldn2\lambda 2$, whereas reconstruction memory is optimal only for logarithmic $k\u22481\lambda ldn$ (see equations B.1 and 4.9).

A limitation of our analysis is the binomial approximation of the error probability, equation C.2, assuming that all one-entries in the weight matrix would have been generated independent of each other. For reconstruction memory we have shown the convergence of this approximation (Knoblauch, 2008b, 2008c), which is relatively easy to show because the probability of a single-output component error depends only on the synaptic weights of the output unit (see equation 4.2). By contrast, for recognition memory, the error probability, equation C.2, depends on the synaptic weights of all active neurons, which may include subtle dependencies that are difficult to analyze precisely. Also, a quantitative verification of our theory by simulation experiments is here much more difficult because the error probabilities (see equation C.2) become extremely small, $q01\u223c0.5k2$, and thus are difficult to test with sufficient precision.^{20}

## Notes

^{1}

In principle, these early results did not really consider our very general definition of $C$, so they only show that $C\u2265ln2$ for heteroassociation and $C\u2265ln2/2$ for autoassociation. The equality seems very plausible after more than 40 years of associative memory research, but to our knowledge, it is still a conjecture.

^{2}

Split the patterns $u\mu $ into two parts of length $(1-r)n$ and $rn$. Then the matrix $W$ contains a submatrix $W(11)$ for autoassociation of the first pattern parts and a matrix $W(12)$ for heteroassociation from the first to the second parts. The parameters for these submatrices are in the close-to-optimal range, so their general information capacities converge to the limits $CA:=C$ and $CH\u2265ln2$ for autoassociation and heteroassociation, respectively. Similarly, the capacity of a third submatrix $W(22)$ for autoassociation of the second pattern parts converges also to $CA$. From this, we get $(1-r)2CA+r(1-r)CH\u2264CA\u2264(1-r)2CA+r(1-r)CH+r2CA$ for any $0<r<1$. Solving for $CA$ yields equivalently $1-r2-rCH\u2264CA\u2264CH2$. Thus, $CA=CH2$ for $r\u21920$.

^{3}

This means there is exactly one synapse $wij$ between any neuron pair $ui,vj$. This assumes autapses for autoassociation and bidirectional synapses for heteroassociation.

^{4}

We use $ldn:=log2(n)$ as an abbreviation for the logarithmus dualis.

^{5}

In most of our analyses and simulation experiments, we assume $\lambda =0.5$ or $\lambda =1$ but no additional false one-entries ($\kappa =0$) because this is known to maximize information capacity. Further simulations using different values for $\lambda $ and $\kappa >0$ have confirmed our results (data not shown).

^{6}

For autoassociation, there are no synaptic links within a block except self-connections. Therefore, a silent neuron never gets input from an active neuron within the same block and therefore cannot reach the firing threshold (equaling the number of *all* active neurons). As a consequence, sIRB and IRB are equivalent for autoassociative pattern part retrieval.

^{7}

The chance that a synapse is potentiated after learning one pattern association is $(1/N)2$. Correspondingly, the chance that a synapse is not potentiated after learning $M$ associations is $p0:=(1-1N2)M$, and $p1:=1-p0$. Note that for autoassociation, within a block only autapses $Wii$ may be potentiated.

^{8}

We assume here that the one-entries in a column of the weight matrix would be placed independently, corresponding to a binomial distribution of dendritic potentials. It has been shown that this so-called binomial approximation of $p01$ becomes exact in the limit of large networks with sufficiently sparse memory patterns, including the case $k=O(n/log2n)$ (Knoblauch, 2008c).

^{9}

Obviously, solving this fixed-point equation yields only an upper bound of $\lambda max$ for two reasons. First, because equation 4.4 gives only a mean value, due to a positive variance, about half of the retrievals will result in worse-than-average values, such that iterative retrieval may get stuck in a spurious fixedpoint. Second, the linear approximation of $\lambda ^(\lambda )$ also yields only an upper bound for $\lambda max$ due to the concave form of the curve for relevant $\lambda $ (see Figure 2). A quadratic approximation would obviously yield better results, but to preserve the upper bound and simplify the following capacity analyses, we keep to the linearization. The simulations in section 5 show that our analysis still provides useful results.

^{10}

As output errors (see equation 4.2) decrease exponentially with $\lambda $, the first retrieval step is most critical for convergence of activity toward the original pattern. Therefore, the resulting approximations should be sufficiently good in spite of not analyzing further retrieval steps.

^{11}

In the following capacity analysis of the maximal matrix load $p1$ satisfying equation 4.8, we do not know $\lambda max$ beforehand (as it depends itself on $p1$). To avoid an iteration between $L$ and maximal $p1$, we can assume a fixed tolerance value for convenience, for example, $\epsilon min:=\epsilon =0.01$, similar to previous analyses (Knoblauch et al., 2010). Note that for $\lambda <1$ and large networks with $r/k\u21920$, the value of $\epsilon min$ becomes irrelevant anyhow.

^{12}

In general, $\lambda ^max$ is the mean completeness averaged over both areas $u$ and $v$. For example, for IRB with OR-ing, the maximal completeness in area $u$ is typically (slightly) larger than in $v$.

^{13}

Strictly speaking, we cannot definitely exclude that there may exist more optimal retrieval strategies exceeding the classical limits. For example, for block coding, IRB-SMX seems currently to be the optimal retrieval strategy (see the simulation results in section 5). Although an analysis of IRB-SMX is more challenging, our simulations suggest that IRB-SMX also cannot exceed the asymptotic values of R1. First, our simulations show that maximum completion capacity (for $n$ in the range between $104$ and $105$) is only slightly larger than for classical IR models and seems to decrease to the classical value for larger $n>105$. Second, some further analysis (not shown) reveals that in order to exceed the classical values, any iterative retrieval method must be able to eliminate a number of noisy neuron activities (e.g., after the first R1 step) that scale larger than any power of $k$. However, our simulations of very large IRB-SMX networks show that this number, at maximal capacity, seems to scale at most with $k2$ (see note 15).

^{14}

Most experiments use $\lambda =0.5$ or $\lambda =1$ because this is optimal for autoassociation and heteroassociation, respectively. But further simulation experiments have confirmed our results also for various other values of $\lambda $ and $\kappa >0$ (data not shown).

^{15}

Although we cannot strictly exclude that the asymptotic capacity of the optimal IRB-SMX method is above the classical asymptotic bounds, our simulation data do not support such a hypothesis. In fact, we have measured output noise $\epsilon ^$ after the first R1-step for IRB-SMX at maximum capacity (and also after convergence). For $k=20,22,24,26,28$ we obtained $\epsilon ^=1.69,1.86,1.98,2.16,2.31$ after the first R1 step (corresponding to $\epsilon ^=0.136,0.117,0.087,0.098,0.113$ after the last iteration). These data support only a linear increase of output noise $\epsilon ^\u223ck$ (or a quadratic increase of the number of wrongly activated neurons with $k$). On the other hand, we have argued in section 4.4 (see note 13) that exceeding the classical asymptotic bounds would require that the output noise $\epsilon \epsilon $ grow faster than any power of $k$. Thus, our data support the hypothesis that IRB-SMX also has the same asymptotic bounds as the classical retrieval methods.

^{16}

In equation 4.8, we defined $L$ as a lower bound on the output completeness $\lambda ^$ being a function of input completeness $\lambda $. Expressing the same relation in terms of input noise $\epsilon \u02dc:=1-\lambda $ and output noise $\epsilon ^=1-\lambda ^$, condition 4.8 becomes equivalently $\epsilon ^\u2264!1-L=1-min(\lambda +rk,\lambda max)=max(1-\lambda -rk,1-\lambda max)=max(\epsilon \u02dc-rk,\epsilon ^min)$ with the minimally possible output noise $\epsilon ^min:=1-\lambda ^max$. Transferring this result from IRB to IR then simply means requiring that the output noise after the first R1 step must be bounded by $\epsilon :=1-L$ as implied by equations B.7 to B.9. Here we may use the minimally possible output noise $\epsilon ^min:=(n-k)p1kk$ that follows from the minimal output component error probability $p01,min=p1k$ for perfect inputs with $\lambda =1$. However, to avoid again iterated computations (as $\epsilon ^min$ depends on $p1$), we simply choose fixed $\epsilon ^min=\epsilon $ as for R1.

^{17}

The right part of equation C.4 follows from solving $lnp1\epsilon =2(\lambda k)2-\lambda kln\epsilon \alpha $ or the quadratic equation $\lambda k2-k-2ln\epsilon \alpha \lambda lnp1\epsilon =0$.

^{18}

For example, for $-ln(\epsilon q)\u223clnln\cdots lnn\u2192\u221e$ it follows from equation C.6 that $M\u223cn2lnln\cdots lnn$ scales almost with the number of synapses. Correspondingly, the normalized critical pattern capacity diverges as $\alpha :=Mn2/ln2n\u223cln2nlnln\cdots lnn\u2192\u221e$.

^{19}

Inserting $-ld(\epsilon q)\u2248kldn$ in equation C.4 yields $k\u22482kldn/(2\lambda )$ and solving for $k$ yields $k=ldn2\lambda 2$.

^{20}

In our current approach (data not shown), we have used randomly selected unseen patterns to estimate $q01$, but this allows precise estimation only for small networks with relatively large error probabilities—for example, $q01\u226510-5$. A possible future approach may try a maximum clique approach to specifically search for cliques of size $\lambda k$ in the graph of the synaptic weight matrix and, by that, estimate the probability that a randomly selected clique corresponds to either a spurious state or a familiar memory.

## Acknowledgments

We are grateful to Friedhelm Schwenker and Fritz Sommer for valuable discussions. We are also very grateful to Hans Kestler for letting us use his computing infrastructure to simulate large neural networks. We acknowledge support by the state of Baden-Württemberg through bwHPC.