## Abstract

Neural associative memories are perceptron-like single-layer networks with fast synaptic learning typically storing discrete associations between pairs of neural activity patterns. Previous work optimized the memory capacity for various models of synaptic learning: linear Hopfield-type rules, the Willshaw model employing binary synapses, or the BCPNN rule of Lansner and Ekeberg, for example. Here I show that all of these previous models are limit cases of a general optimal model where synaptic learning is determined by probabilistic Bayesian considerations. Asymptotically, for large networks and very sparse neuron activity, the Bayesian model becomes identical to an inhibitory implementation of the Willshaw and BCPNN-type models. For less sparse patterns, the Bayesian model becomes identical to Hopfield-type networks employing the covariance rule. For intermediate sparseness or finite networks, the optimal Bayesian learning rule differs from the previous models and can significantly improve memory performance. I also provide a unified analytical framework to determine memory capacity at a given output noise level that links approaches based on mutual information, Hamming distance, and signal-to-noise ratio.

## 1. Introduction

An associative memory is an alternative computing architecture in which, unlike the classical von Neumann machine, computation and data storage are not separated. For example, as illustrated by Figure 1, an associative memory can store a set of associations between pairs of pattern vectors {(**u**^{μ} → **v**^{μ}):μ = 1, …, *M*}. Similar to random access memory, a query pattern **u**^{μ} entered in associative memory can serve as an address for accessing the associated content pattern **v**^{μ}. However, unlike random access memory, an associative memory accepts arbitrary query patterns , and the computation of any particular output involves all stored data records rather than a single one. Specifically, the associative memory task consists of comparing a query with all stored addresses and returning an output pattern equal (or similar) to the pattern **v**^{μ} associated with the address **u**^{μ} most similar to the query. Thus, the associative memory task includes the random access task but is not restricted to it. It also includes computations such as pattern completion, denoising, or data retrieval using incomplete cues. Moreover, neural implementations of associative memory are closely related to Hebbian cell assemblies and play an important role in neuroscience as models of neural computation for various brain structures, for example, neocortex, hippocampus, cerebellum, mushroom body (Hebb, 1949; Braitenberg, 1978; Palm, 1991; Fransen & Lansner, 1998; Pulvermüller, 2003; Johansson & Lansner, 2007; Lansner, 2009; Gardner-Medwin, 1976; Rolls, 1996; Bogacz, Brown, & Giraud-Carrier, 2001; Marr, 1969, 1971; Albus, 1971; Kanerva, 1988; Laurent, 2002).

In its simplest forms, neural associative memories are single-layer perceptrons with fast, typically one-shot, synaptic learning realizing the storage of *M* discrete associations between binary address and content patterns **u**^{μ} and **v**^{μ}. The one-shot constraint favors local learning rules where a synaptic weight *w _{ij}* depends on only

*u*

^{μ}

_{i}and

*v*

^{μ}

_{j}. Alternative nonlocal learning methods are typically time-consuming and require gradient descent (such as error backpropagation) that is based on global error signals obtained from repeated training of the entire pattern set. Instead, associative memories use simple Hebbian-type learning rules where synaptic weights increase if both the presynaptic and postsynaptic neurons are active during presentation of a pattern pair.

The performance of neural associative memory models can be evaluated by storage capacity, which can be defined, for example, by the number of memories *M* a network of a given size can store or by the Shannon information *C* that a synapse can store. More recent work considers also structural compression of synaptic networks and the energy or time requirements per retrieval (Poirazi & Mel, 2001; Stepanyants, Hof, & Chklovskii, 2002; Lennie, 2003; Knoblauch, 2003, 2005, 2009b; Knoblauch, Palm, & Sommer, 2010).

The simplest one-shot learning model is the so-called Steinbuch or Willshaw model with binary synapses and clipped Hebbian learning (Willshaw, Buneman, & Longuet-Higgins, 1969; Steinbuch, 1961; Palm, 1980, 1991; Golomb, Rubin, & Sompolinsky, 1990; Nadal, 1991; Sommer & Dayan, 1998; Sommer & Palm, 1999; Knoblauch et al., 2010). Here a single coincidence of presynaptic and postsynaptic activity is sufficient to increase the synaptic weight from 0 to 1, while further coincidences do not cause further changes.

An alternative model is the linear associative memory, where contributions of different pattern pairs add linearly (Kohonen, 1972; Kohonen & Oja, 1976; Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1982; Palm, 1988a, 1988b; Tsodyks & Feigel'man, 1988; Willshaw & Dayan, 1990; Dayan & Willshaw, 1991; Palm & Sommer, 1992, 1996; Chechik, Meilijson, & Ruppin, 2001; Sterratt & Willshaw, 2008). For example, for binary memory patterns *u*^{μ}_{i}, *v*^{μ}_{j} ∈ {0, 1} the general linear learning rule can be described by four values specifying the weight increments for the possible combinations of presynaptic and postsynaptic activity.

Surprisingly, the maximal storage capacity *C* in bits per synapse is almost identical for the two models: the Willshaw model can achieve up to 0.69 bits per synapse (bps), whereas the linear models achieve only a slightly higher capacity of 0.72 bps in spite of employing real-valued synaptic weights. However, closer investigation reveals that the Willshaw model can achieve nonzero capacity only for extremely sparse activity, where the number of active units per pattern vector scales logarithmic with the vector size. In contrast, the linear model achieves the maximum *C* = 0.72 bps for a much larger range of moderately sparse patterns. Only for a nonvanishing fraction of active units per pattern vector does the performance drop from 0.72 bps to the capacity of the original (nonsparse) Hopfield network (e.g., *C* = 0.14 bps in Hopfield, 1982; Hertz, Krogh, & Palmer, 1991; Palm & Sommer, 1996, or, as we will see below, *C* = 0.33 bps for the hetero-associative feedforward networks considered here). The linear learning model achieves maximal storage capacity only for the optimal covariance learning rule (e.g., Sejnowski, 1977a, 1977b; Dayan & Willshaw, 1991; Dayan & Sejnowski, 1993; Palm & Sommer, 1996), which becomes equal to the Hebb rule for very sparse patterns and equal to the Hopfield rule for nonsparse patterns. Moreover, simulation experiments show that the capacity of the optimal linear model remains well below the capacity of the Willshaw model for any reasonable finite network size (e.g., C = 0.2 bps versus C = 0.5 bps for *n* = 10^{5} neurons; see Knoblauch, 2009a; Palm & Sommer, 1992). This suggests that the linear covariance rule is not always optimal, in particular not for finite networks and sparse memory representations as found in the brain (Waydo, Kraskov, Quiroga, Fried, & Koch, 2006).

A third model class is based on the Bayesian confidence propagation neural network (BCPNN) rule (Lansner & Ekeberg, 1987, 1989; Kononenko, 1989, 1994; Lansner & Holst, 1996; Sandberg, Lansner, Petersson, & Ekeberg, 2000; Lansner, 2009). This model employs Bayesian maximum-likelihood heuristics for synaptic learning and retrieval (see also a related approach based on maximizing the entropy of synaptic weights: MacKay, 1991). Therefore, it has been suspected that the BCPNN model could achieve optimal performance, or at least exceed the performance of Willshaw and linear models. These conjectures have been supported by some numerical investigations; however, theoretical analyses of the BCPNN model have been lacking so far. As we will see, the BCPNN model becomes optimal only for a limited range of very sparse memory patterns.

This article (see also Knoblauch, 2009a, 2010a) develops the generally optimal associative memory that minimizes output noise and maximizes storage capacity by activating neurons based on Bayesian maximum likelihood decisions. The corresponding neural interpretation of this Bayesian associative memory corresponds in general to a novel nonlinear learning rule resembling the BCPNN rule. Specifically, a theoretical analysis including query noise shows that the previous learning models are only special limit cases of the generally optimal Bayesian model. Asymptotically, for large networks and extremely sparse memory patterns, the Bayesian model becomes essentially identical to the binary Willshaw model (but implemented with inhibitory rather than excitatory synapses; see Knoblauch, 2007). Similarly, the BCPNN model is optimal for a less restricted range of sparse memory patterns where the fraction of active units per memory vector still vanishes. For less sparse and nonsparse patterns, the Bayesian model becomes identical to the linear model employing the covariance rule. For a large range of intermediate sparseness and finite networks, the Bayesian learning rule is shown to perform significantly better than previous models. As a by-product, this work also provides a unified analytical framework to determine memory capacities at a given output noise level that links approaches based on mutual information, Hamming distance, and signal-to-noise ratio.

The organization of the paper is as follows. Section 2 describes the model of neural associative memory with optimal Bayesian learning and analyzes signal-to-noise ratio and storage capacity. Section 3 compares the Bayesian associative memory to previous models in the literature, including inhibitory implementations of the Willshaw network, linear learning models with the covariance rule, and BCPNN-type models, and determines asymptotic conditions when the respective models become equivalent to optimal Bayesian learning. Section 4 presents results from numerical simulation experiments verifying the theoretical results concerning signal-to-noise-ratio, output noise, and storage capacity. Further experiments compare the performance of various learning models for finite network sizes. Section 5 summarizes and discusses the main results of this work. The appendixes include a description for appropriate implementations of Bayesian associative memory (appendix A), an analysis for computing optimal firing thresholds (appendix D), an analysis of the relationship between signal-to-noise ratio and Hamming-distance-based measures for output noise and storage capacity (appendix E), and signal-to-noise ratio analyses for the linear and BCPNN-type models (appendixes G, H).

## 2. Model of Bayesian Associative Memory

### 2.1. Memory Storage in Neural and Synaptic Countervariables.

*M*associations between address patterns

**u**

^{μ}and content patterns

**v**

^{μ}where μ = 1, …,

*M*. Here

**u**

^{μ}and

**v**

^{μ}are binary vectors of size

*m*and

*n*, respectively. Memory associations are stored in first-order (neural) and second-order (synaptic) countervariables. In particular, each address neuron

*i*and each content neuron

*j*can memorize its unit usage: Similarly, each synapse

*ij*can memorize its synapse usage: where

*i*= 1, …,

*m*and

*j*= 1, …,

*n*. Note that it is sufficient to memorize

*M*,

*M*

_{1},

*M*′

_{1}, and

*M*

_{11}. Thus, an implementation on a digital computer requires about (

*mn*+

*m*+

*n*+ 1)ld

*M*memory bits. The following analyses consider optimal Bayesian retrieval, assuming that each output unit

*j*= 1, …,

*n*has access to the variables in the set The following analyses will show that the mean values of the coincidence counters and unit usages, , , have a major role in determining the regime of operation for Bayesian associative memory (see Table 2).

### 2.2. Neural Formulation of Optimal Bayesian Retrieval.

**u**

^{μ}and return a reconstruction of the associated content

**v**. In general, query is a noisy version of

^{μ}**u**, assuming component transition probabilities given the activity of a content neuron, : Now the content neurons

^{μ}*j*have to decide independently of each other whether to be activated or remain silent. Given the query , the optimal maximum likelihood decision is based on the odds ratio , which minimizes the expected Hamming distance between original and reconstructed content. If the query pattern components are conditional independent given the activity of content neuron

*j*(e.g., assuming independently generated address and query components), we have for With the Bayes formula , the odds ratio is For a more plausible neural formulation, we can take logarithms of the probabilities and obtain dendritic potentials . With being the

*i*th factor in the product of equation 2.14, it is Thus, synaptic weights

*w*, dendritic potentials

_{ij}*x*, and retrieval output are finally such that writes as a sigmoid function of

_{j}*x*, and a content neuron fires, , iff the dendritic potential is nonnegative. Note that indices of

_{j}*M*

_{0}(

*j*),

*M*

_{1}(

*j*), , ,

*M*

_{00}(

*ij*),

*M*

_{01}(

*ij*),

*M*

_{10}(

*ij*), and

*M*

_{11}(

*ij*) are skipped for readability. Also note that optimal Bayesian learning is nonlinear and, for autoassociation with

**u**

^{μ}=

**v**

^{μ}and nonzero query noise, asymmetric with

*w*≠

_{ij}*w*. Note further that synaptic weights and dendritic potentials may be infinite, such that accurate implementations require two values per variable for finite and infinite components, respectively (see appendix A).

_{ji}^{1}in particular for sparse queries having only a small number of active components with . However, the synaptic weights of equation 2.15 may not yet satisfy Dale's law that a neuron is either excitatory or inhibitory. To be more consistent with biology, we may add a sufficiently large constant

*w*

_{0}≔ −min

_{ij}

*w*to each weight. Then all synapses have nonnegative weights

_{ij}*w*′

_{ij}≔

*w*+

_{ij}*w*

_{0}and the dendritic potentials remain unchanged if we replace the last sum in equation 2.16 by Here the negative sum could be realized, for example, by feedforward inhibition with a strength proportional to the query pattern activity, as suggested by Knoblauch and Palm (2001) and Knoblauch (2005), for example.

*v*

^{μ}has been queried by address queries (where ), then we could estimate for , which requires four countervariables per synapse in addition to

*M*

_{11}. To reduce storage costs, one may assume independent of

*j*, as do most of the following analyses and experiments for the sake of simplicity, although this assumption may reduce the number of discovered rules (corresponding to infinite

*w*) describing deterministic relationships between

_{ij}*u*and

_{i}*v*.

_{j}### 2.3. Analysis of the Signal-to-Noise Ratio.

*d*is as defined below equation 2.12, and

_{H}*q*(

*j*) ≔ pr[

*v*

^{μ}

_{j}= 1] is the prior probability of an active content unit. Thus, retrieval quality is determined by the component output error probabilities, where the Θ

_{j}are firing thresholds (e.g., Θ

_{j}= 0 for dendritic potentials

*x*as in equation 2.16). Intuitively, retrieval quality will be high if the high-potential distribution pr[

_{j}*x*|

_{j}*v*

^{μ}

_{j}= 1] and the low-potential distribution pr[

*x*|

_{j}*v*

^{μ}

_{j}= 0] are well separated, that is, if the signal-to-noise ratio (SNR), is large for each content neuron

*j*(Amari, 1977; Palm, 1988a, 1988b; Dayan & Willshaw, 1991; Palm & Sommer, 1996). Here μ

_{lo}≔

*E*(

*x*|

_{j}*v*

^{μ}

_{j}= 0) and σ

^{2}

_{lo}≔ Var(

*x*|

_{j}*v*

^{μ}

_{j}= 0) are the expectation and variance of the low-potential distribution, and μ

_{hi}=

*E*(

*x*|

_{j}*v*

^{μ}

_{j}= 1) and σ

^{2}

_{hi}≔ Var(

*x*|

_{j}*v*

^{μ}

_{j}= 1) are the expectation and variance of the high-potential distribution. Appendix E shows that under some conditions, the SNR and the Hamming distance are equivalent measures of retrieval quality.

*R*≔

*R*(

*j*) for a particular content neuron

*j*with

*q*≔

*M*

_{1}(

*j*)/

*M*using the following simplifications:

The activation of an address unit

*i*does not depend on other units, and all address units*i*have the same prior probability*p*≔*p*(*i*) ≔ pr[*u*^{μ}_{i}= 1] of being active. Thus, on average, an address pattern has active units.Query noise for an address unit

*i*does not depend on other units, and all query components*i*have the same noise transition probabilities and . Thus, on average, a query will have correct and false one-entries, where and define fractions of average miss noise and add noise, respectively, normalized to the mean address pattern activity .Retrieval involves a particular query pattern being a noisy version of an address pattern

**u**^{μ}that has exactly*k*one-entries, where the query has*c*out of*k*correct one-entries and, additionally,*f*false one-entries. Without loss of generality, we can assume a setting as illustrated by Figure 2 (left), that is, the address pattern has one-entries*u*^{μ}_{i}= 1 at components*i*= 1, 2, …,*k*and zero-entries*u*^{μ}_{i}= 0 at*i*=*k*+ 1,*k*+ 2, …,*m*whereas the query has false entries at*i*=*c*+ 1,*c*+ 2, …,*k*+*f*.The average values of the synaptic coincidence counters diverge: . Note that this assumption also implies diverging unit usages, and . For reasons that will become apparent in section 3, the condition is also referred to as the linear learning regime, whereas will be called the nonlinear learning regime.

*k*≈

*mp*, and, for consistent error estimates, , . Then we obtain from equation B.6 the mean difference Δμ ≔ μ

_{hi}− μ

_{lo}between high potentials and low potentials: Similarly, we obtain from equation B.8 for the potential variance: In order to include randomly diluted networks with connectivity

*P*∈ (0; 1] where a content neuron

*v*receives synapses from only a fraction

_{j}*P*of the

*m*address neurons, we can simply replace

*m*by

*Pm*. With

*M*

_{1}≈

*Mq*and

*M*

_{0}≈

*M*(1 −

*q*), the asymptotic SNR

*R*= Δμ/σ is with Thus, for zero query noise, , , the SNR for optimal Bayesian learning is identical to the asymptotic SNR of linear learning with the optimal covariance rule (e.g., see ρ

^{Covariance}

_{3}in Dayan & Willshaw, 1991, p. 259, or equation 3.28 in Palm & Sommer, 1996, p. 95; see also section 3.2). Nonzero query noise according to or decreases the SNR

*R*by a factor ρ < 1. Note that ρ characterizes the basin of attraction, defined as the set of queries that get mapped to a stored memory

**v**

^{μ}. For example, we can evaluate which combinations of and achieve a fixed desired ρ (and thus

*R*). It turns out that for sparse address patterns,

*p*< 0.5, miss noise impairs network performance more severely than add noise (see Figure 2, right). As a consequence, the basins of attraction for neural associative memories employing sparse address patterns are not necessarily spheres, but they can be heavily distorted, enlarging toward queries with add noise and shrinking toward queries with miss noise. This implies that the similarity metrics employed by associative networks can strongly deviate from commonly used Hamming or Euclidean metrics. Instead, associative networks appear to follow an information-theoretic metric based on mutual information or transinformation (Cover & Thomas, 1991). This is true at least for random address patterns

**u**

^{μ}storing a sufficiently large number of memories such that the synapse usages, in particular

*M*

_{11}, are almost never zero. Numerical simulations discussed in section 4 reveal that basins of attraction can behave quite differently if these assumptions are not fulfilled.

### 2.4. Analysis of Storage Capacity.

*q*≔

*q*(

*j*),

*q*

_{01}≔

*q*

_{01}(

*j*),

*q*

_{10}≔

*q*

_{10}(

*j*) (or considering only a single output unit

*j*), we have miss noise and add noise . The weighing between miss noise and add noise can be expressed by the output noise balance, For any given distribution of dendritic potentials, there exists a unique optimal firing threshold (see appendix D) and, hence, a corresponding optimal noise balance (see equation E.10) that minimize the output noise . This minimal output noise is an increasing function of the number

*M*of stored memories (see equation E.6). Therefore, we can define the pattern capacity as the maximal number of memory patterns that can be stored such that the output noise does not exceed a given value ϵ. Assuming that the dendritic potentials follow approximately a gaussian distribution (which is not always true; e.g., see Henkel & Opper, 1990; Knoblauch, 2008), we can apply the results of appendix E and obtain from the SNR, equation 2.24, by solving the equation for . Here is approximately equal to equation 2.28, and is the minimal SNR required for output noise level ϵ and can be computed from solving equation E.6 for

*R*(or, more conveniently, by iterating equations E.9 and E.10). Thus, where the approximation becomes exact for large networks in the limit

*Mpq*→ ∞.

*Pmn*of synapses employed in a given network. This is the network capacity where

*T*is the transinformation equation F.4 with error probabilities

*q*

_{01},

*q*

_{10}as in equations E.4 and E.5 using . We can refine these results for two important cases using the results of appendixes E and F.

As can be seen in Figures 3a and 3b, the upper bound of *C*_{ϵ} is achieved for zero query noise () and low fidelity with ϵ → 1, while *C*_{ϵ} → 0 for high fidelity with ϵ → 0.

*q*→ 0 and any fixed ϵ, it is where the upper bound of

*C*

_{ϵ}can be reached for zero query noise and high fidelity with ϵ → 0. Not surprisingly, this upper bound equals the one found for the linear covariance rule (Palm & Sommer, 1996) as well as the general capacity bound for neural networks (Gardner, 1988). Numerical evaluations (see Figures 3c to 3f) show that a network capacity close to

*C*

_{ϵ}≈ 0.72 requires extremely sparse content memories and very large networks. In fact, finite networks of practical size can reach less than half of the asymptotic value (see Figure 3f). Note that and

*C*

_{ϵ}are defined only for ϵ < 1 assuming optimal firing thresholds to minimize output noise corresponding to an optimal noise balance as in equation E.10, where output errors are dominated by miss noise (see equation D.10). For generalized definitions of pattern capacity

*M*

_{ϵξ}and network capacity

*C*

_{ϵξ}at a given output noise balance , we can replace by as given by equation E.9. Here finite networks achieve maximal capacity at low fidelity ϵ ≫ 1 and ξ → 1 where output errors are dominated by add noise.

For self-consistency, the analyses so far are valid only for diverging . Thus, the results are not reliable for extremely sparse memory patterns, for example, *mp* = *O*(log *n*), where at least the binomially distributed synaptic countervariables *M*_{11} ∼ *B*_{M,pq} are small and cannot be approximated by gaussians (where *B*_{N,P} is defined below equation B.1). In particular for queries without any add noise, , small *M*_{11} implies very large or even infinite synaptic weights (see equation 2.15) that would also violate the gaussian assumption for the distribution of dendritic potentials. As will be shown below, the Bayesian associative memory becomes equivalent to the Willshaw model with a decreased maximal network capacity *C*_{ϵ} ⩽ ln 2 ≈ 0.69 (or, rather, *C*_{ϵ} ⩽ 1/(*e*ln 2) ≈ 0.53 for independently generated address pattern components *u*^{μ}_{i} with binomially distributed pattern activities, *k*^{μ} ≔ ∑^{m}_{i=1}*u*^{μ}_{i} ∼ *B*(*m*, *p*), as assumed here; see Willshaw et al., 1969; Knoblauch et al., 2010, appendix D). The following section investigates more closely the relationships to the Willshaw net, linear Hopfield–type learning rules, and the BCPNN model.

## 3. Relationships to Previous Models

### 3.1. Willshaw Model and Inhibitory Networks.

The Willshaw model works particularly well for “pattern part retrieval” with zero add noise . Then the active units of a query are a subset of an address pattern **u ^{μ}** and the optimal threshold is maximal, that is, equal to the query pattern activity, . Thus, a single missing query input, but

*w*= 0, excludes activation of content neuron

_{ij}*j*. Based on this observation, it has been suggested that the Willshaw model should be interpreted as an essentially inhibitory network where zero weights become negative, positive weights become zero, and the optimal firing threshold becomes zero (Knoblauch, 2007). Such inhibitory implementations of the Willshaw network are very simple and efficient for a wide parameter range of moderately sparse memory patterns with

*p*≫ log(

*n*)/

*n*where a small number of inhibitory synapses can store a large amount of information,

*C*∼ log

^{S}*n*bps, even for diluted networks with low connectivity

*P*< 1. Moreover, the inhibitory interpretation offers novel functional hypotheses for strongly inhibitory circuits in the brain, for example, involving basket or chandelier cells (Markram et al., 2004). By contrast, the common excitatory interpretation is efficient only for very sparse memory patterns and cannot implement optimal threshold control in a simple and biologically plausible way (Buckingham & Willshaw, 1993; Graham & Willshaw, 1995).

*w*of equation 2.15 become where the approximation is valid if query noise is independent of the content,

_{ij}*p*

_{10|0}=

*p*

_{10|1}, and address patterns have sparse activity,

*p*≪ 1, such that

*M*

_{00}≫

*M*

_{10}and

*M*

_{01}≫

*M*

_{11}. In case

*p*

_{10|0}≠

*p*

_{10|1}, the approximation is still valid up to an additive offset,

*w*

_{0}≔ log((1 −

*p*

_{10|1})/(

*p*

_{10|0})), where optimal retrieval can be implemented as described for equation 2.18.

^{2}

*w*= −∞, for

_{ij}*M*

_{11}= 0 when the original Willshaw network would have zero weights. For sufficiently small , the fraction of synapses with zero coincidence counters will be significant,

*p*

_{0}≔ pr[

*M*

_{11}= 0] = (1 −

*pq*)

^{M}≈ exp(−

*Mpq*) ≫ 0, and, thus, the dendritic potentials will be dominated by the strongly inhibitory inputs. For still diverging unit usages, and , the remaining synaptic countervariables will be large and close to their mean values,

*M*

_{00}≈

*M*(1 −

*p*)(1 −

*q*),

*M*

_{01}≈

*M*(1 −

*p*)

*q*,

*M*

_{10}≈

*Mp*(1 −

*q*), and therefore approximately equal for all synapses. Thus, up to an additive constant, the synaptic weights become corresponding to a nonlinear incremental Hebbian learning rule. At least for large

*p*

_{0}→ 1, this rule will degenerate to the clipped Hebbian rule of the inhibitory Willshaw model where

*w*= −∞ with probability pr[

_{ij}*M*

_{11}= 0] =

*p*

_{0}and

*w*= 0 with probability pr[

_{ij}*M*

_{11}= 1] ≈ 1 −

*p*

_{0}whereas pr[

*M*

_{11}>1] ≈ 0 becomes negligible. Since

*p*

_{0}→ 1 is equivalent to , this means that the Willshaw model becomes equivalent to Bayesian learning at least for max(1/

*p*, 1/

*q*) ≪

*M*≪ 1/(

*pq*) (see Figure 6, left panels). Numerical experiments suggest that the Willshaw model may be optimal even for smaller

*p*

_{0}→ 0.5 corresponding to logarithmic pattern activity, , where the Willshaw capacity becomes maximal, bps, given that individual address pattern activities

*k*

^{μ}are narrowly distributed around

*mp*(see Figure 7b; see also Knoblauch et al., 2010). For even smaller

*p*

_{0}< 0.5 corresponding to the Willshaw model cannot be optimal because then , whereas the capacity of the optimal Bayesian model increases toward bps.

^{3}

### 3.2. Linear Learning Models and the Covariance Rule.

*w*

_{0}and learning increments

*r*specifying the change of synaptic weight when the presynaptic and postsynaptic neurons have activity

_{uv}*u*∈ {0, 1} and

*v*∈ {0, 1}, respectively. In fact, for diverging unit usages,

*M*

_{1},

*M*

_{0}→ ∞, the synapse usages will be close to expectation: , , , and . These approximations make only a negligible relative error if the standard deviations are small compared to the expectations. The most critical variable is the coincidence counter

*M*

_{11}having expectation

*M*

_{1}

*p*and standard deviation .

^{4}Thus, the approximations are valid for large values of the coincidence counter, that is, for

*q*≔

*M*

_{1}/

*M*. Then the argument of the logarithm in equation 2.15 will be close to where

*d**

_{1}: =

*p*(1 −

*p*

_{10|1}) + (1 −

*p*)

*p*

_{01|1},

*d**

_{2}≔ (1 −

*p*)(1 −

*p*

_{01|0}) +

*pp*

_{10|0},

*d**

_{3}≔

*p*(1 −

*p*

_{10|0}) + (1 −

*p*)

*p*

_{01|0}, and

*d**

_{4}: = (1 −

*p*)(1 −

*p*

_{01|1}) +

*pp*

_{10|1}. Linearizing the logarithm around

*a*

_{0}yields where

*d*

_{1}≔

*M*

_{11}(1 −

*p*

_{10|1}) +

*M*

_{01}

*p*

_{01|1},

*d*

_{2}≔

*M*

_{00}(1 −

*p*

_{01|0}) +

*M*

_{10}

*p*

_{10|0},

*d*

_{3}≔

*M*

_{10}(1 −

*p*

_{10|0}) +

*M*

_{00}

*p*

_{01|0}, and

*d*

_{4}≔

*M*

_{01}(1 −

*p*

_{01|1}) +

*M*

_{11}

*p*

_{10|1}for brevity. Similarly, the resulting function

*f*can be linearized around the expectations of the synapse usages. This gives a learning rule of the form of equation 3.4 with offset

*w*

_{0}= log

*a*

_{0}and where If the query noise is independent of the contents,

*p*

_{01}=

*p*

_{01|0}=

*p*

_{01|1}and

*p*

_{10}=

*p*

_{10|0}=

*p*

_{10|1}, then the four constants become identical, η ≔ η

_{11}= η

_{10}= η

_{01}= η

_{00}, the offset becomes zero,

*w*

_{0}= 0, and the synaptic weight becomes This is essentially (up to factor η) the linear covariance rule as discussed in much previous work (e.g., Sejnowski, 1977a, 1977b; Hopfield, 1982; Palm, 1988a, 1988b; Tsodyks & Feigel'man, 1988; Willshaw and Dayan, 1990; Dayan & Willshaw, 1991; Palm & Sommer, 1992, 1996; Dayan & Sejnowski, 1993; Chechik et al., 2001; Sterratt & Willshaw, 2008). Thus, together with the results of section 2.3, this shows that, in the asymptotic limit with query noise being independent of contents, optimal Bayesian learning becomes equivalent to linear learning models employing the covariance rule. If query noise depends on contents, Bayesian learning differs from the covariance rule, but up to an additive offset, it still follows a linear learning rule.

^{5}

### 3.3. BCPNN-Type Models.

*w*is the synaptic weight and, given a query , an output neuron will be activated, , if the dendritic potential exceeds the firing threshold Θ

_{ij}_{j}(see Lansner & Ekeberg, 1989, p. 79).

*m*= 2, and a single output unit,

*n*= 1. After storing

*M*memories, let where, for brevity, the indices are skipped for the output unit. Then, for zero query noise, it is , but . Note that the optimal Bayesian model avoids this problem by computing the odds ratio such that cancels.

Appendix H generalizes the BCPNN rule for noisy queries and describes two improved BCPNN-type rules, each of them fixing one of the two problems described: the BCPNN2 rule (see equation H.9), includes inactive query components but still uses an approximation similar to equation 3.16, and the BCPNN3 rule (see equation H.12) does not employ approximation equation 3.16, but still ignores inactive query components. For the latter, it is possible to compute the SNR in analogy to section 2.3. It turns out that in the linear learning regime, , the squared SNR *R*^{2} (and thus also storage capacity and *C*_{ϵ}) is factor below the optimal value equation 2.28. This implies also that the original BCPNN rule performs at least factor worse than the optimal Bayesian rule and thus, at most, is equivalent to the suboptimal linear homosynaptic rule (e.g., see rule R3 in Dayan & Willshaw, 1991). In the complementary nonlinear regime corresponding to very sparse patterns, similar arguments as in section 3.1 show that the BCPNN model becomes equivalent to optimal Bayesian learning and the Willshaw model.

## 4. Results from Simulation Experiments

This section has two purposes: to verify the theoretical results and compare the performances of the different learning models. To this end, I have implemented associative memory networks with optimal Bayesian learning (see section 2.2), BCPNN-type learning (see appendix H and section 3.3), linear learning (see appendix G and section 3.2), and Willshaw-type clipped Hebbian learning (see section 3.1). All experiments assume full network connectivity (*P* = 1).

### 4.1. Verification of SNR *R*.

A first series of experiments illustrated by Figure 4 implemented networks of size *m* = *n* = 1000 and compared experimental SNR *R* of dendritic potentials (black curves; see equation 2.24) to the theoretical values (gray curves). Here the theoretical values have been computed from equation 2.28 (Bayes), equations G.7 to G.9 (linear), and equation H.21 (BCPNN3). Data correspond to four experimental conditions testing sparse versus nonsparse memory patterns and queries having miss noise versus add noise. For each condition, the corresponding plot shows SNR *R* as a function of stored memories *M*. All experiments assumed ideal conditions where each query pattern was generated from an address pattern **u**^{μ} having *k* = *pm* one-entries, where contained correct one-entries and false one-entries (see Figure 2, left). Furthermore, all tested content neurons had unit usages *M*_{1} = *Mq*.

For most conditions and models, the theoretical predictions match the experimental SNR very well. This is true in particular for the three tested linear models (Hebb rule, homosynaptic rule, and covariance rule), but also for the Bayesian and BCPNN-type rules if the mean value of the coincidence counter is sufficiently large, , as presumed at the beginning of section 2.3. For example, for nonsparse patterns, the theoretical results become virtually exact for *M*>70 or . For fewer coincidences, , the SNR curves of the Bayesian and BCPNN-type models are similar as for the Willshaw model. Here the SNR is not a good predictor of retrieval quality and cannot easily be compared to the regime with for the following reasons. First, variances of dendritic potentials between high and low units become significantly different, (cf. equation 2.26).^{6} Second, the distributions of dendritic potentials become nongaussian (Knoblauch, 2008; cf. appendix E). Third, in particular for very small , dendritic potentials may be contaminated by infinite synaptic inputs (see equations 2.15, 3.2, and 3.14). This reasoning also explains the nonmonotonicity of the SNR curves visible in Figure 4 for the Bayesian and BCPNN-type models as a transition from a nonlinear Willshaw-type to a linear covariance-type regime of operation.

### 4.2. Verification of Output Noise .

In a second step, I verified the theory for output noise (see equation 2.30) as described in appendix E using the same network implementations as described before. In fact, appendix E shows that there is a bijective relation between the SNR *R* and (minimal) output noise if the dendritic potentials are gaussian and the high and low potentials have identical variances. Thus, given that the theory of SNR is correct, here it is tested whether these two conditions hold true.

Figure 5 shows output noise as a function of stored memories *M* assuming the same conditions as described for Figure 4. As before, for most conditions and models, the theoretical predictions match the experimental very well. In fact, the match is good even for the Bayesian and BCPNN-type rules when assuming relatively small where the theoretical estimates of SNR are still inaccurate. Again, the theory is inaccurate only for the Bayesian and BCPNN-type models for the condition of sparse memories and miss noise only. Here the theory basically suggests equivalence to the linear covariance rule, whereas the Bayesian and BCPNN-type models perform much better due to the infinitely negative synaptic weights caused by the *M*_{11} = 0 events, which allow rejecting a neuron activation by a single presynaptic input.

### 4.3. Verification of Storage Capacity *M*_{ϵ}.

*M*

A further series of experiments illustrated by Figure 6 tested the theory of storage capacity (see equations 2.32 and 2.33) for different network sizes *m* = *n* = 100, 1000, 10, 000, a larger range of pattern activities *mp* (=*nq*), and relaxing the restrictive assumption of having fixed *k*, *c*, *f*, *M*_{1}. This means that a query pattern was generated by randomly selecting one of the *M* address patterns **u**^{μ} and applying query noise according to parameters and . Similarly, all content neurons were included in the analysis. Thus, the previously fixed parameters became binomials, *k* ∼ *B*_{m,p}, , , *M*_{1} ∼ *B*_{M,q}, where *B*_{N,P} is as explained below equation B.1.

Each plot shows output noise as a function of mean pattern activity *mp*. For each value of *mp*, the number of stored patterns, , was computed from equation 2.33 for the optimal Bayesian rule and a low-output noise level ϵ = 0.01 (see parameter sets 1–6 in Table 1). For small networks (*m* = *n* = 100; upper panels) the theory is generally inaccurate. For example, for the optimal Bayesian learning rule, the theory strongly overestimates storage capacity for sparse memory patterns and underestimates capacity for nonsparse patterns. For larger networks (middle and bottom panels), there is a large range of *mp* where the theory precisely predicts storage capacity. Only for very sparse memory patterns (with small ) does the theory remain inaccurate. For queries containing add noise, the theory generally overestimates true capacity. For queries containing only miss noise, the theory overestimates capacity for extremely sparse patterns but underestimates capacity for patterns with intermediate sparseness.

For larger networks and , the theory becomes very precise for the optimal Bayes rule, the BCPNN3 rule, and the linear covariance rule.

In contrast, even for *m* = *n* = 10, 000 and *pm*>1000, the theory for the linear homosynaptic rule underestimates output noise by about a factor of two. The underestimation of is even worse for the linear Hebbian rule. Here the reasoning is that in contrast to covariance and homosynaptic rule, the mean synaptic weight is nonzero for the Hebbian rule. Therefore inhomogeneities in *c*, *f*, and *k* can cause a much larger variance in dendritic potentials than predicted by the theory, assuming fixed given values for *c*, *f*, and *k*.

### 4.4. Comparison of the Different Learning Models.

The simulation experiments confirm that the Bayesian learning rule is the general optimum leading to maximal SNR, minimal output noise, and highest storage capacity. Nevertheless, the simulations show also that for particular parameter ranges, some of the previous learning models can also become optimal.

The linear covariance rule becomes optimal in the linear learning regime, , which, for given output noise level , corresponds to moderately sparse or nonsparse memory patterns with *mp*/ln *q* → ∞ (see equations 2.35 and 2.37). However, for sparse memory patterns of finite size, the linear rules can perform much worse than the optimal Bayesian model—even worse than the Willshaw model.

Similarly, the BCPNN-type models become optimal in the limit of sparse query activity, . For finite size or nonsparse query patterns, the storage capacity can be significantly (factor ) below the optimal value.

Finally, the Willshaw model becomes optimal only for pattern part retrieval () and few coincidence counts, corresponding to very sparse memory patterns with *mp* = *O*(ln *q*). For finite networks, the Willshaw model achieves the performance of the Bayesian model only if the output noise level is low and the address pattern activities *k*^{μ} are constant or narrowly distributed around *mp*. In all other cases, the Willshaw model performs much worse than the optimal Bayesian rule.

### 4.5. Further Results Concerning Memory Statistics and Retrieval Methods..

*m*=

*n*= 1000 and pattern part retrieval with , ). Since the Bayesian theory can strongly overestimate pattern capacity for very sparse memory patterns (see equation 2.37), memories were stored at the much lower capacity limit of the Willshaw model assuming a fixed pattern activity

*k*

^{μ}=

*mp*for all memories (see equation 57 in Knoblauch et al., 2010; see parameter set 7 in Table 1). Then testing the networks again with random patterns having independent components (and binomial activity

*k*

^{μ}∼

*B*

_{m,p}) yields qualitatively similar results as before (compare the top left panel of Figure 7 to the middle left panel of Figure 6). Further simulations suggest that the Bayesian and BCPNN-type models have a high-fidelity capacity for very sparse patterns that is almost as low as reported for the Willshaw model (basically for ϵ ≪ 1 and

*k*/log

*n*→ 0; see appendix D in Knoblauch et al., 2010).

. | M_{ϵ} at ϵ = 0.01 parameter set 1
. | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

. | Bayes . | . | . | . | . | . | Willshaw . |

. | m = n = 100
. | . | m = n = 1000
. | . | m = n = 10000
. | . | m = n = 1000
. |

mp = nq
. | , . | , . | , . | , . | , . | , . | , . |

2 | 63 | 85 | 5371 | 7161 | 468,070 | 624,093 | 6 |

4 | 34 | 45 | 2815 | 3753 | 243,308 | 324,411 | 315 |

6 | 23 | 31 | 1932 | 2577 | 166,089 | 221,453 | 988 |

10 | 15 | 20 | 1206 | 1608 | 102,781 | 137,042 | 1578 |

20 | 8 | 11 | 639 | 853 | 53,710 | 71,613 | 1252 |

30 | 6 | 8 | 443 | 591 | 36,794 | 49,059 | 851 |

50 | 5 | 5 | 281 | 374 | 22,886 | 30,514 | 448 |

100 | 154 | 205 | 12,063 | 16,084 | 156 | ||

200 | 88 | 116 | 6399 | 8531 | 47 | ||

300 | 66 | 84 | 4435 | 5912 | 22 | ||

500 | 50 | 50 | 2813 | 3749 | 9 | ||

1000 | 1546 | 2056 | |||||

2000 | 886 | 1163 | |||||

3000 | 664 | 845 | |||||

5000 | 502 | 502 |

. | M_{ϵ} at ϵ = 0.01 parameter set 1
. | 2 . | 3 . | 4 . | 5 . | 6 . | 7 . |
---|---|---|---|---|---|---|---|

. | Bayes . | . | . | . | . | . | Willshaw . |

. | m = n = 100
. | . | m = n = 1000
. | . | m = n = 10000
. | . | m = n = 1000
. |

mp = nq
. | , . | , . | , . | , . | , . | , . | , . |

2 | 63 | 85 | 5371 | 7161 | 468,070 | 624,093 | 6 |

4 | 34 | 45 | 2815 | 3753 | 243,308 | 324,411 | 315 |

6 | 23 | 31 | 1932 | 2577 | 166,089 | 221,453 | 988 |

10 | 15 | 20 | 1206 | 1608 | 102,781 | 137,042 | 1578 |

20 | 8 | 11 | 639 | 853 | 53,710 | 71,613 | 1252 |

30 | 6 | 8 | 443 | 591 | 36,794 | 49,059 | 851 |

50 | 5 | 5 | 281 | 374 | 22,886 | 30,514 | 448 |

100 | 154 | 205 | 12,063 | 16,084 | 156 | ||

200 | 88 | 116 | 6399 | 8531 | 47 | ||

300 | 66 | 84 | 4435 | 5912 | 22 | ||

500 | 50 | 50 | 2813 | 3749 | 9 | ||

1000 | 1546 | 2056 | |||||

2000 | 886 | 1163 | |||||

3000 | 664 | 845 | |||||

5000 | 502 | 502 |

Notes: Data assume various network sizes *m* = *n*, mean pattern activities *mp* = *nq*, and query noise parameters , . Capacities for the Bayesian model have been computed from equation 2.33 (assuming independent pattern components). Capacities for the Willshaw model have been computed from Knoblauch et al. (2010, eq. 57) and are exact for fixed pattern activities *k* = *mp* (whereas independent memory components would imply for a large range of sparse memory patterns; cf., Knoblauch et al., 2010, eq. 65).

In contrast, for random patterns with fixed activity *k*^{μ} = *mp*, the Bayesian and BCPNN-type models perform equivalent to the Willshaw model for a large range of sparse patterns (see Figure 7, top right panel). Moreover, for less sparse patterns, BCPNN2 becomes equivalent to the BCPNN rule, and BCPNN3 becomes equivalent to optimal Bayesian learning. There is also a strong improvement for the linear homosynaptic and Hebb rules now closely matching the theoretical values (for independent pattern components and binomial *k*^{μ}) where the homosynaptic rule becomes equivalent to the covariance rule.

So far, retrieval used fixed firing thresholds to minimize output noise (see appendix D). A simple alternative is -winners-take-all (WTA) retrieval activating the neurons with the largest dendritic potentials *x _{j}* (as may be implemented in the brain by recurrent inhibition, for example).

^{7}Figure 7 (bottom left panel) shows simulation results for -WTA and memory patterns with independent components and binomial

*k*

^{μ}∼

*B*

_{m,p}. Surprisingly, all of the various learning models show almost identical performance at relatively high levels of output noise . There are two reasons that can partly explain this result. First, -WTA cannot achieve high fidelity with because the content patterns

**v**

^{μ}have a distributed pattern activity

*l*

^{μ}∼

*B*

_{n,q}which is unknown beforehand. Thus, activating the most excited units causes a positive baseline level of output noise. Second, storing patterns at the relatively low-capacity limit of the Willshaw model implies, for fixed thresholds, low output noise for all models. Therefore, the actual output noise for -WTA will be dominated by the baseline errors described. Nevertheless, further simulations confirmed that even for a larger number of stored patterns, the performances of the different models are much more similar than for fixed firing thresholds.

For *l*-WTA and fixed pattern activity *l*^{μ} = *nq* the performance generally improves (Figure 7, bottom right panel). As before, *l*-WTA seems to even out the performance differences of various synaptic learning models: Surprisingly, the linear Hebbian, homosynaptic, and covariance rule now show identical high performance, precisely matching the theoretical values for the covariance rule. Also the Bayesian and BCPNN-type rules show identical performance. Further simulations show that for queries including add noise (), *l*-WTA retrieval becomes identical even between the Bayesian-type and linear model groups. These results support the view that homeostatic mechanisms, such as regulating total activity level, may play an equally important role as tuning the synaptic learning parameters (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Van Welie, Van Hooft, & Wadman, 2004; Chechik et al., 2001; Knoblauch, 2009c).

## 5. Summary and Discussion

Neural associative memories are promising models for computations in the brain (Hebb, 1949; Anderson, 1968; Willshaw et al., 1969; Marr, 1969, 1971; Little, 1974; Gardner-Medwin, 1976; Braitenberg, 1978; Hopfield, 1982; Amari, 1989; Palm, 1990; Lansner, 2009), as well as they are potentially useful in technical applications such as cluster analysis, speech and object recognition, or information retrieval in large databases (Kohonen, 1977; Bentz, Hagstroem, & Palm, 1989; Prager & Fallside, 1989; Greene, Parnas, & Yao, 1994; Huyck & Orengo, 2005; Knoblauch, 2005; Mu, Artiklar, Watta, & Hassoun, 2006; Wichert, 2006; Rehn & Sommer, 2006).

In this paper, I have developed and analyzed the generally optimal neural associative memory that minimizes the Hamming-distance-based output noise and maximizes pattern capacity and network storage capacity *C*_{ϵ} by making Bayesian maximum likelihood considerations. In general, the resulting optimal synaptic learning rule, equation 2.15 is nonlinear and asymmetric, and it differs from previously investigated linear learning models of the Hopfield type, simple nonlinear learning models of the Willshaw type, and BCPNN-type Bayesian learning heuristics. As revealed by detailed theoretical and experimental comparisons, the previous models are rather special cases of Bayesian learning that becomes optimal only in the asymptotic limit of large networks and for particular ranges of pattern activity *p*, *q* and query noise (see Table 2).

Learning Rule . | General Conditions for Optimality . | Conditions at Capacity Limit . |
---|---|---|

Optimal Bayesian | None | None |

BCPNN type | p → 0 | p → 0 |

Linear covariance | (mp)/log m → ∞ | |

Linear homosynaptic | and p → 0 | (mp)/log m → ∞ and p → 0 |

Linear heterosynaptic | and q → 0 | (mp)/log m → ∞ and q → 0 |

Linear Hebb | and p, q → 0 | (mp)/log m → ∞ and p, q → 0 |

Linear Hopfield | and p, q → 0.5 | p, q → 0.5 |

Willshaw | and and | mp ∼ log m and |

Learning Rule . | General Conditions for Optimality . | Conditions at Capacity Limit . |
---|---|---|

Optimal Bayesian | None | None |

BCPNN type | p → 0 | p → 0 |

Linear covariance | (mp)/log m → ∞ | |

Linear homosynaptic | and p → 0 | (mp)/log m → ∞ and p → 0 |

Linear heterosynaptic | and q → 0 | (mp)/log m → ∞ and q → 0 |

Linear Hebb | and p, q → 0 | (mp)/log m → ∞ and p, q → 0 |

Linear Hopfield | and p, q → 0.5 | p, q → 0.5 |

Willshaw | and and | mp ∼ log m and |

Notes: The constraints depend on the fraction of active units in an address pattern (*p* ≔ pr[*u*^{μ}_{i} = 1]) or content pattern (*q* ≔ pr[*v*^{μ}_{j} = 1]), the size of the address population (*m*), the mean value of the synaptic coincidence counter (, where *M* is the number of stored memories), the mean unit usages (, ), and the fraction of add noise in the query pattern (). The right column reexpresses the general conditions of the middle column for the case when *M* equals the pattern capacity .

For example, the Willshaw model becomes optimal only in the limit of small coincidence counters, , for queries without any add noise, . For maximal at the capacity limit, can be achieved only for extremely sparse memory patterns where the number of active units per memory vector scales typically logarithmic in the population size, for example, *p*, *q* ∼ log *n*/*n* (Knoblauch et al., 2010). Nevertheless, one may be surprised how a simple model employing binary synapses can already perform optimal Bayesian retrieval. The reason is that a low value of guarantees that a large fraction *p*_{0} ≔ (1 − *pq*)^{M} of synaptic weights remains zero in the Willshaw model or minus infinity in the corresponding Bayesian interpretation (see equation 3.2). Then retrieval gets dominated by rejecting activations of postsynaptic neurons based on single but strongly inhibitory inputs. In particular, for small but nonvanishing *p*_{0}, the inhibitory Willshaw network becomes very efficient by storing large amounts of information with a small number of synapses (Knoblauch, 2007). Such an inhibitory interpretation of associative memory may also offer novel functional hypotheses for strongly inhibitory cortical circuits, for example, involving chandelier or basket cells (Markram et al., 2004), and also for inhibition-dominated brain structures such as cerebellum and basal ganglia (Marr, 1969; Albus, 1971; Kanerva, 1988; Wilson, 2004).

In contrast to the Willshaw model, the linear covariance rule becomes optimal in the linear learning regime where the synaptic coincidence counters diverge, . Then linearization of the optimal Bayesian rule yields the covariance rule, and the two rules have the same asymptotic SNR. Correspondingly, the fraction of synapses with infinite weights vanishes, *p*_{0} → 0, which, at the capacity limit (see equation 2.33), corresponds to moderately or nonsparse memory patterns with typically *p*, *q* ≫ log *n*/*n*. Numerical experiments indicate that in reasonably large but finite networks, the optimal Bayesian model still performs significantly better than the linear covariance rule for a large range of pattern activities *p* ≪ 0.5. Furthermore, the SNR analysis allows a characterization of basins of attraction in terms of miss noise and add noise (see equation 2.29 and Figure 2, right). It turns out that in the linear learning regime, , the network is more vulnerable against miss noise () than add noise (). This contrasts with the nonlinear learning regime, , where the network is more vulnerable against add noise, mainly because add noise destroys the network's ability to reject postsynaptic activations by single strongly inhibitory synaptic inputs. Alternative linear learning models such as the Hebb, homosynaptic, and heterosynaptic rules behave similar to the covariance rule but have a lower signal-to-noise ratio unless *p* → 0 and/or *q* → 0 (Dayan & Willshaw, 1991).

The original BCPNN model of Lansner and Ekeberg has a similar formulation as the optimal Bayesian model but neglects inactive query neurons and employs an inaccurate approximation (see equation 3.16). More recent hypercolumnar variants of the BCPNN model for discrete valued memories remedy the first problem by employing extra neurons to represent inactivity (Lansner & Holst, 1996; Johansson, Sandberg, & Lansner, 2002), but require (at least) double the network size of the optimal Bayesian model. For comparison, I have extended the original BCPNN model to include query noise and derived two improved BCPNN-type rules: The BCPNN2 rule also considers the inactive query neurons, whereas the BCPNN3 rule does not make use of the inaccurate approximation. Similar to the Willshaw model, the BCPNN-type rules become optimal at least in the nonlinear learning regime, , corresponding to very sparse patterns where active units dominate the total information contained in a query pattern. Moreover, for the linear learning regime , I have analyzed the SNR of the BCPNN3 rule being an upper bound for the original BCPNN rule. The analysis revealed that the SNR of the BCPNN3 model is equivalent to the linear homosynaptic rule, that is, factor worse than for optimal Bayesian learning (see also Dayan & Willshaw, 1991). Thus, the original BCPNN rule achieves at most the capacity of the homosynaptic rule and becomes optimal only for sparse address patterns with *p* → 0 or low query activity with small . Even for sparse address patterns with *p* → 0, the BCPNN-type models have reduced basins of attraction in the sense that they are more vulnerable to add noise with large than the optimal Bayesian model.

MacKay (1991) has suggested a learning model based on maximizing the entropy of synaptic weights that is closely related to optimal Bayesian associative memory. In particular, he arrived at a similar learning rule and also discussed the convergence to the covariance rule as well as the necessity of infinite synaptic weights. The current approach goes beyond these previous results by generalizing the learning rule for query noise and providing an SNR analysis for Bayesian learning. The latter, in connection with the results of appendix E, rigorously proves the equivalence of Bayesian learning and the covariance rule in the limit (whereas Taylor expansion of the BCPNN rule, for example, also leads to the covariance rule in spite of BCPNN being suboptimal; see section H.4). Moreover, this analysis also discusses convergence of the Bayesian learning rule to linear learning rules other than the covariance rule when the query noise is not independent of the stored contents (as can be expected for any real-world data).

As with most previous approaches, the “optimal” Bayesian memory model still makes the naive assumption that address attributes are independent of each other. Although this assumption is almost never fulfilled in real-world data, experiments reveal that naive Bayesian classifiers perform surprisingly well or even optimal in many domains that contain clear attribute dependencies (Zhang, 2004; Domingos & Pazzani, 1997). Moreover, it may be possible to extend the model by semi-naive approaches including higher-order dependencies, for example, as suggested by Kononenko (1991, 1994).

At least for independent address attributes, the Bayesian neural associative memory presented in this work is, by definition, the optimal local learning model maximizing and *C*_{ϵ}. On the other hand, there exist general bounds on the storage capacity of neural networks that do not refer to any particular learning algorithm (Gardner, 1988; Gardner & Derrida, 1988). As the linear covariance rule, the optimal Bayesian model reaches the Gardner bound for sparse memory patterns *p*, *q* → 0 in the limit *Mpq* → ∞ corresponding to moderately sparse patterns with *mp* ≫ log(*n*) where the network can store *C*_{ϵ} = 1/(2ln 2) ≈ 0.72 bps (compare equation 2.37 to equation 40 in Gardner, 1988). However, for logarithmic sparse memory patterns with *mp* ∼ log *n*, the storage capacity of the optimal Bayesian rule is below the Gardner bound and cannot exceed the maximal capacity of the Willshaw model, which is at *C*_{ϵ} = ln 2 ≈ 0.69 bps (or, rather, *C*_{ϵ} = 1/*e*ln 2 ≈ 0.53 bps for distributed pattern activities; see Knoblauch et al., 2010, appendix D). For even sparser memory patterns with *mp*/log *n* → 0, the storage capacity vanishes, *C*_{ϵ} → 0. Also for nonsparse patterns where *p* → 0.5, the Gardner bound of 2 bps cannot be reached. Here the optimal Bayesian rule achieves at most *C*_{ϵ} ≈ 0.33 bps for very low-fidelity retrieval with ϵ → 1, and only *C*_{ϵ} → 0 for high-fidelity retrieval with vanishing output noise ϵ → 0 (see Figure 3). Thus, as noted by Sommer and Dayan (1998), at least for nonsparse address patterns with *p* → 0.5, local learning is insufficient, and the optimal synaptic weights must be found by more sophisticated algorithms, including nonlocal information.

Even if the Bayesian associative memory could reach the Gardner bound, the resulting storage capacity of at most 2 bits per synapse would be low compared to the physical memory actually required to represent real-valued synaptic weights (or, alternatively, the countervariables described in section 2.1). Even worse, an accurate neural implementation of the Bayesian associative memory requires two numbers per synaptic weight: a real-valued variable for the finite contributions and an integer variable for the infinite contributions (see appendix A). In fact, if we take into account the computational resources required to represent the resulting network, the Willshaw model outperforms all other models due to the binary weights (Knoblauch et al., 2010): For implementations on digital hardware, the Willshaw model can reach the theoretical maximum of *C ^{I}* = 1 bit per computer bit (Knoblauch, 2003). Correspondingly, parallel hardware implementations of structurally plastic Willshaw networks can reach the theoretical maximum of

*C*= log

^{S}*n*bits per synapse (Knoblauch, 2009b). However, these high capacities (per synapse) are achieved only for a relatively low absolute number of stored memories,

*M*, far below the Gardner bound, equation 2.37. Some preliminary work (Knoblauch, 2009c, 2010b) indicates that the Bayesian associative memory can be efficiently discretized such that structurally compressed network implementations can store

*C*→ 1 bit per computer bit or

^{I}*C*→ log

^{S}*n*bits per synapse, whereas

*M*(and

*C*) can still be close to the Gardner bound. Another future direction will be to investigate more closely the biological relevance of Bayesian learning by implementing more realistic network models that include spikes, forgetful synapses, and inhibitory circuits (Sandberg et al., 2000; Fusi, Drew, & Abbott, 2005; Markram et al., 2004).

## Appendix A: Implementation of Infinite Weights and Thresholds

*d*

_{1}≔

*M*

_{11}(1 −

*p*

_{10|1}) +

*M*

_{01}

*p*

_{01|1},

*d*

_{2}≔

*M*

_{00}(1 −

*p*

_{01|0}) +

*M*

_{10}

*p*

_{10|0},

*d*

_{3}≔

*M*

_{10}(1 −

*p*

_{10|0}) +

*M*

_{00}

*p*

_{01|0},

*d*

_{4}≔

*M*

_{01}(1 −

*p*

_{01|1}) +

*M*

_{11}

*p*

_{10|1}, the synaptic weight, equation 2.15, can be expressed by with the gating functions for and for , and for and for . Thus,

*w*represents the finite weight-neglecting infinite components, whereas

_{ij}*w*

^{∞}

_{ij}counts the number of contributions toward plus and minus infinity. Similarly, the finite and infinite components of firing thresholds (corresponding to the “bias” in equation 2.16) write as Then finite and infinite components of dendritic potentials are and , such that a postsynaptic neuron

*j*gets activated if either

*x*

^{∞}

_{j}>Θ

^{∞}

_{j}or

*x*

^{∞}

_{j}= Θ

^{∞}

_{j}and

*x*⩾ Θ

_{j}_{j}.

## Appendix B: Analysis of the SNR for Optimal Bayesian Retrieval

The following computes the SNR (see equation 2.24) for neural associative memory with optimal Bayesian learning (section 2.2) making the same definitions and simplifications as detailed at the beginning of section 2.3. Section B.1 computes the mean difference Δμ ≔ μ_{hi} − μ_{lo} between the dendritic potential of a high and a low unit, and section B.2 computes the variances σ^{2}_{hi} and σ^{2}_{lo} for the corresponding distributions of dendritic potentials.

### B.1. Mean Values of Dendritic Potentials.

*m*− 1 by

*m*and skipping indices

*i*,

*j*for brevity), a content neuron

*j*will be activated if the dendritic potential

*x*exceeds the threshold Θ

_{j}_{j}≔ log(

*M*

_{0}/

*M*

_{1}) (instead of Θ

_{j}= 0), where Given

*M*

_{1},

*M*

_{0}, the remaining variables are binomially distributed— and , where . For large

*NP*(1 −

*P*) the binomial

*B*

_{N,P}can be approximated by a gaussian

*G*

_{μ,σ}with mean μ =

*NP*and variance σ

^{2}=

*NP*(1 −

*P*). Given

*u*

^{μ}

_{i}and

*v*

^{μ}

_{j}, we then have From this, we can approximate the distribution of the dendritic potential

*x*for low units and high units, respectively. For large

_{j}*k*and

*m*−

*k*, the sums of logarithms in equation B.1 are approximately gaussian distributed. In principle, the mean potentials μ

_{lo}and μ

_{hi}for low units and high units can be computed exactly from equation B.12. Fortunately, it turns out that the mean potential difference Δμ ≔ μ

_{hi}− μ

_{lo}required for the SNR can be well approximated by using only the first-order term in equation B.12 (while all higher-order terms become virtually identical for μ

_{hi}and μ

_{lo}; for more details, see Knoblauch, 2009a, appendixes D, F). These first-order approximations μ′

_{lo}, μ′

_{hi}of μ

_{lo}, μ

_{hi}are where the approximations are valid for large

*M*

_{0}

*p*,

*M*

_{1}

*p*→ ∞ and sufficiently small

*p*

_{01},

*p*

_{10}. Therefore, the mean difference Δμ ≔ μ

_{hi}− μ

_{lo}between the high and low distributions is

### B.2. Variance of Dendritic Potentials.

^{2}

_{lo}and σ

^{2}

_{hi}for

*x*in equation B.1. Given the unit usages

_{j}*M*

_{1}(

*j*), the random variables

*M*

_{00}(

*i*,

*j*) and

*M*

_{11}(

*i*,

*j*) are independent, and thus the variances simply add. Because each variance summand is positive, for large

*M*

_{1}

*p*,

*M*

_{0}

*p*→ ∞, we can simply assume and in all cases (cf. equations B.2 and B.3). With equation B.13 we get Thus, the variances Var(

*x*) for the potentials of both low units and high units are approximately

_{j}### B.3. B.3 Lemmas for Computing Dendritic Potential Distributions.

*X*be a random variable with normal distribution,

*X*∼

*G*

_{0,σ}, that is,

*X*is a gaussian with zero mean and variance σ

^{2}. Then the

*d*th moment is Proofs can be found in standard textbooks of statistics and probability theory (e.g., see equation 5.44 in Papoulis, 1991).

## Appendix C: Gaussian Tail Integrals

*g*(

*x*) be the gaussian probability density: Then the complementary gaussian distribution function is the right tail integral: The first bound is true for any

*x*>0, and the corresponding approximation error becomes smaller than 1% for

*x*>10. The second bound is true for any

*x*>0. Inverting

*G*yields The two approximations correspond to those of equation C.2. In the first approximation, the term

^{c}*G*

^{c}^{−1}(

*x*) can be replaced, for example, by the second approximation .

## Appendix D: Optimal Firing Thresholds

**u**

^{μ}, our goal is to minimize the expected Hamming distance between the corresponding content

**v**

^{μ}and the retrieval output (see equation 2.21). To this end, each content neuron

*v*has to adjust its firing threshold Θ in order to minimize where

_{j}*q*≔ pr[

*v*

^{μ}

_{j}= 1] is the prior and are the probabilities of making an output error (e.g., equations 2.22 and 2.23) assuming a given low distribution

*g*

_{lo}(

*x*) ≔ pr[

*x*=

_{j}*x*|

*v*

^{μ}

_{j}= 0] and high distribution

*g*

_{hi}(

*x*) ≔ pr[

*x*=

_{j}*x*|

*v*

^{μ}

_{j}= 1] for the dendritic potential

*x*(e.g., see equation 2.16). Minimizing

_{j}*H*(Θ) requires

*dH*/

*d*Θ = 0 or, equivalently, as illustrated by Figure 8 (left). The optimal threshold Θ

_{opt}can be obtained by solving equation D.3, which is easy if the distributions of dendritic potentials are gaussians. Then equation D.3 is rewritten as where

*g*is the Gaussian density, equation C.1, and μ

_{lo}, μ

_{hi}, σ

_{lo}, σ

_{hi}are means and standard deviations of the low and high dendritic potentials similar as defined below equation 2.24. Taking logarithms yields a quadratic equation in Θ with the solution where the optimal threshold is either Θ

_{1}or Θ

_{2}. If the standard deviations are equal, σ

_{lo}= σ

_{hi}, then

*A*= 0, and equation D.4 has the unique solution The following lemma characterizes the weighing of add noise (

*v*

^{μ}

_{j}= 0 but ) versus miss noise (

*v*

^{μ}

_{j}= 1 but ) in the retrieval result when choosing the optimal firing threshold: If we assume a given constant output noise (cf. equation 2.30), gaussian potentials with equal standard deviations σ

_{lo}= σ

_{hi}and optimal firing threshold Θ = Θ

_{opt}as in equation D.9, then that is, for sparse content patterns, the output errors are dominated by miss noise (see equation 2.31). A formal proof of the lemma can be found in Knoblauch (2009a, appendix A, equation 74). Figure 8 (left) gives an intuition as to why the lemma is true. Here

*H*(Θ

_{opt}) is the intersection area of high and low distribution, where the left and right parts of the area correspond to miss noise

_{10}and add noise (1 −

*q*)

*q*

_{01}, respectively (see the arrows). Requiring constant

*H*(Θ

_{opt})/

*q*implies that the intersection area

*H*(Θ

_{opt}) must be a constant fraction of the area below

*qg*

_{hi}(

*x*). Thus,

*q*→ 0 implies for σ

_{lo}= σ

_{hi}that the decrease of (1 −

*q*)

*g*

_{lo}(

*x*) with

*x*becomes very steep compared to the increase of

*qg*

_{hi}(

*x*) and finally approaches the dashed line corresponding to Θ

_{opt}.

## Appendix E: The Relation Between SNR *R* and Output Noise

*R*

We can use two different measures to evaluate retrieval quality: section 2.3 uses the SNR *R* (see equation 2.24), whereas section 2.4 uses output noise , which is based on the Hamming distance (see equation 2.30). This appendix shows that the two measures are actually equivalent if we assume that (1) all content neurons *j* have the same priors *q* ≔ pr[*v*^{μ}_{j} = 1] and the same distributions for high and low dendritic potentials; (2) all dendritic potentials follow a gaussian distribution; (3) each content neuron optimally adjusts the firing threshold in order to minimize output noise (see appendix D); and (4) the distributions of high and low dendritic potentials have the same standard deviation, σ ≔ σ_{lo} = σ_{hi}. Note that all assumptions are fulfilled at least in the limit *Mpq* → ∞ for reasons discussed in section 2.3.

*R*: Due to assumption 1, we can write the output noise, equation 2.30 in terms of the output error probabilities, equations 2.22 and 2.23: Due to assumption 2, the output error probabilities write where

*G*(

^{c}*x*) is the tail integral of a gaussian (see equation C.2), and, due to assumption 3, Θ

_{opt}is the optimal firing threshold as explained in appendix D. Due to assumption 4, Θ

_{opt}is as in equation D.9: The last bound implies that the optimal threshold shifts toward the high potentials for sparse patterns with

*q*< 0.5 and centers only for

*q*= 0.5. Thus, the error probabilities at optimal threshold are and thus the minimal output noise level that can be achieved with SNR

*R*equals where

*G*can be evaluated with equation C.2.

^{c}*R*. We can do this easily for two special cases. First, for nonsparse content patterns with

*q*= 0.5, we have and thus where

*G*

^{c}^{−1}is as in equation C.3. Second, for sparse content patterns with

*q*→ 0, miss-noise will dominate output errors according to equation D.10. Correspondingly, the output noise, equation E.6, is dominated by the second summand. Therefore,

*q*→ 0 implies

*q*.

## Appendix F: Binary Channels

*X*∈ {0, 1} with

*q*≔ pr[

*X*= 1] the information I(X) equals (Shannon & Weaver, 1949) It is

*I*(

*q*) =

*I*(1 −

*q*) and

*I*(

*q*) → 0 for

*q*→ 0. A binary memory-less channel is determined by the two error probabilities

*q*

_{01}for add noise and

*q*

_{10}for miss noise. For two binary random variables

*X*and

*Y*, where

*Y*is the result of transmitting

*X*over the binary channel, we can write For the analysis of storage capacity of associative networks at noise level ϵ (see section 2.4), we are interested in fulfilling the high-fidelity criterion, equation E.1, with a “noise balance” parameter ξ weighing between add noise and miss noise, such that Thus, we can compute the component transinformation for several interesting cases: For details see Knoblauch (2009a, appendix E). Three approximations are of particular interest. For

*q*= 0.5 and ξ = 0.5, we have

*T*≈ 1 −

*I*(ϵ/2). For

*q*→ 0, constant ϵ, and dominating miss noise with ξ → 0, we have

*T*≈

*I*(

*q*)(1 − ϵ). For

*q*→ 0, constant ϵ, and dominating add noise with ξ → 1, we have

*T*≈

*I*(

*q*).

## Appendix G: Analysis of the SNR for Linear Learning Rules

*M*th address pattern and, similarly as illustrated by Figure 2 (left), contains

*c*correct one-entries and

*f*false one-entries. The synaptic weight writes as the linear sum of learning increments

*r*due to individual memory associations with presynaptic activity

_{uv}*u*∈ {0, 1} and postsynaptic activity

*v*∈ {0, 1}, where, without loss of generality, for a high unit (

*v*= 1), we assume that

^{M}_{j}*v*

^{μ}

_{j}= 1 for μ =

*M*

_{0}+ 1, …,

*M*; and for a low unit (

*v*= 0), we assume that

^{M}_{j}*v*

^{μ}

_{j}= 1 for μ = 1, …,

*M*

_{1}. Then the dendritic potential with

*F*(1) = 1 and

*F*(0) =

*a*is Thus, the mean dendritic potentials for high and low units are using and . Similarly, we can compute the variances of dendritic potentials by replacing

*a*by

*a*

^{2}and

*E*by Var and leaving out constant terms, using and . Then the mean potential difference Δμ ≔ μ

_{hi}− μ

_{lo}is With this, we can compute the SNR

*R*≔ Δμ/max(σ

_{hi}, σ

_{lo}) (see equation 2.24), optimal firing thresholds (see appendix D), and storage capacity (see section 2.4). It is well known that the optimal linear rule (maximizing

*R*) is the so-called covariance rule

*r*

_{00}=

*pq*,

*r*

_{01}= −

*p*(1 −

*q*),

*r*

_{10}= −(1 −

*p*)

*q*,

*r*

_{11}= (1 −

*p*)(1 −

*q*), and where

*p*≔ pr[

*u*

^{μ}

_{i}= 1] and

*q*≔ pr[

*v*

^{μ}

_{j}= 1] (see Dayan & Willshaw, 1991; Palm & Sommer, 1996). Further rules of interest are, for example, the Hebbian rule

*r*

_{11}= 1,

*r*

_{00}=

*r*

_{01}=

*r*

_{10}=

*a*= 0; the homosynaptic rule

*r*

_{11}= 1 −

*q*,

*r*

_{10}= −

*q*,

*r*

_{00}=

*r*

_{01}=

*a*= 0; and the heterosynaptic rule

*r*

_{11}= 1 −

*p*,

*r*

_{01}= −

*p*,

*r*

_{00}=

*r*

_{10}=

*a*= 0.

## Appendix H: Generalized BCPNN-Type Learning Rules

### H.1. Generalizing the BCPNN Rule for Query Noise.

*j*), it is where denotes the number of one-entries in the query vector. Thus, taking logarithms yields synaptic weights

*w*and firing thresholds Θ

_{ij}_{j}, where we have again skipped indices

*i*,

*j*for brevity. Transition probabilities can again be estimated as in equations 2.19 and 2.20.

### H.2. The BCPNN2 Rule: Including Inactive Query Components.

### H.3. The BCPNN3 Rule: Eliminating .

### H.4. The SNR of the BCPNN3 Rule.

*Mpq*→ ∞. However, asymptotically identical first-order terms of single synaptic weights is not a sufficient condition for identical network performance since

*Mpq*→ ∞ implies a diverging synapse number. In fact, the following analysis shows that the BCPNN3 rule has a lower SNR than the optimal Bayes rule, which also excludes the optimality of the BCPNN model. We can easily adapt the SNR analysis of section 2.3 to the BCPNN3 rule simply by skipping all terms relating to inactive query components . Equivalently to equations H.11 and H.12, the biological formulation of the BCPNN3 model as In analogy to equation B.1, the potential

*x*of content neuron

_{j}*j*writes as In analogy to equations B.4 and B.5, the first-order approximations of mean low and high potentials are In analogy to equation B.6 the mean difference Δμ ≔ μ

_{hi}− μ

_{lo}between the high and low distributions is In analogy to equation B.8, the variances of dendritic potentials are Thus, asymptotically, for and and assuming large networks and consistent error estimation such that

*k*=

*pm*, , , we obtain in analogy to equations 2.25 and 2.26, Therefore, similar to equation 2.28, for large

*M*

_{1}≈

*Mq*and including network connectivity

*P*, the SNR

*R*= Δμ/σ can be obtained from Thus, asymptotically for

*Mpq*→ ∞, the squared SNR for the BCPNN3 rule is factor worse than for the optimal Bayesian model.

## Acknowledgments

I am grateful to Julian Eggert, Marc-Oliver Gewaltig, Helmut Glünder, Edgar Körner, Ursula Körner, Anders Lansner, Günther Palm, Friedrich Sommer, and the two anonymous reviewers for helpful discussions and comments.

## Notes

^{1}

Evaluating equation 2.14 during retrieval requires about 5*m* multiplications and 2*m* additions even for sparse query activity with . By contrast, evaluating equation 2.16 requires only multiplications and *m* additions, as the “bias” (first and second summands) of *x _{j}* is independent of and therefore can be computed in advance.

^{2}

For this, the offset *w*_{0} should not depend on *i*.

^{3}

Note that *p*_{0} → 0.5 corresponds to . The same argumentation for independently generated address pattern components with binomially distributed *k*^{μ} ∼ *B*_{m,p} would even suggest optimality until *p*_{0} → 1/*e* ≈ 0.37 and where the Willshaw model achieves the maximal capacity (see Knoblauch et al., 2010, eq. D.12).

^{4}

Without loss of generality, *p* ≔ pr[*u*^{μ}_{i}] ⩽ 0.5 (otherwise, invert the address pattern components).

^{5}

For example, if address “feature” *u _{i}* = 1 is positively correlated with content

*v*= 1, then it typically occurs that

_{j}*p*

_{10|1}(

*ij*) <

*p*

_{10|0}(

*ij*) and

*p*

_{01|1}(

*ij*)>

*p*

_{01|0}(

*ij*), such that the optimal coincidence increment,

*r*

_{11}(

*ij*), is smaller than expected from the covariance rule, η

_{11}/η

_{00}< 1, whereas the offset is positive,

*w*

_{0}(

*ij*)>0. The deviation from the covariance rule can be significant, for example,

*p*=

*q*= 0.1, , (corresponding to

*p*

_{10}= 0.25,

*p*

_{01}= 0.025),

*p*

_{10|1}= 0.1

*p*

_{10},

*p*

_{01|1}= 10

*p*

_{01}yields η

_{11}/η

_{00}≈ 0.3 and

*w*

_{0}≈ 1.8.

^{6}

For example, σ_{hi} = 0 for pattern part retrieval in the Willshaw model (see section 3.1).

^{7}

Although l-WTA retrieval is simple to implement, it is much more difficult to analyze.

^{8}

Although l-WTA retrieval is simple to implement, it is much more difficult to analyze.