Abstract

Neural associative memories are perceptron-like single-layer networks with fast synaptic learning typically storing discrete associations between pairs of neural activity patterns. Previous work optimized the memory capacity for various models of synaptic learning: linear Hopfield-type rules, the Willshaw model employing binary synapses, or the BCPNN rule of Lansner and Ekeberg, for example. Here I show that all of these previous models are limit cases of a general optimal model where synaptic learning is determined by probabilistic Bayesian considerations. Asymptotically, for large networks and very sparse neuron activity, the Bayesian model becomes identical to an inhibitory implementation of the Willshaw and BCPNN-type models. For less sparse patterns, the Bayesian model becomes identical to Hopfield-type networks employing the covariance rule. For intermediate sparseness or finite networks, the optimal Bayesian learning rule differs from the previous models and can significantly improve memory performance. I also provide a unified analytical framework to determine memory capacity at a given output noise level that links approaches based on mutual information, Hamming distance, and signal-to-noise ratio.

1.  Introduction

An associative memory is an alternative computing architecture in which, unlike the classical von Neumann machine, computation and data storage are not separated. For example, as illustrated by Figure 1, an associative memory can store a set of associations between pairs of pattern vectors {(uμvμ):μ = 1, …, M}. Similar to random access memory, a query pattern uμ entered in associative memory can serve as an address for accessing the associated content pattern vμ. However, unlike random access memory, an associative memory accepts arbitrary query patterns , and the computation of any particular output involves all stored data records rather than a single one. Specifically, the associative memory task consists of comparing a query with all stored addresses and returning an output pattern equal (or similar) to the pattern vμ associated with the address uμ most similar to the query. Thus, the associative memory task includes the random access task but is not restricted to it. It also includes computations such as pattern completion, denoising, or data retrieval using incomplete cues. Moreover, neural implementations of associative memory are closely related to Hebbian cell assemblies and play an important role in neuroscience as models of neural computation for various brain structures, for example, neocortex, hippocampus, cerebellum, mushroom body (Hebb, 1949; Braitenberg, 1978; Palm, 1991; Fransen & Lansner, 1998; Pulvermüller, 2003; Johansson & Lansner, 2007; Lansner, 2009; Gardner-Medwin, 1976; Rolls, 1996; Bogacz, Brown, & Giraud-Carrier, 2001; Marr, 1969, 1971; Albus, 1971; Kanerva, 1988; Laurent, 2002).

Figure 1:

Pattern storage and retrieval in an associative memory. The task is to store M associations between address patterns uμ and content patterns vμ. Address patterns uμ are binary vectors of size m with an average number of active units. Similarly, the content patterns vμ have size n and mean activity . During retrieval, the memories are addressed by a query pattern being a noisy version of one of the address patterns with component transition probabilities and corresponding to miss noise and add noise, respectively. Thus, the query contains, on average, a fraction of correctly active units and another fraction of falsely active units. The total fraction of wrong query components is called query noise . Similarly, the output noise is the fraction of wrong components in the retrieval output pattern where and are the transition probabilities of the corresponding memory channel.

Figure 1:

Pattern storage and retrieval in an associative memory. The task is to store M associations between address patterns uμ and content patterns vμ. Address patterns uμ are binary vectors of size m with an average number of active units. Similarly, the content patterns vμ have size n and mean activity . During retrieval, the memories are addressed by a query pattern being a noisy version of one of the address patterns with component transition probabilities and corresponding to miss noise and add noise, respectively. Thus, the query contains, on average, a fraction of correctly active units and another fraction of falsely active units. The total fraction of wrong query components is called query noise . Similarly, the output noise is the fraction of wrong components in the retrieval output pattern where and are the transition probabilities of the corresponding memory channel.

In its simplest forms, neural associative memories are single-layer perceptrons with fast, typically one-shot, synaptic learning realizing the storage of M discrete associations between binary address and content patterns uμ and vμ. The one-shot constraint favors local learning rules where a synaptic weight wij depends on only uμi and vμj. Alternative nonlocal learning methods are typically time-consuming and require gradient descent (such as error backpropagation) that is based on global error signals obtained from repeated training of the entire pattern set. Instead, associative memories use simple Hebbian-type learning rules where synaptic weights increase if both the presynaptic and postsynaptic neurons are active during presentation of a pattern pair.

The performance of neural associative memory models can be evaluated by storage capacity, which can be defined, for example, by the number of memories M a network of a given size can store or by the Shannon information C that a synapse can store. More recent work considers also structural compression of synaptic networks and the energy or time requirements per retrieval (Poirazi & Mel, 2001; Stepanyants, Hof, & Chklovskii, 2002; Lennie, 2003; Knoblauch, 2003, 2005, 2009b; Knoblauch, Palm, & Sommer, 2010).

The simplest one-shot learning model is the so-called Steinbuch or Willshaw model with binary synapses and clipped Hebbian learning (Willshaw, Buneman, & Longuet-Higgins, 1969; Steinbuch, 1961; Palm, 1980, 1991; Golomb, Rubin, & Sompolinsky, 1990; Nadal, 1991; Sommer & Dayan, 1998; Sommer & Palm, 1999; Knoblauch et al., 2010). Here a single coincidence of presynaptic and postsynaptic activity is sufficient to increase the synaptic weight from 0 to 1, while further coincidences do not cause further changes.

An alternative model is the linear associative memory, where contributions of different pattern pairs add linearly (Kohonen, 1972; Kohonen & Oja, 1976; Anderson, Silverstein, Ritz, & Jones, 1977; Hopfield, 1982; Palm, 1988a, 1988b; Tsodyks & Feigel'man, 1988; Willshaw & Dayan, 1990; Dayan & Willshaw, 1991; Palm & Sommer, 1992, 1996; Chechik, Meilijson, & Ruppin, 2001; Sterratt & Willshaw, 2008). For example, for binary memory patterns uμi, vμj ∈ {0, 1} the general linear learning rule can be described by four values specifying the weight increments for the possible combinations of presynaptic and postsynaptic activity.

Surprisingly, the maximal storage capacity C in bits per synapse is almost identical for the two models: the Willshaw model can achieve up to 0.69 bits per synapse (bps), whereas the linear models achieve only a slightly higher capacity of 0.72 bps in spite of employing real-valued synaptic weights. However, closer investigation reveals that the Willshaw model can achieve nonzero capacity only for extremely sparse activity, where the number of active units per pattern vector scales logarithmic with the vector size. In contrast, the linear model achieves the maximum C = 0.72 bps for a much larger range of moderately sparse patterns. Only for a nonvanishing fraction of active units per pattern vector does the performance drop from 0.72 bps to the capacity of the original (nonsparse) Hopfield network (e.g., C = 0.14 bps in Hopfield, 1982; Hertz, Krogh, & Palmer, 1991; Palm & Sommer, 1996, or, as we will see below, C = 0.33 bps for the hetero-associative feedforward networks considered here). The linear learning model achieves maximal storage capacity only for the optimal covariance learning rule (e.g., Sejnowski, 1977a, 1977b; Dayan & Willshaw, 1991; Dayan & Sejnowski, 1993; Palm & Sommer, 1996), which becomes equal to the Hebb rule for very sparse patterns and equal to the Hopfield rule for nonsparse patterns. Moreover, simulation experiments show that the capacity of the optimal linear model remains well below the capacity of the Willshaw model for any reasonable finite network size (e.g., C = 0.2 bps versus C = 0.5 bps for n = 105 neurons; see Knoblauch, 2009a; Palm & Sommer, 1992). This suggests that the linear covariance rule is not always optimal, in particular not for finite networks and sparse memory representations as found in the brain (Waydo, Kraskov, Quiroga, Fried, & Koch, 2006).

A third model class is based on the Bayesian confidence propagation neural network (BCPNN) rule (Lansner & Ekeberg, 1987, 1989; Kononenko, 1989, 1994; Lansner & Holst, 1996; Sandberg, Lansner, Petersson, & Ekeberg, 2000; Lansner, 2009). This model employs Bayesian maximum-likelihood heuristics for synaptic learning and retrieval (see also a related approach based on maximizing the entropy of synaptic weights: MacKay, 1991). Therefore, it has been suspected that the BCPNN model could achieve optimal performance, or at least exceed the performance of Willshaw and linear models. These conjectures have been supported by some numerical investigations; however, theoretical analyses of the BCPNN model have been lacking so far. As we will see, the BCPNN model becomes optimal only for a limited range of very sparse memory patterns.

This article (see also Knoblauch, 2009a, 2010a) develops the generally optimal associative memory that minimizes output noise and maximizes storage capacity by activating neurons based on Bayesian maximum likelihood decisions. The corresponding neural interpretation of this Bayesian associative memory corresponds in general to a novel nonlinear learning rule resembling the BCPNN rule. Specifically, a theoretical analysis including query noise shows that the previous learning models are only special limit cases of the generally optimal Bayesian model. Asymptotically, for large networks and extremely sparse memory patterns, the Bayesian model becomes essentially identical to the binary Willshaw model (but implemented with inhibitory rather than excitatory synapses; see Knoblauch, 2007). Similarly, the BCPNN model is optimal for a less restricted range of sparse memory patterns where the fraction of active units per memory vector still vanishes. For less sparse and nonsparse patterns, the Bayesian model becomes identical to the linear model employing the covariance rule. For a large range of intermediate sparseness and finite networks, the Bayesian learning rule is shown to perform significantly better than previous models. As a by-product, this work also provides a unified analytical framework to determine memory capacities at a given output noise level that links approaches based on mutual information, Hamming distance, and signal-to-noise ratio.

The organization of the paper is as follows. Section 2 describes the model of neural associative memory with optimal Bayesian learning and analyzes signal-to-noise ratio and storage capacity. Section 3 compares the Bayesian associative memory to previous models in the literature, including inhibitory implementations of the Willshaw network, linear learning models with the covariance rule, and BCPNN-type models, and determines asymptotic conditions when the respective models become equivalent to optimal Bayesian learning. Section 4 presents results from numerical simulation experiments verifying the theoretical results concerning signal-to-noise-ratio, output noise, and storage capacity. Further experiments compare the performance of various learning models for finite network sizes. Section 5 summarizes and discusses the main results of this work. The appendixes include a description for appropriate implementations of Bayesian associative memory (appendix A), an analysis for computing optimal firing thresholds (appendix D), an analysis of the relationship between signal-to-noise ratio and Hamming-distance-based measures for output noise and storage capacity (appendix E), and signal-to-noise ratio analyses for the linear and BCPNN-type models (appendixes G, H).

2.  Model of Bayesian Associative Memory

2.1.  Memory Storage in Neural and Synaptic Countervariables.

The task is to store M associations between address patterns uμ and content patterns vμ where μ = 1, …, M. Here uμ and vμ are binary vectors of size m and n, respectively. Memory associations are stored in first-order (neural) and second-order (synaptic) countervariables. In particular, each address neuron i and each content neuron j can memorize its unit usage:
formula
2.1
formula
2.2
formula
2.3
formula
2.4
Similarly, each synapse ij can memorize its synapse usage:
formula
2.5
formula
2.6
formula
2.7
formula
2.8
where i = 1, …, m and j = 1, …, n. Note that it is sufficient to memorize M, M1, M1, and M11. Thus, an implementation on a digital computer requires about (mn + m + n + 1)ldM memory bits. The following analyses consider optimal Bayesian retrieval, assuming that each output unit j = 1, …, n has access to the variables in the set
formula
2.9
The following analyses will show that the mean values of the coincidence counters and unit usages, , , have a major role in determining the regime of operation for Bayesian associative memory (see Table 2).

2.2.  Neural Formulation of Optimal Bayesian Retrieval.

Given a query pattern and the countervariables of section 2.1, the memory task is to find the most similar address pattern uμ and return a reconstruction of the associated content vμ. In general, query is a noisy version of uμ, assuming component transition probabilities given the activity of a content neuron, :
formula
2.10
formula
2.11
Now the content neurons j have to decide independently of each other whether to be activated or remain silent. Given the query , the optimal maximum likelihood decision is based on the odds ratio ,
formula
2.12
which minimizes the expected Hamming distance between original and reconstructed content. If the query pattern components are conditional independent given the activity of content neuron j (e.g., assuming independently generated address and query components), we have for
formula
2.13
With the Bayes formula , the odds ratio is
formula
2.14
For a more plausible neural formulation, we can take logarithms of the probabilities and obtain dendritic potentials . With being the ith factor in the product of equation 2.14, it is
formula
Thus, synaptic weights wij, dendritic potentials xj, and retrieval output are finally
formula
2.15
formula
2.16
formula
2.17
such that writes as a sigmoid function of xj, and a content neuron fires, , iff the dendritic potential is nonnegative. Note that indices of M0(j), M1(j), , , M00(ij), M01(ij), M10(ij), and M11(ij) are skipped for readability. Also note that optimal Bayesian learning is nonlinear and, for autoassociation with uμ = vμ and nonzero query noise, asymmetric with wijwji. Note further that synaptic weights and dendritic potentials may be infinite, such that accurate implementations require two values per variable for finite and infinite components, respectively (see appendix A).
Nevertheless, evaluating equation 2.16 is much cheaper than equation 2.14,1 in particular for sparse queries having only a small number of active components with . However, the synaptic weights of equation 2.15 may not yet satisfy Dale's law that a neuron is either excitatory or inhibitory. To be more consistent with biology, we may add a sufficiently large constant w0 ≔ −minijwij to each weight. Then all synapses have nonnegative weights wijwij + w0 and the dendritic potentials remain unchanged if we replace the last sum in equation 2.16 by
formula
2.18
Here the negative sum could be realized, for example, by feedforward inhibition with a strength proportional to the query pattern activity, as suggested by Knoblauch and Palm (2001) and Knoblauch (2005), for example.
The transition probabilities, equations 2.10 and 2.11, can be estimated by maintaining countervariables similar as in section 2.1. For example, if the μth memory vμ has been queried by address queries (where ), then we could estimate for ,
formula
2.19
which requires four countervariables per synapse in addition to M11. To reduce storage costs, one may assume
formula
2.20
independent of j, as do most of the following analyses and experiments for the sake of simplicity, although this assumption may reduce the number of discovered rules (corresponding to infinite wij) describing deterministic relationships between ui and vj.

2.3.  Analysis of the Signal-to-Noise Ratio.

We would like to build a memory system with high retrieval quality, for example, where the expected Hamming distance,
formula
2.21
is small. Here, dH is as defined below equation 2.12, and q(j) ≔ pr[vμj = 1] is the prior probability of an active content unit. Thus, retrieval quality is determined by the component output error probabilities,
formula
2.22
formula
2.23
where the Θj are firing thresholds (e.g., Θj = 0 for dendritic potentials xj as in equation 2.16). Intuitively, retrieval quality will be high if the high-potential distribution pr[xj|vμj = 1] and the low-potential distribution pr[xj|vμj = 0] are well separated, that is, if the signal-to-noise ratio (SNR),
formula
2.24
is large for each content neuron j (Amari, 1977; Palm, 1988a, 1988b; Dayan & Willshaw, 1991; Palm & Sommer, 1996). Here μloE(xj|vμj = 0) and σ2lo ≔ Var(xj|vμj = 0) are the expectation and variance of the low-potential distribution, and μhi = E(xj|vμj = 1) and σ2hi ≔ Var(xj|vμj = 1) are the expectation and variance of the high-potential distribution. Appendix E shows that under some conditions, the SNR and the Hamming distance are equivalent measures of retrieval quality.
Appendix B computes the SNR RR(j) for a particular content neuron j with qM1(j)/M using the following simplifications:
  1. The activation of an address unit i does not depend on other units, and all address units i have the same prior probability pp(i) ≔ pr[uμi = 1] of being active. Thus, on average, an address pattern has active units.

  2. Query noise for an address unit i does not depend on other units, and all query components i have the same noise transition probabilities and . Thus, on average, a query will have correct and false one-entries, where and define fractions of average miss noise and add noise, respectively, normalized to the mean address pattern activity .

  3. Retrieval involves a particular query pattern being a noisy version of an address pattern uμ that has exactly k one-entries, where the query has c out of k correct one-entries and, additionally, f false one-entries. Without loss of generality, we can assume a setting as illustrated by Figure 2 (left), that is, the address pattern has one-entries uμi = 1 at components i = 1, 2, …, k and zero-entries uμi = 0 at i = k + 1, k + 2, …, m whereas the query has false entries at i = c + 1, c + 2, …, k + f.

  4. The average values of the synaptic coincidence counters diverge: . Note that this assumption also implies diverging unit usages, and . For reasons that will become apparent in section 3, the condition is also referred to as the linear learning regime, whereas will be called the nonlinear learning regime.

Figure 2:

(Left) For the analysis of SNR, we assume that the query pattern corresponds to one of the original address patterns uμ that has k one-entries (and mk zero-entries). Due to query noise, the query has only c correct one-entries overlapping with uμ and an addition of f false one-entries. Without loss of generality, the analysis assumes the setting as illustrated. (Right) Contour plot of the relative SNR ρ2 for sparse address activity with p → 0 (see equation 2.29) as a function of miss noise and add noise contained in the query used for memory retrieval.

Figure 2:

(Left) For the analysis of SNR, we assume that the query pattern corresponds to one of the original address patterns uμ that has k one-entries (and mk zero-entries). Due to query noise, the query has only c correct one-entries overlapping with uμ and an addition of f false one-entries. Without loss of generality, the analysis assumes the setting as illustrated. (Right) Contour plot of the relative SNR ρ2 for sparse address activity with p → 0 (see equation 2.29) as a function of miss noise and add noise contained in the query used for memory retrieval.

From the results of appendix B, we obtain the SNR, equation 2.24 in the asymptotic limit of large where all variables will be close to their expectations due to the law of large numbers. In particular, we can assume kmp, and, for consistent error estimates, , . Then we obtain from equation B.6 the mean difference Δμ ≔ μhi − μlo between high potentials and low potentials:
formula
2.25
Similarly, we obtain from equation B.8 for the potential variance:
formula
2.26
In order to include randomly diluted networks with connectivity P ∈ (0; 1] where a content neuron vj receives synapses from only a fraction P of the m address neurons, we can simply replace m by Pm. With M1Mq and M0M(1 − q), the asymptotic SNR R = Δμ/σ is
formula
2.27
formula
2.28
with
formula
2.29
Thus, for zero query noise, , , the SNR for optimal Bayesian learning is identical to the asymptotic SNR of linear learning with the optimal covariance rule (e.g., see ρCovariance3 in Dayan & Willshaw, 1991, p. 259, or equation 3.28 in Palm & Sommer, 1996, p. 95; see also section 3.2). Nonzero query noise according to or decreases the SNR R by a factor ρ < 1. Note that ρ characterizes the basin of attraction, defined as the set of queries that get mapped to a stored memory vμ. For example, we can evaluate which combinations of and achieve a fixed desired ρ (and thus R). It turns out that for sparse address patterns, p < 0.5, miss noise impairs network performance more severely than add noise (see Figure 2, right). As a consequence, the basins of attraction for neural associative memories employing sparse address patterns are not necessarily spheres, but they can be heavily distorted, enlarging toward queries with add noise and shrinking toward queries with miss noise. This implies that the similarity metrics employed by associative networks can strongly deviate from commonly used Hamming or Euclidean metrics. Instead, associative networks appear to follow an information-theoretic metric based on mutual information or transinformation (Cover & Thomas, 1991). This is true at least for random address patterns uμ storing a sufficiently large number of memories such that the synapse usages, in particular M11, are almost never zero. Numerical simulations discussed in section 4 reveal that basins of attraction can behave quite differently if these assumptions are not fulfilled.

2.4.  Analysis of Storage Capacity.

Let us determine the maximal number of memories that can be stored in an associative network or, equivalently, the maximal amount of information that a synapse can store. To this end, we define storage capacity at a given level of output noise,
formula
2.30
being the expected Hamming distance, equation 2.21 normalized to the mean content pattern activity . As for query noise, we can write output noise as a sum of miss noise and add noise . For ergodic qq(j), q01q01(j), q10q10(j) (or considering only a single output unit j), we have miss noise and add noise . The weighing between miss noise and add noise can be expressed by the output noise balance,
formula
2.31
For any given distribution of dendritic potentials, there exists a unique optimal firing threshold (see appendix D) and, hence, a corresponding optimal noise balance (see equation E.10) that minimize the output noise . This minimal output noise is an increasing function of the number M of stored memories (see equation E.6). Therefore, we can define the pattern capacity
formula
2.32
as the maximal number of memory patterns that can be stored such that the output noise does not exceed a given value ϵ. Assuming that the dendritic potentials follow approximately a gaussian distribution (which is not always true; e.g., see Henkel & Opper, 1990; Knoblauch, 2008), we can apply the results of appendix E and obtain from the SNR, equation 2.24, by solving the equation for . Here is approximately equal to equation 2.28, and is the minimal SNR required for output noise level ϵ and can be computed from solving equation E.6 for R (or, more conveniently, by iterating equations E.9 and E.10). Thus,
formula
2.33
where the approximation becomes exact for large networks in the limit Mpq → ∞.
An alternative capacity measure normalizes the stored Shannon information (of the content memories) to the number Pmn of synapses employed in a given network. This is the network capacity
formula
2.34
where T is the transinformation equation F.4 with error probabilities q01, q10 as in equations E.4 and E.5 using . We can refine these results for two important cases using the results of appendixes E and F.
First, for nonsparse content patterns with q = 0.5, it is
formula
2.35
formula
2.36

As can be seen in Figures 3a and 3b, the upper bound of Cϵ is achieved for zero query noise () and low fidelity with ϵ → 1, while Cϵ → 0 for high fidelity with ϵ → 0.

Figure 3:

Generalized asymptotic network capacity Cϵξ for zero query noise (, , ρ = 1) displayed as a function of output noise parameter ϵ, noise balance parameter ξ, and content pattern activity q (see text below equation 2.38). Here Cϵ of equation 2.34 is a special case for optimal noise balance as in equation E.10 minimizing output noise (see equation 2.30). (a) Cϵ as function of output noise ϵ for nonsparse memories, q = 0.5, and optimal noise balance . (b) Contour plot of general Cϵξ as function of parameters ϵ and ξ for nonsparse q = 0.5. (c) Contour plot of Cϵ as function of q and ϵ for optimal . (d) Optimal corresponding to panel c. For q → 0 miss noise is dominating with (cf. equation D.10). (e) Contour plot of Cϵξ similar to panel b, but for sparse content memories with q = 0.1. (f) Similar to panel e, but for extremely sparse q = 0.0000001. Note that q → 0 implies that the maximum of Cϵξ occurs for low fidelity ϵ>1 and dominating add-noise with ξ>0.5. Thus, although minimizing the Hamming-distance-based output noise, the “optimal” firing threshold Θopt of appendix E does not necessarily maximize Cϵξ unless ϵ → 0.

Figure 3:

Generalized asymptotic network capacity Cϵξ for zero query noise (, , ρ = 1) displayed as a function of output noise parameter ϵ, noise balance parameter ξ, and content pattern activity q (see text below equation 2.38). Here Cϵ of equation 2.34 is a special case for optimal noise balance as in equation E.10 minimizing output noise (see equation 2.30). (a) Cϵ as function of output noise ϵ for nonsparse memories, q = 0.5, and optimal noise balance . (b) Contour plot of general Cϵξ as function of parameters ϵ and ξ for nonsparse q = 0.5. (c) Contour plot of Cϵ as function of q and ϵ for optimal . (d) Optimal corresponding to panel c. For q → 0 miss noise is dominating with (cf. equation D.10). (e) Contour plot of Cϵξ similar to panel b, but for sparse content memories with q = 0.1. (f) Similar to panel e, but for extremely sparse q = 0.0000001. Note that q → 0 implies that the maximum of Cϵξ occurs for low fidelity ϵ>1 and dominating add-noise with ξ>0.5. Thus, although minimizing the Hamming-distance-based output noise, the “optimal” firing threshold Θopt of appendix E does not necessarily maximize Cϵξ unless ϵ → 0.

Second, for sparse content patterns with q → 0 and any fixed ϵ, it is
formula
2.37
formula
2.38
where the upper bound of Cϵ can be reached for zero query noise and high fidelity with ϵ → 0. Not surprisingly, this upper bound equals the one found for the linear covariance rule (Palm & Sommer, 1996) as well as the general capacity bound for neural networks (Gardner, 1988). Numerical evaluations (see Figures 3c to 3f) show that a network capacity close to Cϵ ≈ 0.72 requires extremely sparse content memories and very large networks. In fact, finite networks of practical size can reach less than half of the asymptotic value (see Figure 3f). Note that and Cϵ are defined only for ϵ < 1 assuming optimal firing thresholds to minimize output noise corresponding to an optimal noise balance as in equation E.10, where output errors are dominated by miss noise (see equation D.10). For generalized definitions of pattern capacity Mϵξ and network capacity Cϵξ at a given output noise balance , we can replace by as given by equation E.9. Here finite networks achieve maximal capacity at low fidelity ϵ ≫ 1 and ξ → 1 where output errors are dominated by add noise.

For self-consistency, the analyses so far are valid only for diverging . Thus, the results are not reliable for extremely sparse memory patterns, for example, mp = O(log n), where at least the binomially distributed synaptic countervariables M11BM,pq are small and cannot be approximated by gaussians (where BN,P is defined below equation B.1). In particular for queries without any add noise, , small M11 implies very large or even infinite synaptic weights (see equation 2.15) that would also violate the gaussian assumption for the distribution of dendritic potentials. As will be shown below, the Bayesian associative memory becomes equivalent to the Willshaw model with a decreased maximal network capacity Cϵ ⩽ ln 2 ≈ 0.69 (or, rather, Cϵ ⩽ 1/(eln 2) ≈ 0.53 for independently generated address pattern components uμi with binomially distributed pattern activities, kμ ≔ ∑mi=1uμiB(m, p), as assumed here; see Willshaw et al., 1969; Knoblauch et al., 2010, appendix D). The following section investigates more closely the relationships to the Willshaw net, linear Hopfield–type learning rules, and the BCPNN model.

3.  Relationships to Previous Models

3.1.  Willshaw Model and Inhibitory Networks.

The Willshaw or Steinbuch model is one of the simplest models for distributed associative memory employing synapses with binary weights:
formula
3.1
The dendritic potentials of the content neurons are simply . Exact potential distributions are well known and can be used to compute optimal firing thresholds (Palm, 1980; Buckingham & Willshaw, 1992, 1993; Knoblauch, 2008; Knoblauch et al., 2010).

The Willshaw model works particularly well for “pattern part retrieval” with zero add noise . Then the active units of a query are a subset of an address pattern uμ and the optimal threshold is maximal, that is, equal to the query pattern activity, . Thus, a single missing query input, but wij = 0, excludes activation of content neuron j. Based on this observation, it has been suggested that the Willshaw model should be interpreted as an essentially inhibitory network where zero weights become negative, positive weights become zero, and the optimal firing threshold becomes zero (Knoblauch, 2007). Such inhibitory implementations of the Willshaw network are very simple and efficient for a wide parameter range of moderately sparse memory patterns with p ≫ log(n)/n where a small number of inhibitory synapses can store a large amount of information, CS ∼ log n bps, even for diluted networks with low connectivity P < 1. Moreover, the inhibitory interpretation offers novel functional hypotheses for strongly inhibitory circuits in the brain, for example, involving basket or chandelier cells (Markram et al., 2004). By contrast, the common excitatory interpretation is efficient only for very sparse memory patterns and cannot implement optimal threshold control in a simple and biologically plausible way (Buckingham & Willshaw, 1993; Graham & Willshaw, 1995).

The following arguments show that the inhibitory Willshaw network is actually a limit case of the optimal Bayesian associative memory in the nonlinear learning limit when synaptic coincidence counters are small, , but unit usages are still large, , . For pattern part retrieval with queries containing miss noise only, , the optimal Bayesian synaptic weights wij of equation 2.15 become
formula
3.2
where the approximation is valid if query noise is independent of the content, p10|0 = p10|1, and address patterns have sparse activity, p ≪ 1, such that M00M10 and M01M11. In case p10|0p10|1, the approximation is still valid up to an additive offset, w0 ≔ log((1 − p10|1)/(p10|0)), where optimal retrieval can be implemented as described for equation 2.18.2
Thus, the optimal Bayesian model has strongly inhibitory weights, wij = −∞, for M11 = 0 when the original Willshaw network would have zero weights. For sufficiently small , the fraction of synapses with zero coincidence counters will be significant, p0 ≔ pr[M11 = 0] = (1 − pq)M ≈ exp(−Mpq) ≫ 0, and, thus, the dendritic potentials will be dominated by the strongly inhibitory inputs. For still diverging unit usages, and , the remaining synaptic countervariables will be large and close to their mean values, M00M(1 − p)(1 − q), M01M(1 − p)q, M10Mp(1 − q), and therefore approximately equal for all synapses. Thus, up to an additive constant, the synaptic weights become
formula
3.3
corresponding to a nonlinear incremental Hebbian learning rule. At least for large p0 → 1, this rule will degenerate to the clipped Hebbian rule of the inhibitory Willshaw model where wij = −∞ with probability pr[M11 = 0] = p0 and wij = 0 with probability pr[M11 = 1] ≈ 1 − p0 whereas pr[M11>1] ≈ 0 becomes negligible. Since p0 → 1 is equivalent to , this means that the Willshaw model becomes equivalent to Bayesian learning at least for max(1/p, 1/q) ≪ M ≪ 1/(pq) (see Figure 6, left panels). Numerical experiments suggest that the Willshaw model may be optimal even for smaller p0 → 0.5 corresponding to logarithmic pattern activity, , where the Willshaw capacity becomes maximal, bps, given that individual address pattern activities kμ are narrowly distributed around mp (see Figure 7b; see also Knoblauch et al., 2010). For even smaller p0 < 0.5 corresponding to the Willshaw model cannot be optimal because then , whereas the capacity of the optimal Bayesian model increases toward bps.3

3.2.  Linear Learning Models and the Covariance Rule.

In general, the synaptic weights of the Bayesian associative network (see equation 2.15) are a nonlinear function of presynaptic and postsynaptic activity. This section shows that in the limit , the optimal Bayesian rule, equation 2.15, can be approximated by a linear learning rule,
formula
3.4
with offset w0 and learning increments ruv specifying the change of synaptic weight when the presynaptic and postsynaptic neurons have activity u ∈ {0, 1} and v ∈ {0, 1}, respectively. In fact, for diverging unit usages, M1, M0 → ∞, the synapse usages will be close to expectation: , , , and . These approximations make only a negligible relative error if the standard deviations are small compared to the expectations. The most critical variable is the coincidence counter M11 having expectation M1p and standard deviation .4 Thus, the approximations are valid for large values of the coincidence counter, that is, for qM1/M. Then the argument of the logarithm in equation 2.15 will be close to
formula
3.5
where   d*1: =  p(1 − p10|1) + (1 − p)p01|1,    d*2 ≔ (1 − p)(1 − p01|0) +  pp10|0, d*3p(1 − p10|0) + (1 − p)p01|0,   and   d*4: = (1 − p)(1 − p01|1) + pp10|1. Linearizing the logarithm around a0 yields
formula
3.6
where d1M11(1 − p10|1) + M01p01|1, d2M00(1 − p01|0) + M10p10|0, d3M10(1 − p10|0) + M00p01|0, and d4M01(1 − p01|1) + M11p10|1 for brevity. Similarly, the resulting function f can be linearized around the expectations of the synapse usages. This gives a learning rule of the form of equation 3.4 with offset w0 = log a0 and
formula
3.7
formula
3.8
formula
3.9
formula
3.10
where
formula
3.11
If the query noise is independent of the contents, p01 = p01|0 = p01|1 and p10 = p10|0 = p10|1, then the four constants become identical, η ≔ η11 = η10 = η01 = η00, the offset becomes zero, w0 = 0, and the synaptic weight becomes
formula
3.12
This is essentially (up to factor η) the linear covariance rule as discussed in much previous work (e.g., Sejnowski, 1977a, 1977b; Hopfield, 1982; Palm, 1988a, 1988b; Tsodyks & Feigel'man, 1988; Willshaw and Dayan, 1990; Dayan & Willshaw, 1991; Palm & Sommer, 1992, 1996; Dayan & Sejnowski, 1993; Chechik et al., 2001; Sterratt & Willshaw, 2008). Thus, together with the results of section 2.3, this shows that, in the asymptotic limit with query noise being independent of contents, optimal Bayesian learning becomes equivalent to linear learning models employing the covariance rule. If query noise depends on contents, Bayesian learning differs from the covariance rule, but up to an additive offset, it still follows a linear learning rule.5

3.3.  BCPNN-Type Models.

The BCPNN rule is an early learning model for neural associative memory employing a Bayesian heuristics (Lansner & Ekeberg, 1987, 1989; Kononenko, 1989). The original rule is
formula
3.13
formula
3.14
where wij is the synaptic weight and, given a query , an output neuron will be activated, , if the dendritic potential exceeds the firing threshold Θj (see Lansner & Ekeberg, 1989, p. 79).
The following summarizes the main results of a technical report (Knoblauch, 2010a) comparing the BCPNN rule to the optimal Bayesian rule, equation 2.15. Obviously the two rules are not identical. The reason for this discrepancy is that Lansner and Ekeberg derived the BCPNN rule from the following maximum likelihood decision,
formula
3.15
where is the set of active query components. Thus, there are two main differences to the optimal Bayesian decision, equation 2.12. One is that the BCPNN model considers only active query components and ignores inactive components . In contrast, the optimal Bayesian model considers both active and inactive query components. Second, the BCPNN model needs to compute , which becomes viable only by wrongly assuming that the query components would be independent of each other, that is, by using
formula
3.16
This approximation is inaccurate because the query components given the storage variables depend on each other even for independently generated query components with . For example, consider the following simple network motif of two input units, m = 2, and a single output unit, n = 1. After storing M memories, let
formula
3.17
where, for brevity, the indices are skipped for the output unit. Then, for zero query noise, it is , but . Note that the optimal Bayesian model avoids this problem by computing the odds ratio such that cancels.

Appendix H generalizes the BCPNN rule for noisy queries and describes two improved BCPNN-type rules, each of them fixing one of the two problems described: the BCPNN2 rule (see equation H.9), includes inactive query components but still uses an approximation similar to equation 3.16, and the BCPNN3 rule (see equation H.12) does not employ approximation equation 3.16, but still ignores inactive query components. For the latter, it is possible to compute the SNR in analogy to section 2.3. It turns out that in the linear learning regime, , the squared SNR R2 (and thus also storage capacity and Cϵ) is factor below the optimal value equation 2.28. This implies also that the original BCPNN rule performs at least factor worse than the optimal Bayesian rule and thus, at most, is equivalent to the suboptimal linear homosynaptic rule (e.g., see rule R3 in Dayan & Willshaw, 1991). In the complementary nonlinear regime corresponding to very sparse patterns, similar arguments as in section 3.1 show that the BCPNN model becomes equivalent to optimal Bayesian learning and the Willshaw model.

4.  Results from Simulation Experiments

This section has two purposes: to verify the theoretical results and compare the performances of the different learning models. To this end, I have implemented associative memory networks with optimal Bayesian learning (see section 2.2), BCPNN-type learning (see appendix H and section 3.3), linear learning (see appendix G and section 3.2), and Willshaw-type clipped Hebbian learning (see section 3.1). All experiments assume full network connectivity (P = 1).

4.1.  Verification of SNR R.

A first series of experiments illustrated by Figure 4 implemented networks of size m = n = 1000 and compared experimental SNR R of dendritic potentials (black curves; see equation 2.24) to the theoretical values (gray curves). Here the theoretical values have been computed from equation 2.28 (Bayes), equations G.7 to G.9 (linear), and equation H.21 (BCPNN3). Data correspond to four experimental conditions testing sparse versus nonsparse memory patterns and queries having miss noise versus add noise. For each condition, the corresponding plot shows SNR R as a function of stored memories M. All experiments assumed ideal conditions where each query pattern was generated from an address pattern uμ having k = pm one-entries, where contained correct one-entries and false one-entries (see Figure 2, left). Furthermore, all tested content neurons had unit usages M1 = Mq.

Figure 4:

Verification and comparison of SNR R for different learning models (see equation 2.24). Each plot shows SNR R as a function of stored memories M for a network of size m = n = 1000, with data from simulation experiments (black) and theory (gray). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4, equation H.21), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Top panels correspond to pattern part retrieval with miss noise only (, ). Bottom panels correspond to queries including add noise (, ). Left panels correspond to nonsparse memory patterns with p = q = 0.5. Right panels correspond to (moderately) sparse patterns with p = q = 0.1. Each data value averages over 10,000 networks, each tested with a single query under ideal theoretical conditions (see text).

Figure 4:

Verification and comparison of SNR R for different learning models (see equation 2.24). Each plot shows SNR R as a function of stored memories M for a network of size m = n = 1000, with data from simulation experiments (black) and theory (gray). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4, equation H.21), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Top panels correspond to pattern part retrieval with miss noise only (, ). Bottom panels correspond to queries including add noise (, ). Left panels correspond to nonsparse memory patterns with p = q = 0.5. Right panels correspond to (moderately) sparse patterns with p = q = 0.1. Each data value averages over 10,000 networks, each tested with a single query under ideal theoretical conditions (see text).

For most conditions and models, the theoretical predictions match the experimental SNR very well. This is true in particular for the three tested linear models (Hebb rule, homosynaptic rule, and covariance rule), but also for the Bayesian and BCPNN-type rules if the mean value of the coincidence counter is sufficiently large, , as presumed at the beginning of section 2.3. For example, for nonsparse patterns, the theoretical results become virtually exact for M>70 or . For fewer coincidences, , the SNR curves of the Bayesian and BCPNN-type models are similar as for the Willshaw model. Here the SNR is not a good predictor of retrieval quality and cannot easily be compared to the regime with for the following reasons. First, variances of dendritic potentials between high and low units become significantly different, (cf. equation 2.26).6 Second, the distributions of dendritic potentials become nongaussian (Knoblauch, 2008; cf. appendix E). Third, in particular for very small , dendritic potentials may be contaminated by infinite synaptic inputs (see equations 2.15, 3.2, and 3.14). This reasoning also explains the nonmonotonicity of the SNR curves visible in Figure 4 for the Bayesian and BCPNN-type models as a transition from a nonlinear Willshaw-type to a linear covariance-type regime of operation.

4.2.  Verification of Output Noise .

In a second step, I verified the theory for output noise (see equation 2.30) as described in appendix E using the same network implementations as described before. In fact, appendix E shows that there is a bijective relation between the SNR R and (minimal) output noise if the dendritic potentials are gaussian and the high and low potentials have identical variances. Thus, given that the theory of SNR is correct, here it is tested whether these two conditions hold true.

Figure 5 shows output noise as a function of stored memories M assuming the same conditions as described for Figure 4. As before, for most conditions and models, the theoretical predictions match the experimental very well. In fact, the match is good even for the Bayesian and BCPNN-type rules when assuming relatively small where the theoretical estimates of SNR are still inaccurate. Again, the theory is inaccurate only for the Bayesian and BCPNN-type models for the condition of sparse memories and miss noise only. Here the theory basically suggests equivalence to the linear covariance rule, whereas the Bayesian and BCPNN-type models perform much better due to the infinitely negative synaptic weights caused by the M11 = 0 events, which allow rejecting a neuron activation by a single presynaptic input.

Figure 5:

Verification and comparison of output noise for different learning models (see equation 2.30). Each plot shows as a function of stored memories M for a network of size m = n = 1000, including data from simulation experiments (black) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4 and equation H.21), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Top panels correspond to pattern part retrieval with miss noise only (, ). Bottom panels correspond to queries including add noise (, ). Left panels correspond to nonsparse memory patterns with p = q = 0.5. Right panels correspond to (moderately) sparse patterns with p = q = 0.1. Each data value averages over 10,000 networks each tested with a single query under ideal theoretical conditions (see text; same data as in Figure 4).

Figure 5:

Verification and comparison of output noise for different learning models (see equation 2.30). Each plot shows as a function of stored memories M for a network of size m = n = 1000, including data from simulation experiments (black) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4 and equation H.21), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Top panels correspond to pattern part retrieval with miss noise only (, ). Bottom panels correspond to queries including add noise (, ). Left panels correspond to nonsparse memory patterns with p = q = 0.5. Right panels correspond to (moderately) sparse patterns with p = q = 0.1. Each data value averages over 10,000 networks each tested with a single query under ideal theoretical conditions (see text; same data as in Figure 4).

4.3.  Verification of Storage Capacity Mϵ.

A further series of experiments illustrated by Figure 6 tested the theory of storage capacity (see equations 2.32 and 2.33) for different network sizes m = n = 100, 1000, 10, 000, a larger range of pattern activities mp (=nq), and relaxing the restrictive assumption of having fixed k, c, f, M1. This means that a query pattern was generated by randomly selecting one of the M address patterns uμ and applying query noise according to parameters and . Similarly, all content neurons were included in the analysis. Thus, the previously fixed parameters became binomials, kBm,p, , , M1BM,q, where BN,P is as explained below equation B.1.

Figure 6:

Verification and comparison of pattern capacity for different learning models (see equation 2.32). Each plot shows output noise as function of mean pattern activity mp = nq when storing memories at the theoretical capacity limit of Bayesian learning for low output noise (equation 2.33, ϵ = 0.01; see parameter sets 1–6 in Table 1). Plots show data from simulation experiments (black; see equation 2.30) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), and BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Left panels correspond to pattern part retrieval with miss noise only (, ). Right panels correspond to queries including add noise (, ). Top panels correspond to small networks with m = n = 100. Middle panels correspond to medium networks with m = n = 1000. Bottom panels correspond to larger networks with m = n = 10, 000. Each data value averages over 10,000 retrievals in 100 networks storing random patterns with independent components.

Figure 6:

Verification and comparison of pattern capacity for different learning models (see equation 2.32). Each plot shows output noise as function of mean pattern activity mp = nq when storing memories at the theoretical capacity limit of Bayesian learning for low output noise (equation 2.33, ϵ = 0.01; see parameter sets 1–6 in Table 1). Plots show data from simulation experiments (black; see equation 2.30) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), and BCPNN3 rule (medium dash-dotted; see sections H.3 and H.4), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Left panels correspond to pattern part retrieval with miss noise only (, ). Right panels correspond to queries including add noise (, ). Top panels correspond to small networks with m = n = 100. Middle panels correspond to medium networks with m = n = 1000. Bottom panels correspond to larger networks with m = n = 10, 000. Each data value averages over 10,000 retrievals in 100 networks storing random patterns with independent components.

Each plot shows output noise as a function of mean pattern activity mp. For each value of mp, the number of stored patterns, , was computed from equation 2.33 for the optimal Bayesian rule and a low-output noise level ϵ = 0.01 (see parameter sets 1–6 in Table 1). For small networks (m = n = 100; upper panels) the theory is generally inaccurate. For example, for the optimal Bayesian learning rule, the theory strongly overestimates storage capacity for sparse memory patterns and underestimates capacity for nonsparse patterns. For larger networks (middle and bottom panels), there is a large range of mp where the theory precisely predicts storage capacity. Only for very sparse memory patterns (with small ) does the theory remain inaccurate. For queries containing add noise, the theory generally overestimates true capacity. For queries containing only miss noise, the theory overestimates capacity for extremely sparse patterns but underestimates capacity for patterns with intermediate sparseness.

For larger networks and , the theory becomes very precise for the optimal Bayes rule, the BCPNN3 rule, and the linear covariance rule.

In contrast, even for m = n = 10, 000 and pm>1000, the theory for the linear homosynaptic rule underestimates output noise by about a factor of two. The underestimation of is even worse for the linear Hebbian rule. Here the reasoning is that in contrast to covariance and homosynaptic rule, the mean synaptic weight is nonzero for the Hebbian rule. Therefore inhomogeneities in c, f, and k can cause a much larger variance in dendritic potentials than predicted by the theory, assuming fixed given values for c, f, and k.

4.4.  Comparison of the Different Learning Models.

The simulation experiments confirm that the Bayesian learning rule is the general optimum leading to maximal SNR, minimal output noise, and highest storage capacity. Nevertheless, the simulations show also that for particular parameter ranges, some of the previous learning models can also become optimal.

The linear covariance rule becomes optimal in the linear learning regime, , which, for given output noise level , corresponds to moderately sparse or nonsparse memory patterns with mp/ln q → ∞ (see equations 2.35 and 2.37). However, for sparse memory patterns of finite size, the linear rules can perform much worse than the optimal Bayesian model—even worse than the Willshaw model.

Similarly, the BCPNN-type models become optimal in the limit of sparse query activity, . For finite size or nonsparse query patterns, the storage capacity can be significantly (factor ) below the optimal value.

Finally, the Willshaw model becomes optimal only for pattern part retrieval () and few coincidence counts, corresponding to very sparse memory patterns with mp = O(ln q). For finite networks, the Willshaw model achieves the performance of the Bayesian model only if the output noise level is low and the address pattern activities kμ are constant or narrowly distributed around mp. In all other cases, the Willshaw model performs much worse than the optimal Bayesian rule.

4.5.  Further Results Concerning Memory Statistics and Retrieval Methods..

Figure 7 shows additional simulation experiments testing the various learning models for different retrieval methods and different ways of generating random patterns (m = n = 1000 and pattern part retrieval with , ). Since the Bayesian theory can strongly overestimate pattern capacity for very sparse memory patterns (see equation 2.37), memories were stored at the much lower capacity limit of the Willshaw model assuming a fixed pattern activity kμ = mp for all memories (see equation 57 in Knoblauch et al., 2010; see parameter set 7 in Table 1). Then testing the networks again with random patterns having independent components (and binomial activity kμBm,p) yields qualitatively similar results as before (compare the top left panel of Figure 7 to the middle left panel of Figure 6). Further simulations suggest that the Bayesian and BCPNN-type models have a high-fidelity capacity for very sparse patterns that is almost as low as reported for the Willshaw model (basically for ϵ ≪ 1 and k/log n → 0; see appendix D in Knoblauch et al., 2010).
Figure 7:

Effect of memory statistics and retrieval method on the performance of different learning models. Each plot shows output noise as a function of mean pattern activity mp = nq when storing memories at the theoretical capacity limit of the Willshaw model (ϵ = 0.01; see parameter set 7 in Table 1) assuming network size m = n = 1000 and queries containing miss noise only (, ). Plots show data from simulation experiments (black; see equation 2.30) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3, and H.4), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Left panels correspond to random memory patterns with independently generated components, that is, kμ ≔ ∑mi=1uμi follows a binomial distribution, kμBm,p. Right panels correspond to random memory patterns with a fixed pattern activity kμ = mp. Top panels correspond to fixed optimal firing thresholds Θj (see appendix D). Bottom panels correspond to l-winners-take-all retrieval activating the lnq neurons having the largest dendritic potentials xj. Each data value averages over 10,000 retrievals in 100 networks.

Figure 7:

Effect of memory statistics and retrieval method on the performance of different learning models. Each plot shows output noise as a function of mean pattern activity mp = nq when storing memories at the theoretical capacity limit of the Willshaw model (ϵ = 0.01; see parameter set 7 in Table 1) assuming network size m = n = 1000 and queries containing miss noise only (, ). Plots show data from simulation experiments (black; see equation 2.30) and theory (gray; see equation E.6). Individual curves correspond to the optimal Bayesian model (thick solid; see section 2.2, equation 2.28), linear covariance rule (thick dashed; see appendix G), Willshaw model (thick dash-dotted; see section 3.1), BCPNN rule (medium solid; see section H.1), BCPNN2 rule (medium dashed; see section H.2), BCPNN3 rule (medium dash-dotted; see sections H.3, and H.4), linear homosynaptic rule (thin solid; see appendix G), and the linear Hebb rule (thin dashed; see appendix G). Left panels correspond to random memory patterns with independently generated components, that is, kμ ≔ ∑mi=1uμi follows a binomial distribution, kμBm,p. Right panels correspond to random memory patterns with a fixed pattern activity kμ = mp. Top panels correspond to fixed optimal firing thresholds Θj (see appendix D). Bottom panels correspond to l-winners-take-all retrieval activating the lnq neurons having the largest dendritic potentials xj. Each data value averages over 10,000 retrievals in 100 networks.

Table 1:
Theoretical Pattern Capacities at Output Noise Level ϵ = 0.01 for Optimal Bayesian Learning (Parameter Sets 1–6) and the Willshaw Model (Parameter Set 7) as Employed for the Simulation Experiments Illustrated by Figures 4 to 7.
Mϵ at ϵ = 0.01 parameter set 1234567
BayesWillshaw
m = n = 100m = n = 1000m = n = 10000m = n = 1000
mp = nq, , , , , , ,
63 85 5371 7161 468,070 624,093 
34 45 2815 3753 243,308 324,411 315 
23 31 1932 2577 166,089 221,453 988 
10 15 20 1206 1608 102,781 137,042 1578 
20 11 639 853 53,710 71,613 1252 
30 443 591 36,794 49,059 851 
50 281 374 22,886 30,514 448 
100   154 205 12,063 16,084 156 
200   88 116 6399 8531 47 
300   66 84 4435 5912 22 
500   50 50 2813 3749 
1000     1546 2056  
2000     886 1163  
3000     664 845  
5000     502 502  
Mϵ at ϵ = 0.01 parameter set 1234567
BayesWillshaw
m = n = 100m = n = 1000m = n = 10000m = n = 1000
mp = nq, , , , , , ,
63 85 5371 7161 468,070 624,093 
34 45 2815 3753 243,308 324,411 315 
23 31 1932 2577 166,089 221,453 988 
10 15 20 1206 1608 102,781 137,042 1578 
20 11 639 853 53,710 71,613 1252 
30 443 591 36,794 49,059 851 
50 281 374 22,886 30,514 448 
100   154 205 12,063 16,084 156 
200   88 116 6399 8531 47 
300   66 84 4435 5912 22 
500   50 50 2813 3749 
1000     1546 2056  
2000     886 1163  
3000     664 845  
5000     502 502  

Notes: Data assume various network sizes m = n, mean pattern activities mp = nq, and query noise parameters , . Capacities for the Bayesian model have been computed from equation 2.33 (assuming independent pattern components). Capacities for the Willshaw model have been computed from Knoblauch et al. (2010, eq. 57) and are exact for fixed pattern activities k = mp (whereas independent memory components would imply for a large range of sparse memory patterns; cf., Knoblauch et al., 2010, eq. 65).

In contrast, for random patterns with fixed activity kμ = mp, the Bayesian and BCPNN-type models perform equivalent to the Willshaw model for a large range of sparse patterns (see Figure 7, top right panel). Moreover, for less sparse patterns, BCPNN2 becomes equivalent to the BCPNN rule, and BCPNN3 becomes equivalent to optimal Bayesian learning. There is also a strong improvement for the linear homosynaptic and Hebb rules now closely matching the theoretical values (for independent pattern components and binomial kμ) where the homosynaptic rule becomes equivalent to the covariance rule.

So far, retrieval used fixed firing thresholds to minimize output noise (see appendix D). A simple alternative is -winners-take-all (WTA) retrieval activating the neurons with the largest dendritic potentials xj (as may be implemented in the brain by recurrent inhibition, for example).7 Figure 7 (bottom left panel) shows simulation results for -WTA and memory patterns with independent components and binomial kμBm,p. Surprisingly, all of the various learning models show almost identical performance at relatively high levels of output noise . There are two reasons that can partly explain this result. First, -WTA cannot achieve high fidelity with because the content patterns vμ have a distributed pattern activity lμBn,q which is unknown beforehand. Thus, activating the most excited units causes a positive baseline level of output noise. Second, storing patterns at the relatively low-capacity limit of the Willshaw model implies, for fixed thresholds, low output noise for all models. Therefore, the actual output noise for -WTA will be dominated by the baseline errors described. Nevertheless, further simulations confirmed that even for a larger number of stored patterns, the performances of the different models are much more similar than for fixed firing thresholds.

For l-WTA and fixed pattern activity lμ = nq the performance generally improves (Figure 7, bottom right panel). As before, l-WTA seems to even out the performance differences of various synaptic learning models: Surprisingly, the linear Hebbian, homosynaptic, and covariance rule now show identical high performance, precisely matching the theoretical values for the covariance rule. Also the Bayesian and BCPNN-type rules show identical performance. Further simulations show that for queries including add noise (), l-WTA retrieval becomes identical even between the Bayesian-type and linear model groups. These results support the view that homeostatic mechanisms, such as regulating total activity level, may play an equally important role as tuning the synaptic learning parameters (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Van Welie, Van Hooft, & Wadman, 2004; Chechik et al., 2001; Knoblauch, 2009c).

5.  Summary and Discussion

Neural associative memories are promising models for computations in the brain (Hebb, 1949; Anderson, 1968; Willshaw et al., 1969; Marr, 1969, 1971; Little, 1974; Gardner-Medwin, 1976; Braitenberg, 1978; Hopfield, 1982; Amari, 1989; Palm, 1990; Lansner, 2009), as well as they are potentially useful in technical applications such as cluster analysis, speech and object recognition, or information retrieval in large databases (Kohonen, 1977; Bentz, Hagstroem, & Palm, 1989; Prager & Fallside, 1989; Greene, Parnas, & Yao, 1994; Huyck & Orengo, 2005; Knoblauch, 2005; Mu, Artiklar, Watta, & Hassoun, 2006; Wichert, 2006; Rehn & Sommer, 2006).

In this paper, I have developed and analyzed the generally optimal neural associative memory that minimizes the Hamming-distance-based output noise and maximizes pattern capacity and network storage capacity Cϵ by making Bayesian maximum likelihood considerations. In general, the resulting optimal synaptic learning rule, equation 2.15 is nonlinear and asymmetric, and it differs from previously investigated linear learning models of the Hopfield type, simple nonlinear learning models of the Willshaw type, and BCPNN-type Bayesian learning heuristics. As revealed by detailed theoretical and experimental comparisons, the previous models are rather special cases of Bayesian learning that becomes optimal only in the asymptotic limit of large networks and for particular ranges of pattern activity p, q and query noise (see Table 2).

Table 2:
Asymptotic Conditions When the Various Learning Rules Become Optimal (Equivalent to the Bayesian Rule).
Learning RuleGeneral Conditions for OptimalityConditions at Capacity Limit
Optimal Bayesian None None 
BCPNN type p → 0 p → 0 
Linear covariance  (mp)/log m → ∞ 
Linear homosynaptic  and p → 0 (mp)/log m → ∞ and p → 0 
Linear heterosynaptic  and q → 0 (mp)/log m → ∞ and q → 0 
Linear Hebb  and p, q → 0 (mp)/log m → ∞ and p, q → 0 
Linear Hopfield  and p, q → 0.5 p, q → 0.5 
Willshaw  and and  mp ∼ log m and  
Learning RuleGeneral Conditions for OptimalityConditions at Capacity Limit
Optimal Bayesian None None 
BCPNN type p → 0 p → 0 
Linear covariance  (mp)/log m → ∞ 
Linear homosynaptic  and p → 0 (mp)/log m → ∞ and p → 0 
Linear heterosynaptic  and q → 0 (mp)/log m → ∞ and q → 0 
Linear Hebb  and p, q → 0 (mp)/log m → ∞ and p, q → 0 
Linear Hopfield  and p, q → 0.5 p, q → 0.5 
Willshaw  and and  mp ∼ log m and  

Notes: The constraints depend on the fraction of active units in an address pattern (p ≔ pr[uμi = 1]) or content pattern (q ≔ pr[vμj = 1]), the size of the address population (m), the mean value of the synaptic coincidence counter (, where M is the number of stored memories), the mean unit usages (, ), and the fraction of add noise in the query pattern (). The right column reexpresses the general conditions of the middle column for the case when M equals the pattern capacity .

For example, the Willshaw model becomes optimal only in the limit of small coincidence counters, , for queries without any add noise, . For maximal at the capacity limit, can be achieved only for extremely sparse memory patterns where the number of active units per memory vector scales typically logarithmic in the population size, for example, p, q ∼ log n/n (Knoblauch et al., 2010). Nevertheless, one may be surprised how a simple model employing binary synapses can already perform optimal Bayesian retrieval. The reason is that a low value of guarantees that a large fraction p0 ≔ (1 − pq)M of synaptic weights remains zero in the Willshaw model or minus infinity in the corresponding Bayesian interpretation (see equation 3.2). Then retrieval gets dominated by rejecting activations of postsynaptic neurons based on single but strongly inhibitory inputs. In particular, for small but nonvanishing p0, the inhibitory Willshaw network becomes very efficient by storing large amounts of information with a small number of synapses (Knoblauch, 2007). Such an inhibitory interpretation of associative memory may also offer novel functional hypotheses for strongly inhibitory cortical circuits, for example, involving chandelier or basket cells (Markram et al., 2004), and also for inhibition-dominated brain structures such as cerebellum and basal ganglia (Marr, 1969; Albus, 1971; Kanerva, 1988; Wilson, 2004).

In contrast to the Willshaw model, the linear covariance rule becomes optimal in the linear learning regime where the synaptic coincidence counters diverge, . Then linearization of the optimal Bayesian rule yields the covariance rule, and the two rules have the same asymptotic SNR. Correspondingly, the fraction of synapses with infinite weights vanishes, p0 → 0, which, at the capacity limit (see equation 2.33), corresponds to moderately or nonsparse memory patterns with typically p, q ≫ log n/n. Numerical experiments indicate that in reasonably large but finite networks, the optimal Bayesian model still performs significantly better than the linear covariance rule for a large range of pattern activities p ≪ 0.5. Furthermore, the SNR analysis allows a characterization of basins of attraction in terms of miss noise and add noise (see equation 2.29 and Figure 2, right). It turns out that in the linear learning regime, , the network is more vulnerable against miss noise () than add noise (). This contrasts with the nonlinear learning regime, , where the network is more vulnerable against add noise, mainly because add noise destroys the network's ability to reject postsynaptic activations by single strongly inhibitory synaptic inputs. Alternative linear learning models such as the Hebb, homosynaptic, and heterosynaptic rules behave similar to the covariance rule but have a lower signal-to-noise ratio unless p → 0 and/or q → 0 (Dayan & Willshaw, 1991).

The original BCPNN model of Lansner and Ekeberg has a similar formulation as the optimal Bayesian model but neglects inactive query neurons and employs an inaccurate approximation (see equation 3.16). More recent hypercolumnar variants of the BCPNN model for discrete valued memories remedy the first problem by employing extra neurons to represent inactivity (Lansner & Holst, 1996; Johansson, Sandberg, & Lansner, 2002), but require (at least) double the network size of the optimal Bayesian model. For comparison, I have extended the original BCPNN model to include query noise and derived two improved BCPNN-type rules: The BCPNN2 rule also considers the inactive query neurons, whereas the BCPNN3 rule does not make use of the inaccurate approximation. Similar to the Willshaw model, the BCPNN-type rules become optimal at least in the nonlinear learning regime, , corresponding to very sparse patterns where active units dominate the total information contained in a query pattern. Moreover, for the linear learning regime , I have analyzed the SNR of the BCPNN3 rule being an upper bound for the original BCPNN rule. The analysis revealed that the SNR of the BCPNN3 model is equivalent to the linear homosynaptic rule, that is, factor worse than for optimal Bayesian learning (see also Dayan & Willshaw, 1991). Thus, the original BCPNN rule achieves at most the capacity of the homosynaptic rule and becomes optimal only for sparse address patterns with p → 0 or low query activity with small . Even for sparse address patterns with p → 0, the BCPNN-type models have reduced basins of attraction in the sense that they are more vulnerable to add noise with large than the optimal Bayesian model.

MacKay (1991) has suggested a learning model based on maximizing the entropy of synaptic weights that is closely related to optimal Bayesian associative memory. In particular, he arrived at a similar learning rule and also discussed the convergence to the covariance rule as well as the necessity of infinite synaptic weights. The current approach goes beyond these previous results by generalizing the learning rule for query noise and providing an SNR analysis for Bayesian learning. The latter, in connection with the results of appendix E, rigorously proves the equivalence of Bayesian learning and the covariance rule in the limit (whereas Taylor expansion of the BCPNN rule, for example, also leads to the covariance rule in spite of BCPNN being suboptimal; see section H.4). Moreover, this analysis also discusses convergence of the Bayesian learning rule to linear learning rules other than the covariance rule when the query noise is not independent of the stored contents (as can be expected for any real-world data).

As with most previous approaches, the “optimal” Bayesian memory model still makes the naive assumption that address attributes are independent of each other. Although this assumption is almost never fulfilled in real-world data, experiments reveal that naive Bayesian classifiers perform surprisingly well or even optimal in many domains that contain clear attribute dependencies (Zhang, 2004; Domingos & Pazzani, 1997). Moreover, it may be possible to extend the model by semi-naive approaches including higher-order dependencies, for example, as suggested by Kononenko (1991, 1994).

At least for independent address attributes, the Bayesian neural associative memory presented in this work is, by definition, the optimal local learning model maximizing and Cϵ. On the other hand, there exist general bounds on the storage capacity of neural networks that do not refer to any particular learning algorithm (Gardner, 1988; Gardner & Derrida, 1988). As the linear covariance rule, the optimal Bayesian model reaches the Gardner bound for sparse memory patterns p, q → 0 in the limit Mpq → ∞ corresponding to moderately sparse patterns with mp ≫ log(n) where the network can store Cϵ = 1/(2ln 2) ≈ 0.72 bps (compare equation 2.37 to equation 40 in Gardner, 1988). However, for logarithmic sparse memory patterns with mp ∼ log n, the storage capacity of the optimal Bayesian rule is below the Gardner bound and cannot exceed the maximal capacity of the Willshaw model, which is at Cϵ = ln 2 ≈ 0.69 bps (or, rather, Cϵ = 1/eln 2 ≈ 0.53 bps for distributed pattern activities; see Knoblauch et al., 2010, appendix D). For even sparser memory patterns with mp/log n → 0, the storage capacity vanishes, Cϵ → 0. Also for nonsparse patterns where p → 0.5, the Gardner bound of 2 bps cannot be reached. Here the optimal Bayesian rule achieves at most Cϵ ≈ 0.33 bps for very low-fidelity retrieval with ϵ → 1, and only Cϵ → 0 for high-fidelity retrieval with vanishing output noise ϵ → 0 (see Figure 3). Thus, as noted by Sommer and Dayan (1998), at least for nonsparse address patterns with p → 0.5, local learning is insufficient, and the optimal synaptic weights must be found by more sophisticated algorithms, including nonlocal information.

Even if the Bayesian associative memory could reach the Gardner bound, the resulting storage capacity of at most 2 bits per synapse would be low compared to the physical memory actually required to represent real-valued synaptic weights (or, alternatively, the countervariables described in section 2.1). Even worse, an accurate neural implementation of the Bayesian associative memory requires two numbers per synaptic weight: a real-valued variable for the finite contributions and an integer variable for the infinite contributions (see appendix A). In fact, if we take into account the computational resources required to represent the resulting network, the Willshaw model outperforms all other models due to the binary weights (Knoblauch et al., 2010): For implementations on digital hardware, the Willshaw model can reach the theoretical maximum of CI = 1 bit per computer bit (Knoblauch, 2003). Correspondingly, parallel hardware implementations of structurally plastic Willshaw networks can reach the theoretical maximum of CS = log n bits per synapse (Knoblauch, 2009b). However, these high capacities (per synapse) are achieved only for a relatively low absolute number of stored memories, M, far below the Gardner bound, equation 2.37. Some preliminary work (Knoblauch, 2009c, 2010b) indicates that the Bayesian associative memory can be efficiently discretized such that structurally compressed network implementations can store CI → 1 bit per computer bit or CS → log n bits per synapse, whereas M (and C) can still be close to the Gardner bound. Another future direction will be to investigate more closely the biological relevance of Bayesian learning by implementing more realistic network models that include spikes, forgetful synapses, and inhibitory circuits (Sandberg et al., 2000; Fusi, Drew, & Abbott, 2005; Markram et al., 2004).

Appendix A:  Implementation of Infinite Weights and Thresholds

As noted in section 2.2, synaptic weights (see equation 2.15) and dendritic potentials (see equation 2.16) may be plus or minus infinity. Naive neural network implementations lead to suboptimal performance if neglecting that positively and negatively infinite contributions may cancel each other. To obtain accurate results, it is necessary to represent synaptic weights and firing thresholds each with two numbers for finite and infinite components. For d1M11(1 − p10|1) + M01p01|1, d2M00(1 − p01|0) + M10p10|0, d3M10(1 − p10|0) + M00p01|0, d4M01(1 − p01|1) + M11p10|1, the synaptic weight, equation 2.15, can be expressed by
formula
A.1
formula
A.2
with the gating functions for and for , and for and for . Thus, wij represents the finite weight-neglecting infinite components, whereas wij counts the number of contributions toward plus and minus infinity. Similarly, the finite and infinite components of firing thresholds (corresponding to the “bias” in equation 2.16) write as
formula
A.3
formula
A.4
Then finite and infinite components of dendritic potentials are and , such that a postsynaptic neuron j gets activated if either xjj or xj = Θj and xj ⩾ Θj.

Appendix B:  Analysis of the SNR for Optimal Bayesian Retrieval

The following computes the SNR (see equation 2.24) for neural associative memory with optimal Bayesian learning (section 2.2) making the same definitions and simplifications as detailed at the beginning of section 2.3. Section B.1 computes the mean difference Δμ ≔ μhi − μlo between the dendritic potential of a high and a low unit, and section B.2 computes the variances σ2hi and σ2lo for the corresponding distributions of dendritic potentials.

B.1.  Mean Values of Dendritic Potentials.

Equivalent to equation 2.16 (but replacing m − 1 by m and skipping indices i, j for brevity), a content neuron j will be activated if the dendritic potential xj exceeds the threshold Θj ≔ log(M0/M1) (instead of Θj = 0), where
formula
B.1
Given M1, M0, the remaining variables are binomially distributed— and , where . For large NP(1 − P) the binomial BN,P can be approximated by a gaussian Gμ,σ with mean μ = NP and variance σ2 = NP(1 − P). Given uμi and vμj, we then have
formula
B.2
formula
B.3
From this, we can approximate the distribution of the dendritic potential xj for low units and high units, respectively. For large k and mk, the sums of logarithms in equation B.1 are approximately gaussian distributed. In principle, the mean potentials μlo and μhi for low units and high units can be computed exactly from equation B.12. Fortunately, it turns out that the mean potential difference Δμ ≔ μhi − μlo required for the SNR can be well approximated by using only the first-order term in equation B.12 (while all higher-order terms become virtually identical for μhi and μlo; for more details, see Knoblauch, 2009a, appendixes D, F). These first-order approximations μ′lo, μ′hi of μlo, μhi are
formula
B.4
formula
B.5
where the approximations are valid for large M0p, M1p → ∞ and sufficiently small p01, p10. Therefore, the mean difference Δμ ≔ μhi − μlo between the high and low distributions is
formula
B.6

B.2.  Variance of Dendritic Potentials.

In order to get the SNR, equation 2.24, we have to compute the variances σ2lo and σ2hi for xj in equation B.1. Given the unit usages M1(j), the random variables M00(i, j) and M11(i, j) are independent, and thus the variances simply add. Because each variance summand is positive, for large M1p, M0p → ∞, we can simply assume and in all cases (cf. equations B.2 and B.3). With equation B.13 we get
formula
B.7
Thus, the variances Var(xj) for the potentials of both low units and high units are approximately
formula
B.8

B.3.  B.3 Lemmas for Computing Dendritic Potential Distributions.

Let X be a random variable with normal distribution, XG0,σ, that is, X is a gaussian with zero mean and variance σ2. Then the dth moment is
formula
B.9
Proofs can be found in standard textbooks of statistics and probability theory (e.g., see equation 5.44 in Papoulis, 1991).
Then the Taylor expansion of log(x) around μ (also called the Newton-Mercator series) is
formula
B.10
formula
B.11
for −1 < Δ/μ ⩽ 1. Proofs can be found in standard textbooks of analysis (e.g., see Borwein & Bailey, 2003; Weisstein, 1999; Abramowitz & Stegun, 1972).
Now let X be a gaussian random variable, XGμ,σ, with mean μ and variance σ2. Then for σ ≪ μ, we have
formula
B.12
formula
B.13
where the approximations are tight for σ/μ → 0 if .
Proof.
We can write X = μ + Δ where Δ is normal with variance σ2. Then equation B.12 follows from eqs. B.9 and B.11. Similarly, the variance Var(log X) = E((log X)2) − (E(log X))2 follows from
formula
where in the last equation for σ/μ → 0, the first summand (d1 = d2 = 1) dominates.

Appendix C:  Gaussian Tail Integrals

Let g(x) be the gaussian probability density:
formula
C.1
Then the complementary gaussian distribution function is the right tail integral:
formula
C.2
The first bound is true for any x>0, and the corresponding approximation error becomes smaller than 1% for x>10. The second bound is true for any x>0. Inverting Gc yields
formula
C.3
The two approximations correspond to those of equation C.2. In the first approximation, the term Gc−1(x) can be replaced, for example, by the second approximation .

Appendix D:  Optimal Firing Thresholds

Given a query pattern resembling one of the original address patterns uμ, our goal is to minimize the expected Hamming distance between the corresponding content vμ and the retrieval output (see equation 2.21). To this end, each content neuron vj has to adjust its firing threshold Θ in order to minimize
formula
D.1
where q ≔ pr[vμj = 1] is the prior and
formula
D.2
are the probabilities of making an output error (e.g., equations 2.22 and 2.23) assuming a given low distribution glo(x) ≔ pr[xj = x|vμj = 0] and high distribution ghi(x) ≔ pr[xj = x|vμj = 1] for the dendritic potential xj (e.g., see equation 2.16). Minimizing H(Θ) requires dH/dΘ = 0 or, equivalently,
formula
D.3
as illustrated by Figure 8 (left). The optimal threshold Θopt can be obtained by solving equation D.3, which is easy if the distributions of dendritic potentials are gaussians. Then equation D.3 is rewritten as
formula
D.4
where g is the Gaussian density, equation C.1, and μlo, μhi, σlo, σhi are means and standard deviations of the low and high dendritic potentials similar as defined below equation 2.24. Taking logarithms yields a quadratic equation in Θ with the solution
formula
D.5
formula
D.6
formula
D.7
formula
D.8
where the optimal threshold is either Θ1 or Θ2. If the standard deviations are equal, σlo = σhi, then A = 0, and equation D.4 has the unique solution
formula
D.9
The following lemma characterizes the weighing of add noise (vμj = 0 but ) versus miss noise (vμj = 1 but ) in the retrieval result when choosing the optimal firing threshold: If we assume a given constant output noise (cf. equation 2.30), gaussian potentials with equal standard deviations σlo = σhi and optimal firing threshold Θ = Θopt as in equation D.9, then
formula
D.10
that is, for sparse content patterns, the output errors are dominated by miss noise (see equation 2.31). A formal proof of the lemma can be found in Knoblauch (2009a, appendix A, equation 74). Figure 8 (left) gives an intuition as to why the lemma is true. Here Hopt) is the intersection area of high and low distribution, where the left and right parts of the area correspond to miss noise qq10 and add noise (1 − q)q01, respectively (see the arrows). Requiring constant Hopt)/q implies that the intersection area Hopt) must be a constant fraction of the area below qghi(x). Thus, q → 0 implies for σlo = σhi that the decrease of (1 − q)glo(x) with x becomes very steep compared to the increase of qghi(x) and finally approaches the dashed line corresponding to Θopt.
Figure 8:

Optimal firing threshold and minimal SNR. (Left) Expected normalized distributions (1 − q)glo(x) and qghi(x) of dendritic potential x for low-units (with vμj = 0) and high-units (with vμj = 1), respectively. The optimal firing threshold is at dendritic potential x = Θopt where the two distributions are equal. For sparse content patterns with q < 0.5 the resulting miss noise qq10 is larger than the add noise (1 − q)q01. In fact, for q → 0 and constant ϵ the add noise becomes negligible (see equation D.10; see also Knoblauch, 2009a). (Right) Contour plot showing the minimal SNR (see appendix E) required to obtain output noise for content pattern activity q and optimal firing threshold equation D.9.

Figure 8:

Optimal firing threshold and minimal SNR. (Left) Expected normalized distributions (1 − q)glo(x) and qghi(x) of dendritic potential x for low-units (with vμj = 0) and high-units (with vμj = 1), respectively. The optimal firing threshold is at dendritic potential x = Θopt where the two distributions are equal. For sparse content patterns with q < 0.5 the resulting miss noise qq10 is larger than the add noise (1 − q)q01. In fact, for q → 0 and constant ϵ the add noise becomes negligible (see equation D.10; see also Knoblauch, 2009a). (Right) Contour plot showing the minimal SNR (see appendix E) required to obtain output noise for content pattern activity q and optimal firing threshold equation D.9.

Appendix E:  The Relation Between SNR R and Output Noise

We can use two different measures to evaluate retrieval quality: section 2.3 uses the SNR R (see equation 2.24), whereas section 2.4 uses output noise , which is based on the Hamming distance (see equation 2.30). This appendix shows that the two measures are actually equivalent if we assume that (1) all content neurons j have the same priors q ≔ pr[vμj = 1] and the same distributions for high and low dendritic potentials; (2) all dendritic potentials follow a gaussian distribution; (3) each content neuron optimally adjusts the firing threshold in order to minimize output noise (see appendix D); and (4) the distributions of high and low dendritic potentials have the same standard deviation, σ ≔ σlo = σhi. Note that all assumptions are fulfilled at least in the limit Mpq → ∞ for reasons discussed in section 2.3.

We first write the output noise as a function of the SNR R: Due to assumption 1, we can write the output noise, equation 2.30 in terms of the output error probabilities, equations 2.22 and 2.23:
formula
E.1
Due to assumption 2, the output error probabilities write
formula
E.2
where Gc(x) is the tail integral of a gaussian (see equation C.2), and, due to assumption 3, Θopt is the optimal firing threshold as explained in appendix D. Due to assumption 4, Θopt is as in equation D.9:
formula
E.3
The last bound implies that the optimal threshold shifts toward the high potentials for sparse patterns with q < 0.5 and centers only for q = 0.5. Thus, the error probabilities at optimal threshold are
formula
E.4
formula
E.5
and thus the minimal output noise level that can be achieved with SNR R equals
formula
E.6
where Gc can be evaluated with equation C.2.
Vice versa, we obtain the minimal SNR required for an output noise level by resolving equation E.6 for R. We can do this easily for two special cases. First, for nonsparse content patterns with q = 0.5, we have and thus
formula
E.7
where Gc−1 is as in equation C.3. Second, for sparse content patterns with q → 0, miss-noise will dominate output errors according to equation D.10. Correspondingly, the output noise, equation E.6, is dominated by the second summand. Therefore, q → 0 implies
formula
E.8
Alternatively, and in particular for , we can compute by iteratively applying the following two equations:
formula
E.9
formula
E.10
starting with , for example. In the first step, equation E.9 computes the minimal SNR required to obtain output noise where, in contrast to assumption 3, firing thresholds are chosen such that a given fraction of the expected output errors is add noise, and the remaining fraction is miss noise (here, is the output noise balance, equation 2.31; see also equation E.1 and Figure 8, left). In the second step, we insert from the first step into equation E.10 and compute the optimal noise balance such that output noise is minimal and assumption 3 is fulfilled again. In practice, few iterations of this procedure (e.g., fewer than 10) are sufficient to obtain an accurate estimate of , which may be further verified by insertion into equation E.6. For more details, see Knoblauch (2009a, appendix A). Figure 8 (right) computes for relevant parameters and q.

Appendix F:  Binary Channels

For a random variable X ∈ {0, 1} with q ≔ pr[X = 1] the information I(X) equals (Shannon & Weaver, 1949)
formula
F.1
It is I(q) = I(1 − q) and I(q) → 0 for q → 0. A binary memory-less channel is determined by the two error probabilities q01 for add noise and q10 for miss noise. For two binary random variables X and Y, where Y is the result of transmitting X over the binary channel, we can write
formula
F.2
formula
F.3
formula
F.4
For the analysis of storage capacity of associative networks at noise level ϵ (see section 2.4), we are interested in fulfilling the high-fidelity criterion, equation E.1, with a “noise balance” parameter ξ weighing between add noise and miss noise,
formula
F.5
such that
formula
F.6
Thus, we can compute the component transinformation for several interesting cases:
formula
F.7
For details see Knoblauch (2009a, appendix E). Three approximations are of particular interest. For q = 0.5 and ξ = 0.5, we have T ≈ 1 − I(ϵ/2). For q → 0, constant ϵ, and dominating miss noise with ξ → 0, we have TI(q)(1 − ϵ). For q → 0, constant ϵ, and dominating add noise with ξ → 1, we have TI(q).

Appendix G:  Analysis of the SNR for Linear Learning Rules

Here we analyze the SNR for the linear learning rule, equation 3.6, in analogy to the analysis in section 2.3. Without loss of generality, we assume that the query pattern resembles the Mth address pattern and, similarly as illustrated by Figure 2 (left), contains c correct one-entries and f false one-entries. The synaptic weight writes as the linear sum of learning increments ruv due to individual memory associations with presynaptic activity u ∈ {0, 1} and postsynaptic activity v ∈ {0, 1},
formula
G.1
formula
G.2
where, without loss of generality, for a high unit (vMj = 1), we assume that vμj = 1 for μ = M0 + 1, …, M; and for a low unit (vMj = 0), we assume that vμj = 1 for μ = 1, …, M1. Then the dendritic potential with F(1) = 1 and F(0) = a is
formula
G.3
Thus, the mean dendritic potentials for high and low units are
formula
G.4
formula
G.5
using and . Similarly, we can compute the variances of dendritic potentials by replacing a by a2 and E by Var and leaving out constant terms,
formula
G.6
formula
G.7
using and . Then the mean potential difference Δμ ≔ μhi − μlo is
formula
G.8
formula
G.9
With this, we can compute the SNR R ≔ Δμ/max(σhi, σlo) (see equation 2.24), optimal firing thresholds (see appendix D), and storage capacity (see section 2.4). It is well known that the optimal linear rule (maximizing R) is the so-called covariance rule r00 = pq, r01 = −p(1 − q), r10 = −(1 − p)q, r11 = (1 − p)(1 − q), and where p ≔ pr[uμi = 1] and q ≔ pr[vμj = 1] (see Dayan & Willshaw, 1991; Palm & Sommer, 1996). Further rules of interest are, for example, the Hebbian rule r11 = 1, r00 = r01 = r10 = a = 0; the homosynaptic rule r11 = 1 − q, r10 = −q, r00 = r01 = a = 0; and the heterosynaptic rule r11 = 1 − p, r01 = −p, r00 = r10 = a = 0.

Appendix H:  Generalized BCPNN-Type Learning Rules

H.1.  Generalizing the BCPNN Rule for Query Noise.

Section 3.3 discusses the original BCPNN rule of Lansner & Ekeberg (1989). The original BCPNN rule, equation 3.14, does not consider query noise. We can generalize the BCPNN rule including query noise as we have done for the optimal Bayesian rule in section 2.2. Defining (for any j), it is
formula
H.1
formula
H.2
formula
H.3
where denotes the number of one-entries in the query vector. Thus, taking logarithms yields synaptic weights wij and firing thresholds Θj,
formula
H.4
formula
H.5
where we have again skipped indices i, j for brevity. Transition probabilities can again be estimated as in equations 2.19 and 2.20.

H.2.  The BCPNN2 Rule: Including Inactive Query Components.

As discussed in section 3.3, we can improve the BCPNN rule by also considering the zero-entries in a query pattern, that is, by computing
formula
H.6
formula
H.7
and thus
formula
H.8
formula
H.9

H.3.  The BCPNN3 Rule: Eliminating .

As discussed in section 3.3, we can improve the BCPNN rule by computing the odds ratio:
formula
H.10
and thus
formula
H.11
formula
H.12

H.4.  The SNR of the BCPNN3 Rule.

One can show that linearizing the BCPNN-type rules also yields the covariance rule, as shown in section 3.2 for the optimal Bayesian rule (Knoblauch, 2010a). By this, one may be tempted to believe that the BCPNN model would also be optimal in the limit Mpq → ∞. However, asymptotically identical first-order terms of single synaptic weights is not a sufficient condition for identical network performance since Mpq → ∞ implies a diverging synapse number. In fact, the following analysis shows that the BCPNN3 rule has a lower SNR than the optimal Bayes rule, which also excludes the optimality of the BCPNN model. We can easily adapt the SNR analysis of section 2.3 to the BCPNN3 rule simply by skipping all terms relating to inactive query components . Equivalently to equations H.11 and H.12, the biological formulation of the BCPNN3 model as
formula
H.13
In analogy to equation B.1, the potential xj of content neuron j writes as
formula
H.14
In analogy to equations B.4 and B.5, the first-order approximations of mean low and high potentials are
formula
H.15
formula
H.16
In analogy to equation B.6 the mean difference Δμ ≔ μhi − μlo between the high and low distributions is
formula
H.17
In analogy to equation B.8, the variances of dendritic potentials are
formula
H.18
Thus, asymptotically, for and and assuming large networks and consistent error estimation such that k = pm, , , we obtain in analogy to equations 2.25 and 2.26,
formula
H.19
formula
H.20
Therefore, similar to equation 2.28, for large M1Mq and including network connectivity P, the SNR R = Δμ/σ can be obtained from
formula
H.21
Thus, asymptotically for Mpq → ∞, the squared SNR for the BCPNN3 rule is factor worse than for the optimal Bayesian model.

Acknowledgments

I am grateful to Julian Eggert, Marc-Oliver Gewaltig, Helmut Glünder, Edgar Körner, Ursula Körner, Anders Lansner, Günther Palm, Friedrich Sommer, and the two anonymous reviewers for helpful discussions and comments.

Notes

1

Evaluating equation 2.14 during retrieval requires about 5m multiplications and 2m additions even for sparse query activity with . By contrast, evaluating equation 2.16 requires only multiplications and m additions, as the “bias” (first and second summands) of xj is independent of and therefore can be computed in advance.

2

For this, the offset w0 should not depend on i.

3

Note that p0 → 0.5 corresponds to . The same argumentation for independently generated address pattern components with binomially distributed kμBm,p would even suggest optimality until p0 → 1/e ≈ 0.37 and where the Willshaw model achieves the maximal capacity (see Knoblauch et al., 2010, eq. D.12).

4

Without loss of generality, p ≔ pr[uμi] ⩽ 0.5 (otherwise, invert the address pattern components).

5

For example, if address “feature” ui = 1 is positively correlated with content vj = 1, then it typically occurs that p10|1(ij) < p10|0(ij) and p01|1(ij)>p01|0(ij), such that the optimal coincidence increment, r11(ij), is smaller than expected from the covariance rule, η1100 < 1, whereas the offset is positive, w0(ij)>0. The deviation from the covariance rule can be significant, for example, p = q = 0.1, , (corresponding to p10 = 0.25, p01 = 0.025), p10|1 = 0.1p10, p01|1 = 10p01 yields η1100 ≈ 0.3 and w0 ≈ 1.8.

6

For example, σhi = 0 for pattern part retrieval in the Willshaw model (see section 3.1).

7

Although l-WTA retrieval is simple to implement, it is much more difficult to analyze.

8

Although l-WTA retrieval is simple to implement, it is much more difficult to analyze.

References

Abramowitz
,
M.
, &
Stegun
,
I.
(
1972
).
Handbook of mathematical functions with formulas, graphs, and mathematical tables.
New York
:
Dover
.
Albus
,
J.
(
1971
).
A theory of cerebellar function
.
Mathematical Biosciences
,
10
,
25
61
.
Amari
,
S.-I.
(
1977
).
Neural theory of association and concept-formation
.
Biological Cybernetics
,
26
,
175
185
.
Amari
,
S.-I.
(
1989
).
Characteristics of sparsely encoded associative memory
.
Neural Networks
,
2
,
451
457
.
Anderson
,
J.
(
1968
).
A memory storage model utilizing spatial correlation functions
.
Kybernetik
,
5
,
113
119
.
Anderson
,
J.
,
Silverstein
,
J.
,
Ritz
,
S.
, &
Jones
,
R.
(
1977
).
Distinctive features, categorical perception, and probability learning: Some applications of a neural model
.
Psychological Review
,
84
,
413
451
.
Bentz
,
H.
,
Hagstroem
,
M.
, &
Palm
,
G.
(
1989
).
Information storage and effective data retrieval in sparse matrices
.
Neural Networks
,
2
,
289
293
.
Bogacz
,
R.
,
Brown
,
M.
, &
Giraud-Carrier
,
C.
(
2001
).
Model of familiarity discrimination in the perirhinal cortex
.
Journal of Computational Neuroscience
,
10
,
5
23
.
Borwein
,
J.
, &
Bailey
,
D.
(
2003
).
Mathematics by experiment: Plausible reasoning in the 21st century.
Wellesley, MA
:
AK Peters
.
Braitenberg
,
V.
(
1978
).
Cell assemblies in the cerebral cortex
. In
R. Heim & G. Palm
(Eds.),
Lecture notes in biomathematics (21). Theoretical approaches to complex systems
(pp.
171
188
).
Berlin
:
Springer-Verlag
.
Buckingham
,
J.
, &
Willshaw
,
D.
(
1992
).
Performance characteristics of the associative net
.
Network: Computation in Neural Systems
,
3
,
407
414
.
Buckingham
,
J.
, &
Willshaw
,
D.
(
1993
).
On setting unit thresholds in an incompletely connected associative net
.
Network: Computation in Neural Systems
,
4
,
441
459
.
Chechik
,
G.
,
Meilijson
,
I.
, &
Ruppin
,
E.
(
2001
).
Effective neuronal learning with ineffective Hebbian learning rules
.
Neural Computation
,
13
,
817
840
.
Cover
,
T.
, &
Thomas
,
J.
(
1991
).
Elements of information theory.
New York
:
Wiley
.
Dayan
,
P.
, &
Sejnowski
,
T.
(
1993
).
The variance of covariance rules for associative matrix memories and reinforcement learning
.
Neural Computation
,
5
,
205
209
.
Dayan
,
P.
, &
Willshaw
,
D.
(
1991
).
Optimising synaptic learning rules in linear associative memory
.
Biological Cybernetics
,
65
,
253
265
.
Domingos
,
P.
, &
Pazzani
,
M.
(
1997
).
On the optimality of the simple Bayesian classifier under zero-one loss
.
Machine Learning
,
29
,
103
130
.
Fransen
,
E.
, &
Lansner
,
A.
(
1998
).
A model of cortical associative memory based on a horizontal network of connected columns
.
Network: Computation in Neural Systems
,
9
,
235
264
.
Fusi
,
S.
,
Drew
,
P.
, &
Abbott
,
L.
(
2005
).
Cascade models of synaptically stored memories
.
Neuron
,
45
,
599
611
.
Gardner
,
E.
(
1988
).
The space of interactions in neural network models
.
J. Phys. A: Math. Gen.
,
21
,
257
270
.
Gardner
,
E.
, &
Derrida
,
B.
(
1988
).
Optimal storage properties of neural network models
.
J.Phys. A: Math. Gen.
,
21
,
271
284
.
Gardner-Medwin
,
A.
(
1976
).
The recall of events through the learning of associations between their parts
.
Proceedings of the Royal Society of London Series B
,
194
,
375
402
.
Golomb
,
D.
,
Rubin
,
N.
, &
Sompolinsky
,
H.
(
1990
).
Willshaw model: Associative memory with sparse coding and low firing rates
.
Phys. Rev. A
,
41
,
1843
1854
.
Graham
,
B.
, &
Willshaw
,
D.
(
1995
).
Improving recall from an associative memory
.
Biological Cybernetics
,
72
,
337
346
.
Greene
,
D.
,
Parnas
,
M.
, &
Yao
,
F.
(
1994
).
Multi-index hashing for information retrieval
.
Proceedings of the 35th Annual Symposium on Foundations of Computer Science
(pp.
722
731
).
Piscataway, NJ
:
IEEE Press
.
Hebb
,
D.
(
1949
).
The organization of behavior: A neuropsychological theory.
New York
:
Wiley
.
Henkel
,
R.
, &
Opper
,
M.
(
1990
).
Distribution of internal fields and dynamics of neural networks
.
Europhysics Letters
,
11
(
5
),
403
408
.
Hertz
,
J.
,
Krogh
,
A.
, &
Palmer
,
R.
(
1991
).
Introduction to the theory of neural computation.
Redwood City, CA
:
Addison-Wesley
.
Hopfield
,
J.
(
1982
).
Neural networks and physical systems with emergent collective computational abilities
.
Proceedings of the National Academy of Sciences, USA
,
79
,
2554
2558
.
Huyck
,
C.
, &
Orengo
,
V.
(
2005
).
Information retrieval and categorization using a cell assembly network
.
Neural Computing and Applications
,
14
(
4
),
282
289
.
Johansson
,
C.
, &
Lansner
,
A.
(
2007
).
Imposing biological constraints onto an abstract neocortical attractor network model
.
Neural Computation
,
19
(
7
),
1871
1896
.
Johansson
,
C.
,
Sandberg
,
A.
, &
Lansner
,
A.
(
2002
).
A neural network with hypercolumns
. In
J. Dorronsoro
(Ed.),
Proceedings of the International Conference on Artificial Neural Networks (ICANN)
(pp.
192
197
).
Berlin
:
Springer-Verlag
.
Kanerva
,
P.
(
1988
).
Sparse distributed memory.
Cambridge, MA
:
MIT Press
.
Knoblauch
,
A.
(
2003
).
Optimal matrix compression yields storage capacity 1 for binary Willshaw associative memory
. In
O. Kaynak, E. Alpaydin, E. Oja, & L. Xu
(Eds.),
Artificial Neural Networks and Neural Information Processing—ICANN/ICONIP 2003
(pp.
325
332
).
Berlin
:
Springer-Verlag
.
Knoblauch
,
A.
(
2005
).
Neural associative memory for brain modeling and information retrieval
.
Information Processing Letters
,
95
,
537
544
.