Abstract

Neural associative networks are a promising computational paradigm for both modeling neural circuits of the brain and implementing associative memory and Hebbian cell assemblies in parallel VLSI or nanoscale hardware. Previous work has extensively investigated synaptic learning in linear models of the Hopfield type and simple nonlinear models of the Steinbuch/Willshaw type. Optimized Hopfield networks of size n can store a large number of about memories of size k (or associations between them) but require real-valued synapses, which are expensive to implement and can store at most bits per synapse. Willshaw networks can store a much smaller number of about memories but get along with much cheaper binary synapses. Here I present a learning model employing synapses with discrete synaptic weights. For optimal discretization parameters, this model can store, up to a factor close to one, the same number of memories as for optimized Hopfield-type learning—for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. The model also provides the theoretical framework to determine optimal discretization parameters for computer implementations or brainlike parallel hardware including structural plasticity. In particular, as recently shown for the Willshaw network, it is possible to store bit per computer bit and up to bits per nonsilent synapse, whereas the absolute number of stored memories can be much larger than for the Willshaw model.

1  Introduction

Current von Neumann computers are characterized by a segregation between processing and memory, which leads to the well-known von Neumann bottleneck, meaning a limited data transfer rate between CPU and memory but also the intellectual limitation of “word-at-a-time thinking” (Burks, Goldstine, & von Neumann, 1946; Backus, 1978). In the past, this bottleneck has been compensated for by constructing increasingly faster and larger processors having higher clock rates, larger caches, and denser integration of electronic elements, hoping on an infinite continuation of Moore’s law. As such architectures become increasingly expensive in terms of energy, space, and cooling requirements, alternative computational paradigms become more and more attractive.

Associative neurocomputers are one such alternative paradigm in which, unlike the classical von Neumann machine, computation and data storage are not separated (Steinbuch, 1961; Willshaw, Buneman, & Longuet-Higgins, 1969; Palm, 1982; Palm & Palm, 1991; Hammerstrom, 1990; Heittmann & Rückert, 2002; Chicca et al., 2003; Hammerstrom, Gao, Zhu, & Butts, 2006; Laiho et al., 2015; Poikonen, Lehtonen, Laiho, & Knuutila, 2015). For example, they can easily implement associative memory storing a large set of M memories. In the general heteroassociative case, memories are associations between typically high-dimensional pattern vectors and (where ). Similar to random access memory, a query pattern entered in the associative memory can serve as an address for accessing the associated content pattern . However, unlike random access memories, associative memories accept arbitrary query patterns , and the computation of any particular output involves all stored data records rather than a single one. Specifically, the associative memory task consists of comparing a query with all stored addresses and returning an output pattern equal or similar to the pattern associated with the address most similar to the query. Thus, the associative memory task includes the random access task but is not restricted to it. It also includes computations such as pattern completion, denoising, or data retrieval using incomplete cues. Moreover, neural implementations of associative memory are closely related to Hebbian cell assemblies and play an important role in neuroscience as models of neural computation for various brain structures, for example neocortex, hippocampus, cerebellum, mushroom body (Hebb, 1949; Braitenberg, 1978; Palm, 1982; Hopfield, 1982; Fransen & Lansner, 1998; Pulvermüller, 2003; Johansson & Lansner, 2007; Lansner, 2009; Marr, 1969, 1971; Gardner-Medwin, 1976; Rolls, 1996; Bogacz, Brown, & Giraud-Carrier, 2001; Albus, 1971; Kanerva, 1988; Laurent, 2002; Honegger, Campbell, & Turner, 2011).

In its simplest form, such neurocomputers can be realized as a neural associative network, that is, a single layer of n linear threshold elements or perceptrons, typically employing fast, easy-to-implement synaptic learning that depends on local information only. For hetero-association as described above, each of the n content neurons vj receives m synaptic inputs Wij from address neurons ui (see Figure 1, left panel). For the special case of autoassociation, the two neuron layers are identical, and , such that the weight matrix describes a recurrent network and the terms memory, pattern, and cell assembly may be used as synonyms. More complex cognitive architectures can be realized by connecting multiple modules of auto- and heteroassociative networks (e.g., Knoblauch, Markert, & Palm, 2005).

Figure 1:

(Left) Structure of an associative network: m address neurons ui are linked to n content neurons vj via synaptic weights Wij obtained from learning M associations between address patterns and content patterns (where ). For retrieval analysis of the th association, we assume, without loss of generality, that the address pattern contains k one-entries and mk zero-entries as illustrated. Similarly, we assume that the noisy query used for retrieval contains c correct one-entries and f false one-entries, and the associated content pattern contains l one-entries (the high-units) and nl zero-entries (the low-units). Then the weight matrix can be divided into four relevant regions, each corresponding to a different conditional distribution of synaptic potentials (see e.g., equations 3.10–3.15) and matrix loads , , , (see equations 2.4 and A.1). (Right) Some linear learning rules (upper right table) and specification of general linear learning rules r(a,b) by four learning increments , , , and to compute synaptic potentials from synaptic counters Mab (lower right panels; see equations 1.1 and 1.2). Efficient learning rules like the covariance rule depend on mean presynaptic and postsynaptic activity levels and .

Figure 1:

(Left) Structure of an associative network: m address neurons ui are linked to n content neurons vj via synaptic weights Wij obtained from learning M associations between address patterns and content patterns (where ). For retrieval analysis of the th association, we assume, without loss of generality, that the address pattern contains k one-entries and mk zero-entries as illustrated. Similarly, we assume that the noisy query used for retrieval contains c correct one-entries and f false one-entries, and the associated content pattern contains l one-entries (the high-units) and nl zero-entries (the low-units). Then the weight matrix can be divided into four relevant regions, each corresponding to a different conditional distribution of synaptic potentials (see e.g., equations 3.10–3.15) and matrix loads , , , (see equations 2.4 and A.1). (Right) Some linear learning rules (upper right table) and specification of general linear learning rules r(a,b) by four learning increments , , , and to compute synaptic potentials from synaptic counters Mab (lower right panels; see equations 1.1 and 1.2). Efficient learning rules like the covariance rule depend on mean presynaptic and postsynaptic activity levels and .

For local learning rules the synaptic weight Wij depends only on and (). This excludes, for example, gradient descent methods (e.g., error backpropagation) that are based on global error signals obtained from repeated training of the whole pattern set. Instead, associative memories use simple Hebbian-type learning rules where synaptic weights increase if both the presynaptic and postsynaptic neurons are active during presentation of a pattern pair.

The performance of neural associative memory models can be evaluated by storage capacity, which can be defined, for example, by the number of memories M a network of given size can store or by the Shannon information C that a synapse can store. More recent work considers also structural compression of synaptic networks and the energy or time requirements per retrieval (Poirazi & Mel, 2001; Stepanyants, Hof, & Chklovskii, 2002; Lennie, 2003; Knoblauch, 2003a, 2005, 2009b; Knoblauch, Palm, & Sommer, 2010).

Much previous work has focused on optimizing synaptic learning in order to maximize storage capacity of associative networks. Most learning rules can be expressed in terms of second-order synaptic counter variables (Knoblauch, 2010a, 2011),
formula
1.1
that count how often synapse encountered presynaptic activity a combined with postsynaptic activity b. One well-investigated model class is of associative networks with linear learning (Hopfield, 1982; Palm, 1988; Tsodyks & Feigel’man, 1988; Dayan & Willshaw, 1991; Dayan & Sejnowski, 1993; Palm & Sommer, 1996; Chechik, Meilijson, & Ruppin, 2001; Sterratt & Willshaw, 2008) where the resulting synaptic weight,
formula
1.2
is a linear function of the synaptic counters. If pattern vectors are binary, the general linear learning rule can be described by four parameters where (see Figure 1, right panels). Optimizing the parameters rab to maximize storage capacity yields the covariance rule (see Figure 1, right upper table), which can store up to bits per synapse (bps) if address and content patterns have only a small fraction p and q of active units, respectively. The corresponding maximal number of memories is equal to the Gardner bound, a general upper-capacity limit for arbitrary learning methods (Gardner, 1987, 1988; Gardner & Derrida, 1988).
Previous work on nonlinear learning is scarce and focused mostly on a very simple model employing binary synapses, the so-called Steinbuch or Willshaw model with clipped Hebbian learning (Steinbuch, 1961; Willshaw et al., 1969; Palm, 1980; Golomb, Rubin, & Sompolinsky, 1990; Palm, 1991; Nadal, 1991; Graham & Willshaw, 1997; Sommer & Palm, 1999; Knoblauch, 2005; Knoblauch, Palm, & Sommer, 2010),
formula
1.3
This model can achieve quite a high network storage capacity, up to  bps, however, only for very sparse patterns with , where the fraction of active units grows logarithmically with the neuron number n. For more reasonable , the information a synapse can store vanishes, , and the absolute number of memories, , is much smaller than for optimal linear learning. Nevertheless, due to the binary synapses, Willshaw networks have very efficient implementations on both digital computers and brainlike parallel hardware including structural plasticity or synaptic pruning, where performance can more reasonably be evaluated in terms of information capacity CI and synaptic capacity CS (Knoblauch, 2003a; Knoblauch et al., 2010, see section 4). For example, compressed implementations of Willshaw networks can store up to bit per computer bit for almost any nonlogarithmic sparse activity with and (Knoblauch, 2008). Moreover, networks employing structural plasticity (e.g., by pruning of irrelevant silent synapses) can store up to to bits per synapse and provide functional interpretations for structural plasticity and hippocampal memory replay in the brain (Knoblauch, Körner, Körner, & Sommer, 2014; Butz, Wörgötter, & van Ooyen, 2009; Holtmaat & Svoboda, 2009; Ji & Wilson, 2007; Sirota, Csicsvari, Buhl, & Buzsaki, 2003).
Other nonlinear learning models include the pseudo-inverse rule (Kohonen & Ruohonen, 1972; Diederich & Opper, 1987) and Bayesian heuristics to implement maximum-likelihood retrieval of memory contents (Lansner & Ekeberg, 1989; Kononenko, 1989, 1994; MacKay, 1991; Lansner & Holst, 1996). In particular, a recent analysis (Knoblauch, 2010a, 2011) revealed that the optimal nonlinear Bayesian learning rule,
formula
1.4
becomes equivalent to the Steinbuch/Willshaw rule in the limit (which corresponds at the capacity limit to ) and equivalent to the linear covariance rule in the limit (which corresponds to ). Although finite networks perform significantly better than previous models (if evaluated by M and C), optimal Bayesian learning cannot exceed the asymptotic limit bps either. As both linear and Bayesian learning require real-valued synaptic weights, they have no efficient implementations on digital computers (e.g., for high computing precision), and it is unclear how to include structural plasticity or synaptic pruning in an efficient way (thus only bps).

Here I develop a novel nonlinear learning procedure for associative networks with discrete synaptic weights that combines the advantages of the previous models: large storage capacity (M and C) and efficient implementations on digital computers and brain-like parallel hardware (CI and CS). Basically, the new model corresponds to an optimal discretization of the synaptic weights obtained from the linear or Bayesian learning rule in the limit . For this, synaptic weights are computed in two processing steps. First, synaptic potentials are computed as in equation 1.2 or 1.4 from the synaptic counter variables Mab and thus correspond to the real-valued synaptic weights of linear or Bayesian learning. Second, the discrete synaptic weights Wij are obtained by applying one or several synaptic thresholds to the synaptic potentials (see Figure 2).

Figure 2:

Discrete synaptic strengths are obtained from nonlinear threshold operations on the synaptic potentials (indices are skipped for brevity, and discretization noise is assumed to be zero; see equation 2.1). Here a fraction of the synapses obtain strength corresponding to synaptic potentials in the range for (assuming ). Synaptic thresholds can be computed from using equation 2.5. Optimal discretization parameters , can be obtained from maximizing the zip factors , , or (e.g., see equations A.25, 3.36, 4.14, 4.20). The latter two variables correspond to compressed network implementations where the synapses with the dominant strength are pruned (here ; typically ). Maximizing , requires large .

Figure 2:

Discrete synaptic strengths are obtained from nonlinear threshold operations on the synaptic potentials (indices are skipped for brevity, and discretization noise is assumed to be zero; see equation 2.1). Here a fraction of the synapses obtain strength corresponding to synaptic potentials in the range for (assuming ). Synaptic thresholds can be computed from using equation 2.5. Optimal discretization parameters , can be obtained from maximizing the zip factors , , or (e.g., see equations A.25, 3.36, 4.14, 4.20). The latter two variables correspond to compressed network implementations where the synapses with the dominant strength are pruned (here ; typically ). Maximizing , requires large .

This procedure is similar to the general learning rule proposed already by Steinbuch (1961, Figure 2c) where the synaptic potential corresponds to Steinbuch’s “Indiz” (indication). In contrast to Steinbuch’s sigmoid transfer functions, I consider optimal step functions to obtain discrete synaptic weights that maximize signal-to-noise ratio and storage capacity. The resulting learning rules generalize the original Steinbuch/Willshaw rule (see equation 1.3) for less sparse memory patterns and multiple state synapses. The analysis reveals that the resulting storage capacities M and C are almost the same as for the optimal linear and Bayesian learning rules: More exactly, it turns out that the “zip factor” , defined as the relative capacity of discrete synapses compared to continuous synapses, is close to one already for a small number of discrete states, for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. Moreover, the analysis also provides optimal discretization parameters for implementations on digital computers or brainlike parallel hardware including structural plasticity. In particular, for low-entropy settings where most synapses share a single discrete weight (e.g., zero), the network becomes “compressible” such that computer implementations can store up to bit per computer bit, and parallel hardware employing structural plasticity can store up to bits per synapse, similarly as shown for the Willshaw network (Knoblauch et al., 2010, 2014). For that property and the sake of brevity, I will sometimes refer to the model as the zip net model. As a by-product, it is possible to derive precision requirements for synaptic weights that may also include biological constraints such as Dale’s law, sparse synaptic connectivity, and structural plasticity.

The letter is organized as follows. Section 2 describes the learning and retrieval procedure of the model. Section 3 analyzes the signal-to-noise ratio for linear and Bayesian learning of the synaptic potentials and derives a general expression for the zip factor . Section 4 computes various storage capacities (M, C, CI, CS). Section 5 maximizes for various numbers of synaptic states N and implementation constraints and derives the corresponding optimal discretization parameters. Section 6 compares retrieval efficiency in terms of space, time and energy requirements to previous models. Section 7 presents numerical simulations that verify the theory and provide a comparison to the previous models. Section 8 summarizes and discusses the main results of this work and, in particular, points to implications for memory theories that are based on structural plasticity and possible nanoscale hardware implementations. Finally, the appendixes compute the SNR and for general distributions of synaptic potentials (appendix  A), give formulas for gaussian tail integrals (appendix  B) and optimal firing thresholds (appendix  D), discuss the relation between the SNR and the Hamming distance based output noise (appendix  E), give basic information theoretic formulas (appendix  F) and recommendations for an efficient implementation of networks of discrete multistate synapses (appendix  G). Taxonomy and notations are as in Knoblauch et al. (2010) and Knoblauch (2011) whenever possible.

2  Network Model

2.1  Learning of Discrete Synaptic Weights

This section considers networks of discrete N-state synapses where synaptic weights Wij can assume one out of N discrete values (). Specifically, Wij is obtained from applying multiple synaptic thresholds to synaptic potentials (see Figure 2),
formula
2.1
The synaptic potentials depend on the set of memory patterns and , as well as on the employed learning method. For example, may equal the real-valued synaptic weights of the linear or Bayesian learning methods (see equation 1.2 or 1.4). The model also includes for each synapse independent additive noise variables with zero means, standard deviations , and density functions satisfying
formula
2.2
to account for random hardware variability and other noise effects (see, sections 4.5 and A.1).
Let us choose synaptic thresholds such that, on average, a fraction of the synapses have weight (as illustrated by Figure 2), where we call the matrix load for the tth weight value.1 Then let and be the mean and standard deviation and
formula
2.3
the standardized (complementary) distribution function of the synaptic potentials. Finally, we choose synaptic thresholds to obtain the desired matrix loads
formula
2.4
for , that is,
formula
2.5
where is the inverse function of . Thus, the matrix loads are essentially equivalent to the synaptic thresholds and, together with the synaptic strengths , fully specify the discretization procedure of equation 2.1. Actually, the following performance analyses become most concise and general if using matrix loads instead of synaptic thresholds.2

2.2  Retrieval

For memory retrieval we address the zip net with a query usually being a noisy version of one of the previously stored address patterns, for example, as illustrated in Figure 1 (left). Then the dendritic potential xj of content neuron j is
formula
2.6
For binary memory patterns (as presumed by Figure 1), one-step retrieval yields the retrieval output simply by applying a vector of firing thresholds to the dendritic potentials , where (see appendix  D)
formula
2.7
We can then evaluate the retrieval quality, for example, by computing the Hamming distance,
formula
2.8
between retrieval output and the original content pattern . To be independent of network size, a more convenient measure is the output noise,
formula
2.9
which normalizes the expected Hamming distance to the mean number of active units in a content pattern. Similarly, one can define query noise , where is the mean activity in an address pattern. In recurrent autoassociative networks, iterative retrieval can be realized by feeding back the retrieval output to the input layer repeatedly (Hopfield, 1982; Schwenker, Sommer, & Palm, 1996).
Similarly to Palm (1988), Dayan and Willshaw (1991), Palm and Sommer (1996), Sterratt and Willshaw (2008), and Knoblauch (2011) we can alternatively evaluate retrieval quality by the signal-to-noise ratio (SNR),
formula
2.10
Here and are expectation and variance of the dendritic potential for a low neuron with . Similarly, and are expectation and variance of the dendritic potential for a high neuron with . Positive SNR means that high units (with ) have larger mean dendritic potentials than low units (with ), that is, . This is so because distributions of synaptic potentials depend on whether address neuron i transmits (one of the c) correct or (one of the f) false activations and whether content neuron j is a high or low unit. Correspondingly, we can divide the connection matrix into four relevant regions according to four cases with conditional matrix loads (see Figure 1, left).
For gaussian dendritic potentials and , it is possible to show that the SNR R and output noise are equivalent measures (see appendix  E). Correspondingly, the optimal firing thresholds minimizing expected dH write essentially as a function of the SNR (see appendix  D and equation D.7),
formula
2.11

2.3  Control of Synaptic Thresholds

There are several strategies to control the synaptic thresholds of a content neuron j. Here I consider the following two:

  1. Fixed synaptic thresholds. Synaptic thresholds may be precomputed and fixed based on an estimate of the distribution of synaptic potentials using equation 2.5 (as assumed in the SNR analysis of section 3.1)

  2. Homeostatic control. Synaptic thresholds may be adapted during online learning to realize a desired distribution of synaptic strengths .

While the former variant is easy to analyze (as done in section 3), the latter one turns out to minimize output noise (see equation 2.9) by realizing identical “false alarm” error probabilities among low units vj (see the numerical experiments in section 7). In computer implementations, homeostatic control can easily be realized by choosing synaptic thresholds such that at any time, a content neuron j has exactly synapses with strength . For example, we may sort the synaptic potentials for each neuron j such that where i is a corresponding permutation on . Then we may choose for .

For nonrandom memory patterns, there may also be significant heterogeneity in the distributions of synaptic potentials for different presynaptic neurons ui. Then sorting should be based on one of the following normalized synaptic potentials rather than , for example, on
formula
2.12
formula
2.13
formula
2.14
where (e.g., as in equation B.2 for gaussian ), is a random variable drawn from the distribution of as characterized by , , and , the synaptic counter variables , are as in section 1 (see equation 1.1), and the are estimates of query noise. Here the first normalization variant, see equation 2.12, has a probabilistic interpretation where is uniformly distributed between 0 and 1 because . Since is monotonically increasing, sortings based on of equation 2.13 yield equivalent results as sortings based on . Similarly, equation 2.14 is equivalent to sorting the synaptic weights of the Bayesian learning rule, equation 1.4, including query noise estimates (for details see Knoblauch, 2011, equation 2.15). Note that any of the three possible normalization procedures guarantees that each column j of matrix Wij has exactly and each row i approximately synapses with weight .

There is actually evidence that the brain regulates the number of (potentiated) synapses per neuron in a similar way (Fares & Stepanyants, 2009; Knoblauch et al., 2014). For biological models or implementations on parallel hardware, the homeostatic control could be realized by regulating synaptic thresholds in order to achieve a desired mean neuron activity when stimulating with random inputs with defined activity statistics (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Turrigiano & Nelson, 2004; Turrigiano, 2007; Desai, Rutherford, & Turrigiano, 1999; Van Welie, Van Hooft, & Wadman, 2004). This is particularly obvious for binary synapses because, for inputs with a given mean activity, there is a unique relationship between the number of potentiated synapses and mean output activity. It would then be sufficient to increase (or decrease) synaptic thresholds if mean output activity is above (or below) the target activity (Knoblauch, 2009c, 2010b).3 Similar control mechanisms could also be realized for multistate synapses, possibly requiring a regulation based on state-specific expression factors.

3  Signal-to-Noise Ratio for Linear Learning of Synaptic Potentials

It is possible to analyze the discrete associative network model of the previous section for any given learning procedure that allows deriving the distribution of synaptic potentials (see appendix  A for the general analysis). Specifically, for linear learning (Hopfield, 1982; Palm, 1988; Dayan & Willshaw, 1991; Palm & Sommer, 1996) the synaptic potentials,
formula
3.1
write as a sum of M learning increments . As illustrated by Figure 1 (right), learning of synaptic potentials is then determined by four learning parameters: , , , and .

The following subsections compute the SNR, equation 2.10, of a given content neuron j for linear learning, making a number of further assumptions:

  1. Components of address patterns are identically and independently distributed where is the probability of an active component.4

  2. Content neuron j has a given unit usage,
    formula
    being the number of content patterns where neuron j is active. Correspondingly, we define for brevity.
  3. The query pattern is a noisy version of having c correct and f false one entries, similar to that illustrated by Figure 1 (left).

  4. Synaptic potentials follow a gaussian distribution, , where Gc is the complementary gaussian distribution function (see appendix  B, equation B.5). This assumption becomes true due to the central limit theorem at least for a large memory number (where also the counters diverge for constant ).

  5. Let us finally assume the limit of large networks, , to allow , whereas activity parameters p, q, learning parameters r00, r01, r10, r11, and discretization parameters , are assumed to remain constant. (For an analysis of finite networks see appendix  A.)

The SNR analysis consists of three parts. Section 3.1 analyzes conditional and unconditional distributions of synaptic potentials for linear learning. Section 3.2 computes conditional matrix loads and derives distributions of dendritic potentials for low and high units in large networks. Finally, section 3.3 optimizes linear learning parameters to maximize the SNR.

3.1  Distribution of Synaptic Potentials

First, we compute the mean value and variance of the synaptic potentials given the content neuron’s unit usage M1 (see equation 3.2). With , it is
formula
3.3
formula
3.4
where and are expectations and variances of given the value of the content pattern . It is
formula
3.5
formula
3.6
formula
3.7
formula
3.8
Assuming gaussian with (see equation B.5), the synaptic thresholds of equations 2.5 and 2.1 write
formula
3.9
where is the inverse complementary gaussian distribution function (see equation B.7).
Retrieval with a query pattern induces (slightly) differing distributions of synaptic potentials depending on the given properties of neurons i and j. To analyze this dependency, let be a noisy version of the th address pattern having c correct and f false one entries (see Figure 1, left). Then for an active query component i with , the corresponding distribution of depends on whether the activity of neuron i is correct () or false (), as well as on whether neuron j is a high unit () or a low unit (). Thus, we have to discern four conditional distributions of synaptic potentials, where we can compute means and variances similar to equations 3.3 and 3.4,
formula
3.10
formula
3.11
formula
3.12
formula
3.13
and , with
formula
3.14
formula
3.15
where the indices c versus f, and lo versus hi refer to the same regions of the weight matrix as illustrated in Figure 1 (left). Thus, the differences between conditional and unconditional mean values are
formula
3.16
formula
3.17
formula
3.18
formula
3.19
where we use parameter differences,
formula
3.20
for brevity. With the linear approximation , the differences between conditional and unconditional standard deviations write
formula
3.21
formula
3.22
where the error bounds assume constant but diverging . Similar to equations 3.14 and 3.15, we can also use and .

3.2  SNR for Large Networks

The SNR for finite networks with and general distribution of synaptic potentials is analyzed in appendix  A. Here we apply these results to compute SNR for large networks with and linear learning of synaptic potentials as described in section 3.1. Thus, we can assume again a gaussian distribution of synaptic potentials with (see equations 3.1, 2.3, and B.5).

We first compute the conditional matrix loads from equations A.19 and A.18 with equations 3.10 to 3.13 and equation 3.4,
formula
3.23
formula
3.24
formula
3.25
formula
3.26
with constants and (see equation A.16)
formula
3.27
for , where is the gaussian density function (see equation B.1).5 For finite networks, the exact conditional matrix loads can also be obtained from equation A.1 with and equations 3.9 and 3.10 to 3.15. Then, the SNR, equation 2.10, can be computed, for example, from (see equations A.3–A.7 in appendix  A)
formula
3.28
formula
3.29
and
formula
3.30
formula
3.31
where, for , it is (see equation A.10)
formula
3.32
Alternatively, we can compute the SNR R directly from equation A.24 in section A.3. With equations 3.16 to 3.19, the contributions of correct and false activations in the query to R can be computed from
formula
3.33
formula
3.34
respectively. Thus, the squared SNR is
formula
3.35
for
formula
3.36
where is the (average) variance of the discretization noise and is the variance of the discrete strength values. We can generalize this result for large diluted networks by assuming that the content neuron j is connected to a fraction P of the m address neurons. Then the analysis so far remains valid if we replace m by (i.e., the number of address neurons that are connected to neuron j) and generalize the definitions of c and f as numbers of correctly and falsely active query neurons, respectively, that are connected to neuron j. Then defining and as the fractions of correct and false positives in a query pattern, and using
formula
3.37
formula
3.38
the squared SNR for large networks with linear learning of synaptic potentials finally writes
formula
3.39
Here the first term is a function of the discretization parameters and , and the upper bound will be shown in section 4.1. The second term, , depends on the learning parameters and (and also on the fraction q of active neurons in a content pattern ) and will be maximized in the following section showing the upper bound . The third term, , describes the influence of query noise parameters and (and the fraction p of active neurons in an address pattern ), where the maximum is obtained for zero query noise with and . The last term equals the SNR for suboptimal linear learning of synaptic weights employing the homosynaptic rule (see Figure 1; see  Dayan & Willshaw, 1991, p. 259, rule R3) which is, for nonsparse address patterns, factor worse than the maximum obtained for the covariance rule (see Figure 1; see Dayan & Willshaw, 1991, p. 259, rule R1, or Palm & Sommer, 1996, p. 95, equation 3.28).

3.3  Optimal Linear Learning of Synaptic Potentials

Maximizing , equation 3.37 yields optimal learning parameters and that maximize the SNR R, equation 3.39. Writing for some number , we have to maximize
formula
3.40
with respect to . The derivative with respect to is
formula
3.41
The SNR is maximal when the derivative is zero corresponding to linear learning rules satisfying
formula
3.42
Inserting optimal yields the maximum,
formula
3.43
proving the upper bound of equation 3.37.

Note that the covariance rule is optimal because satisfies the optimality criterion (see Figure 1). However, unlike for linear weight learning (Dayan & Willshaw, 1991), the covariance rule is not the unique optimum. Instead, for discrete weights, there is a three-dimensional subspace of optimal learning rules, including biologically more realistic ones than the covariance rule. For example, the homosynaptic rule is also optimal because (see Figure 1). Other well-investigated learning rules such as the heterosynaptic rule the Hebbian rule, and the Hopfield rule, in general do not satisfy the optimality criterion (see Figure 1; see also Dayan & Willshaw, 1991 and Palm & Sommer, 1996).

4  Analysis of Storage Capacity

A reasonable performance measure for associative networks is storage capacity. There are actually various definitions of storage capacity that depend on the target platform used for implementing the networks. For example, section 4.1 computes pattern capacity , defined as the absolute number of memories a network of a given size can store at output noise level (see equation 2.9). Similarly, section 4.2 computes network capacity , which is the stored information per synapse for networks with a fixed given structure. Section 4.3 computes information capacity , which applies to compressed implementations on digital computers measuring the stored information per computer bit. Finally, section 4.4 computes the synaptic capacity , defined as the stored information per synapse in structurally plastic networks. All of these capacity measures can be derived from the SNR R (see equations 2.10 and 3.39), making the same presumptions as described at the beginning of section 3. In addition, we have to assume that dendritic potentials have a gaussian distribution (see equation C.7)

Although we focus on one-step retrieval, the following analyses apply as well to iterative retrieval in recurrent autoassociative (Hopfield-type) networks with , . There, we can expect successful retrieval if output noise is less than query noise . Thus, to approximate the various storage capacities, we have to choose an output noise level . In particular, to approximate the maximal number of stable memory attractors, we can compute for zero query noise (,) and .

4.1  Pattern Capacity

The pattern capacity is the maximal number of content patterns that can be stored and retrieved at a given output noise level (see equation 2.9),
formula
4.1
To obtain output noise level , the SNR must be at least (see appendix  E). With R as computed in section 3.2 and appendix  A being a function of the number M of stored memories, solving for M yields the pattern capacity
formula
4.2
where is the inverse function of . For example, for large networks with linear learning of synaptic potentials, equation 3.39 with from equations E.5 and E.6 yields
formula
4.3
formula
4.4
formula
4.5
We can make three important conclusions.

First, . To see this, note that for , the right-most term in equation 4.4 (as well as the upper bound, equation 4.5) corresponds to the Gardner bound, which defines a general upper bound on for any model of synaptic learning (Gardner, 1988, eq. 40). Since both (see equation 3.38) and (see equations 3.37 and 3.43) can reach their upper bounds, it necessarily follows (whereas would violate the Gardner bound for ).

Second, for diverging network size, any finite discretization procedure with decreases pattern capacity by at least factor compared to optimal learning of real-valued synaptic weights. This follows because equation 4.3 is factor worse than the pattern capacity for optimal Bayesian learning (Knoblauch, 2010a, 2011) or linear covariance learning (Palm & Sommer, 1996) of real-valued synaptic weights.

Third, reaching the full pattern capacity of optimal Bayesian or linear covariance learning requires that the computing precision diverges with the network size, , as . This is the reverse of the second conclusion assuming that is possible for . The latter assumption is true because for any given network, the discretization error (in dendritic potentials xj) vanishes for diverging even for trivial discretization parameters (e.g., consider and , where and are the maximal and minimal real-valued synaptic weight—(e.g., , for linear learning; see also equation 5.7 in section 5.1). In section 5, we will see that already for small N.

4.2  Network Capacity

An alternative capacity measure normalizes the stored Shannon information (of the content patterns ) to the number of synapses employed in a network with given structure. This is the network capacity,
formula
4.6
where T is the transinformation equation F.5 with error probabilities q01, q10 as in equations D.1 and D.2 (and using, for example, optimal firing thresholds as in equation D.7 and as in appendix  E). For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields
formula
4.7
formula
4.8
formula
4.9
where I is the information of a binary random variable, equation F.1, and T simplifies as described at the end of appendix  F. As for , network capacity for large networks with discrete weights is factor worse than for optimal Bayesian learning or optimal linear covariance learning of real-valued synaptic weights, where quickly converges toward 1 even for small N (see section 5).

4.3  Information Capacity for Compressed Computer Implementations

The synaptic weights Wij can be conceived as random variables taking value with probability where . Thus, for implementations on digital computers, the matrix could be compressed, for example, by Huffman or Golomb coding (see appendix  G; Huffman, 1952; Golomb, 1966; Cover & Thomas, 1991). Assuming optimal compression, the computer memory required to represent the weight matrix is then only bits, where I is the Shannon information of an N-ary random variable (see equation F.2).6 This motivates the definition of information capacity (Knoblauch, 2003a; Knoblauch et al., 2010) as the stored Shannon information per computer memory bit,
formula
4.10
For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields
formula
4.11
formula
4.12
formula
4.13
which is similar to except replacing (see equation 3.36) by
formula
4.14
Section 5 shows that optimal discretizations maximizing can achieve the theoretical bound already for binary discretizations ().

4.4  Synaptic Capacity for Structural Plasticity

Besides Hebbian-type weight modification, synaptic plasticity in the brain also includes network rewiring by elimination and generation of synapses (Butz et al., 2009; Holtmaat & Svoboda, 2009). A reasonable functional interpretation of such structural plasticity is that useful synapses get consolidated, whereas irrelevant synapses get eliminated or pruned (Knoblauch et al., 2010, 2014). Assuming a homeostatic control of the total number of synapses, such a selection procedure will effectively ``move'' the synapses to the most useful locations for storing a given set of memories (provided ongoing replay of the memories or a more abstract “consolidation signal” that tags synapses to be either consolidated or eliminated). Here we can assume that weight value defines synapses to be eliminated. This means that a fraction of the synapses can be eliminated without modifying the network function.7 In practice, given the discretization parameters and , it is most effective to eliminate the synapses having the most frequent weight, . This means that a fraction
formula
4.15
of the synapses is sufficient to preserve network function, while a fraction can safely be eliminated.8 These arguments motivate the definition of synaptic capacity being the stored Shannon information per necessary (e.g., nonsilent) synapse,
formula
4.16
where the general upper bound is given by the amount of information that can be stored by structural plasticity in the location of the synapse within the network (by selecting one out of possible locations within the weight matrix; see Knoblauch et al., 2010, 2014), plus the maximal information that can be stored in a given synapse by weight modification (which cannot exceed the Gardner bound of at most two bits per synapse, or actually only 0.72 bits per synapse for sparse activity with ; see Gardner, 1988; Knoblauch, 2011). For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields
formula
4.17
formula
4.18
formula
4.19
which is similar to except replacing (see equation 3.36) by
formula
4.20
where p1 is as in equation 4.15 or as described in note 8. Section 5 shows that optimal discretizations maximizing can come close to the theoretical bound, , already for binary discretizations ().

4.5  Effect of Discretization Noise

In particular, for analog hardware implementations, learning or read-out of discrete synaptic weights may be disturbed by additive noise as defined in equation 2.1. If the tth discrete weight value is disturbed by noise with zero mean and standard deviation , then the total discretization noise has standard deviation as defined below equation 3.36, and each of the performance measures decreases by a factor
formula
4.21
where is the standard deviation of the discrete synaptic weight values as defined below equation 3.36. This means that SNR and storage capacities only moderately decrease by discretization noise with , whereas effects are strong for . For example, for storage capacities are decreased by factors , , , , , .

5  Optimal Discretizations

The previous sections computed the SNR and various storage capacities for given discretization parameters , , and . This section optimizes and in order to maximize the different storage capacities , , , for linear learning of synaptic potentials. First, as a reference for comparison, section 5.1 computes capacities for naive discretization methods. Then section 5.2 maximizes in order to maximize pattern capacity and network capacity , which is most relevant for networks with a fixed given connectivity structure. Section 5.3 maximizes to maximize information capacity CI relevant for compressed implementations on digital computers. Finally, section 5.4 maximizes to maximize synaptic capacity CS relevant for brainlike hardware including structural plasticity. First, note that the SNR, and thus the storage capacities, are invariant to scaling and shifting of synaptic weights.9 Thus, we can assume, without loss of generality, that discrete synaptic strengths are within a given interval, for example, or (where the latter satisfies Dale’s law that all synapses of a given neuron have the same polarity—either excitatory or inhibitory).10

5.1  Naive Discretizations

The first naive approach (referred to in the following as ) to discretize continuous-valued synaptic weights is the following: First, each synaptic strength should occur with the same frequency and, second, strengths should be equally distributed over the target interval (e.g., between 0 and 1),
formula
5.1
formula
5.2
Equation 5.1 corresponds to maximizing the entropy of the resulting synaptic weights, a strategy that is well known to maximize storage capacity in the Willshaw model with binary synapses (where and become maximal if half of the synapses are potentiated, ; see Palm, 1980; Knoblauch et al., 2010). Instead of using equation 5.2, however, it may be more appropriate to adjust synaptic strengths to the actual distribution of synaptic potentials (e.g., a gaussian). Specifically, we may define each synaptic strength as the mean synaptic potential given that (see equation 2.1 and Figure 2). This corresponds to the second naive approach (referred to in the following as ), which yields for a gaussian distribution of synaptic potentials
formula
5.3
formula
5.4
where
formula
5.5
formula
5.6
and is as in equation 3.27. Both approaches maximize entropy with synaptic weights between 0 and 1. As discussed in note 10, it may sometimes be more useful (in particular, for unknown c and f) to shift and scale the weights further such that the mean synaptic weight is zero and the maximal weight is one. The two resulting discretization procedures will be referred to in the following as and .
Interestingly, for the second approach (/), we can simplify equation 3.36. As discussed above, the resulting and SNR R2 are invariant to scaling and shifting synaptic weight values. Therefore, inserting (rather than normalized equation 5.4) into equation 3.36 and exploiting the gaussian mean yields
formula
5.7
where is the variance of the synaptic strengths and is the variance of the discretization noise as defined below equation 3.36. The asymptotic result follows for infinite precision with because samples a gaussian with mean 0 and variance 1, and therefore also . Thus, for zero discretization noise (), it follows indeed , proving that it is possible to reach the upper bound of equation 3.36.

5.2  Maximizing to Maximize R, , and

Maximizing the “zip factor” as defined in equation 3.36 maximizes the SNR R (see equation 3.39), pattern capacity (see equation 4.3), and network capacity (see equation 4.7). Without loss of generality, we can assume because the dependence of on discretization noise is considered sufficiently in section 4.5 (assume a given ratio ; see equations 3.36 and 4.21) and because is invariant to scaling and shifting of synaptic weights (see note 9).

5.2.1  Binary Synapses

For binary synapses () we can explicitly compute the optimal discretization parameters. Without loss of generality we can assume and and . Then equation 3.36 writes
formula
5.8
A necessary condition for maximal is . With equation B.9, this yields the condition , which is fulfilled for (see, Figure 3a) and, thus,
formula
5.9
Figure 3:

Network storage capacity C for optimal discretization parameters that maximize the zip factor for N-ary synapses (see equations 4.9 and 3.36). (a) Asymptotic C as a function of total matrix load (see equation 4.15) for different N (solid lines; ). For increasing N, network capacity C quickly approaches the Bayesian limit , but requires a large fraction of “nonzero” synapses, (see Knoblauch, 2011). For comparison, the plot also shows corresponding curves for the Willshaw model (dashed) with a constant fraction of active units per address pattern (Willshaw) and binomially distributed (Willshaw). Note that for the Willshaw model, p1 depends on p (typically and for interesting ; see Knoblauch et al., 2010), whereas p1 is a free parameter in the zip net model. (b) Zip factor () as a function of the state number N (logarithmic scale) for optimal discretization (solid line) and the two naive methods NV1 (dashed) and NV2 (dash-dotted). (c) Contour plot of network capacity C for ternary synapses () as a function of matrix loads , and otherwise optimal discretization parameters (as explained in section 5.2). (d) Contour plot of optimal synaptic strength corresponding to panel c, assuming (without loss of generality) discretizations with and . Plots assume zero discretization noise () and large address populations (). Plots for C also assume optimal learning parameters (see equation 3.42), sparse activity (), zero query noise (), and zero output noise ().

Figure 3:

Network storage capacity C for optimal discretization parameters that maximize the zip factor for N-ary synapses (see equations 4.9 and 3.36). (a) Asymptotic C as a function of total matrix load (see equation 4.15) for different N (solid lines; ). For increasing N, network capacity C quickly approaches the Bayesian limit , but requires a large fraction of “nonzero” synapses, (see Knoblauch, 2011). For comparison, the plot also shows corresponding curves for the Willshaw model (dashed) with a constant fraction of active units per address pattern (Willshaw) and binomially distributed (Willshaw). Note that for the Willshaw model, p1 depends on p (typically and for interesting ; see Knoblauch et al., 2010), whereas p1 is a free parameter in the zip net model. (b) Zip factor () as a function of the state number N (logarithmic scale) for optimal discretization (solid line) and the two naive methods NV1 (dashed) and NV2 (dash-dotted). (c) Contour plot of network capacity C for ternary synapses () as a function of matrix loads , and otherwise optimal discretization parameters (as explained in section 5.2). (d) Contour plot of optimal synaptic strength corresponding to panel c, assuming (without loss of generality) discretizations with and . Plots assume zero discretization noise () and large address populations (). Plots for C also assume optimal learning parameters (see equation 3.42), sparse activity (), zero query noise (), and zero output noise ().

This means that binary synapses can already store about two-thirds of the information that gradual synapses can store if employing optimal Bayesian or linear covariance learning (assuming sparse address patterns with ; see the discussion below equation 4.5),
formula
5.10

As for the Willshaw model (see Figure 3a, dashed lines), the maximum is obtained for maximal entropy (or information ; see equation F.1) of synaptic weights when synaptic thresholds are adjusted such that half of the synapses get potentiated.11

5.2.2  Ternary Synapses

For multistate synapses with more than two discrete weight values, , we can numerically maximize with respect to and . The following shows that for ternary synapses (), maximization of can be reduced to a two-dimensional optimization problem.

Without loss of generality, we can assume and (see note 9). Then equation 3.36 yields (again assuming )
formula
5.11
The maximum occurs for corresponding to
formula
Here the solution corresponds to the minimum , and solving the remaining linear equation yields the unique maximum at
formula
5.12

Thus, to compute optimal discretization parameters for ternary synapses, it is sufficient to maximize with respect to and , for example, and assuming , , , and as given by equation 5.12. Figure 3 illustrates optimal (see panel d) and the corresponding maximal network capacity (see panel c). As can be seen, is maximal for “symmetric” matrix loads and .

5.2.3  General Multistate Synapses

For general multistate synapses with more than three discrete weight values, , we can numerically maximize with respect to and corresponding to a dimensional optimization problem. Table 1 shows some optimal data obtained from maximizing equation 3.36 using Matlab function fminsearch with initial discretization parameters from (see equations 5.3 and 5.4), where the resulting discrete strength values are scaled such that and . The resulting discrete weight distributions are bell shaped and resemble gaussians, in particular for large odd N (data not shown). For but not for , random initializations of the discretization parameters yielded always the same maxima as shown in the table. This indicates that there is a unique optimal set of discretization parameters for , whereas the data for may correspond to local maxima. Data for used an alternative initialization method described in a technical report (Knoblauch, 2009c) yielding better results than . Note the symmetry relations and for optimal discretizations. Most important, quickly converges toward 1. Networks of 2-bit synapses () can already store more than 88% of the information that gradual synapses can store when employing optimal Bayesian learning. Similarly, 3-bit synapses () reach more than 96% and 4-bit synapses () more than 99% of the maximal capacity.

Table 1:
Optimal Discretization Parameters.
N for for
0.6366 0.4592 0.5 0.5 1 −1 
0.8098 0.5842 0.2703 0.4594 0.2703 1 0 −1 
0.8825 0.6366 0.1631 0.3369 0.3369 0.1631 1 0.2998 −0.2998 −1 
0.9201 0.6637 0.1067 0.2444 0.2978 0.2444 0.1067 1 0.4435 0 −0.4435 −1 
0.9420 0.6795 0.0740 0.1810 0.2450 0.2450 0.1810 0.0740 1 0.5281 0.1678 −0.1678 −0.5281 −1 
0.9560 0.6896 0.0536 0.1374 0.1986 0.2208 0.1986 0.1374 0.0536 1 0.5843 0.2757 0 −0.2757 −0.5843 −1 
0.9655 0.6964 0.0402 0.1066 0.1615 0.1917 0.1917 0.1615 0.1066 0.0402 1 0.6245 0.3513 0.1139 −0.1139 −0.3513 −0.6245 −1 
0.9721 0.7013 0.0310 0.0845 0.1323 0.1644 0.1756 0.1644 0.1323 0.0845 0.0310 1 0.6548 0.4075 0.1968 0 −0.1968 −0.4075 −0.6548 −1 
10 0.9771 0.7048 0.0245 0.0681 0.1095 0.1407 0.1572 0.1572 0.1407 0.1095 0.0681 0.0245 1 0.6786 0.4511 0.2601 0.0851 −0.0851 −0.2601 −0.4511 −0.6786 −1 
11 0.9808 0.7075 0.0198 0.0558 0.0916 0.1206 0.1393 0.1458 0.1393 0.1206 0.0916 0.0558 0.0198 1 0.6978 0.4860 0.3108 0.1515 0 −0.1515 −0.3108 −0.4860 −0.6978 −1 
12 0.9837 0.7096 0.0162 0.0463 0.0773 0.1040 0.1231 0.1331 0.1331 0.1231 0.1040 0.0773 0.0463 0.0162 1 0.7137 0.5146 0.3509 0.2049 0.0674 −0.0674 −0.2049 −0.3509 −0.5146 −0.7137 −1 
13 0.9859 0.7112 0.0134 0.0389 0.0659 0.0901 0.1088 0.1206 0.1246 0.1206 0.1088 0.0901 0.0659 0.0389 0.0134 1 0.7271 0.5386 0.3848 0.2488 0.1223 0 −0.1223 −0.2488 −0.3848 −0.5386 −0.7271 −1 
14 0.9878 0.7125 0.0113 0.0331 0.0566 0.0783 0.0964 0.1088 0.1155 0.1155 0.1088 0.0964 0.0783 0.0566 0.0331 0.0113 1 0.7387 0.5591 0.4137 0.2861 0.1683 0.0556 −0.0556 −0.1683 −0.2861 −0.4137 −0.5591 −0.7387 −1 
15 0.9893 0.7136 0.0095 0.0283 0.0490 0.0687 0.0855 0.0983 0.1062 0.1090 0.1062 0.0983 0.0855 0.0687 0.0490 0.0283 0.0095 1 0.7484 0.5767 0.4383 0.3175 0.2069 0.1022 0 −0.1022 −0.2069 −0.3175 −0.4383 −0.5767 −0.7484 −1 
16 0.9904 0.7144 0.0079 0.0255 0.0385 0.0639 0.0774 0.0965 0.0854 0.1049 0.1049 0.0854 0.0965 0.0774 0.0639 0.0385 0.0255 0.0079 1 0.7519 0.5938 0.4616 0.3420 0.2328 0.1380 0.0497 −0.0497 −0.1380 −0.2328 −0.3420 −0.4616 −0.5938 −0.7519 −1 
17 0.9915 0.7152 0.0071 0.0213 0.0375 0.0535 0.0681 0.0802 0.0892 0.0948 0.0966 0.0948 0.0892 0.0802 0.0681 0.0535 0.0375 0.0213 0.0071 1 0.7649 0.6057 0.4786 0.3688 0.2694 0.1765 0.0874 0 −0.0874 −0.1765 −0.2694 −0.3688 −0.4786 −0.6057 −0.7649 −1 
18 0.9919 0.7155 0.0071 0.0214 0.0437 0.0551 0.0385 0.0815 0.0821 0.0839 0.0867 0.0867 0.0839 0.0821 0.0815 0.0385 0.0551 0.0437 0.0214 0.0071 1 0.7628 0.5911 0.4598 0.3770 0.2932 0.2013 0.1189 0.0387 −0.0387 −0.1189 −0.2013 −0.2932 −0.3770 −0.4598 −0.5911 −0.7628 −1 
19 0.9932 0.7164 0.0054 0.0165 0.0294 0.0429 0.0547 0.0658 0.0749 0.0820 0.0855 0.0858 0.0855 0.0820 0.0749 0.0658 0.0547 0.0429 0.0294 0.0165 0.0054 1 0.7779 0.6286 0.5097 0.4082 0.3176 0.2333 0.1531 0.0756 0 −0.0756 −0.1531 −0.2333 −0.3176 −0.4082 −0.5097 −0.6286 −0.7779 −1 
20 0.9936 0.7168 0.0050 0.0160 0.0287 0.0391 0.0497 0.0586 0.0681 0.0596 0.0787 0.0965 0.0965 0.0787 0.0596 0.0681 0.0586 0.0497 0.0391 0.0287 0.0160 0.0050 1 0.7726 0.6253 0.5106 0.4171 0.3330 0.2539 0.1851 0.1191 0.0417 −0.0417 −0.1191 −0.1851 −0.2539 −0.3330 −0.4171 −0.5106 −0.6253 −0.7726 −1 
N