Neural associative networks are a promising computational paradigm for both modeling neural circuits of the brain and implementing associative memory and Hebbian cell assemblies in parallel VLSI or nanoscale hardware. Previous work has extensively investigated synaptic learning in linear models of the Hopfield type and simple nonlinear models of the Steinbuch/Willshaw type. Optimized Hopfield networks of size n can store a large number of about memories of size k (or associations between them) but require real-valued synapses, which are expensive to implement and can store at most bits per synapse. Willshaw networks can store a much smaller number of about memories but get along with much cheaper binary synapses. Here I present a learning model employing synapses with discrete synaptic weights. For optimal discretization parameters, this model can store, up to a factor close to one, the same number of memories as for optimized Hopfield-type learning—for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. The model also provides the theoretical framework to determine optimal discretization parameters for computer implementations or brainlike parallel hardware including structural plasticity. In particular, as recently shown for the Willshaw network, it is possible to store bit per computer bit and up to bits per nonsilent synapse, whereas the absolute number of stored memories can be much larger than for the Willshaw model.
Current von Neumann computers are characterized by a segregation between processing and memory, which leads to the well-known von Neumann bottleneck, meaning a limited data transfer rate between CPU and memory but also the intellectual limitation of “word-at-a-time thinking” (Burks, Goldstine, & von Neumann, 1946; Backus, 1978). In the past, this bottleneck has been compensated for by constructing increasingly faster and larger processors having higher clock rates, larger caches, and denser integration of electronic elements, hoping on an infinite continuation of Moore’s law. As such architectures become increasingly expensive in terms of energy, space, and cooling requirements, alternative computational paradigms become more and more attractive.
Associative neurocomputers are one such alternative paradigm in which, unlike the classical von Neumann machine, computation and data storage are not separated (Steinbuch, 1961; Willshaw, Buneman, & Longuet-Higgins, 1969; Palm, 1982; Palm & Palm, 1991; Hammerstrom, 1990; Heittmann & Rückert, 2002; Chicca et al., 2003; Hammerstrom, Gao, Zhu, & Butts, 2006; Laiho et al., 2015; Poikonen, Lehtonen, Laiho, & Knuutila, 2015). For example, they can easily implement associative memory storing a large set of M memories. In the general heteroassociative case, memories are associations between typically high-dimensional pattern vectors and (where ). Similar to random access memory, a query pattern entered in the associative memory can serve as an address for accessing the associated content pattern . However, unlike random access memories, associative memories accept arbitrary query patterns , and the computation of any particular output involves all stored data records rather than a single one. Specifically, the associative memory task consists of comparing a query with all stored addresses and returning an output pattern equal or similar to the pattern associated with the address most similar to the query. Thus, the associative memory task includes the random access task but is not restricted to it. It also includes computations such as pattern completion, denoising, or data retrieval using incomplete cues. Moreover, neural implementations of associative memory are closely related to Hebbian cell assemblies and play an important role in neuroscience as models of neural computation for various brain structures, for example neocortex, hippocampus, cerebellum, mushroom body (Hebb, 1949; Braitenberg, 1978; Palm, 1982; Hopfield, 1982; Fransen & Lansner, 1998; Pulvermüller, 2003; Johansson & Lansner, 2007; Lansner, 2009; Marr, 1969, 1971; Gardner-Medwin, 1976; Rolls, 1996; Bogacz, Brown, & Giraud-Carrier, 2001; Albus, 1971; Kanerva, 1988; Laurent, 2002; Honegger, Campbell, & Turner, 2011).
In its simplest form, such neurocomputers can be realized as a neural associative network, that is, a single layer of n linear threshold elements or perceptrons, typically employing fast, easy-to-implement synaptic learning that depends on local information only. For hetero-association as described above, each of the n content neurons vj receives m synaptic inputs Wij from address neurons ui (see Figure 1, left panel). For the special case of autoassociation, the two neuron layers are identical, and , such that the weight matrix describes a recurrent network and the terms memory, pattern, and cell assembly may be used as synonyms. More complex cognitive architectures can be realized by connecting multiple modules of auto- and heteroassociative networks (e.g., Knoblauch, Markert, & Palm, 2005).
For local learning rules the synaptic weight Wij depends only on and (). This excludes, for example, gradient descent methods (e.g., error backpropagation) that are based on global error signals obtained from repeated training of the whole pattern set. Instead, associative memories use simple Hebbian-type learning rules where synaptic weights increase if both the presynaptic and postsynaptic neurons are active during presentation of a pattern pair.
The performance of neural associative memory models can be evaluated by storage capacity, which can be defined, for example, by the number of memories M a network of given size can store or by the Shannon information C that a synapse can store. More recent work considers also structural compression of synaptic networks and the energy or time requirements per retrieval (Poirazi & Mel, 2001; Stepanyants, Hof, & Chklovskii, 2002; Lennie, 2003; Knoblauch, 2003a, 2005, 2009b; Knoblauch, Palm, & Sommer, 2010).
Here I develop a novel nonlinear learning procedure for associative networks with discrete synaptic weights that combines the advantages of the previous models: large storage capacity (M and C) and efficient implementations on digital computers and brain-like parallel hardware (CI and CS). Basically, the new model corresponds to an optimal discretization of the synaptic weights obtained from the linear or Bayesian learning rule in the limit . For this, synaptic weights are computed in two processing steps. First, synaptic potentials are computed as in equation 1.2 or 1.4 from the synaptic counter variables Mab and thus correspond to the real-valued synaptic weights of linear or Bayesian learning. Second, the discrete synaptic weights Wij are obtained by applying one or several synaptic thresholds to the synaptic potentials (see Figure 2).
This procedure is similar to the general learning rule proposed already by Steinbuch (1961, Figure 2c) where the synaptic potential corresponds to Steinbuch’s “Indiz” (indication). In contrast to Steinbuch’s sigmoid transfer functions, I consider optimal step functions to obtain discrete synaptic weights that maximize signal-to-noise ratio and storage capacity. The resulting learning rules generalize the original Steinbuch/Willshaw rule (see equation 1.3) for less sparse memory patterns and multiple state synapses. The analysis reveals that the resulting storage capacities M and C are almost the same as for the optimal linear and Bayesian learning rules: More exactly, it turns out that the “zip factor” , defined as the relative capacity of discrete synapses compared to continuous synapses, is close to one already for a small number of discrete states, for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. Moreover, the analysis also provides optimal discretization parameters for implementations on digital computers or brainlike parallel hardware including structural plasticity. In particular, for low-entropy settings where most synapses share a single discrete weight (e.g., zero), the network becomes “compressible” such that computer implementations can store up to bit per computer bit, and parallel hardware employing structural plasticity can store up to bits per synapse, similarly as shown for the Willshaw network (Knoblauch et al., 2010, 2014). For that property and the sake of brevity, I will sometimes refer to the model as the zip net model. As a by-product, it is possible to derive precision requirements for synaptic weights that may also include biological constraints such as Dale’s law, sparse synaptic connectivity, and structural plasticity.
The letter is organized as follows. Section 2 describes the learning and retrieval procedure of the model. Section 3 analyzes the signal-to-noise ratio for linear and Bayesian learning of the synaptic potentials and derives a general expression for the zip factor . Section 4 computes various storage capacities (M, C, CI, CS). Section 5 maximizes for various numbers of synaptic states N and implementation constraints and derives the corresponding optimal discretization parameters. Section 6 compares retrieval efficiency in terms of space, time and energy requirements to previous models. Section 7 presents numerical simulations that verify the theory and provide a comparison to the previous models. Section 8 summarizes and discusses the main results of this work and, in particular, points to implications for memory theories that are based on structural plasticity and possible nanoscale hardware implementations. Finally, the appendixes compute the SNR and for general distributions of synaptic potentials (appendix A), give formulas for gaussian tail integrals (appendix B) and optimal firing thresholds (appendix D), discuss the relation between the SNR and the Hamming distance based output noise (appendix E), give basic information theoretic formulas (appendix F) and recommendations for an efficient implementation of networks of discrete multistate synapses (appendix G). Taxonomy and notations are as in Knoblauch et al. (2010) and Knoblauch (2011) whenever possible.
2 Network Model
2.1 Learning of Discrete Synaptic Weights
2.3 Control of Synaptic Thresholds
There are several strategies to control the synaptic thresholds of a content neuron j. Here I consider the following two:
Fixed synaptic thresholds. Synaptic thresholds may be precomputed and fixed based on an estimate of the distribution of synaptic potentials using equation 2.5 (as assumed in the SNR analysis of section 3.1)
Homeostatic control. Synaptic thresholds may be adapted during online learning to realize a desired distribution of synaptic strengths .
While the former variant is easy to analyze (as done in section 3), the latter one turns out to minimize output noise (see equation 2.9) by realizing identical “false alarm” error probabilities among low units vj (see the numerical experiments in section 7). In computer implementations, homeostatic control can easily be realized by choosing synaptic thresholds such that at any time, a content neuron j has exactly synapses with strength . For example, we may sort the synaptic potentials for each neuron j such that where i is a corresponding permutation on . Then we may choose for .
There is actually evidence that the brain regulates the number of (potentiated) synapses per neuron in a similar way (Fares & Stepanyants, 2009; Knoblauch et al., 2014). For biological models or implementations on parallel hardware, the homeostatic control could be realized by regulating synaptic thresholds in order to achieve a desired mean neuron activity when stimulating with random inputs with defined activity statistics (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Turrigiano & Nelson, 2004; Turrigiano, 2007; Desai, Rutherford, & Turrigiano, 1999; Van Welie, Van Hooft, & Wadman, 2004). This is particularly obvious for binary synapses because, for inputs with a given mean activity, there is a unique relationship between the number of potentiated synapses and mean output activity. It would then be sufficient to increase (or decrease) synaptic thresholds if mean output activity is above (or below) the target activity (Knoblauch, 2009c, 2010b).3 Similar control mechanisms could also be realized for multistate synapses, possibly requiring a regulation based on state-specific expression factors.
3 Signal-to-Noise Ratio for Linear Learning of Synaptic Potentials
The following subsections compute the SNR, equation 2.10, of a given content neuron j for linear learning, making a number of further assumptions:
Components of address patterns are identically and independently distributed where is the probability of an active component.4
The query pattern is a noisy version of having c correct and f false one entries, similar to that illustrated by Figure 1 (left).
Synaptic potentials follow a gaussian distribution, , where Gc is the complementary gaussian distribution function (see appendix B, equation B.5). This assumption becomes true due to the central limit theorem at least for a large memory number (where also the counters diverge for constant ).
Let us finally assume the limit of large networks, , to allow , whereas activity parameters p, q, learning parameters r00, r01, r10, r11, and discretization parameters , are assumed to remain constant. (For an analysis of finite networks see appendix A.)
The SNR analysis consists of three parts. Section 3.1 analyzes conditional and unconditional distributions of synaptic potentials for linear learning. Section 3.2 computes conditional matrix loads and derives distributions of dendritic potentials for low and high units in large networks. Finally, section 3.3 optimizes linear learning parameters to maximize the SNR.
3.1 Distribution of Synaptic Potentials
3.2 SNR for Large Networks
The SNR for finite networks with and general distribution of synaptic potentials is analyzed in appendix A. Here we apply these results to compute SNR for large networks with and linear learning of synaptic potentials as described in section 3.1. Thus, we can assume again a gaussian distribution of synaptic potentials with (see equations 3.1, 2.3, and B.5).
3.3 Optimal Linear Learning of Synaptic Potentials
Note that the covariance rule is optimal because satisfies the optimality criterion (see Figure 1). However, unlike for linear weight learning (Dayan & Willshaw, 1991), the covariance rule is not the unique optimum. Instead, for discrete weights, there is a three-dimensional subspace of optimal learning rules, including biologically more realistic ones than the covariance rule. For example, the homosynaptic rule is also optimal because (see Figure 1). Other well-investigated learning rules such as the heterosynaptic rule the Hebbian rule, and the Hopfield rule, in general do not satisfy the optimality criterion (see Figure 1; see also Dayan & Willshaw, 1991 and Palm & Sommer, 1996).
4 Analysis of Storage Capacity
A reasonable performance measure for associative networks is storage capacity. There are actually various definitions of storage capacity that depend on the target platform used for implementing the networks. For example, section 4.1 computes pattern capacity , defined as the absolute number of memories a network of a given size can store at output noise level (see equation 2.9). Similarly, section 4.2 computes network capacity , which is the stored information per synapse for networks with a fixed given structure. Section 4.3 computes information capacity , which applies to compressed implementations on digital computers measuring the stored information per computer bit. Finally, section 4.4 computes the synaptic capacity , defined as the stored information per synapse in structurally plastic networks. All of these capacity measures can be derived from the SNR R (see equations 2.10 and 3.39), making the same presumptions as described at the beginning of section 3. In addition, we have to assume that dendritic potentials have a gaussian distribution (see equation C.7)
Although we focus on one-step retrieval, the following analyses apply as well to iterative retrieval in recurrent autoassociative (Hopfield-type) networks with , . There, we can expect successful retrieval if output noise is less than query noise . Thus, to approximate the various storage capacities, we have to choose an output noise level . In particular, to approximate the maximal number of stable memory attractors, we can compute for zero query noise (,) and .
4.1 Pattern Capacity
First, . To see this, note that for , the right-most term in equation 4.4 (as well as the upper bound, equation 4.5) corresponds to the Gardner bound, which defines a general upper bound on for any model of synaptic learning (Gardner, 1988, eq. 40). Since both (see equation 3.38) and (see equations 3.37 and 3.43) can reach their upper bounds, it necessarily follows (whereas would violate the Gardner bound for ).
Second, for diverging network size, any finite discretization procedure with decreases pattern capacity by at least factor compared to optimal learning of real-valued synaptic weights. This follows because equation 4.3 is factor worse than the pattern capacity for optimal Bayesian learning (Knoblauch, 2010a, 2011) or linear covariance learning (Palm & Sommer, 1996) of real-valued synaptic weights.
Third, reaching the full pattern capacity of optimal Bayesian or linear covariance learning requires that the computing precision diverges with the network size, , as . This is the reverse of the second conclusion assuming that is possible for . The latter assumption is true because for any given network, the discretization error (in dendritic potentials xj) vanishes for diverging even for trivial discretization parameters (e.g., consider and , where and are the maximal and minimal real-valued synaptic weight—(e.g., , for linear learning; see also equation 5.7 in section 5.1). In section 5, we will see that already for small N.
4.2 Network Capacity
4.3 Information Capacity for Compressed Computer Implementations
4.4 Synaptic Capacity for Structural Plasticity
4.5 Effect of Discretization Noise
5 Optimal Discretizations
The previous sections computed the SNR and various storage capacities for given discretization parameters , , and . This section optimizes and in order to maximize the different storage capacities , , , for linear learning of synaptic potentials. First, as a reference for comparison, section 5.1 computes capacities for naive discretization methods. Then section 5.2 maximizes in order to maximize pattern capacity and network capacity , which is most relevant for networks with a fixed given connectivity structure. Section 5.3 maximizes to maximize information capacity CI relevant for compressed implementations on digital computers. Finally, section 5.4 maximizes to maximize synaptic capacity CS relevant for brainlike hardware including structural plasticity. First, note that the SNR, and thus the storage capacities, are invariant to scaling and shifting of synaptic weights.9 Thus, we can assume, without loss of generality, that discrete synaptic strengths are within a given interval, for example, or (where the latter satisfies Dale’s law that all synapses of a given neuron have the same polarity—either excitatory or inhibitory).10
5.1 Naive Discretizations
5.2 Maximizing to Maximize R, , and
Maximizing the “zip factor” as defined in equation 3.36 maximizes the SNR R (see equation 3.39), pattern capacity (see equation 4.3), and network capacity (see equation 4.7). Without loss of generality, we can assume because the dependence of on discretization noise is considered sufficiently in section 4.5 (assume a given ratio ; see equations 3.36 and 4.21) and because is invariant to scaling and shifting of synaptic weights (see note 9).
5.2.1 Binary Synapses
5.2.2 Ternary Synapses
For multistate synapses with more than two discrete weight values, , we can numerically maximize with respect to and . The following shows that for ternary synapses (), maximization of can be reduced to a two-dimensional optimization problem.
Thus, to compute optimal discretization parameters for ternary synapses, it is sufficient to maximize with respect to and , for example, and assuming , , , and as given by equation 5.12. Figure 3 illustrates optimal (see panel d) and the corresponding maximal network capacity (see panel c). As can be seen, is maximal for “symmetric” matrix loads and .
5.2.3 General Multistate Synapses
For general multistate synapses with more than three discrete weight values, , we can numerically maximize with respect to and corresponding to a dimensional optimization problem. Table 1 shows some optimal data obtained from maximizing equation 3.36 using Matlab function fminsearch with initial discretization parameters from (see equations 5.3 and 5.4), where the resulting discrete strength values are scaled such that and . The resulting discrete weight distributions are bell shaped and resemble gaussians, in particular for large odd N (data not shown). For but not for , random initializations of the discretization parameters yielded always the same maxima as shown in the table. This indicates that there is a unique optimal set of discretization parameters for , whereas the data for may correspond to local maxima. Data for used an alternative initialization method described in a technical report (Knoblauch, 2009c) yielding better results than . Note the symmetry relations and for optimal discretizations. Most important, quickly converges toward 1. Networks of 2-bit synapses () can already store more than 88% of the information that gradual synapses can store when employing optimal Bayesian learning. Similarly, 3-bit synapses () reach more than 96% and 4-bit synapses () more than 99% of the maximal capacity.
|N .||.||.||for .||for .|
|2||0.6366||0.4592||0.5 0.5||1 −1|
|3||0.8098||0.5842||0.2703 0.4594 0.2703||1 0 −1|
|4||0.8825||0.6366||0.1631 0.3369 0.3369 0.1631||1 0.2998 −0.2998 −1|
|5||0.9201||0.6637||0.1067 0.2444 0.2978 0.2444 0.1067||1 0.4435 0 −0.4435 −1|
|6||0.9420||0.6795||0.0740 0.1810 0.2450 0.2450 0.1810 0.0740||1 0.5281 0.1678 −0.1678 −0.5281 −1|
|7||0.9560||0.6896||0.0536 0.1374 0.1986 0.2208 0.1986 0.1374 0.0536||1 0.5843 0.2757 0 −0.2757 −0.5843 −1|
|8||0.9655||0.6964||0.0402 0.1066 0.1615 0.1917 0.1917 0.1615 0.1066 0.0402||1 0.6245 0.3513 0.1139 −0.1139 −0.3513 −0.6245 −1|
|9||0.9721||0.7013||0.0310 0.0845 0.1323 0.1644 0.1756 0.1644 0.1323 0.0845 0.0310||1 0.6548 0.4075 0.1968 0 −0.1968 −0.4075 −0.6548 −1|
|10||0.9771||0.7048||0.0245 0.0681 0.1095 0.1407 0.1572 0.1572 0.1407 0.1095 0.0681 0.0245||1 0.6786 0.4511 0.2601 0.0851 −0.0851 −0.2601 −0.4511 −0.6786 −1|
|11||0.9808||0.7075||0.0198 0.0558 0.0916 0.1206 0.1393 0.1458 0.1393 0.1206 0.0916 0.0558 0.0198||1 0.6978 0.4860 0.3108 0.1515 0 −0.1515 −0.3108 −0.4860 −0.6978 −1|
|12||0.9837||0.7096||0.0162 0.0463 0.0773 0.1040 0.1231 0.1331 0.1331 0.1231 0.1040 0.0773 0.0463 0.0162||1 0.7137 0.5146 0.3509 0.2049 0.0674 −0.0674 −0.2049 −0.3509 −0.5146 −0.7137 −1|
|13||0.9859||0.7112||0.0134 0.0389 0.0659 0.0901 0.1088 0.1206 0.1246 0.1206 0.1088 0.0901 0.0659 0.0389 0.0134||1 0.7271 0.5386 0.3848 0.2488 0.1223 0 −0.1223 −0.2488 −0.3848 −0.5386 −0.7271 −1|
|14||0.9878||0.7125||0.0113 0.0331 0.0566 0.0783 0.0964 0.1088 0.1155 0.1155 0.1088 0.0964 0.0783 0.0566 0.0331 0.0113||1 0.7387 0.5591 0.4137 0.2861 0.1683 0.0556 −0.0556 −0.1683 −0.2861 −0.4137 −0.5591 −0.7387 −1|
|15||0.9893||0.7136||0.0095 0.0283 0.0490 0.0687 0.0855 0.0983 0.1062 0.1090 0.1062 0.0983 0.0855 0.0687 0.0490 0.0283 0.0095||1 0.7484 0.5767 0.4383 0.3175 0.2069 0.1022 0 −0.1022 −0.2069 −0.3175 −0.4383 −0.5767 −0.7484 −1|
|16||0.9904||0.7144||0.0079 0.0255 0.0385 0.0639 0.0774 0.0965 0.0854 0.1049 0.1049 0.0854 0.0965 0.0774 0.0639 0.0385 0.0255 0.0079||1 0.7519 0.5938 0.4616 0.3420 0.2328 0.1380 0.0497 −0.0497 −0.1380 −0.2328 −0.3420 −0.4616 −0.5938 −0.7519 −1|
|17||0.9915||0.7152||0.0071 0.0213 0.0375 0.0535 0.0681 0.0802 0.0892 0.0948 0.0966 0.0948 0.0892 0.0802 0.0681 0.0535 0.0375 0.0213 0.0071||1 0.7649 0.6057 0.4786 0.3688 0.2694 0.1765 0.0874 0 −0.0874 −0.1765 −0.2694 −0.3688 −0.4786 −0.6057 −0.7649 −1|
|18||0.9919||0.7155||0.0071 0.0214 0.0437 0.0551 0.0385 0.0815 0.0821 0.0839 0.0867 0.0867 0.0839 0.0821 0.0815 0.0385 0.0551 0.0437 0.0214 0.0071||1 0.7628 0.5911 0.4598 0.3770 0.2932 0.2013 0.1189 0.0387 −0.0387 −0.1189 −0.2013 −0.2932 −0.3770 −0.4598 −0.5911 −0.7628 −1|
|19||0.9932||0.7164||0.0054 0.0165 0.0294 0.0429 0.0547 0.0658 0.0749 0.0820 0.0855 0.0858 0.0855 0.0820 0.0749 0.0658 0.0547 0.0429 0.0294 0.0165 0.0054||1 0.7779 0.6286 0.5097 0.4082 0.3176 0.2333 0.1531 0.0756 0 −0.0756 −0.1531 −0.2333 −0.3176 −0.4082 −0.5097 −0.6286 −0.7779 −1|
|20||0.9936||0.7168||0.0050 0.0160 0.0287 0.0391 0.0497 0.0586 0.0681 0.0596 0.0787 0.0965 0.0965 0.0787 0.0596 0.0681 0.0586 0.0497 0.0391 0.0287 0.0160 0.0050||1 0.7726 0.6253 0.5106 0.4171 0.3330 0.2539 0.1851 0.1191 0.0417 −0.0417 −0.1191 −0.1851 −0.2539 −0.3330 −0.4171 −0.5106 −0.6253 −0.7726 −1|