## Abstract

Neural associative networks are a promising computational paradigm for both modeling neural circuits of the brain and implementing associative memory and Hebbian cell assemblies in parallel VLSI or nanoscale hardware. Previous work has extensively investigated synaptic learning in linear models of the Hopfield type and simple nonlinear models of the Steinbuch/Willshaw type. Optimized Hopfield networks of size *n* can store a large number of about memories of size *k* (or associations between them) but require real-valued synapses, which are expensive to implement and can store at most bits per synapse. Willshaw networks can store a much smaller number of about memories but get along with much cheaper binary synapses. Here I present a learning model employing synapses with discrete synaptic weights. For optimal discretization parameters, this model can store, up to a factor close to one, the same number of memories as for optimized Hopfield-type learning—for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. The model also provides the theoretical framework to determine optimal discretization parameters for computer implementations or brainlike parallel hardware including structural plasticity. In particular, as recently shown for the Willshaw network, it is possible to store bit per computer bit and up to bits per nonsilent synapse, whereas the absolute number of stored memories can be much larger than for the Willshaw model.

## 1 Introduction

Current von Neumann computers are characterized by a segregation between processing and memory, which leads to the well-known von Neumann bottleneck, meaning a limited data transfer rate between CPU and memory but also the intellectual limitation of “word-at-a-time thinking” (Burks, Goldstine, & von Neumann, 1946; Backus, 1978). In the past, this bottleneck has been compensated for by constructing increasingly faster and larger processors having higher clock rates, larger caches, and denser integration of electronic elements, hoping on an infinite continuation of Moore’s law. As such architectures become increasingly expensive in terms of energy, space, and cooling requirements, alternative computational paradigms become more and more attractive.

Associative neurocomputers are one such alternative paradigm in which, unlike the classical von Neumann machine, computation and data storage are not separated (Steinbuch, 1961; Willshaw, Buneman, & Longuet-Higgins, 1969; Palm, 1982; Palm & Palm, 1991; Hammerstrom, 1990; Heittmann & Rückert, 2002; Chicca et al., 2003; Hammerstrom, Gao, Zhu, & Butts, 2006; Laiho et al., 2015; Poikonen, Lehtonen, Laiho, & Knuutila, 2015). For example, they can easily implement associative memory storing a large set of *M* memories. In the general heteroassociative case, memories are associations between typically high-dimensional pattern vectors and (where ). Similar to random access memory, a query pattern entered in the associative memory can serve as an address for accessing the associated content pattern . However, unlike random access memories, associative memories accept arbitrary query patterns , and the computation of any particular output involves all stored data records rather than a single one. Specifically, the associative memory task consists of comparing a query with all stored addresses and returning an output pattern equal or similar to the pattern associated with the address most similar to the query. Thus, the associative memory task includes the random access task but is not restricted to it. It also includes computations such as pattern completion, denoising, or data retrieval using incomplete cues. Moreover, neural implementations of associative memory are closely related to Hebbian cell assemblies and play an important role in neuroscience as models of neural computation for various brain structures, for example neocortex, hippocampus, cerebellum, mushroom body (Hebb, 1949; Braitenberg, 1978; Palm, 1982; Hopfield, 1982; Fransen & Lansner, 1998; Pulvermüller, 2003; Johansson & Lansner, 2007; Lansner, 2009; Marr, 1969, 1971; Gardner-Medwin, 1976; Rolls, 1996; Bogacz, Brown, & Giraud-Carrier, 2001; Albus, 1971; Kanerva, 1988; Laurent, 2002; Honegger, Campbell, & Turner, 2011).

In its simplest form, such neurocomputers can be realized as a neural associative network, that is, a single layer of *n* linear threshold elements or perceptrons, typically employing fast, easy-to-implement synaptic learning that depends on local information only. For hetero-association as described above, each of the *n* content neurons *v _{j}* receives

*m*synaptic inputs

*W*from address neurons

_{ij}*u*(see Figure 1, left panel). For the special case of autoassociation, the two neuron layers are identical, and , such that the weight matrix describes a recurrent network and the terms

_{i}*memory*,

*pattern*, and

*cell assembly*may be used as synonyms. More complex cognitive architectures can be realized by connecting multiple modules of auto- and heteroassociative networks (e.g., Knoblauch, Markert, & Palm, 2005).

For local learning rules the synaptic weight *W _{ij}* depends only on and (). This excludes, for example, gradient descent methods (e.g., error backpropagation) that are based on global error signals obtained from repeated training of the whole pattern set. Instead, associative memories use simple Hebbian-type learning rules where synaptic weights increase if both the presynaptic and postsynaptic neurons are active during presentation of a pattern pair.

The performance of neural associative memory models can be evaluated by storage capacity, which can be defined, for example, by the number of memories *M* a network of given size can store or by the Shannon information *C* that a synapse can store. More recent work considers also structural compression of synaptic networks and the energy or time requirements per retrieval (Poirazi & Mel, 2001; Stepanyants, Hof, & Chklovskii, 2002; Lennie, 2003; Knoblauch, 2003a, 2005, 2009b; Knoblauch, Palm, & Sommer, 2010).

*a*combined with postsynaptic activity

*b*. One well-investigated model class is of associative networks with linear learning (Hopfield, 1982; Palm, 1988; Tsodyks & Feigel’man, 1988; Dayan & Willshaw, 1991; Dayan & Sejnowski, 1993; Palm & Sommer, 1996; Chechik, Meilijson, & Ruppin, 2001; Sterratt & Willshaw, 2008) where the resulting synaptic weight, is a linear function of the synaptic counters. If pattern vectors are binary, the general linear learning rule can be described by four parameters where (see Figure 1, right panels). Optimizing the parameters

*r*to maximize storage capacity yields the covariance rule (see Figure 1, right upper table), which can store up to bits per synapse (bps) if address and content patterns have only a small fraction

_{ab}*p*and

*q*of active units, respectively. The corresponding maximal number of memories is equal to the Gardner bound, a general upper-capacity limit for arbitrary learning methods (Gardner, 1987, 1988; Gardner & Derrida, 1988).

*n*. For more reasonable , the information a synapse can store vanishes, , and the absolute number of memories, , is much smaller than for optimal linear learning. Nevertheless, due to the binary synapses, Willshaw networks have very efficient implementations on both digital computers and brainlike parallel hardware including structural plasticity or synaptic pruning, where performance can more reasonably be evaluated in terms of information capacity

*C*and synaptic capacity

^{I}*C*(Knoblauch, 2003a; Knoblauch et al., 2010, see section 4). For example, compressed implementations of Willshaw networks can store up to bit per computer bit for almost any nonlogarithmic sparse activity with and (Knoblauch, 2008). Moreover, networks employing structural plasticity (e.g., by pruning of irrelevant silent synapses) can store up to to bits per synapse and provide functional interpretations for structural plasticity and hippocampal memory replay in the brain (Knoblauch, Körner, Körner, & Sommer, 2014; Butz, Wörgötter, & van Ooyen, 2009; Holtmaat & Svoboda, 2009; Ji & Wilson, 2007; Sirota, Csicsvari, Buhl, & Buzsaki, 2003).

^{S}*M*and

*C*), optimal Bayesian learning cannot exceed the asymptotic limit bps either. As both linear and Bayesian learning require real-valued synaptic weights, they have no efficient implementations on digital computers (e.g., for high computing precision), and it is unclear how to include structural plasticity or synaptic pruning in an efficient way (thus only bps).

Here I develop a novel nonlinear learning procedure for associative networks with discrete synaptic weights that combines the advantages of the previous models: large storage capacity (*M* and *C*) and efficient implementations on digital computers and brain-like parallel hardware (*C ^{I}* and

*C*). Basically, the new model corresponds to an optimal discretization of the synaptic weights obtained from the linear or Bayesian learning rule in the limit . For this, synaptic weights are computed in two processing steps. First, synaptic potentials are computed as in equation 1.2 or 1.4 from the synaptic counter variables

^{S}*M*and thus correspond to the real-valued synaptic weights of linear or Bayesian learning. Second, the discrete synaptic weights

_{ab}*W*are obtained by applying one or several synaptic thresholds to the synaptic potentials (see Figure 2).

_{ij}This procedure is similar to the general learning rule proposed already by Steinbuch (1961, Figure 2c) where the synaptic potential corresponds to Steinbuch’s “Indiz” (indication). In contrast to Steinbuch’s sigmoid transfer functions, I consider optimal step functions to obtain discrete synaptic weights that maximize signal-to-noise ratio and storage capacity. The resulting learning rules generalize the original Steinbuch/Willshaw rule (see equation 1.3) for less sparse memory patterns and multiple state synapses. The analysis reveals that the resulting storage capacities *M* and *C* are almost the same as for the optimal linear and Bayesian learning rules: More exactly, it turns out that the “zip factor” , defined as the relative capacity of discrete synapses compared to continuous synapses, is close to one already for a small number of discrete states, for example, for binary synapses, for 2 bit (four-state) synapses, for 3 bit (8-state) synapses, and for 4 bit (16-state) synapses. Moreover, the analysis also provides optimal discretization parameters for implementations on digital computers or brainlike parallel hardware including structural plasticity. In particular, for low-entropy settings where most synapses share a single discrete weight (e.g., zero), the network becomes “compressible” such that computer implementations can store up to bit per computer bit, and parallel hardware employing structural plasticity can store up to bits per synapse, similarly as shown for the Willshaw network (Knoblauch et al., 2010, 2014). For that property and the sake of brevity, I will sometimes refer to the model as the zip net model. As a by-product, it is possible to derive precision requirements for synaptic weights that may also include biological constraints such as Dale’s law, sparse synaptic connectivity, and structural plasticity.

The letter is organized as follows. Section 2 describes the learning and retrieval procedure of the model. Section 3 analyzes the signal-to-noise ratio for linear and Bayesian learning of the synaptic potentials and derives a general expression for the zip factor . Section 4 computes various storage capacities (*M*, *C*, *C ^{I}*,

*C*). Section 5 maximizes for various numbers of synaptic states

^{S}*N*and implementation constraints and derives the corresponding optimal discretization parameters. Section 6 compares retrieval efficiency in terms of space, time and energy requirements to previous models. Section 7 presents numerical simulations that verify the theory and provide a comparison to the previous models. Section 8 summarizes and discusses the main results of this work and, in particular, points to implications for memory theories that are based on structural plasticity and possible nanoscale hardware implementations. Finally, the appendixes compute the SNR and for general distributions of synaptic potentials (appendix A), give formulas for gaussian tail integrals (appendix B) and optimal firing thresholds (appendix D), discuss the relation between the SNR and the Hamming distance based output noise (appendix E), give basic information theoretic formulas (appendix F) and recommendations for an efficient implementation of networks of discrete multistate synapses (appendix G). Taxonomy and notations are as in Knoblauch et al. (2010) and Knoblauch (2011) whenever possible.

## 2 Network Model

### 2.1 Learning of Discrete Synaptic Weights

*N*-state synapses where synaptic weights

*W*can assume one out of

_{ij}*N*discrete values (). Specifically,

*W*is obtained from applying multiple synaptic thresholds to synaptic potentials (see Figure 2), The synaptic potentials depend on the set of memory patterns and , as well as on the employed learning method. For example, may equal the real-valued synaptic weights of the linear or Bayesian learning methods (see equation 1.2 or 1.4). The model also includes for each synapse independent additive noise variables with zero means, standard deviations , and density functions satisfying to account for random hardware variability and other noise effects (see, sections 4.5 and A.1).

_{ij}*t*th weight value.

^{1}Then let and be the mean and standard deviation and the standardized (complementary) distribution function of the synaptic potentials. Finally, we choose synaptic thresholds to obtain the desired matrix loads for , that is, where is the inverse function of . Thus, the matrix loads are essentially equivalent to the synaptic thresholds and, together with the synaptic strengths , fully specify the discretization procedure of equation 2.1. Actually, the following performance analyses become most concise and general if using matrix loads instead of synaptic thresholds.

^{2}

### 2.2 Retrieval

*x*of content neuron

_{j}*j*is For binary memory patterns (as presumed by Figure 1), one-step retrieval yields the retrieval output simply by applying a vector of firing thresholds to the dendritic potentials , where (see appendix D) We can then evaluate the retrieval quality, for example, by computing the Hamming distance, between retrieval output and the original content pattern . To be independent of network size, a more convenient measure is the output noise, which normalizes the expected Hamming distance to the mean number of active units in a content pattern. Similarly, one can define query noise , where is the mean activity in an address pattern. In recurrent autoassociative networks, iterative retrieval can be realized by feeding back the retrieval output to the input layer repeatedly (Hopfield, 1982; Schwenker, Sommer, & Palm, 1996).

*i*transmits (one of the

*c*) correct or (one of the

*f*) false activations and whether content neuron

*j*is a high or low unit. Correspondingly, we can divide the connection matrix into four relevant regions according to four cases with conditional matrix loads (see Figure 1, left).

### 2.3 Control of Synaptic Thresholds

There are several strategies to control the synaptic thresholds of a content neuron *j*. Here I consider the following two:

*Fixed synaptic thresholds*. Synaptic thresholds may be precomputed and fixed based on an estimate of the distribution of synaptic potentials using equation 2.5 (as assumed in the SNR analysis of section 3.1)*Homeostatic control*. Synaptic thresholds may be adapted during online learning to realize a desired distribution of synaptic strengths .

While the former variant is easy to analyze (as done in section 3), the latter one turns out to minimize output noise (see equation 2.9) by realizing identical “false alarm” error probabilities among low units *v _{j}* (see the numerical experiments in section 7). In computer implementations, homeostatic control can easily be realized by choosing synaptic thresholds such that at any time, a content neuron

*j*has exactly synapses with strength . For example, we may sort the synaptic potentials for each neuron

*j*such that where

*i*is a corresponding permutation on . Then we may choose for .

*u*. Then sorting should be based on one of the following normalized synaptic potentials rather than , for example, on where (e.g., as in equation B.2 for gaussian ), is a random variable drawn from the distribution of as characterized by , , and , the synaptic counter variables , are as in section 1 (see equation 1.1), and the are estimates of query noise. Here the first normalization variant, see equation 2.12, has a probabilistic interpretation where is uniformly distributed between 0 and 1 because . Since is monotonically increasing, sortings based on of equation 2.13 yield equivalent results as sortings based on . Similarly, equation 2.14 is equivalent to sorting the synaptic weights of the Bayesian learning rule, equation 1.4, including query noise estimates (for details see Knoblauch, 2011, equation 2.15). Note that any of the three possible normalization procedures guarantees that each column

_{i}*j*of matrix

*W*has exactly

_{ij}*and*each row

*i*approximately synapses with weight .

There is actually evidence that the brain regulates the number of (potentiated) synapses per neuron in a similar way (Fares & Stepanyants, 2009; Knoblauch et al., 2014). For biological models or implementations on parallel hardware, the homeostatic control could be realized by regulating synaptic thresholds in order to achieve a desired mean neuron activity when stimulating with random inputs with defined activity statistics (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Turrigiano & Nelson, 2004; Turrigiano, 2007; Desai, Rutherford, & Turrigiano, 1999; Van Welie, Van Hooft, & Wadman, 2004). This is particularly obvious for binary synapses because, for inputs with a given mean activity, there is a unique relationship between the number of potentiated synapses and mean output activity. It would then be sufficient to increase (or decrease) synaptic thresholds if mean output activity is above (or below) the target activity (Knoblauch, 2009c, 2010b).^{3} Similar control mechanisms could also be realized for multistate synapses, possibly requiring a regulation based on state-specific expression factors.

## 3 Signal-to-Noise Ratio for Linear Learning of Synaptic Potentials

*M*learning increments . As illustrated by Figure 1 (right), learning of synaptic potentials is then determined by four learning parameters: , , , and .

The following subsections compute the SNR, equation 2.10, of a given content neuron *j* for linear learning, making a number of further assumptions:

Components of address patterns are identically and independently distributed where is the probability of an active component.

^{4}The query pattern is a noisy version of having

*c*correct and*f*false one entries, similar to that illustrated by Figure 1 (left).Synaptic potentials follow a gaussian distribution, , where

*G*is the complementary gaussian distribution function (see appendix B, equation B.5). This assumption becomes true due to the central limit theorem at least for a large memory number (where also the counters diverge for constant ).^{c}Let us finally assume the limit of large networks, , to allow , whereas activity parameters

*p*,*q*, learning parameters*r*_{00},*r*_{01},*r*_{10},*r*_{11}, and discretization parameters , are assumed to remain constant. (For an analysis of finite networks see appendix A.)

The SNR analysis consists of three parts. Section 3.1 analyzes conditional and unconditional distributions of synaptic potentials for linear learning. Section 3.2 computes conditional matrix loads and derives distributions of dendritic potentials for low and high units in large networks. Finally, section 3.3 optimizes linear learning parameters to maximize the SNR.

### 3.1 Distribution of Synaptic Potentials

*M*

_{1}(see equation 3.2). With , it is where and are expectations and variances of given the value of the content pattern . It is Assuming gaussian with (see equation B.5), the synaptic thresholds of equations 2.5 and 2.1 write where is the inverse complementary gaussian distribution function (see equation B.7).

*i*and

*j*. To analyze this dependency, let be a noisy version of the th address pattern having

*c*correct and

*f*false one entries (see Figure 1, left). Then for an active query component

*i*with , the corresponding distribution of depends on whether the activity of neuron

*i*is correct () or false (), as well as on whether neuron

*j*is a high unit () or a low unit (). Thus, we have to discern four conditional distributions of synaptic potentials, where we can compute means and variances similar to equations 3.3 and 3.4, and , with where the indices

*c*versus

*f*, and lo versus hi refer to the same regions of the weight matrix as illustrated in Figure 1 (left). Thus, the differences between conditional and unconditional mean values are where we use parameter differences, for brevity. With the linear approximation , the differences between conditional and unconditional standard deviations write where the error bounds assume constant but diverging . Similar to equations 3.14 and 3.15, we can also use and .

### 3.2 SNR for Large Networks

The SNR for finite networks with and general distribution of synaptic potentials is analyzed in appendix A. Here we apply these results to compute SNR for large networks with and linear learning of synaptic potentials as described in section 3.1. Thus, we can assume again a gaussian distribution of synaptic potentials with (see equations 3.1, 2.3, and B.5).

^{5}For finite networks, the exact conditional matrix loads can also be obtained from equation A.1 with and equations 3.9 and 3.10 to 3.15. Then, the SNR, equation 2.10, can be computed, for example, from (see equations A.3–A.7 in appendix A) and where, for , it is (see equation A.10) Alternatively, we can compute the SNR

*R*directly from equation A.24 in section A.3. With equations 3.16 to 3.19, the contributions of correct and false activations in the query to

*R*can be computed from respectively. Thus, the squared SNR is for where is the (average) variance of the discretization noise and is the variance of the discrete strength values. We can generalize this result for large diluted networks by assuming that the content neuron

*j*is connected to a fraction

*P*of the

*m*address neurons. Then the analysis so far remains valid if we replace

*m*by (i.e., the number of address neurons that are connected to neuron

*j*) and generalize the definitions of

*c*and

*f*as numbers of correctly and falsely active query neurons, respectively, that are connected to neuron

*j*. Then defining and as the fractions of correct and false positives in a query pattern, and using the squared SNR for large networks with linear learning of synaptic potentials finally writes Here the first term is a function of the discretization parameters and , and the upper bound will be shown in section 4.1. The second term, , depends on the learning parameters and (and also on the fraction

*q*of active neurons in a content pattern ) and will be maximized in the following section showing the upper bound . The third term, , describes the influence of query noise parameters and (and the fraction

*p*of active neurons in an address pattern ), where the maximum is obtained for zero query noise with and . The last term equals the SNR for suboptimal linear learning of synaptic weights employing the homosynaptic rule (see Figure 1; see Dayan & Willshaw, 1991, p. 259, rule R3) which is, for nonsparse address patterns, factor worse than the maximum obtained for the covariance rule (see Figure 1; see Dayan & Willshaw, 1991, p. 259, rule R1, or Palm & Sommer, 1996, p. 95, equation 3.28).

### 3.3 Optimal Linear Learning of Synaptic Potentials

*R*, equation 3.39. Writing for some number , we have to maximize with respect to . The derivative with respect to is The SNR is maximal when the derivative is zero corresponding to linear learning rules satisfying Inserting optimal yields the maximum, proving the upper bound of equation 3.37.

Note that the covariance rule is optimal because satisfies the optimality criterion (see Figure 1). However, unlike for linear weight learning (Dayan & Willshaw, 1991), the covariance rule is not the unique optimum. Instead, for discrete weights, there is a three-dimensional subspace of optimal learning rules, including biologically more realistic ones than the covariance rule. For example, the homosynaptic rule is also optimal because (see Figure 1). Other well-investigated learning rules such as the heterosynaptic rule the Hebbian rule, and the Hopfield rule, in general do not satisfy the optimality criterion (see Figure 1; see also Dayan & Willshaw, 1991 and Palm & Sommer, 1996).

## 4 Analysis of Storage Capacity

A reasonable performance measure for associative networks is storage capacity. There are actually various definitions of storage capacity that depend on the target platform used for implementing the networks. For example, section 4.1 computes pattern capacity , defined as the absolute number of memories a network of a given size can store at output noise level (see equation 2.9). Similarly, section 4.2 computes network capacity , which is the stored information per synapse for networks with a fixed given structure. Section 4.3 computes information capacity , which applies to compressed implementations on digital computers measuring the stored information per computer bit. Finally, section 4.4 computes the synaptic capacity , defined as the stored information per synapse in structurally plastic networks. All of these capacity measures can be derived from the SNR *R* (see equations 2.10 and 3.39), making the same presumptions as described at the beginning of section 3. In addition, we have to assume that dendritic potentials have a gaussian distribution (see equation C.7)

Although we focus on one-step retrieval, the following analyses apply as well to iterative retrieval in recurrent autoassociative (Hopfield-type) networks with , . There, we can expect successful retrieval if output noise is less than query noise . Thus, to approximate the various storage capacities, we have to choose an output noise level . In particular, to approximate the maximal number of stable memory attractors, we can compute for zero query noise (,) and .

### 4.1 Pattern Capacity

*R*as computed in section 3.2 and appendix A being a function of the number

*M*of stored memories, solving for

*M*yields the pattern capacity where is the inverse function of . For example, for large networks with linear learning of synaptic potentials, equation 3.39 with from equations E.5 and E.6 yields We can make three important conclusions.

First, . To see this, note that for , the right-most term in equation 4.4 (as well as the upper bound, equation 4.5) corresponds to the Gardner bound, which defines a general upper bound on for any model of synaptic learning (Gardner, 1988, eq. 40). Since both (see equation 3.38) and (see equations 3.37 and 3.43) can reach their upper bounds, it necessarily follows (whereas would violate the Gardner bound for ).

Second, for diverging network size, any finite discretization procedure with decreases pattern capacity by at least factor compared to optimal learning of real-valued synaptic weights. This follows because equation 4.3 is factor worse than the pattern capacity for optimal Bayesian learning (Knoblauch, 2010a, 2011) or linear covariance learning (Palm & Sommer, 1996) of real-valued synaptic weights.

Third, reaching the full pattern capacity of optimal Bayesian or linear covariance learning requires that the computing precision diverges with the network size, , as . This is the reverse of the second conclusion assuming that is possible for . The latter assumption is true because for any given network, the discretization error (in dendritic potentials *x _{j}*) vanishes for diverging even for trivial discretization parameters (e.g., consider and , where and are the maximal and minimal real-valued synaptic weight—(e.g., , for linear learning; see also equation 5.7 in section 5.1). In section 5, we will see that already for small

*N*.

### 4.2 Network Capacity

*T*is the transinformation equation F.5 with error probabilities

*q*

_{01},

*q*

_{10}as in equations D.1 and D.2 (and using, for example, optimal firing thresholds as in equation D.7 and as in appendix E). For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields where

*I*is the information of a binary random variable, equation F.1, and

*T*simplifies as described at the end of appendix F. As for , network capacity for large networks with discrete weights is factor worse than for optimal Bayesian learning or optimal linear covariance learning of real-valued synaptic weights, where quickly converges toward 1 even for small

*N*(see section 5).

### 4.3 Information Capacity for Compressed Computer Implementations

*W*can be conceived as random variables taking value with probability where . Thus, for implementations on digital computers, the matrix could be compressed, for example, by Huffman or Golomb coding (see appendix G; Huffman, 1952; Golomb, 1966; Cover & Thomas, 1991). Assuming optimal compression, the computer memory required to represent the weight matrix is then only bits, where

_{ij}*I*is the Shannon information of an

*N*-ary random variable (see equation F.2).

^{6}This motivates the definition of information capacity (Knoblauch, 2003a; Knoblauch et al., 2010) as the stored Shannon information per computer memory bit, For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields which is similar to except replacing (see equation 3.36) by Section 5 shows that optimal discretizations maximizing can achieve the theoretical bound already for binary discretizations ().

### 4.4 Synaptic Capacity for Structural Plasticity

^{7}In practice, given the discretization parameters and , it is most effective to eliminate the synapses having the most frequent weight, . This means that a fraction of the synapses is sufficient to preserve network function, while a fraction can safely be eliminated.

^{8}These arguments motivate the definition of

*synaptic capacity*being the stored Shannon information per necessary (e.g., nonsilent) synapse, where the general upper bound is given by the amount of information that can be stored by structural plasticity in the location of the synapse within the network (by selecting one out of possible locations within the weight matrix; see Knoblauch et al., 2010, 2014), plus the maximal information that can be stored in a given synapse by weight modification (which cannot exceed the Gardner bound of at most two bits per synapse, or actually only 0.72 bits per synapse for sparse activity with ; see Gardner, 1988; Knoblauch, 2011). For example, for large networks with linear learning of synaptic potentials, equation 4.3 yields which is similar to except replacing (see equation 3.36) by where

*p*

_{1}is as in equation 4.15 or as described in note 8. Section 5 shows that optimal discretizations maximizing can come close to the theoretical bound, , already for binary discretizations ().

### 4.5 Effect of Discretization Noise

*t*th discrete weight value is disturbed by noise with zero mean and standard deviation , then the total discretization noise has standard deviation as defined below equation 3.36, and each of the performance measures decreases by a factor where is the standard deviation of the discrete synaptic weight values as defined below equation 3.36. This means that SNR and storage capacities only moderately decrease by discretization noise with , whereas effects are strong for . For example, for storage capacities are decreased by factors , , , , , .

## 5 Optimal Discretizations

The previous sections computed the SNR and various storage capacities for given discretization parameters , , and . This section optimizes and in order to maximize the different storage capacities , , , for linear learning of synaptic potentials. First, as a reference for comparison, section 5.1 computes capacities for naive discretization methods. Then section 5.2 maximizes in order to maximize pattern capacity and network capacity , which is most relevant for networks with a fixed given connectivity structure. Section 5.3 maximizes to maximize information capacity *C ^{I}* relevant for compressed implementations on digital computers. Finally, section 5.4 maximizes to maximize synaptic capacity

*C*relevant for brainlike hardware including structural plasticity. First, note that the SNR, and thus the storage capacities, are invariant to scaling and shifting of synaptic weights.

^{S}^{9}Thus, we can assume, without loss of generality, that discrete synaptic strengths are within a given interval, for example, or (where the latter satisfies Dale’s law that all synapses of a given neuron have the same polarity—either excitatory or inhibitory).

^{10}

### 5.1 Naive Discretizations

*c*and

*f*) to shift and scale the weights further such that the mean synaptic weight is zero and the maximal weight is one. The two resulting discretization procedures will be referred to in the following as and .

*R*

^{2}are invariant to scaling and shifting synaptic weight values. Therefore, inserting (rather than normalized equation 5.4) into equation 3.36 and exploiting the gaussian mean yields where is the variance of the synaptic strengths and is the variance of the discretization noise as defined below equation 3.36. The asymptotic result follows for infinite precision with because samples a gaussian with mean 0 and variance 1, and therefore also . Thus, for zero discretization noise (), it follows indeed , proving that it is possible to reach the upper bound of equation 3.36.

### 5.2 Maximizing to Maximize *R*, , and

Maximizing the “zip factor” as defined in equation 3.36 maximizes the SNR *R* (see equation 3.39), pattern capacity (see equation 4.3), and network capacity (see equation 4.7). Without loss of generality, we can assume because the dependence of on discretization noise is considered sufficiently in section 4.5 (assume a given ratio ; see equations 3.36 and 4.21) and because is invariant to scaling and shifting of synaptic weights (see note 9).

#### 5.2.1 Binary Synapses

*binary*synapses () we can explicitly compute the optimal discretization parameters. Without loss of generality we can assume and and . Then equation 3.36 writes

#### 5.2.2 Ternary Synapses

For multistate synapses with more than two discrete weight values, , we can numerically maximize with respect to and . The following shows that for ternary synapses (), maximization of can be reduced to a two-dimensional optimization problem.

Thus, to compute optimal discretization parameters for ternary synapses, it is sufficient to maximize with respect to and , for example, and assuming , , , and as given by equation 5.12. Figure 3 illustrates optimal (see panel d) and the corresponding maximal network capacity (see panel c). As can be seen, is maximal for “symmetric” matrix loads and .

#### 5.2.3 General Multistate Synapses

For general multistate synapses with more than three discrete weight values, , we can numerically maximize with respect to and corresponding to a dimensional optimization problem. Table 1 shows some optimal data obtained from maximizing equation 3.36 using Matlab function *fminsearch* with initial discretization parameters from (see equations 5.3 and 5.4), where the resulting discrete strength values are scaled such that and . The resulting discrete weight distributions are bell shaped and resemble gaussians, in particular for large odd *N* (data not shown). For but not for , random initializations of the discretization parameters yielded always the same maxima as shown in the table. This indicates that there is a unique optimal set of discretization parameters for , whereas the data for may correspond to local maxima. Data for used an alternative initialization method described in a technical report (Knoblauch, 2009c) yielding better results than . Note the symmetry relations and for optimal discretizations. Most important, quickly converges toward 1. Networks of 2-bit synapses () can already store more than 88% of the information that gradual synapses can store when employing optimal Bayesian learning. Similarly, 3-bit synapses () reach more than 96% and 4-bit synapses () more than 99% of the maximal capacity.

N
. | . | . | for . | for . |
---|---|---|---|---|

2 | 0.6366 | 0.4592 | 0.5 0.5 | 1 −1 |

3 | 0.8098 | 0.5842 | 0.2703 0.4594 0.2703 | 1 0 −1 |

4 | 0.8825 | 0.6366 | 0.1631 0.3369 0.3369 0.1631 | 1 0.2998 −0.2998 −1 |

5 | 0.9201 | 0.6637 | 0.1067 0.2444 0.2978 0.2444 0.1067 | 1 0.4435 0 −0.4435 −1 |

6 | 0.9420 | 0.6795 | 0.0740 0.1810 0.2450 0.2450 0.1810 0.0740 | 1 0.5281 0.1678 −0.1678 −0.5281 −1 |

7 | 0.9560 | 0.6896 | 0.0536 0.1374 0.1986 0.2208 0.1986 0.1374 0.0536 | 1 0.5843 0.2757 0 −0.2757 −0.5843 −1 |

8 | 0.9655 | 0.6964 | 0.0402 0.1066 0.1615 0.1917 0.1917 0.1615 0.1066 0.0402 | 1 0.6245 0.3513 0.1139 −0.1139 −0.3513 −0.6245 −1 |

9 | 0.9721 | 0.7013 | 0.0310 0.0845 0.1323 0.1644 0.1756 0.1644 0.1323 0.0845 0.0310 | 1 0.6548 0.4075 0.1968 0 −0.1968 −0.4075 −0.6548 −1 |

10 | 0.9771 | 0.7048 | 0.0245 0.0681 0.1095 0.1407 0.1572 0.1572 0.1407 0.1095 0.0681 0.0245 | 1 0.6786 0.4511 0.2601 0.0851 −0.0851 −0.2601 −0.4511 −0.6786 −1 |

11 | 0.9808 | 0.7075 | 0.0198 0.0558 0.0916 0.1206 0.1393 0.1458 0.1393 0.1206 0.0916 0.0558 0.0198 | 1 0.6978 0.4860 0.3108 0.1515 0 −0.1515 −0.3108 −0.4860 −0.6978 −1 |

12 | 0.9837 | 0.7096 | 0.0162 0.0463 0.0773 0.1040 0.1231 0.1331 0.1331 0.1231 0.1040 0.0773 0.0463 0.0162 | 1 0.7137 0.5146 0.3509 0.2049 0.0674 −0.0674 −0.2049 −0.3509 −0.5146 −0.7137 −1 |

13 | 0.9859 | 0.7112 | 0.0134 0.0389 0.0659 0.0901 0.1088 0.1206 0.1246 0.1206 0.1088 0.0901 0.0659 0.0389 0.0134 | 1 0.7271 0.5386 0.3848 0.2488 0.1223 0 −0.1223 −0.2488 −0.3848 −0.5386 −0.7271 −1 |

14 | 0.9878 | 0.7125 | 0.0113 0.0331 0.0566 0.0783 0.0964 0.1088 0.1155 0.1155 0.1088 0.0964 0.0783 0.0566 0.0331 0.0113 | 1 0.7387 0.5591 0.4137 0.2861 0.1683 0.0556 −0.0556 −0.1683 −0.2861 −0.4137 −0.5591 −0.7387 −1 |

15 | 0.9893 | 0.7136 | 0.0095 0.0283 0.0490 0.0687 0.0855 0.0983 0.1062 0.1090 0.1062 0.0983 0.0855 0.0687 0.0490 0.0283 0.0095 | 1 0.7484 0.5767 0.4383 0.3175 0.2069 0.1022 0 −0.1022 −0.2069 −0.3175 −0.4383 −0.5767 −0.7484 −1 |

16 | 0.9904 | 0.7144 | 0.0079 0.0255 0.0385 0.0639 0.0774 0.0965 0.0854 0.1049 0.1049 0.0854 0.0965 0.0774 0.0639 0.0385 0.0255 0.0079 | 1 0.7519 0.5938 0.4616 0.3420 0.2328 0.1380 0.0497 −0.0497 −0.1380 −0.2328 −0.3420 −0.4616 −0.5938 −0.7519 −1 |

17 | 0.9915 | 0.7152 | 0.0071 0.0213 0.0375 0.0535 0.0681 0.0802 0.0892 0.0948 0.0966 0.0948 0.0892 0.0802 0.0681 0.0535 0.0375 0.0213 0.0071 | 1 0.7649 0.6057 0.4786 0.3688 0.2694 0.1765 0.0874 0 −0.0874 −0.1765 −0.2694 −0.3688 −0.4786 −0.6057 −0.7649 −1 |

18 | 0.9919 | 0.7155 | 0.0071 0.0214 0.0437 0.0551 0.0385 0.0815 0.0821 0.0839 0.0867 0.0867 0.0839 0.0821 0.0815 0.0385 0.0551 0.0437 0.0214 0.0071 | 1 0.7628 0.5911 0.4598 0.3770 0.2932 0.2013 0.1189 0.0387 −0.0387 −0.1189 −0.2013 −0.2932 −0.3770 −0.4598 −0.5911 −0.7628 −1 |

19 | 0.9932 | 0.7164 | 0.0054 0.0165 0.0294 0.0429 0.0547 0.0658 0.0749 0.0820 0.0855 0.0858 0.0855 0.0820 0.0749 0.0658 0.0547 0.0429 0.0294 0.0165 0.0054 | 1 0.7779 0.6286 0.5097 0.4082 0.3176 0.2333 0.1531 0.0756 0 −0.0756 −0.1531 −0.2333 −0.3176 −0.4082 −0.5097 −0.6286 −0.7779 −1 |

20 | 0.9936 | 0.7168 | 0.0050 0.0160 0.0287 0.0391 0.0497 0.0586 0.0681 0.0596 0.0787 0.0965 0.0965 0.0787 0.0596 0.0681 0.0586 0.0497 0.0391 0.0287 0.0160 0.0050 | 1 0.7726 0.6253 0.5106 0.4171 0.3330 0.2539 0.1851 0.1191 0.0417 −0.0417 −0.1191 −0.1851 −0.2539 −0.3330 −0.4171 −0.5106 −0.6253 −0.7726 −1 |

N
. | . | . |
---|