This letter describes a simple modification of the Oja learning rule, which asymptotically constrains the L1-norm of an input weight vector instead of the L2-norm as in the original rule. This constraining is local as opposed to commonly used instant normalizations, which require the knowledge of all input weights of a neuron to update each one of them individually. The proposed rule converges to a weight vector that is sparser (has more zero weights) than the vector learned by the original Oja rule with or without the zero bound, which could explain the developmental synaptic pruning.
The developing brain of humans and other animals undergoes a synaptic growth spurt in early childhood, followed by a massive synaptic pruning that removes about half of the synapses by adulthood (Huttenlocher, 1979; Rakic, Bourgeois, & Goldman-Rakic, 1994; Innocenti, 1995). This synaptic pruning results in a sparse connectivity between neurons; only a small fraction of all possible connections between neurons physically exists at both the global and local levels (Markram, Lübke, Frotscher, Roth, & Sakmann, 1997; Braitenberg & Schüz, 1998; Holmgren, Harkany, Svennenfors, & Zilberter, 2003). The functional role of the developmental synaptic pruning is not well understood, but theoretical studies of neural networks have shown that an elimination of “unimportant” synapses increases the memory capacity per synapse (Chechik, Meilijson, & Ruppin, 1998; Mimura, Kimoto, & Okada, 2003) and improves the generalization ability of a network (Le Cun, Denker, & Solla, 1990; Hassibi, Stork, & Wolff, 1993). Because the synaptic activity is strongly correlated with energy (Roland, 1993), the synaptic pruning also improves the brain's energy efficiency. Formation and elimination of synapses (structural plasticity) continue in the adult brain and are believed to be necessary for memory formation and consolidation (Wolff, Laskawi, Spatz, & Missler, 1995; Chklovskii, Mel, & Svoboda, 2004; Butz, Wörgötter, & van Ooyen, 2009).
Experimental data show that the synaptic pruning is activity dependent and removes weaker synapses (Zhou, Homma, & Poo, 2004; Le Bé & Markram, 2006; Bastrikova, Gardner, Reece, Jeromin, & Dudek, 2008; Becker, Wierenga, Fonseca, Bonhoeffer, & Nagerl, 2008). This correlation between the efficacy of synapses and their removal suggests that the latter may be caused by a competitive synaptic plasticity. Spike-timing-dependent plasticity (STDP) has been shown to cause a synaptic competition for the control of postsynaptic spike timing, which under certain conditions leads to a normalization of the total input strength and the output firing rate (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Kempter, Gerstner, & van Hemmen, 2001). In some cases, this competition leads to a synaptic pruning and sparse connectivity (Iglesias, Eriksson, Grize, Tomassini, & Villa, 2005). However, stable firing requires a fine tuning of the ratio of long-term potentiation (LTP) to long-term depression (LTD). Small changes in the ratio or relatively high presynaptic firing rates can destabilize the postsynaptic firing rate (Sjöström, Turrigiano, & Nelson, 2001; Tegnér & Kepecs, 2002; Gütig, Aharonov, Rotter, & Sompolinsky, 2003). Experimental evidence suggests that besides a synaptic plasticity such as STDP, synapses also undergo a homeostatic plasticity in the form of a synaptic scaling that stabilizes the postsynaptic firing rate (reviewed by Watt & Desai, 2010). Several STDP models have been proposed to incorporate such homeostatic regulation (Tegnér & Kepecs, 2002; Benuskova & Abraham, 2007; Michler, Eckhorn, & Wachtler, 2009; Clopath, Busing, Vasilaki, & Gérstner, 2010), some of which use an explicit weight normalization (Sprekeler, Michaelis, & Wiskott, 2007).
Synapses can also compete for finite resources such as neurotrophic factors (Rasmussen & Willshaw, 1993; Ribchester & Barry, 1994; Miller, 1996; van Ooyen, 2001). Rate-based learning rules force such synaptic competition through either instant or asymptotic normalizations, which can be multiplicative or subtractive (reviewed by Miller & MacKay, 1994). The instant normalizations typically conserve the total strength of the input synapses of a neuron, which has a biological justification (Royer & Paré, 2003). However, these normalizations are nonlocal; they require the knowledge of all input weights of a neuron to update each one of them individually, which is not supported by experimental evidence. The rules with asymptotic normalizations can be local (Oja, 1982; Bienenstock, Cooper, & Munro, 1982) or nonlocal (Yuille, Kammen, & Cohen, 1989). While being local, the Oja (1982) rule asymptotically constrains the sum of squared input weights, which does not have a biological justification (Miller & MacKay, 1994). The BCM rule (Bienenstock et al., 1982) is also local. It asymptotically normalizes the average postsynaptic firing rate through a sliding postsynaptic threshold that depends on the recent history of the postsynaptic activation. However, the BCM rule is unable to prune inactive synapses because it requires a nonzero presynaptic activity for LTD to take place.
This letter describes a simple modification of the Oja rule, which asymptotically constrains the sum of weight magnitudes (the L1-norm) instead of the sum of squared weights (the L2-norm), as in the original rule. This constraining is local, implicit, and biologically justified (Royer & Paré, 2003). We first briefly review existing rate-based learning rules to show how they limit weight growth. Then we introduce our new rule and mathematically prove that it indeed constrains the L1-norm of an input weight vector at the equilibrium point. Next, we use a simplified geometric analysis of the rule to show that its asymptotic solution is sparse. We compare the proposed rule to other rate-based rules in learning the input weights of simple and complex cells in a V1 model. We show that while the Hebb rule with subtractive normalization converges to sparse weight vectors if the weights are bounded at zero, it fails to do so without such zero bound. The Oja rule fails to converge to sparse weight vectors under any scenario. The proposed rule, on the other hand, converges to sparse weight vectors with or without the zero bound, which could explain the developmental synaptic pruning.
2.1. Brief Review of Rate-Based Learning Rules.
The above rules are typically applied to unsigned weights to obey Dale's principle (Dale, 1935), according to which connections from excitatory neurons must have positive weights and connections from inhibitory neurons must have negative weights. Weights are not allowed to change their sign by using a zero bound. If a rule can segregate afferents, then the zero bound often leads to weight vectors with many zeros (sparse vectors). However, if weights are allowed to change their sign, then the above rules converge to weight vectors with just few zeros (nonsparse vectors).
2.2. Proposed Modification of Oja Rule.
2.3. Simplified Geometric Analysis of Asymptotic Solutions.
The Hebb rule with subtractive normalization, equation 2.7, maintains the total sum of the weights constant: w1 + w2 = α. This constraint draws a 135° line passing through (α, 0) and (0, α) on the (w1, w2) plane, as illustrated in Figure 1B. Two constraints are shown. The solid line on the right is for nonnegative weights, and the solid line on the left is for signed weights. Small circles mark possible asymptotic solutions. Subtractive normalization is typically applied to nonnegative weights, in which case α>0 and the weights are bounded at zero: w1,2 ≥ 0. The asymptotic solutions are (α, 0) and (0, α); both are sparse. If weights are allowed to change their sign, then the asymptotic solutions are unbounded unless bounds are forced. If the maximum weight magnitude is constrained at α and if w1 + w2 = 0, then the asymptotic solutions are (−α, α) and (α, − α); are both nonsparse.
To the first-order approximation, the Oja rule in equation 2.8 can be broken into the Hebbian term (the first term in the parentheses of equation 2.8) and the constraint term (the second term in the parentheses of equation 2.8). The Hebbian term maximizes y assuming that the initial y is positive, and the second term imposes the asymptotic constraint w21 + w22 = α. This constraint draws a circle of the radius on the (w1, w2) plane as shown in Figure 1C. The asymptotic solution (w1(∞), w2(∞)) is found as a point, at which the solution line in equation 2.16 is tangent to the circle. As can be seen in Figure 1C, it is impossible to get a sparse solution with the Oja rule unless x1 = 0 or x2 = 0. A more rigorous analysis shows that the Oja rule converges to the principal eigenvector of the data covariance matrix C with Cik = 〈xixk〉 (Oja, 1982).
The proposed rule, equation 2.9, imposes the asymptotic constraint |w1| + |w2| = α. This constraint draws a rhombus with all sides equal to on the (w1, w2) plane, as shown in Figure 1D. The asymptotic solution (w1(∞), w2(∞)) is found by moving the solution line in equation 2.16 in the direction of increasing y until it touches the rhombus at just one point, which is always one of the vertices unless |x1| = |x2|. Therefore, for the vast majority of inputs, the proposed rule gives a sparse solution (one of the two weights is zero).
If a neuron has more than two weights, the rule in equation 2.9 should converge to a solution with only one nonzero weight of magnitude α. It may be desirable to have more than one nonzero element in an input weight vector. This can be achieved by bounding each weight magnitude at wmax < α, where α/wmax is a desired number of nonzero weights in w. As an example, consider a linear neuron with three inputs x1, x2, and x3 and their corresponding weights w1, w2, and w3. The proposed rule in equation 2.9 creates an octahedron constraint shape shown in Figure 2A. Triplets (w1, w2, w3) that satisfy y = w1x1 + w2x2 + w3x3 for given y, x1, x2, and x3 form a solution plane. This solution plane should intersect the constraint shape at only one point, which is one of the octahedron vertices lying on the axes. The asymptotic solution would always have only one nonzero weight: (±α, 0, 0), (0, ± α, 0), or (0, 0, ± α). Now we add a hard bound on each weight magnitude of wmax = α/2: |w1,2,3| ≤ α/2. This will transform the octahedron into the cuboctahedron shown in Figure 2B. The coordinates of the cuboctahedron vertices and, thus the possible solutions, have two nonzero weights: (±α/2, ± α/2, 0), (0, ± α/2, ± α/2), or (±α/2, 0, ± α/2). The choice of α is somewhat arbitrary. However, if all neural inputs and outputs in a network should be within the same bounds, then the appropriate value for α is 1. In this case, the proposed rule has only two parameters: the learning speed η and the upper bound wmax ≤ 1. The L0-constraint rule, equation 2.15, can also keep the network inputs and outputs within the same bounds if the maximum weight magnitude is limited to wmax = 1/α, where, in this case, α is a desired number of nonzero elements in each weight vector.
The proposed rule was compared against three other rate-based rules in learning the feedforward connection weights in a V1 neural network model. The network consists of four two-dimensional layers: photoreceptors, retinal ganglion cells (RGCs), V1 simple cells (S1s), and V1 complex cells (C1s). Cells of each layer collect feedforward inputs from localized retinotopically mapped regions of cells in the previous layer. The photoreceptors are mapped 1:1 to the pixels of an input image. Each photoreceptor codes the luminosity of the corresponding pixel in the range [− 1, 1]. The photoreceptor outputs are fed to the RGCs through fixed-weight connections whose strength is set by the Laplacian of gaussian. The output of each RGC is calculated as a linear sum of the weighted inputs. It can be positive, negative, or zero. The RGC outputs are fed to the S1s through adaptive signed weights. In addition to these feedforward connections, the S1 layer also has lateral connections with a short-range excitation and a long-range inhibition, similar to the model of von der Malsburg (1973). These lateral connections help the S1s to self-organize into an orientation map with pinwheels and linear zones. Each S1 is modeled as a sum of weighted inputs passed through a half-wave rectifier, which preserves the positive part of the output and clips the negative part to zero. The S1 outputs are fed to the C1s through adaptive positive weights. The RGC-to-S1 and S1-to-C1 weights are trained incrementally by applying a visual stimulus to the photoreceptors, computing the outputs of the network layers in the order RGC-S1-C1, and then applying an unsupervised learning rule to the weights connecting these layers. This training step is repeated for each new visual stimulus, which can be a new image or a new patch within the same image. It should be noted that because the S1 layer uses the reciprocal lateral connections, its outputs are computed in several iterations for each given visual stimulus, and the RGC-to-S1 weights are updated only after these iterations.
First, we compared the RGC-to-S1 weights trained by four rules: the Hebb rule with subtractive normalization, equation 2.7; the Oja rule, equation 2.8; the proposed rule, equation 2.9, and the modified Oja rule with the L0-norm constraint, equation 2.15. All four rules were supplemented with a [− wmax, wmax] weight bounding. Figure 3 shows examples of the emerged RGC-to-S1 weight matrices, in which the empty circles represent positive weights (the ON region) and the filled circles represent negative weights (the OFF region). The circle diameter is proportional to the weight magnitude. Very small or zero weights are shown as dots. Figure 4 shows the receptive fields corresponding to the RGC-to-S1 weight matrices in Figure 3. These receptive fields look very similar except for the one learned by the Hebb rule with subtractive normalization, which has elongated ON and OFF regions. Figure 5 shows the corresponding distributions of all RGC-to-S1 weights. As can be seen, the Hebb rule with subtractive normalization converges to the weights of maximum magnitude. The Oja rule converges to graded weights, some of which have small but nonzero values. The proposed rule, equation 2.9, converges to a weight matrix with well-defined ON and OFF regions and many close-to-zero weights. The modified Oja rule with the L0-norm constraint, equation 2.15, fails to converge to a sparse weight matrix because of the division by wi, which makes small weights to oscillate around 0. It is impossible to get exactly zero weights without the zero bound in any of these rules. Therefore, to assess the sparseness of the weight matrices, we counted the weights that are zero within a chosen rounding error. With the rounding error of 0.01wmax, approximately 54% of RGC-to-S1 weights trained by the proposed rule, equation 2.9, are zero, whereas less than 3% of weights trained by the other three rules are zero.
We also used the same four rules to train the S1-to-C1 weights, but with a [0, wmax] bounding. To make a fair comparison of these rules, the RGC-to-S1 weights were pretrained by the Oja rule and set to be the same in all four cases. Figure 6 shows a fragment of the S1 orientation map as an iso-orientation contour plot. The values over the contours represent the preferred orientations of the simple cells in degrees. The box outlines a region of 27 × 27 S1s, whose outputs are fed to the same complex cell. The circles inside the box indicate the connection strengths from these simple cells to the chosen complex cell: the larger the circle, the larger the weight. Very small or zero weights are shown as dots. Figure 7 shows the corresponding distributions of all S1-to-C1 weights. We can see that the Hebb rule with subtractive normalization, equation 2.7; the proposed rule, equation 2.9; and the modified Oja rule with the L0-norm constraint, equation 2.15, create similar sparse connectivity patterns between the pool of simple cells and the given complex cell with strong connections originating from the simple cells of similar orientations (within ±30° of the vertical orientation) and the zero-strength connections from the simple cells of other orientations. Such sparse connectivity between simple and complex cells leads to a high selectivity of the complex cells to stimulus orientation, which is consistent with experimental data (Henry, Dreher, & Bishop, 1974; Heggelund & Albus, 1978). The sparse S1-to-C1 connectivity in the case of the Hebb rule with subtractive normalization and the modified Oja rule with the L0-norm constraint emerged thanks to clipping negative weights to 0. The Oja rule creates connections of a variable strength to all simple cells within the box, even to those with orthogonal orientations.
A simple rate-based learning rule was proposed that is similar in appearance to the Oja rule (Oja, 1982) but asymptotically constrains the L1-norm instead of the L2-norm of an input weight vector, which makes it more biologically realistic (Miller & MacKay, 1994; Royer & Paré, 2003). This constraining is local as opposed to commonly used instant normalizations (Miller & MacKay, 1994), which require the knowledge of all input weights of a neuron to update each one of them individually. The geometric analysis shows that the asymptotic solutions of the proposed rule are sparse; the rule converges to input weight vectors with many zeros, which may explain the developmental synaptic pruning (Huttenlocher, 1979; Rakic et al., 1994; Innocenti, 1995). The number of nonzero elements (sparseness) can be controlled by imposing a hard bound on the maximum magnitude of individual weights and choosing an appropriate asymptotic target for the L1-norm (the parameter α in equation 2.9).
The proposed rule was used to learn the input weights of simple and complex cells in a V1 model and was compared against three other rate-based rules: the Hebb rule with subtractive normalization, equation 2.7; the Oja rule, equation 2.8; and the modified Oja rule with the L0-norm constraint, equation 2.15. The Oja rule failed to converge to sparse input weight vectors for both simple and complex cells. The Hebb rule with subtractive normalization converged to sparse weight vectors only for complex cells, thanks to the zero bound that prevented the weights from becoming negative. It failed to converge to sparse weight vectors for simple cells because the RGC-to-S1 weights were allowed to change their sign. The modified Oja rule with the L0-norm constraint failed to converge to sparse RGC-to-S1 weight vectors because of the division by wi in equation 2.15, which made the weights overshoot zero in a oscillatory manner. But this rule converged to sparse S1-to-C1 weight vectors thanks to the zero bound. The proposed rule, equation 2.9, converged to sparse weight vectors for both simple and complex cells. Therefore, to achieve sparseness, the proposed rule does not require the zero bound, which makes this rule valuable not only for neural networks that comply with Dale's principle (Dale, 1935) but also for networks whose weights are allowed to change their sign during training. To our knowledge, the proposed rule is the only rate-based rule that yields sparse connectivity with and without the zero bound. This sparse connectivity may be viewed as a result of the synaptic competition caused by the L1-norm constraint. Besides this competition, the proposed rule also automatically regulates the postsynaptic firing rate by limiting it to the same range as the presynaptic firing rates (if α = 1). STDP rules also cause a synaptic competition, which may lead to synaptic pruning (Iglesias et al., 2005). But these rules require a fine balancing of the LTP/LTD ratio or a homeostatic regulation to stabilize the postsynaptic firing rate (Sjöström et al., 2001; Tegnér & Kepecs, 2002; Gütig et al., 2003; Benuskova & Abraham, 2007; Sprekeler et al., 2007; Michler, Eckhorn, & Wachtler, 2009; Watt & Desai, 2010; Clopath et al., 2010).
Another way to assess sparseness of a V1 model is to count how many simple cells have nonzero outputs on average while encoding structured visual stimuli such as natural scenes (Field, 1987, 1994). Models based on this sparseness definition typically use linear encoding y = Wx, where x is a visual stimulus, y are the outputs of the encoders (simple cells), and the columns of w are encoding filters, which are equivalent to the receptive fields of the simple cells in the present work. The encoding filters can be found by maximizing the statistical independence or, equivalently, the sparseness of the encoder outputs y (Olshausen & Field, 1996; Bell & Sejnowski, 1997). This method performs independent component analysis (ICA) and yields localized and oriented encoding filters similar to the receptive fields of biological simple cells. Localization of the S1 filters in this letter is achieved by spatially constraining the RGC regions, from which the S1s collect their inputs. Because ICA-based S1 models treat S1s as linear encoders, they cannot explain the shift invariance of C1s. It is possible to extend these models to C1s by maximizing the independence (sparseness) of local sums of squared S1 outputs ∑iy2i as the C1 responses (Hyvärinen & Hoyer, 2000). In this letter, the S1s are modeled as half-wave rectifiers to obtain the shift invariance of the C1s.
The author wishes to thank Subramaniam Venkatraman, QUALCOMM, Inc., for useful discussions about the complex cell modeling, and Krzys Wegrzyn, QUALCOMM, Inc., for his help in programming the V1 model.