Abstract

This letter describes a simple modification of the Oja learning rule, which asymptotically constrains the L1-norm of an input weight vector instead of the L2-norm as in the original rule. This constraining is local as opposed to commonly used instant normalizations, which require the knowledge of all input weights of a neuron to update each one of them individually. The proposed rule converges to a weight vector that is sparser (has more zero weights) than the vector learned by the original Oja rule with or without the zero bound, which could explain the developmental synaptic pruning.

1.  Introduction

The developing brain of humans and other animals undergoes a synaptic growth spurt in early childhood, followed by a massive synaptic pruning that removes about half of the synapses by adulthood (Huttenlocher, 1979; Rakic, Bourgeois, & Goldman-Rakic, 1994; Innocenti, 1995). This synaptic pruning results in a sparse connectivity between neurons; only a small fraction of all possible connections between neurons physically exists at both the global and local levels (Markram, Lübke, Frotscher, Roth, & Sakmann, 1997; Braitenberg & Schüz, 1998; Holmgren, Harkany, Svennenfors, & Zilberter, 2003). The functional role of the developmental synaptic pruning is not well understood, but theoretical studies of neural networks have shown that an elimination of “unimportant” synapses increases the memory capacity per synapse (Chechik, Meilijson, & Ruppin, 1998; Mimura, Kimoto, & Okada, 2003) and improves the generalization ability of a network (Le Cun, Denker, & Solla, 1990; Hassibi, Stork, & Wolff, 1993). Because the synaptic activity is strongly correlated with energy (Roland, 1993), the synaptic pruning also improves the brain's energy efficiency. Formation and elimination of synapses (structural plasticity) continue in the adult brain and are believed to be necessary for memory formation and consolidation (Wolff, Laskawi, Spatz, & Missler, 1995; Chklovskii, Mel, & Svoboda, 2004; Butz, Wörgötter, & van Ooyen, 2009).

Experimental data show that the synaptic pruning is activity dependent and removes weaker synapses (Zhou, Homma, & Poo, 2004; Le Bé & Markram, 2006; Bastrikova, Gardner, Reece, Jeromin, & Dudek, 2008; Becker, Wierenga, Fonseca, Bonhoeffer, & Nagerl, 2008). This correlation between the efficacy of synapses and their removal suggests that the latter may be caused by a competitive synaptic plasticity. Spike-timing-dependent plasticity (STDP) has been shown to cause a synaptic competition for the control of postsynaptic spike timing, which under certain conditions leads to a normalization of the total input strength and the output firing rate (Song, Miller, & Abbott, 2000; van Rossum, Bi, & Turrigiano, 2000; Kempter, Gerstner, & van Hemmen, 2001). In some cases, this competition leads to a synaptic pruning and sparse connectivity (Iglesias, Eriksson, Grize, Tomassini, & Villa, 2005). However, stable firing requires a fine tuning of the ratio of long-term potentiation (LTP) to long-term depression (LTD). Small changes in the ratio or relatively high presynaptic firing rates can destabilize the postsynaptic firing rate (Sjöström, Turrigiano, & Nelson, 2001; Tegnér & Kepecs, 2002; Gütig, Aharonov, Rotter, & Sompolinsky, 2003). Experimental evidence suggests that besides a synaptic plasticity such as STDP, synapses also undergo a homeostatic plasticity in the form of a synaptic scaling that stabilizes the postsynaptic firing rate (reviewed by Watt & Desai, 2010). Several STDP models have been proposed to incorporate such homeostatic regulation (Tegnér & Kepecs, 2002; Benuskova & Abraham, 2007; Michler, Eckhorn, & Wachtler, 2009; Clopath, Busing, Vasilaki, & Gérstner, 2010), some of which use an explicit weight normalization (Sprekeler, Michaelis, & Wiskott, 2007).

Synapses can also compete for finite resources such as neurotrophic factors (Rasmussen & Willshaw, 1993; Ribchester & Barry, 1994; Miller, 1996; van Ooyen, 2001). Rate-based learning rules force such synaptic competition through either instant or asymptotic normalizations, which can be multiplicative or subtractive (reviewed by Miller & MacKay, 1994). The instant normalizations typically conserve the total strength of the input synapses of a neuron, which has a biological justification (Royer & Paré, 2003). However, these normalizations are nonlocal; they require the knowledge of all input weights of a neuron to update each one of them individually, which is not supported by experimental evidence. The rules with asymptotic normalizations can be local (Oja, 1982; Bienenstock, Cooper, & Munro, 1982) or nonlocal (Yuille, Kammen, & Cohen, 1989). While being local, the Oja (1982) rule asymptotically constrains the sum of squared input weights, which does not have a biological justification (Miller & MacKay, 1994). The BCM rule (Bienenstock et al., 1982) is also local. It asymptotically normalizes the average postsynaptic firing rate through a sliding postsynaptic threshold that depends on the recent history of the postsynaptic activation. However, the BCM rule is unable to prune inactive synapses because it requires a nonzero presynaptic activity for LTD to take place.

This letter describes a simple modification of the Oja rule, which asymptotically constrains the sum of weight magnitudes (the L1-norm) instead of the sum of squared weights (the L2-norm), as in the original rule. This constraining is local, implicit, and biologically justified (Royer & Paré, 2003). We first briefly review existing rate-based learning rules to show how they limit weight growth. Then we introduce our new rule and mathematically prove that it indeed constrains the L1-norm of an input weight vector at the equilibrium point. Next, we use a simplified geometric analysis of the rule to show that its asymptotic solution is sparse. We compare the proposed rule to other rate-based rules in learning the input weights of simple and complex cells in a V1 model. We show that while the Hebb rule with subtractive normalization converges to sparse weight vectors if the weights are bounded at zero, it fails to do so without such zero bound. The Oja rule fails to converge to sparse weight vectors under any scenario. The proposed rule, on the other hand, converges to sparse weight vectors with or without the zero bound, which could explain the developmental synaptic pruning.

2.  Methods

2.1.  Brief Review of Rate-Based Learning Rules.

In 1949, Hebb postulated a general learning rule of synaptic weights (Hebb, 1949), which is expressed as
formula
2.1
where Δwi is the change in the ith synaptic weight wi, η is a learning speed, xi is the ith input (presynaptic activity), and y is the neuron output (postsynaptic activity). The rule, equation 2.1, causes unbounded weight growth and therefore is often complemented by an upper bound on the weight magnitude. It also fails to segregate positively correlated inputs such as visual stimuli from both eyes because of the lack of synaptic competition (all weights will reach the maximum magnitude).
There have been several modifications to the Hebb rule to overcome its drawbacks. Grossberg (1968) added a passive weight decay term to restrict weight growth:
formula
2.2
where γ is a decay speed between 0 and 1. This rule prunes connections with low activity and may prune all connections if γ is not chosen carefully. To circumvent this problem, Grossberg (1972, 1976a, 1976b), introduced the instar rule, in which the decay term is gated with the postsynaptic activity y:
formula
2.3
A similar rule was used in the self-organizing maps of Kohonen (1982). This rule converges to wi = xi, which is a nonsparse solution.
Sejnowski (1977a, 1977b) proposed the covariance rule, which removes the bias of the Hebb rule due to nonzero means of xi and y and, at the same time, adds the synaptic depression:
formula
2.4
where 〈xi〉 and 〈y〉 are the average pre- and postsynaptic activities, respectively. Just like the Hebb rule, rule 2.4 does not limit weight growth or force synaptic competition (Miller & MacKay, 1994; Miller, 1996).
To achieve a synaptic competition, Bienenstock et al. (1982) used a postsynaptic threshold that grows faster than linearly with the average postsynaptic activity. The resulting learning rule is called the BCM rule:
formula
2.5
where α is an asymptotic target for 〈y〉. As can be seen from equation 2.5, the BCM rule is unable to prune inactive synapses because it requires a nonzero presynaptic activity for LTD to take place.
To prevent unbounded weight growth, Rochester, Holland, Haibt, and Duda (1956) divided the weights by their sum to keep it constant:
formula
2.6
where wi(t) = wi(t − 1) + ηxiy, α is a target for ∑iwi(t), and t is the time index. This type of weight bounding is called multiplicative normalization. It has become very popular in computational neural models since von der Malsburg (1973). It has also been confirmed in experimental studies (Turrigiano, Leslie, Desai, & Nelson, 1998; Ibata, Sun, & Turrigiano, 2008), which showed a homeostatic synaptic scaling after a firing rate destabilization. In its original form, multiplicative normalization is applied to unsigned weights. However, it can be expanded to signed weights by changing the denominator in equation 2.6 to the L1-norm ∑i|wi(t)|. It can also be modified to limit the weight vector length (the L2-norm) by changing the denominator to (Barrow, 1987; Obermayer, Ritter, & Schulten, 1990). Because the weights in equation 2.6 are trained by the Hebb rule and then scaled by a common factor, both equations 2.1 and 2.6 converge to the weight vectors pointing in the same direction but having different lengths. Just like the original Hebb rule, rule 2.6 fails to segregate positively correlated inputs (Miller & MacKay, 1994).
One can also subtract an equal amount from each weight after they are modified by equation 2.1 with the amount chosen so that the total sum of the weights remains constant,
formula
2.7
where N is the number of inputs. This type of weight bounding is called subtractive normalization. It also received a biological justification (Royer & Paré, 2003) and was shown to segregate positively correlated inputs (Miller & MacKay, 1994). Subtractive normalization is typically applied to unsigned weights and thus requires a lower bound of zero to prevent weights from changing their sign. With this bound, all input weights of a neuron trained by equation 2.7 asymptotically converge to zero except one (Miller & MacKay, 1994). To prevent a single nonzero weight, an upper bound on the weight magnitude is also typically imposed. The described multiplicative and subtractive normalizations have the advantage of instantly normalizing weights at each learning step. But they are nonlocal: they require the knowledge of all input weights or inputs of a neuron to compute each weight individually.
Oja (1982) proposed a local learning rule that asymptotically constrains the L2 norm of an input weight vector,
formula
2.8
where α is a target for ∑iw2i. While this rule creates a competition between synapses for limited resources, modeling these resources as a sum of squared weights is not biologically justified (Miller & MacKay, 1994).

The above rules are typically applied to unsigned weights to obey Dale's principle (Dale, 1935), according to which connections from excitatory neurons must have positive weights and connections from inhibitory neurons must have negative weights. Weights are not allowed to change their sign by using a zero bound. If a rule can segregate afferents, then the zero bound often leads to weight vectors with many zeros (sparse vectors). However, if weights are allowed to change their sign, then the above rules converge to weight vectors with just few zeros (nonsparse vectors).

2.2.  Proposed Modification of Oja Rule.

The proposed modification of the Oja rule is
formula
2.9
where Δwi is the change in the ith synaptic weight wi, η is a learning speed, xi is the ith input (presynaptic activity), y is the neuron output (postsynaptic activity), α is a target for ∑i|wi|, and sgn() is the sign function. The only difference of this rule from the original Oja rule, equation 2.8, is that wi is replaced by sgn(wi) in the parentheses on the right.
To prove that the proposed rule in equation 2.9 asymptotically constrains ∑i|wi| to α, we consider a linear neuron whose output y is computed as the weighted sum of the inputs:
formula
2.10
Substituting equation 2.10 into 2.9 and taking the time average 〈〉 of the result with an assumption that the weights change much more slowly than the inputs, we get
formula
2.11
or, in matrix form,
formula
2.12
where w is the input weight vector, T in the superscript means transpose, and C with Cik = 〈xixk〉 is the correlation matrix of the inputs.
At the equilibrium point, the average weight change should be zero:
formula
2.13
After multiplying both sides of equation 2.13 by wT from the left, dividing the resulting equation by the scalar , and rearranging the terms, we get
formula
2.14
that is, the L1-norm of the weight vector w is equal to α at the equilibrium point.
In a similar manner, it can be proved that the following rule constrains the L0-norm of the weight vector at the equilibrium point:
formula
2.15
where α is an asymptotic target for the count of nonzero elements in w. Because of the division by wi, the rule in equation 2.15 creates large Δwi updates when wi is close to 0, making it to oscillate around 0 and never reaching it unless the zero bound is used. The rule in equation 2.9 does not show such behavior and converges to a sparse w with or without the zero bound, as will be shown in section 3.

2.3.  Simplified Geometric Analysis of Asymptotic Solutions.

As an example, consider a linear neuron with two inputs x1 and x2 and the corresponding weights w1 and w2. The neuron output is given by
formula
2.16
To simplify our geometric analysis, x1 and x2 are assumed constant. The Hebb rule can be viewed as an optimization step in the direction of the gradient of an objective function E:
formula
2.17
It is easy to show that E = y2/2, that is, the Hebb rule maximizes the neuron energy, thus the unbounded weight growth. There are two possible gradient ascent paths: along the left (y < 0) and right (y>0) sides of the parabola y2/2, depending on the initial value of y. For simplicity, we assume that this initial value is positive such that the learning rule in equation 2.17 moves along the right side of the parabola y2/2. In this case, the maximization of y2/2 is equivalent to the maximization of y. To prevent an unbounded weight growth, we impose a constraint on the weight magnitudes: |w1,2| ≤ α. This constraint draws a square on the (w1, w2) plane as shown in Figure 1A. The straight dashed line draws all possible (w1, w2) solutions for given y, x1, and x2. The slope of this line is determined by −x1/x2, and its position relative to the origin is determined by y/x2. Maximization of y moves this line away from the origin (up if x2>0 or down if x2 < 0). The asymptotic solution (w1(∞), w2(∞)) is found by moving this line in the direction of increasing y until it touches the square at just one point, which is always one of the corners unless x1 = 0 or x2 = 0. As can be seen in Figure 1A, for the vast majority of inputs, the Hebb rule with our bounds always leads to a solution in which all weights have the maximum magnitude: |w1| = |w2| = α.
Figure 1:

Simplified geometric analysis of asymptotic solutions for a neuron with two inputs. The dashed line shows (w1, w2) pairs that satisfy equation 2.16 for given y, x1, and x2. Maximization of y moves this line away from the origin. The solid line shows the constraint on w1 and w2. The asymptotic solution (w1(∞), w2(∞)) is found as a point, at which the solution line, equation 2.16, is tangent to the constraint shape. (A) Hebb rule with weight bounding. (B) Hebb rule with subtractive normalization. (C) Oja rule. (D) Proposed rule 2.9.

Figure 1:

Simplified geometric analysis of asymptotic solutions for a neuron with two inputs. The dashed line shows (w1, w2) pairs that satisfy equation 2.16 for given y, x1, and x2. Maximization of y moves this line away from the origin. The solid line shows the constraint on w1 and w2. The asymptotic solution (w1(∞), w2(∞)) is found as a point, at which the solution line, equation 2.16, is tangent to the constraint shape. (A) Hebb rule with weight bounding. (B) Hebb rule with subtractive normalization. (C) Oja rule. (D) Proposed rule 2.9.

The Hebb rule with subtractive normalization, equation 2.7, maintains the total sum of the weights constant: w1 + w2 = α. This constraint draws a 135° line passing through (α, 0) and (0, α) on the (w1, w2) plane, as illustrated in Figure 1B. Two constraints are shown. The solid line on the right is for nonnegative weights, and the solid line on the left is for signed weights. Small circles mark possible asymptotic solutions. Subtractive normalization is typically applied to nonnegative weights, in which case α>0 and the weights are bounded at zero: w1,2 ≥ 0. The asymptotic solutions are (α, 0) and (0, α); both are sparse. If weights are allowed to change their sign, then the asymptotic solutions are unbounded unless bounds are forced. If the maximum weight magnitude is constrained at α and if w1 + w2 = 0, then the asymptotic solutions are (−α, α) and (α, − α); are both nonsparse.

To the first-order approximation, the Oja rule in equation 2.8 can be broken into the Hebbian term (the first term in the parentheses of equation 2.8) and the constraint term (the second term in the parentheses of equation 2.8). The Hebbian term maximizes y assuming that the initial y is positive, and the second term imposes the asymptotic constraint w21 + w22 = α. This constraint draws a circle of the radius on the (w1, w2) plane as shown in Figure 1C. The asymptotic solution (w1(∞), w2(∞)) is found as a point, at which the solution line in equation 2.16 is tangent to the circle. As can be seen in Figure 1C, it is impossible to get a sparse solution with the Oja rule unless x1 = 0 or x2 = 0. A more rigorous analysis shows that the Oja rule converges to the principal eigenvector of the data covariance matrix C with Cik = 〈xixk〉 (Oja, 1982).

The proposed rule, equation 2.9, imposes the asymptotic constraint |w1| + |w2| = α. This constraint draws a rhombus with all sides equal to on the (w1, w2) plane, as shown in Figure 1D. The asymptotic solution (w1(∞), w2(∞)) is found by moving the solution line in equation 2.16 in the direction of increasing y until it touches the rhombus at just one point, which is always one of the vertices unless |x1| = |x2|. Therefore, for the vast majority of inputs, the proposed rule gives a sparse solution (one of the two weights is zero).

If a neuron has more than two weights, the rule in equation 2.9 should converge to a solution with only one nonzero weight of magnitude α. It may be desirable to have more than one nonzero element in an input weight vector. This can be achieved by bounding each weight magnitude at wmax < α, where α/wmax is a desired number of nonzero weights in w. As an example, consider a linear neuron with three inputs x1, x2, and x3 and their corresponding weights w1, w2, and w3. The proposed rule in equation 2.9 creates an octahedron constraint shape shown in Figure 2A. Triplets (w1, w2, w3) that satisfy y = w1x1 + w2x2 + w3x3 for given y, x1, x2, and x3 form a solution plane. This solution plane should intersect the constraint shape at only one point, which is one of the octahedron vertices lying on the axes. The asymptotic solution would always have only one nonzero weight: (±α, 0, 0), (0, ± α, 0), or (0, 0, ± α). Now we add a hard bound on each weight magnitude of wmax = α/2: |w1,2,3| ≤ α/2. This will transform the octahedron into the cuboctahedron shown in Figure 2B. The coordinates of the cuboctahedron vertices and, thus the possible solutions, have two nonzero weights: (±α/2, ± α/2, 0), (0, ± α/2, ± α/2), or (±α/2, 0, ± α/2). The choice of α is somewhat arbitrary. However, if all neural inputs and outputs in a network should be within the same bounds, then the appropriate value for α is 1. In this case, the proposed rule has only two parameters: the learning speed η and the upper bound wmax ≤ 1. The L0-constraint rule, equation 2.15, can also keep the network inputs and outputs within the same bounds if the maximum weight magnitude is limited to wmax = 1/α, where, in this case, α is a desired number of nonzero elements in each weight vector.

Figure 2:

Simplified geometric analysis of asymptotic solutions for a neuron with three inputs. The constraint shape is centered at the origin. The plane shows (w1, w2, w3) triplets that satisfy y = w1x1 + w2x2 + w3x3 for given y, x1, x2, and x3. Maximization of y moves this plane away from the origin. The asymptotic solution (w1(∞), w2(∞), w3(∞)) is found as a point, at which the solution plane is tangent to the constraint shape. (A) Proposed rule 2.9 without additional weight bounding. (B) Proposed rule 2.9 with magnitudes of all weights bounded at α/2.

Figure 2:

Simplified geometric analysis of asymptotic solutions for a neuron with three inputs. The constraint shape is centered at the origin. The plane shows (w1, w2, w3) triplets that satisfy y = w1x1 + w2x2 + w3x3 for given y, x1, x2, and x3. Maximization of y moves this plane away from the origin. The asymptotic solution (w1(∞), w2(∞), w3(∞)) is found as a point, at which the solution plane is tangent to the constraint shape. (A) Proposed rule 2.9 without additional weight bounding. (B) Proposed rule 2.9 with magnitudes of all weights bounded at α/2.

3.  Results

The proposed rule was compared against three other rate-based rules in learning the feedforward connection weights in a V1 neural network model. The network consists of four two-dimensional layers: photoreceptors, retinal ganglion cells (RGCs), V1 simple cells (S1s), and V1 complex cells (C1s). Cells of each layer collect feedforward inputs from localized retinotopically mapped regions of cells in the previous layer. The photoreceptors are mapped 1:1 to the pixels of an input image. Each photoreceptor codes the luminosity of the corresponding pixel in the range [− 1, 1]. The photoreceptor outputs are fed to the RGCs through fixed-weight connections whose strength is set by the Laplacian of gaussian. The output of each RGC is calculated as a linear sum of the weighted inputs. It can be positive, negative, or zero. The RGC outputs are fed to the S1s through adaptive signed weights. In addition to these feedforward connections, the S1 layer also has lateral connections with a short-range excitation and a long-range inhibition, similar to the model of von der Malsburg (1973). These lateral connections help the S1s to self-organize into an orientation map with pinwheels and linear zones. Each S1 is modeled as a sum of weighted inputs passed through a half-wave rectifier, which preserves the positive part of the output and clips the negative part to zero. The S1 outputs are fed to the C1s through adaptive positive weights. The RGC-to-S1 and S1-to-C1 weights are trained incrementally by applying a visual stimulus to the photoreceptors, computing the outputs of the network layers in the order RGC-S1-C1, and then applying an unsupervised learning rule to the weights connecting these layers. This training step is repeated for each new visual stimulus, which can be a new image or a new patch within the same image. It should be noted that because the S1 layer uses the reciprocal lateral connections, its outputs are computed in several iterations for each given visual stimulus, and the RGC-to-S1 weights are updated only after these iterations.

First, we compared the RGC-to-S1 weights trained by four rules: the Hebb rule with subtractive normalization, equation 2.7; the Oja rule, equation 2.8; the proposed rule, equation 2.9, and the modified Oja rule with the L0-norm constraint, equation 2.15. All four rules were supplemented with a [− wmax, wmax] weight bounding. Figure 3 shows examples of the emerged RGC-to-S1 weight matrices, in which the empty circles represent positive weights (the ON region) and the filled circles represent negative weights (the OFF region). The circle diameter is proportional to the weight magnitude. Very small or zero weights are shown as dots. Figure 4 shows the receptive fields corresponding to the RGC-to-S1 weight matrices in Figure 3. These receptive fields look very similar except for the one learned by the Hebb rule with subtractive normalization, which has elongated ON and OFF regions. Figure 5 shows the corresponding distributions of all RGC-to-S1 weights. As can be seen, the Hebb rule with subtractive normalization converges to the weights of maximum magnitude. The Oja rule converges to graded weights, some of which have small but nonzero values. The proposed rule, equation 2.9, converges to a weight matrix with well-defined ON and OFF regions and many close-to-zero weights. The modified Oja rule with the L0-norm constraint, equation 2.15, fails to converge to a sparse weight matrix because of the division by wi, which makes small weights to oscillate around 0. It is impossible to get exactly zero weights without the zero bound in any of these rules. Therefore, to assess the sparseness of the weight matrices, we counted the weights that are zero within a chosen rounding error. With the rounding error of 0.01wmax, approximately 54% of RGC-to-S1 weights trained by the proposed rule, equation 2.9, are zero, whereas less than 3% of weights trained by the other three rules are zero.

Figure 3:

Afferent weight matrix of a simple cell. Each S1 collects feedforward input from a square region of 27 × 27 RGCs. The empty circles represent positive weights (the ON region). The filled circles represent negative weights (the OFF region). The circle diameter is proportional to the weight magnitude. Very small or zero weights are shown as dots. The weights are trained by four different rules with a [− wmax, wmax] bounding. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 3:

Afferent weight matrix of a simple cell. Each S1 collects feedforward input from a square region of 27 × 27 RGCs. The empty circles represent positive weights (the ON region). The filled circles represent negative weights (the OFF region). The circle diameter is proportional to the weight magnitude. Very small or zero weights are shown as dots. The weights are trained by four different rules with a [− wmax, wmax] bounding. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 4:

Simple-cell receptive fields calculated from the RGC-to-S1 weight matrices in Figure 3. These receptive fields account for the RGC input weights modeled by the Laplacian of gaussian with the sigma of 6. Gray is 0, black is −1, and white is +1. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 4:

Simple-cell receptive fields calculated from the RGC-to-S1 weight matrices in Figure 3. These receptive fields account for the RGC input weights modeled by the Laplacian of gaussian with the sigma of 6. Gray is 0, black is −1, and white is +1. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 5:

Distributions of all RGC-to-S1 weights. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 5:

Distributions of all RGC-to-S1 weights. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

We also used the same four rules to train the S1-to-C1 weights, but with a [0, wmax] bounding. To make a fair comparison of these rules, the RGC-to-S1 weights were pretrained by the Oja rule and set to be the same in all four cases. Figure 6 shows a fragment of the S1 orientation map as an iso-orientation contour plot. The values over the contours represent the preferred orientations of the simple cells in degrees. The box outlines a region of 27 × 27 S1s, whose outputs are fed to the same complex cell. The circles inside the box indicate the connection strengths from these simple cells to the chosen complex cell: the larger the circle, the larger the weight. Very small or zero weights are shown as dots. Figure 7 shows the corresponding distributions of all S1-to-C1 weights. We can see that the Hebb rule with subtractive normalization, equation 2.7; the proposed rule, equation 2.9; and the modified Oja rule with the L0-norm constraint, equation 2.15, create similar sparse connectivity patterns between the pool of simple cells and the given complex cell with strong connections originating from the simple cells of similar orientations (within ±30° of the vertical orientation) and the zero-strength connections from the simple cells of other orientations. Such sparse connectivity between simple and complex cells leads to a high selectivity of the complex cells to stimulus orientation, which is consistent with experimental data (Henry, Dreher, & Bishop, 1974; Heggelund & Albus, 1978). The sparse S1-to-C1 connectivity in the case of the Hebb rule with subtractive normalization and the modified Oja rule with the L0-norm constraint emerged thanks to clipping negative weights to 0. The Oja rule creates connections of a variable strength to all simple cells within the box, even to those with orthogonal orientations.

Figure 6:

Fragment of the S1 orientation map as an iso-orientation contour plot. The values over the contours are the preferred orientations of S1s in degrees. The box outlines a region of 27 × 27 S1s, whose outputs are fed to the same C1. The circles inside the box indicate the connection weights from these S1s to the chosen C1: the larger the circle, the larger the weight (all weights are positive). Very small or zero weights are shown as dots. The S1-to-C1 weights are trained by four different rules with a [0, wmax] bounding while the RGC-to-S1 weights are fixed and the same for all four cases. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 6:

Fragment of the S1 orientation map as an iso-orientation contour plot. The values over the contours are the preferred orientations of S1s in degrees. The box outlines a region of 27 × 27 S1s, whose outputs are fed to the same C1. The circles inside the box indicate the connection weights from these S1s to the chosen C1: the larger the circle, the larger the weight (all weights are positive). Very small or zero weights are shown as dots. The S1-to-C1 weights are trained by four different rules with a [0, wmax] bounding while the RGC-to-S1 weights are fixed and the same for all four cases. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 7:

Distributions of all S1-to-C1 weights. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

Figure 7:

Distributions of all S1-to-C1 weights. (A) Hebb rule with subtractive normalization. (B) Oja rule. (C) Proposed rule, equation 2.9. (D) Modified Oja rule with L0-norm constraint, equation 2.15.

4.  Discussion

A simple rate-based learning rule was proposed that is similar in appearance to the Oja rule (Oja, 1982) but asymptotically constrains the L1-norm instead of the L2-norm of an input weight vector, which makes it more biologically realistic (Miller & MacKay, 1994; Royer & Paré, 2003). This constraining is local as opposed to commonly used instant normalizations (Miller & MacKay, 1994), which require the knowledge of all input weights of a neuron to update each one of them individually. The geometric analysis shows that the asymptotic solutions of the proposed rule are sparse; the rule converges to input weight vectors with many zeros, which may explain the developmental synaptic pruning (Huttenlocher, 1979; Rakic et al., 1994; Innocenti, 1995). The number of nonzero elements (sparseness) can be controlled by imposing a hard bound on the maximum magnitude of individual weights and choosing an appropriate asymptotic target for the L1-norm (the parameter α in equation 2.9).

The proposed rule was used to learn the input weights of simple and complex cells in a V1 model and was compared against three other rate-based rules: the Hebb rule with subtractive normalization, equation 2.7; the Oja rule, equation 2.8; and the modified Oja rule with the L0-norm constraint, equation 2.15. The Oja rule failed to converge to sparse input weight vectors for both simple and complex cells. The Hebb rule with subtractive normalization converged to sparse weight vectors only for complex cells, thanks to the zero bound that prevented the weights from becoming negative. It failed to converge to sparse weight vectors for simple cells because the RGC-to-S1 weights were allowed to change their sign. The modified Oja rule with the L0-norm constraint failed to converge to sparse RGC-to-S1 weight vectors because of the division by wi in equation 2.15, which made the weights overshoot zero in a oscillatory manner. But this rule converged to sparse S1-to-C1 weight vectors thanks to the zero bound. The proposed rule, equation 2.9, converged to sparse weight vectors for both simple and complex cells. Therefore, to achieve sparseness, the proposed rule does not require the zero bound, which makes this rule valuable not only for neural networks that comply with Dale's principle (Dale, 1935) but also for networks whose weights are allowed to change their sign during training. To our knowledge, the proposed rule is the only rate-based rule that yields sparse connectivity with and without the zero bound. This sparse connectivity may be viewed as a result of the synaptic competition caused by the L1-norm constraint. Besides this competition, the proposed rule also automatically regulates the postsynaptic firing rate by limiting it to the same range as the presynaptic firing rates (if α = 1). STDP rules also cause a synaptic competition, which may lead to synaptic pruning (Iglesias et al., 2005). But these rules require a fine balancing of the LTP/LTD ratio or a homeostatic regulation to stabilize the postsynaptic firing rate (Sjöström et al., 2001; Tegnér & Kepecs, 2002; Gütig et al., 2003; Benuskova & Abraham, 2007; Sprekeler et al., 2007; Michler, Eckhorn, & Wachtler, 2009; Watt & Desai, 2010; Clopath et al., 2010).

Another way to assess sparseness of a V1 model is to count how many simple cells have nonzero outputs on average while encoding structured visual stimuli such as natural scenes (Field, 1987, 1994). Models based on this sparseness definition typically use linear encoding y = Wx, where x is a visual stimulus, y are the outputs of the encoders (simple cells), and the columns of w are encoding filters, which are equivalent to the receptive fields of the simple cells in the present work. The encoding filters can be found by maximizing the statistical independence or, equivalently, the sparseness of the encoder outputs y (Olshausen & Field, 1996; Bell & Sejnowski, 1997). This method performs independent component analysis (ICA) and yields localized and oriented encoding filters similar to the receptive fields of biological simple cells. Localization of the S1 filters in this letter is achieved by spatially constraining the RGC regions, from which the S1s collect their inputs. Because ICA-based S1 models treat S1s as linear encoders, they cannot explain the shift invariance of C1s. It is possible to extend these models to C1s by maximizing the independence (sparseness) of local sums of squared S1 outputs ∑iy2i as the C1 responses (Hyvärinen & Hoyer, 2000). In this letter, the S1s are modeled as half-wave rectifiers to obtain the shift invariance of the C1s.

Acknowledgments

The author wishes to thank Subramaniam Venkatraman, QUALCOMM, Inc., for useful discussions about the complex cell modeling, and Krzys Wegrzyn, QUALCOMM, Inc., for his help in programming the V1 model.

References

Barrow
,
H. G.
(
1987
).
Learning receptive fields
. In
Proc. IEEE First Annual Conference on Neural Networks
, IV (pp.
115
121
).
Piscataway, NJ
:
IEEE
.
Bastrikova
,
N.
,
Gardner
,
G. A.
,
Reece
,
J. M.
,
Jeromin
,
A.
, &
Dudek
,
S. M.
(
2008
).
Synapse elimination accompanies functional plasticity in hippocampal neurons
.
Proc. Natl. Acad. Sci. USA
,
105
,
3123
3127
.
Becker
,
N.
,
Wierenga
,
C. J.
,
Fonseca
,
R.
,
Bonhoeffer
,
T.
, &
Nagerl
,
U. V.
(
2008
).
LTD induction causes morphological changes of presynaptic boutons and reduces their contacts with spines
.
Neuron
,
60
,
590
597
.
Bell
,
A. J.
, &
Sejnowski
,
T. J.
(
1997
).
The independent components of natural scenes are edge filters
.
Vis. Res.
,
37
,
3327
3338
.
Benuskova
,
L.
, &
Abraham
,
W.
(
2007
).
STDP rule endowed with the BCM sliding threshold accounts for hippocampal heterosynaptic plasticity
.
J. Comput. Neurosci.
,
22
,
129
133
.
Bienenstock
,
E. L.
,
Cooper
,
L. N.
, &
Munro
,
P. W.
(
1982
).
Theory for the development of neuron selectivity: Orientation specificity and binocular interaction in visual cortex
.
J. Neurosci.
,
2
(
1
),
32
48
.
Braitenberg
,
V.
, &
Schüz
,
A.
(
1998
).
Cortex: Statistics and geometry of neuronal connectivity
.
Berlin
:
Springer
.
Butz
,
M.
,
Wörgötter
,
F.
, &
van Ooyen
,
A.
(
2009
).
Activity-dependent structural plasticity
.
Brain Res. Rev.
,
60
,
287
305
.
Chechik
,
G.
,
Meilijson
,
I.
, &
Ruppin
,
E.
(
1998
)
Synaptic pruning in development: A computational account
.
Neural Comput
.,
10
,
1759
1777
.
Chklovskii
,
D. B.
,
Mel
,
B. W.
, &
Svoboda
,
K.
(
2004
).
Cortical rewiring and information storage
.
Nature
,
431
,
782
788
.
Clopath
,
C.
,
Busing
,
L.
,
Vasilaki
,
E.
, &
Gerstner
,
W.
(
2010
).
Connectivity reflects coding: A model of voltage-based STDP with homeostasis
.
Nat. Neurosci.
,
13
,
344
352
.
Dale
,
H. H.
(
1935
).
Pharmacology and the nerve endings
.
Proc. R. Soc. Med.
,
28
,
319
332
.
Field
,
D. J.
(
1987
).
Relations between the statistics of natural images and the response properties of cortical cells
.
J. Opt. Soc. Am.
,
A4
,
2379
2394
.
Field
,
D. J.
(
1994
).
What is the goal of sensory coding?
Neural Comput.
,
6
,
559
601
.
Grossberg
,
S.
(
1968
).
Some nonlinear networks capable of learning a spatial pattern of arbitrary complexity
.
Proc. Natl. Acad. Sci. USA
,
59
,
368
372
.
Grossberg
,
S.
(
1972
).
Neural expectation: Cerebellar and retinal analogs of cells fired by learnable or unlearned pattern classes
.
Kybernetik
,
10
,
49
57
.
Grossberg
,
S.
(
1976a
).
Adaptive pattern classification and universal recoding. I. Parallel development and coding of neural feature detectors
.
Biol. Cybern.
,
23
,
121
134
.
Grossberg
,
S.
(
1976b
).
Adaptive pattern classification and universal recoding. II. Feedback, expectation, olfaction, and illusions
.
Biol. Cybern.
,
23
,
187
202
.
Gütig
,
R.
,
Aharonov
,
S.
,
Rotter
,
S.
, &
Sompolinsky
,
H.
(
2003
).
Learning input correlations through nonlinear temporally asymmetric Hebbian plasticity
.
J. Neurosci.
,
23
,
3697
3714
.
Hassibi
,
B.
,
Stork
,
D. G.
, &
Wolff
,
G. J.
(
1993
).
Optimal brain surgeon and general network pruning
. In
Proc. IEEE Int. Conf. Neural. Networks
(Vol.
1
, pp.
293
299
).
Piscataway, NJ
:
IEEE
.
Hebb
,
D. O.
(
1949
).
The organization of behavior
.
Hoboken, NJ
:
Wiley
.
Heggelund
,
P.
, &
Albus
,
K.
(
1978
).
Orientation selectivity of single cells in striate cortex of cat: The shape of orientation tuning curves
.
Vis. Res.
,
18
,
1067
1071
.
Henry
,
G. H.
,
Dreher
,
B.
, &
Bishop
,
P. O.
(
1974
).
Orientation specificity of cells in cat striate cortex
.
J. Neurophysiol.
,
37
,
1394
1409
.
Holmgren
,
C.
,
Harkany
,
T.
,
Svennenfors
,
B.
, &
Zilberter
,
Y.
(
2003
).
Pyramidal cell communication within local networks in layer 2/3 of rat neocortex
.
J. Physiol.
,
551
,
139
153
.
Huttenlocher
,
P. R.
(
1979
).
Synaptic density in human frontal cortex: Developmental changes and effects of age
.
Brain Res.
,
163
,
195
205
.
Hyvärinen
,
A.
, &
Hoyer
,
P. O.
(
2000
).
Emergence of phase and shift invariant features by decomposition of natural images into independent feature subspaces
.
Neural Comput.
,
12
,
1705
1720
.
Ibata
,
K.
,
Sun
,
Q.
, &
Turrigiano
,
G. G.
(
2008
).
Rapid synaptic scaling induced by changes in postsynaptic firing
.
Neuron
,
57
,
819
826
.
Iglesias
,
J.
,
Eriksson
,
J.
,
Grize
,
F.
,
Tomassini
,
M.
, &
Villa
,
A. E. P.
(
2005
).
Dynamics of pruning in simulated large-scale spiking neural networks
.
BioSystems
,
79
(
1
),
11
20
.
Innocenti
,
G. M.
(
1995
).
Exuberant development of connections and its possible permissive role in cortical evolution
.
Trends Neurosci.
,
18
,
397
402
.
Kempter
,
R.
,
Gerstner
,
W.
, &
van Hemmen
,
J. L.
(
2001
).
Intrinsic stabilization of output rates by spike-based Hebbian learning
.
Neural Comput.
,
13
,
2709
2741
.
Kohonen
,
T.
(
1982
).
Self-organized formation of topologically correct feature maps
.
Biol. Cybern.
,
43
,
59
69
.
Le Bé
,
J. V.
, &
Markram
,
H.
(
2006
).
Spontaneous and evoked synaptic rewiring in the neonatal neocortex
.
Proc. Natl. Acad. Sci. USA
,
103
,
13214
13219
.
Le Cun
,
Y.
,
Denker
,
J. S.
, &
Solla
,
S. A.
(
1990
).
Optimal brain damage
. In
D. S. Touretzky
(Ed.),
Advances in neural information processing systems
, 2 (pp.
598
605
).
San Francisco
:
Morgan Kaufmann
.
Markram
,
H.
,
Lübke
,
J.
,
Frotscher
,
M.
,
Roth
,
A.
, &
Sakmann
,
B.
(
1997
).
Physiology and anatomy of synaptic connections between thick tufted pyramidal neurones in the developing rat neocortex
.
J. Physiol.
,
500
,
409
440
.
Michler
,
F.
,
Eckhorn
,
R.
, &
Wachtler
,
T.
(
2009
).
Using spatio-temporal correlations to learn topographic maps for invariant object recognition
.
J. Neurophysiol.
102
,
953
964
.
Miller
,
K. D.
(
1996
).
Synaptic economics: Competition and cooperation in synaptic plasticity
.
Neuron
,
17
,
371
374
.
Miller
,
K. D.
, &
MacKay
,
D. J. C.
(
1994
).
The role of constraints in Hebbian learning
.
Neural Comput.
,
6
,
100
126
.
Mimura
,
K.
,
Kimoto
,
T.
, &
Okada
,
M.
(
2003
).
Synapse efficiency diverges due to synaptic pruning following over-growth
.
Phys. Rev. E: Stat. Nonlin. Soft. Matter. Phys.
,
68
,
031910
.
Obermayer
,
K.
,
Ritter
,
H.
, &
Schulten
,
K.
(
1990
).
Large-scale simulations of self-organizing neural networks on parallel computers: Application to biological modelling
.
Parallel Computing
,
14
,
381
404
.
Oja
,
E.
(
1982
).
Simplified neuron model as a principal component analyzer
.
J. Math. Biol.
,
15
,
267
273
.
Olshausen
,
B. A.
, &
Field
,
D. J.
(
1996
).
Emergence of simple-cell receptive field properties by learning a sparse code for natural images
.
Nature
,
381
(
6583
),
607
609
.
Rakic
,
P.
,
Bourgeois
,
J. P.
, &
Goldman-Rakic
,
P. S.
(
1994
).
Synaptic development of the cerebral cortex: Implications for learning, memory, and mental illness
.
Prog. Brain Res.
,
102
,
227
243
.
Rasmussen
,
C. E.
, &
Willshaw
,
D. J.
(
1993
).
Presynaptic and postsynaptic competition in models for the development of neuromuscular connections
.
Biol. Cybern.
,
68
,
409
419
.
Ribchester
,
R. R.
, &
Barry
,
J. A.
(
1994
).
Spatial versus consumptive competition at polyneuronally innervated neuromuscular junctions
.
Exp. Physiol.
,
79
,
465
494
.
Rochester
,
N.
,
Holland
,
J. H.
,
Haibt
,
L. H.
, &
Duda
,
W. L.
(
1956
).
Tests on a cell assembly theory of the action of the brain, using a large digital computer
.
IRE Trans. Inform. Theory
,
IT-2
,
80
93
.
Roland
,
P. E.
(
1993
).
Brain activation
.
Hoboken, NJ
:
Wiley-Liss
.
Royer
,
S.
, &
Paré
,
D.
(
2003
).
Conservation of total synaptic weight through balanced synaptic depression and potentiation
.
Nature
,
422
,
518
522
.
Sejnowski
,
T. J.
(
1977a
).
Storing covariance with nonlinearly interacting neurons
.
J. Math. Biol.
,
4
,
303
321
.
Sejnowski
,
T. J.
(
1977b
).
Statistical constraints on synaptic plasticity
.
J. Theor. Biol.
,
69
,
387
389
.
Sjöström
,
P.
,
Turrigiano
,
G.
, &
Nelson
,
S.
(
2001
).
Rate, timing, and cooperativity jointly determine cortical synaptic plasticity
.
Neuron
,
32
,
1149
1164
.
Song
,
S.
,
Miller
,
K. D.
, &
Abbott
,
L.F.
(
2000
).
Competitive Hebbian learning through spike-timing-dependent synaptic plasticity
.
Nat. Neurosci.
,
3
,
919
926
.
Sprekeler
,
H.
,
Michaelis
,
C.
, &
Wiskott
,
L.
(
2007
).
Slowness: An objective for spike-timing dependent plasticity?
PLoS Comput. Biol.
,
3
,
1136
1148
.
Tegnér
,
J.
, &
Kepecs
,
A.
(
2002
).
Adaptive spike-timing-dependent plasticity
.
Neurocomputing
,
44–46
,
189
194
.
Turrigiano
,
G. G.
,
Leslie
,
K.
,
Desai
,
N.
, &
Nelson
,
S. B.
(
1998
).
Activity dependent scaling of quantal amplitude in neocortical pyramidal neurons
.
Nature
,
391
(
6670
),
892
896
.
van Ooyen
,
A.
(
2001
).
Competition in the development of nerve connections: A review of models
.
Network: Comput. Neural Syst.
,
12
,
R1
R47
.
van Rossum
,
M.C.W.
,
Bi
,
G. Q.
, &
Turrigiano
,
G. G.
(
2000
).
Stable Hebbian learning from spike timing-dependent plasticity
.
J. Neurosci.
,
20
,
8812
8821
.
von der Malsburg
,
C.
(
1973
).
Self-organization of orientation sensitive cells in the striate cortex
.
Kybernetik
,
14
,
85
100
.
Watt
,
A. J.
, &
Desai
,
N. S.
(
2010
).
Homeostatic plasticity and STDP: Keeping a neuron's cool in a fluctuating world
.
Front. Syn. Neurosci.
,
2
,
5
.
Wolff
,
J. R.
,
Laskawi
,
R.
,
Spatz
,
W. B.
, &
Missler
,
M.
(
1995
).
Structural dynamics of synapses and synaptic components
.
Behav. Brain Res.
,
66
,
13
20
.
Yuille
,
A. L.
,
Kammen
,
D. M.
, &
Cohen
,
D. S.
(
1989
).
Quadrature and the development of orientation selective cortical cells by Hebb rules
.
Biol. Cybern.
,
61
,
183
194
.
Zhou
,
Q.
,
Homma
,
K. J.
, &
Poo
,
M. M.
(
2004
).
Shrinkage of dendritic spines associated with long-term depression of hippocampal synapses
.
Neuron
,
44
,
749
757
.