Abstract

In this letter, we have implemented and compared two neural coding algorithms in the networks of spiking neurons: Winner-takes-all (WTA) and winners-share-all (WSA). Winners-Share-All exploits the code space provided by the temporal code by training a different combination of out of neurons to fire together in response to different patterns, while WTA uses a one-hot-coding to respond to distinguished patterns. Using WSA, the maximum value of in order to maximize information capacity using output neurons was theoretically determined and utilized. A small proof-of-concept classification problem was applied to a spiking neural network using both algorithms to classify 14 letters of English alphabet with an image size of 15 15 pixels. For both schemes, a modified spike-timing-dependent-plasticity (STDP) learning rule has been used to train the spiking neurons in an unsupervised fashion. The performance and the number of neurons required to perform this computation are compared between the two algorithms. We show that by tolerating a small drop in performance accuracy (84% in WSA versus 91% in WTA), we are able to reduce the number of output neurons by more than a factor of two. We show how the reduction in the number of neurons will increase as the number of patterns increases. The reduction in the number of output neurons would then proportionally reduce the number of training parameters, which requires less memory and hence speeds up the computation, and in the case of neuromorphic implementation on silicon, would take up much less area.

1  Introduction and Background

In recent years, temporal codes have gained a lot of attention as a strong possibility for how the brain encodes information. Spikes are the sole way of communication between neurons, and the time of the arrival of the spikes has been shown to carry a lot of information in the neuronal pathway (Bialek, Rieke, de Ruyter van Steveninck, & Warland, 1991; Heiligenberg, 1991). Moreover, the conventional rate-based codes are not only energetically unfavorable (many spikes needed to encode a single variable) but also seem to be biologically implausible (Thorpe, Delorme, & Van Rullen, 2001). Thorpe et al. (2001) have argued that the speed of processing in the brain and the number of processing layers theoretically result in a few spikes per layer, making the rate-based code incompatible with the rate of information processing in the brain. Previous work in the area of temporal coding has exploited the biological observation that strongly activated neurons tend to fire first (Orchard et al., 2015; Bichler, Querlioz, Thorpe, Bourgoin, & Gamrat, 2012; Masquelier & Thorpe, 2007). With the assumption that the neurons will reset after each pattern presentation, we can conclude that the higher the intensity of the input, the neuron reaches its threshold faster and emits a spike, and hence the information is encoded in the time of the spikes. This allows for a whole new range of coding schemes that not only are energy efficient (computation with one spike) but also have a much larger information capacity than their conventional counterparts. Different coding schemes using temporal codes have been suggested, among which rank-order coding (ROC) has the largest information capacity (Thorpe et al., 2001) if we have a limited time resolution.1 In ROC, information is encoded in the rank of the spikes from neurons. In other words, the order of the firing between neurons encodes the input vector. Since there are ! permutation of the orders between neurons, the information capacity of ROC is log(!).

1.1  Rank-Order Coding

Using ROC to train neural networks requires a learning algorithm dependent on the order of neural spiking: . In this scheme, when a pattern is presented to the network of neurons, the weights of the neurons that fire earlier change more than the ones that fire later. Thus, after training, the earlier spikes carry more information about the input than the later ones do.

Now consider a case where ROC is employed in some form of unsupervised learning, such as competitive learning. In the unsupervised (clustering) problem, a training set is given, and the data need to be grouped into a few clusters. Classic competitive learning such as the K-means clustering algorithm finds cluster centroids that minimize the distance between the data points and the nearest centroid (Coates & Ng, 2011). Therefore, if at each pattern presentation, there is one cluster whose centroid's distance to the input vector is the least, the new pattern will be grouped with that cluster and a new centroid will be calculated for the cluster with the new arrangement.

In a neural network, each cluster represents an output neuron, and its centroid, which is the mean of the cluster, is the neuron's weight vector for an -dimensional input vector and clusters. For each new input, the most active output neuron is the one whose weight vector is the closest to the input vector. This algorithm is known as the winner-takes-all (WTA) algorithm, since one neuron wins the competition between neurons and assigns the input into its cluster (Mead, 1989).

Lateral inhibition can be employed to implement the WTA algorithm when -means is applied to spiking neural networks. In this form of WTA implementation, the first spike at the output layer inhibits all the other neurons, and thus only the winning neuron weights change to include the input in its cluster. Thus, each neuron can be trained to respond to a specific pattern. For example, in work done by Guyonneauet, VanRullen, and Thorpe (2004), four neurons are trained using the rank-order code in a spiking neural network to cluster four different images. Hence by using WTA and rank-order coding, four neurons can encode four images. However, rank-order code has the potential of encoding 4! 24 different input vectors using four neurons; therefore, by applying WTA, the maximum information-theoretic capacity of the network is not being exploited.

Moreover, a greedy algorithm like WTA combined with Hebbian learning results in a positive feedback loop driving the weights to either one or zero, which can be digitally stored. Digital weights have the advantage of being more robust and immune to noise, but the single bit representation of the weight results in information loss. Analog weights (or weights with higher-order bit representation) and true analog computation coupled with temporal coding enable a richer computational platform, and we propose a different paradigm to exploit this space, which we call winners-share-all (WSA).

1.2  Winners-Share-All

Instead of forcing the network to have only one winner for each pattern presentation, which does not fully utilize the code space, we propose having multiple winners at the presentation of each input. Although it may seem as if WSA is reminiscent of -winner-take-all (Maass, 2000), as we explain throughout this letter, contrary to -WTA, is not a fixed number (it is, however, bounded by an optimal value that can be calculated) and the network will learn to pick the value of and the neurons associated with it during the learning process. As a result, it allows the network to evolve to the most efficient code for the pattern.

Having multiple winners brings the question, How many winners are enough? In ROC, the earlier rank spikes have a lot more information about the input scene than the later ones do. So how many of them should we keep? Imagine we have output neurons, of which are firing, where can take any value between 1 and .

The information capacity of such a setting, which we call WSA, is given by
formula
1.1
while maximizes this information capacity, as we will explain, it does so at the expense of an increase in network complexity (and, hence, increase in the training time). However, the later-arriving spikes are naturally deweighted, and this can be used to arrive at an optimum between information capacity and network complexity. In other words, the least significant bits (LSBs) do not contain much information about the input and can be ignored in the coding process.

To explain this better with an example, although 2 and are two different codes, they cannot be assigned to two different patterns since the fourth neuron is in the LSB half and is not carrying much information about the scene. So and are very similar codes. Basically, as we move toward the later-firing neurons, they carry less and less information about the patterns, to a point that it becomes too difficult to train a neural network with them. Thus, we chose in this work, using only the most significant bits in the firing neurons.

Compared to the to 1-WTA algorithm, if we enforce the network to this condition, we will increase the information capacity by
formula
1.2

To get a sense of the increase in information capacity, for = 10, is 2.4; for = 100, = 14.5; and for = 1000, = 100.24. The increase in the information capacity becomes more apparent as the number of neurons at the output grows.

To highlight the advantage of this algorithm, we can compare the number of neurons needed to classify patterns with WSA versus WTA. Using WTA, the number of output neurons required for classification grows linearly with the number of patterns at the input since at least neurons are needed to classify patterns using a WTA algorithm. However, using WSA, the number of neurons needed undergoes a compression (see equation 1.2), and for a large number of patterns, the gap between the number of output neurons in WTA versus WSA algorithm increases rapidly.

Although this algorithm provides higher information capacity, it is still lower than ROC ( versus . However, for similar reasons explained in the example, although the code space is larger, many of the codes are very similar, as the LSB neurons carry much less information about the patterns than the MSBs do and hence cannot be used to encode different patterns.

2  Materials and Methods

To illustrate the power of this algorithm in practice, we have applied it to a neural network to perform a rather small classification problem in which 14 patterns are to be clustered with six neurons. With WSA, using six output neurons, we are able to classify patterns. The same network with the same patterns has been implemented for a WTA algorithm with 14 neurons, and the results are compared.

Figure 1 shows the training set and the test set patterns chosen for this experiment. The set consists of images of 14 letters of the English alphabet with the size 15 pixels 15 pixels in which each pixel can take a binary value. Since the image size is rather small and the data set is chosen synthetically, the 14 letters are picked in a way that does not include patterns that could look very similar to each other given the constraints on the size and the synthetic edges and as a result are very difficult to distinguish. The last column of the training set shows the ideal nonnoisy patterns, and the rest of the patterns are generated by flipping 10 pixels either on the pattern or in a neighborhood of 1 pixel around the pattern inside the image. The reason for generating the noisy patterns this way is that if we flip the pixels in any random location in the image, it might end up somewhere in the corner of the image, which does not perturb the pattern at all and hence is very close to the ideal image. Using this method, five noisy versions of each pattern are generated for training, and two of them are made for testing to evaluate how well the network generalizes.

Figure 1:

English alphabet letters are used as training set and test set patterns for the classification problem. The letters are distorted by flipping the pixels around the ideal patterns.

Figure 1:

English alphabet letters are used as training set and test set patterns for the classification problem. The letters are distorted by flipping the pixels around the ideal patterns.

Patterns are fed to the network sequentially on the rising edge of a global clock with the period , and the network has seconds to process the pattern. The state of the neurons in the network is then reset at the falling edge of the clock.

2.1  Architecture

The network architecture used here is structured in a similar manner to hierarchical neural models, specifically to convolutional neural networks (CNNs; LeCun, Bottou, Bengio, & Haffner, 1998). Figure 2 shows the full neural network architecture used in this work.

Figure 2:

Neural network architecture used in this work. Intensity is converted to time of spike in the first layer by using 225 neurons, each assigned to 1 pixel of the image. Four feature maps are used in the second layer to extract the edges of each 3 3 window in the first layer, therefore requiring 100 neurons. The output spikes of the second layer are combined to train the last layer of neurons by utilizing two different approaches (WSA and WTA) in recognition of patterns. The inset of the figure shows the neuron model in which the excitatory-postsynaptic potential (EPSP) of the presynaptic spike allows giving more importance to the earlier spike than the later ones since the neuron integrates the dotted green area under the EPSP signals. Therefore, the earlier signals stimulate the neurons for a longer time and are hence more effective.

Figure 2:

Neural network architecture used in this work. Intensity is converted to time of spike in the first layer by using 225 neurons, each assigned to 1 pixel of the image. Four feature maps are used in the second layer to extract the edges of each 3 3 window in the first layer, therefore requiring 100 neurons. The output spikes of the second layer are combined to train the last layer of neurons by utilizing two different approaches (WSA and WTA) in recognition of patterns. The inset of the figure shows the neuron model in which the excitatory-postsynaptic potential (EPSP) of the presynaptic spike allows giving more importance to the earlier spike than the later ones since the neuron integrates the dotted green area under the EPSP signals. Therefore, the earlier signals stimulate the neurons for a longer time and are hence more effective.

As in CNNs, the second neuronal layer is inspired from V1 in visual cortex, which responds to orientation edges in the input scene. These edges are designed in the form of filters (kernels) in the second layer, which are convolved with the input scene (valid convolution), resulting in feature extraction from the image. The activation of the neurons at the second layer in response to the convolutions (extracted features) is propagated and combined in the next layer, where output neurons group them as a cluster.

Figure 2 illustrates how four filters are designed to extract vertical, horizontal, 45, and 135 edges from the image. These (3 3) filters scan the (15  15) input images with nonoverlapping windows, and the convolution results in 5 5 feature maps in the second layer. The reason we chose to use nonoverlapping windows is that the size of the image we are using (15 15) is relatively small, and the overlapping windows do not contain much more information than the nonoverlapping ones. Each “feature map” in the second layer contains a certain edge. The neurons on the same coordinates on the different feature maps have the same receptive field on the input image, meaning they get their inputs from the same part of the image.

As is depicted in Figure 2, the third layer is combining the features from the second layer in a fully connected manner to recognize the images on the output layer. This layer is trained in a completely unsupervised fashion using some form of spike-timing-dependent plasticity (STDP), which we discuss in detail later in the letter.

The spiking neural network is coded and simulated in Matlab. Neurons and synapses are modeled mathematically and are discussed below.

2.1.1  Neuron's Model

A leaky integrate-and-fire model is chosen for the neurons, which is defined mathematically as:
formula
2.1
where is the global clock period for presenting patterns and is chosen here as 10 ms.3 EPSP is the excitatory postsynaptic potential, and parameter is the leak coefficient of the membrane potential. The value of EPSP rises to 1 (normalized value) with the arrival of the corresponding spike (at ) and decays with a certain time constant (10 ms) until the end of the clock cycle. The conversion to EPSP gives more importance to earlier spikes than the later ones, since the postsynaptic neuron integrates the area under its curve. As discussed before, in temporal coding, the spikes emitted earlier contain more information about the input than the later spikes. Therefore, since we are utilizing multiple spikes at each layer (WSA), we need to give them appropriate importance based on the time of their arrival. The EPSPs are then multiplied by their corresponding s and summed over the number of inputs connected to the neuron. The inset of Figure 2 illustrates this idea.
At the end of the time window, the EPSP signal and the membrane potential are reset to zero and the neuron awaits the next input pattern:
formula
2.2
where is the length of the simulation.

2.1.2  Synapse Model

The connection between the neurons is modeled as a single variable whose value changes based on dwp (positive changes of ) and dwn (negative changes of ) are such that
formula
2.3
where will scale the changes of .

In the following sections of this letter, we go through the details of the design choices and output of each layer using WSA and WTA algorithms. The output of layers 1 and 2 is identical for both algorithms; the first layer does the conversion from intensity to spikes, and the second layer extracts the features. The third layer, which performs the classification, is where the two algorithms are different, and hence their details are explained separately. A comparison is then made based on the number of computing nodes (neurons) and memory (synapses) required for computation as well as their accuracy of classification.

2.2  Layer 1: Converting Pixel Intensity into Spikes

In temporal codes, each input is presented in one clock cycle, and the information about the input is converted into spike times. As shown in Figure 3A, in layer 1, each neuron, shown by the blue circles is assigned to each pixel. Pixels refresh their values every 10 ms to the value for a new pattern, and during this time window, the network processes the pattern given to the network. The intensity of the pixels is presented at the beginning of the clock cycle as the input to the first layer. This value is treated as a weighted EPSP, which was discussed in the neuron model. Note that the intensity of the pixels is normalized by the number of “on” pixels so that if an input pattern has intrinsically more white pixels than others, it will not cause a stronger stimulus for the next layer and all patterns will have the same total drive for the following neural layer. Normalization ensures that the networks treats all patterns the same way.

Figure 3:

Neural arrangement and output of the first and second layers of the neural network. (A) First layer: Converting pixel intensity to spikes. Each neuron (blue) is assigned to each pixel of the image; therefore, for a 15 15 image, 225 neurons are required at the first layer. (B) Second layer: Extracting edges from different locations of the image. Every group of four neurons (orange) in the second layer extracts four features (shown next to them) from each nine neurons in the first layer (highlighted with the red rectangle), which are the corresponding neurons to the (3 3) windows shown in red in the image panel A. (C) Raster plot showing the spiking of the first-layer neurons in each 10 ms time window in which the patterns are presented. Neurons corresponding to higher intensity pixels emit a spike. (D) Raster plot showing the spiking of the 100 neurons in the second layer in response to the dominant features in different locations of the image at each 10 ms time window.

Figure 3:

Neural arrangement and output of the first and second layers of the neural network. (A) First layer: Converting pixel intensity to spikes. Each neuron (blue) is assigned to each pixel of the image; therefore, for a 15 15 image, 225 neurons are required at the first layer. (B) Second layer: Extracting edges from different locations of the image. Every group of four neurons (orange) in the second layer extracts four features (shown next to them) from each nine neurons in the first layer (highlighted with the red rectangle), which are the corresponding neurons to the (3 3) windows shown in red in the image panel A. (C) Raster plot showing the spiking of the first-layer neurons in each 10 ms time window in which the patterns are presented. Neurons corresponding to higher intensity pixels emit a spike. (D) Raster plot showing the spiking of the 100 neurons in the second layer in response to the dominant features in different locations of the image at each 10 ms time window.

Another note is that ordered presentation of the data can bias the network, leading to poor convergence. In order to avoid this, the data are randomly shuffled prior to each epoch of training (Ng et al., n.d.). Thus, the input patterns are shuffled randomly and then given to the input neurons every 10 milliseconds.

The raster plot in Figure 3C shows the spike times of the 225 neurons of the first layer at the first 50 ms of the simulation. The arrival of a new pattern at the beginning of each clock cycle (10 ms) causes neurons, which are assigned to pixels containing a higher intensity, to emit a spike.

2.3  Layer 2: Extracting Features from the Images

In image processing, appropriate filters are applied to an image to extract certain features from it through convolving these filters with each patch of the image. In the concept of neural networks, this translates to neurons whose weight vectors act as the filters and whose receptive field act as the input patch. As a result of that, the output of the neuron is the convolution (sum of products) of the input image patch with the filters defined by the neurons' weight vector (LeCun et al., 1998).

The filters we have chosen at the second layer of the neural network for feature detection are vertical, horizontal, 45, and 135 edges. The weights (synapses) of these filters are nonplastic and are hardwired in the program. As is illustrated in Figure 3B, every nine neurons from the first (input) layer represent a 3 3 patch of the image and connect to four neurons in the second layer. These four neurons receive spikes from the input layer and depending on the dominant features of that patch, the corresponding neurons fire. Since the image is 15 15 and the receptive fields of each group of four neurons are of the size 3 3 and are nonoverlapping, the second layer needs neurons. Neurons 1 to 4 correspond to the top left corner of the image, neurons 5 to 8 correspond to the patch just to the right of that, and so on. In each cycle of 10 ms where a new pattern is presented, spikes from the first layer travel through the second layer and the neurons corresponding to the dominant edges of that pattern spike. Figure 3D depicts the raster plot of the 100 neurons in the second layer in the first 50 ms of the simulations.

However, we have a set of synthetic and relatively small images that results in high correlation between the pixels. Hence, there are similarities between the patterns, making it more likely for different patterns to be recognized as the same object. Therefore, it is challenging to separate them as different classes. In the following section, we explain how by adding habituation interlayer neurons solved this problem.

2.3.1  Habituation

The solution we employed for highly correlated input patterns is what we call habituation because it diminishes the innate response of the neurons to a frequently repeated stimulus. In habituation, the network finds the similarities between the patterns and ignores them and instead looks for dissimilarities. This is illustrated in Figure 4A. An intermediate habituation leaky integrate-and-fire neuron is introduced for each neuron at layer 2. The threshold of this habituation neuron is set so that it can detect the frequent firing of its input neuron in layer 2 and inhibits the frequent firing neuron from which it is receiving input spikes. The frequent firing of such neuron identifies a common feature in that specific location on the image. Information theoretically, a common feature contains very little information about the pattern since it is very probable and therefore has a low entropy and can be ignored.

Figure 4:

Interlayer neurons to enforce habituation and inhibition conditions. (A) The habituation neuron is designed to ignore the similarities between the input patterns and look for the differences between patterns, which helps to separate patterns. If the neuron from the second layer (shown in blue) is firing very often, the interlayer habituation neuron (shown in red) detects this frequent firing as a flag for a common edge and spikes. Its spike inhibits the neuron from the second layer strongly, which desensitizes the output layer from the “common feature.” (B) The inhibitory neuron is designed to ensure not more than half of the output neurons fire at any given time window. The threshold of the inhibitory neuron is set to fire after receiving three (half of the neurons) spikes.

Figure 4:

Interlayer neurons to enforce habituation and inhibition conditions. (A) The habituation neuron is designed to ignore the similarities between the input patterns and look for the differences between patterns, which helps to separate patterns. If the neuron from the second layer (shown in blue) is firing very often, the interlayer habituation neuron (shown in red) detects this frequent firing as a flag for a common edge and spikes. Its spike inhibits the neuron from the second layer strongly, which desensitizes the output layer from the “common feature.” (B) The inhibitory neuron is designed to ensure not more than half of the output neurons fire at any given time window. The threshold of the inhibitory neuron is set to fire after receiving three (half of the neurons) spikes.

The habituation neuron can be defined as
formula
2.4
Note that unlike other neurons that we have introduced, the habituation neuron does not reset at the end of the clock cycle since it keeps the history of the patterns. It resets only when it reaches the threshold and fires a spike. The term refers to the spikes from the habituation neuron corresponding to the th output. The term is a low-pass filter of the spikes from the habituation neurons, which keeps the inhibition effect on the frequently firing neurons at the second layer, for a period defined by . Therefore, the membrane potential of the neurons at the second layer is defined as
formula
2.5
where is the dot product between the fixed weights of the neuron 's kernel and the EPSP shapes signals from the receptive field of neuron . EPSP is the signal as the result of the spikes from the output of the first layer. are the time of spikes from the first layer.

2.4  Layer 3: Classification

The third layer in this neural network is fully connected to the feature extraction layer (see Figure 2). The goal was to train these connections in an unsupervised fashion to classify 14 patterns of Figure 1 using spike combinations of six output neurons. In this spiking neural network, a modified STDP is used to change the weights, and in this section, we explain the details of the learning and the challenges while training this spiking neural network with the WSA algorithm in a completely unsupervised fashion.

2.4.1  Learning

Positive weight change. During each clock cycle, the neuron that spikes earliest carries the most information about the input. We use the EPSP kernel shape to incorporate that in the model, which results in longer stimulation of postsynaptic neurons from earlier arriving presynaptic spikes.

To reflect this in the learning algorithm, we use the following rules:

  • The presynaptic spikes arriving earlier and causing the postsynaptic neuron to fire should undergo a larger weight change than the ones that aided the stimulation but arrived later.

  • We introduce an intermediate parameter inspired by the calcium concentration and its role in learning via dendritic backpropagation (Shouval, Castellani, Blais, Yeung, & Cooper, 2002). This value increases with the arrival of the presynaptic spike at the synaptic cleft and decays over time in the form of . When the postsynaptic spike occurs, it samples this parameter at the time of its firing, and the synaptic change will be directly proportional to this sampled value.

Combining the two points above results in the parameter increasing with a time evolution given by upon receiving a pre-spike, incorporating the effect of early versus later spikes. Therefore, the later spikes have a lower value of C' in comparison with the earlier spikes, and hence their corresponding weights undergo a smaller weight change (see Figure 5C). Basically this learning algorithm is a form of anti-STDP rule within each clock cycle, because the weight change will be greater as the prespike arrives earlier in time (within the time bin) with respect to the postspike. This is desirable since the weight change reflects the mutual information between the pre- and the postsynaptic neurons and is encoded at the time of arrival of the spikes. The learning algorithm is shown graphically in Figure 5A and can be described mathematically as
formula
2.6
where dwp is the positive weight change, is the positive learning rate, and H is the Heaviside function implemented by the SR latch shown in Figure 5A.
Figure 5:

As a spiking learning algorithm, a modified form of STDP (m-STDP) is used. Calcium concentration models are employed as part of the m-STDP rule to calculate dwp (positive weight change) and dwn (negative weight change). (A) When the prespike is emitted, calcium concentration peaks and decays exponentially. Since an anti-STDP rule is required for the temporal codes discussed in this letter, an SR latch is employed to output . Once the postsynaptic spike occurs, it samples parameter C' at the time of its firing, and positive weight change is calculated. (B) The same rule is applied for the negative weight change calculation. This weight could also endure a negative change when there is no correlation between the pre- and postspikes. When the prespike does not emit any spike within the clock cycle while the postspike has spiked, the negative weight change is maximum. This is modeled by sampling the Qb of the SR latch by the postsynaptic spike. If the prespike does not spike, Qb will be high and cause a negative weight change as soon as the postspike spikes. (C) The change of weight is plotted as a function of the difference in time of arrival of the prespike and postspikes in the learning algorithm used. The weight change takes an opposite form of STDP.

Figure 5:

As a spiking learning algorithm, a modified form of STDP (m-STDP) is used. Calcium concentration models are employed as part of the m-STDP rule to calculate dwp (positive weight change) and dwn (negative weight change). (A) When the prespike is emitted, calcium concentration peaks and decays exponentially. Since an anti-STDP rule is required for the temporal codes discussed in this letter, an SR latch is employed to output . Once the postsynaptic spike occurs, it samples parameter C' at the time of its firing, and positive weight change is calculated. (B) The same rule is applied for the negative weight change calculation. This weight could also endure a negative change when there is no correlation between the pre- and postspikes. When the prespike does not emit any spike within the clock cycle while the postspike has spiked, the negative weight change is maximum. This is modeled by sampling the Qb of the SR latch by the postsynaptic spike. If the prespike does not spike, Qb will be high and cause a negative weight change as soon as the postspike spikes. (C) The change of weight is plotted as a function of the difference in time of arrival of the prespike and postspikes in the learning algorithm used. The weight change takes an opposite form of STDP.

Negative weight change. When there is no correlation between a presynaptic and a postsynaptic neuron, the connection between them undergoes a negative weight change. This lack of correlation translates to the postsynaptic neuron firing before the presynaptic counterpart. The later the presynaptic neuron spikes with respect to the postsynaptic neuron (within a time bin), the more uncorrelated the two neurons are. Hence, the same anti-STDP rule applies here.

Note that the weight could also endure a negative change when there is no correlation whatsoever between the pre- and postspike, meaning that the presynaptic neuron does not emit any spike within the clock cycle while the postspike has spiked. In that case, the negative weight change is maximum. This is described graphically in Figure 4B and can be written mathematically as
formula
2.7
where dwn is the negative weight change, is the negative learning rate, and H is the Heaviside function implemented as the the output of the SR latch (Q) shown in Figure 5B.

2.4.2  Inhibition

WSA. As discussed in section 1, the WSA algorithm requires only half of the neurons at the output to fire. To enforce this requirement, we need to ensure that (1) no neuron fires more than once in a specific clock cycle, and (2) with six output neurons, all other neurons need to be inhibited upon the arrival of the third spike.

Figure 4B shows the solutions employed to establish the above conditions. A self-inhibitory connection at each output neuron ensures no neuron spikes more than once in any given time window since the neuron undergoes an inhibition upon spiking (condition 1). The inhibitory neuron in Figure 4B accumulates the spikes emitted from the pool of the output neurons, and its threshold is set so that it fires after the third spike has been generated, inhibiting all the neurons at the output (condition 2). We can model the inhibitory neuron with the following mathematical expression:
formula
2.8
Therefore the membrane potential of each output neuron can be described as
formula
2.9
where refers to the th output neuron membrane potential and is the weights from the th neuron in layer 2 to the th neuron in layer 3. In order to calculate the membrane potential of the th neuron in the third layer, all the inputs from layer 2 multiplied by their corresponding weights are summed. The term is the spike emitted from the same output neuron, which represents self-inhibition employed to ensure the neuron does not spike more than once in any given time window.

WTA. Inhibition in WTA can be reduced to a lateral inhibitory connections between the output neurons as is done in much previous work, such as Sheridan, Ma, and Lu (2014).

2.4.3  Greedy Attractors

Neuronal activity is the key to learning by the changes of synaptic connectivity through the Hebbian mechanism or in the spike form—spike-timing-dependent plasticity (STDP). However, STDP is a noncontrolled growth or decay of synaptic weights and, hence, a destabilizing force in neural circuits. For example, in the context of the WSA algorithm, if one combination starts being favorable for multiple patterns, Hebbian-based learning will encourage that behavior (since it is a positive feedback mechanism) and the weights grow to the extremes and hence lose a lot of information.

One possible solution to this problem is homeostatic plasticity, which maintains the average neuronal activity within a range by dynamically adjusting the synaptic strength in the correct direction to promote stability (Turrigiano & Nelson, 2004). When neural activity falls too low, excitation between neurons needs to be boosted and feedback inhibition is reduced. This in turn raises the firing rates of neurons. Conversely, when activity is too high, excitation between neurons is reduced, and excitation onto interneurons and inhibitory feedback is strengthened, thereby lowering the activity of neurons. Hence, homeostatic regulation of network activity in recurrent circuits is achieved through adjusting specific synapses to drive network activity toward a set point. It is worth noting that homeostatic property could also be detrimental to remembering the old memories, since it forces the neurons to learn new patterns if the preferred stimulus of one neuron is not presented for a while. This, however, is not a concern in the case of our study, since the patterns are shuffled and presented randomly to the network and hence, on average, all the patterns are seen by the neurons with a similar frequency.

Employing homeostatic property has two main advantages:

  1. It will not let one output neuron (or in our case, one combination of neurons) get very greedy and respond to many input patterns. In other words, it will monitor the competition and make sure all the combinations get a chance to respond to some input pattern.

  2. It will act as a regularizer for the neural network to avoid overfitting. Regularization ensures the network response to small random variations in the input is minimal by controlling the growth of the weights similar to the homeostatic property.

Implementing Homeostatic Plasticity (WSA). To dynamically adjust synaptic strength in the correct direction to promote stability in the weights, we first need to identify the greedy behavior in neurons. Therefore, a type of neural state detector (NSD) is required to recognize if a neural state is happening too often or not often enough. Figure 6 illustrates this idea. Each neuron in the state machine is assigned to respond to the advent of a certain combination or state. Since we have enforced the condition of a maximum number of three (half of the output neurons) to fire in every clock cycle, then every neuron in the state machine takes input from three neurons, which make a unique combination. The leaky integrate- and-fire neuron is designed to fire after detecting three neurons firing at its input. Thus, at each clock cycle, the spikes out of the neural state detector indicate that a particular combination has occurred. The inset in Figure 6A depicts the detection of the combination 1,2,3 from the first neuron in the NSD. These neurons in NSD can be mathematically described as
formula
2.10
where out, out, and out are the output of the output neurons in layer 3. In this example, 20 such NSDs are required since there are 20 combinations that need to be monitored for overspiking or underspiking.
Figure 6:

Neural state machine (NSD) designed to control the frequency of the WSA codes. (A) NSD detects the occurrence of the codes. Each neuron in the NSD is to detect one code. The inset of the figure shows how the last neuron detects the occurrence of code 4,5,6. (B) The output of the NSD is fed to the frequency detector of the codes. The output of this layer detects if a certain code is happening too often (TOS) or too rarely (TRS). (C) TOS and TRS are combined and averaged using the green neurons whose spikes change the threshold of the output neurons. The green arrow shows the green neuron changing the threshold of the first neuron at the output layer. (D) Threshold of six output neurons changing in time to balance the spiking of the network.

Figure 6:

Neural state machine (NSD) designed to control the frequency of the WSA codes. (A) NSD detects the occurrence of the codes. Each neuron in the NSD is to detect one code. The inset of the figure shows how the last neuron detects the occurrence of code 4,5,6. (B) The output of the NSD is fed to the frequency detector of the codes. The output of this layer detects if a certain code is happening too often (TOS) or too rarely (TRS). (C) TOS and TRS are combined and averaged using the green neurons whose spikes change the threshold of the output neurons. The green arrow shows the green neuron changing the threshold of the first neuron at the output layer. (D) Threshold of six output neurons changing in time to balance the spiking of the network.

To find out if a combination is happening too often or not after enough and correct it, the spikes from the neural state machine with detected combinations are fed to a block with leaky integrator units (see Figure 6B). The time constant of these leaky integrators is a design choice and must be picked to determine what it means to be too often or too rarely depending on the network behavior. Each leaky integrator (LI) parameter keeps the history of every combination occurrence, and if it is greater than a certain threshold of , it generates a spike that translates to that combination happening too often. On the contrary, if the LI parameter goes below a threshold of , a spike will be generated, which means that the combination is dormant and is taking place too rarely. This can be mathematically described as
formula
2.11
where is a positive bias added to the output of the integrator as a DC shift in order to detect the nonspiking combination through the decay parameter going below the low threshold:
formula
2.12

We can now use the information from the combination occurrence frequency to correct the direction of network convergence to promote stability. One way of doing so, inspired by the work in Diehl et al. (2015) and Sheridan et al. (2014), is to use each too often spike (TOS) to increase the threshold of the neurons in the combination and to use each too rare spike (TRS) to decrease the threshold of the corresponding neurons. Increasing the threshold decreases the activity of the neurons and decreasing it increases their activity, so it will put the system in a negative feedback loop that controls the combination frequency within a range, as discussed at the start of this section (see Figure 6C). The OR gates in the figure are used to shrink the number of outputs from the number of combinations to the number of output neurons. The th OR at the top is summing the TOS including neuron , and the th OR at the bottom is doing so for TRS, including neuron .

However, since we were dealing with combinations of neurons firing, a neuron might be contributing in TOS at the same time as it is doing so for TRS. To explain that more clearly, consider the following example. If combination 1,2,3 is happening too often and 1,4,6 is not being seen at all, then neuron 1's threshold should increase because of TOS and decrease because of TRS. Therefore, we need an averaging mechanism to determine how the threshold of each neuron must change on average. That can be achieved by using another set of integrators that keep track of the history of TOS and TRSs. This is illustrated graphically in Figure 6C. The output of these averaging neurons is used to modify the threshold of the output neurons with a feedback shown in green in the figure. The threshold voltage at every time step can hence be calculated as
formula
2.13

Figure 6D plots the time evolution of the threshold of six output neurons being changed to balance the spiking of the network. The threshold of all neurons converges to the suitable value as the network converges to the solution.

Implementing homeostatic plasticity (WTA). The same concept is implemented for the WTA version, except that in the WTA case, only 14 neurons are required in the NSD since for 14 patterns, 14 output neurons are utilized.

3  Results

In this section, we first go through the result of each algorithm (WSA and WTA) and then compare their results.

3.1  WSA

For the WSA algorithm, we look at the output spikes and the evolution of the pattern assignment to neural codes and classification performance.

3.1.1  Output Neurons

The output of the network is depicted in Figure 7A. In each time window of 10 ms, the output of six output neurons is shown as raster plots. However, as illustrated in the figure, not all three output neurons fire at all times since the inhibition rule enforces that not more than three of them fire, but not necessarily exactly three neurons to fire. This is the verification of what we explained in the introduction to this section as the difference between the and . would have chosen exactly neurons to fire; however, in , the network chooses the codes freely in the training phase bound by k, which is three in the case of this study. Hence it can choose to fire fewer than three neurons, which is illustrated in Figure 7A.

Figure 7:

Raster plot of the spiking of the neurons at the output layer using WSA and WTA algorithms. (A) WSA: The spikes inside the time window show that half the neurons do not necessarily spike at each clock cycle. This opens up an even larger code space since other combinations of neurons are also participating in the code space, as discussed in the text. (B) WTA: At each clock cycle, only one neuron responds to the input pattern.

Figure 7:

Raster plot of the spiking of the neurons at the output layer using WSA and WTA algorithms. (A) WSA: The spikes inside the time window show that half the neurons do not necessarily spike at each clock cycle. This opens up an even larger code space since other combinations of neurons are also participating in the code space, as discussed in the text. (B) WTA: At each clock cycle, only one neuron responds to the input pattern.

3.1.2  Classification

The network starts with a normally distributed random initial condition with a mean of 0.03 and a standard deviation of 0.1. We have noticed that the network converges faster by choosing an initial condition of small weights since otherwise, the output of the network is already saturated and it is hard to train. The standard deviation is chosen so that the weights are far enough from each other, giving the network the imbalance it needs to start the competition. Over the course of training, the neurons learn to respond with certain codes to each pattern. Different patterns, numbered in order from 1 to 14, are assigned to unique combinations (1–41) from the code space and hence are separated by the network. Table 1 identifies the numbers that refer to the patterns and the codes.

Table 1:
Corresponding Numbers Refering to Patterns (Top) and Codes (Bottom).
Number Pattern1 H2 F3 T4 L5 X6 Y7 K8 T9 M10 W11 S12 A13 N14 V
Number Neural Code Number Neural Code   
1,2,3 21 1,2   
1,2,4 22 1,3   
1,2,5 23 1,4   
1,2,6 24 1,5   
1,3,4 25 1,6   
1,3,5 26 2,3   
1,3,6 27 2,4   
1,4,5 28 2,5   
1,4,6 29 2,6   
10 1,5,6 30 3,4   
11 2,3,4 31 3,5   
12 2,3,5 32 3,6   
13 2,3,6 33 4,5   
14 2,4,5 34 4,6   
15 2,4,6 35 5,6   
16 2,5,6 36   
17 3,4,5 37   
18 3,4,6 38   
19 3,5,6 39   
20 4,5,6 40   
  41   
Number Pattern1 H2 F3 T4 L5 X6 Y7 K8 T9 M10 W11 S12 A13 N14 V
Number Neural Code Number Neural Code   
1,2,3 21 1,2   
1,2,4 22 1,3   
1,2,5 23 1,4   
1,2,6 24 1,5   
1,3,4 25 1,6   
1,3,5 26 2,3   
1,3,6 27 2,4   
1,4,5 28 2,5   
1,4,6 29 2,6   
10 1,5,6 30 3,4   
11 2,3,4 31 3,5   
12 2,3,5 32 3,6   
13 2,3,6 33 4,5   
14 2,4,5 34 4,6   
15 2,4,6 35 5,6   
16 2,5,6 36   
17 3,4,5 37   
18 3,4,6 38   
19 3,5,6 39   
20 4,5,6 40   
  41   

Pattern assignments to codes are illustrated in Figure 8A. Each circle highlights the most responsive code to each pattern, measured by counting all the codes present at the output for each pattern during a clock cycle. The code that is the most frequent response (MFR) is chosen as the “class” to which the input pattern belongs. In Figure 8A these circles are illustrated with a color representing their count before and after the training. Before training, all the patterns are assigned to the same few clusters and hence most of them belong to a similar class. After training, however, the patterns are separated, and each of them is assigned to a unique code represented by the circle.

Figure 8:

Code assignment in response to each pattern (1–14) before (left) and after (right) training in WSA (A) and WTA (B). The color of each circle represents the number of times a code appeared in a certain range of time (count) mapped by the color bar on the right of each figure. The left figure is the result of the first 500 ms of the simulation; hence, the counts are low, and many patterns are assigned to the same code. The right figure is the code assignment in the network taken from the time interval [79,000–81,000] ms. The patterns are therefore separated as they are assigned to unique codes. This is explained in detail in the text.

Figure 8:

Code assignment in response to each pattern (1–14) before (left) and after (right) training in WSA (A) and WTA (B). The color of each circle represents the number of times a code appeared in a certain range of time (count) mapped by the color bar on the right of each figure. The left figure is the result of the first 500 ms of the simulation; hence, the counts are low, and many patterns are assigned to the same code. The right figure is the code assignment in the network taken from the time interval [79,000–81,000] ms. The patterns are therefore separated as they are assigned to unique codes. This is explained in detail in the text.

To calculate the performance, the response of each code to all the patterns is considered. The pattern to which each code responds the most is assigned to that code, and the number of times the code appeared in response to the pattern—the maximum code count (MCC)—is used to calculate performance.

It is worth noting that if multiple codes respond to the same pattern, it is not necessarily an error since the code's space is greater than the number of patterns and the patterns could be assigned to multiple codes. Therefore, the network accuracy is calculated as
formula
3.1
where NP is the number of patterns and SumofMCCs is the sum of the MCCs as described. NotAssigned is the number of patterns the network did not classify as a separate pattern.

The ultimate measure of how well the WSA algorithm works is by determining the classification accuracy on the test set that the network was not presented during training. This is shown in Figure 9A. The network performance reaches 87% on the training set (blue) and 84% on the test set (red). The lower performance on the test set is expected, as the network has not “seen” the patterns in the test set before. Note that since this is a small synthetic image set, the separation of the patterns is more difficult than natural images. As the images get bigger and more natural, the similarity between the patterns decreases and the performance increases.

Figure 9:

Network performance evolution for the training set (red) and test set (blue). (A) Using the WSA algorithm, training set (blue) and test set (red) performance converges to 87% and 84%, respectively. (B) Using the WTA algorithm, training set (blue) and test set (red) performance converge to 93% and 91% respectively.

Figure 9:

Network performance evolution for the training set (red) and test set (blue). (A) Using the WSA algorithm, training set (blue) and test set (red) performance converges to 87% and 84%, respectively. (B) Using the WTA algorithm, training set (blue) and test set (red) performance converge to 93% and 91% respectively.

3.2  WTA

In this section, we report the results of the classification and pattern separation by employing WTA on the same problem set that we used for WSA. We will look at the output spikes and the evolution of the pattern assignment to neural codes and classification performance.

3.2.1  Output Spikes

The output of the network is depicted in Figure 7B. In each time window of 10 ms, the outputs of 14 output neurons are shown as raster plots. At each time window, we have one winner neuron that responds to the pattern presented during that time.

Figure 8B depicts the network response to the patterns before and after training. As is shown in the figure, before training, all the patterns are mixed together and the network responds to them similarly. After training, the patterns are separated, and all of them are assigned a unique code represented by the circle.

3.2.2  Classification

We initialize the network by sampling the initial weights from the same probability distribution as was used for WSA (normal distribution with a mean of 0.03 and standard deviation of 0.1). After training, each neuron responds to a different pattern. Network performance is calculated in a similar manner as for WSA formulated in equation 2.14. Figure 9B plots the evolution of network performance in time. Classification accuracy reaches 93% on the training set (blue) and 91% on the test set.

3.3  WSA versus WTA: A Comparison

To highlight the strength of the WSA algorithm, Table 2 compares the number of neurons, connections, and learning parameters of an equivalent network using the WTA algorithm as opposed to WSA.

Table 2:
Comparison of the Number of Neurons, Connections, and Learning Parameters Using the WTA versus WSA Algorithm in This Work.
Number of NeuronsNumber of ConnectionsNumber of WeightsLearning?Number of Learning Parameters
WSA 
Layer 1 225  
Layer 2 100 900 900 No 
Layer 3 600 600 Yes 600 
Inhibitory neuron No 
Habituation neuron 100 100 No 
Neural state machine 41 96 No 
Homeostatic neurons 12 12 No 
Total 485 1714 1500  600 
WTA 
Layer 1 225  
Layer 2 100 900 900 No 
Layer 3 14 1400 1400 Yes 1400 
Inhibitory neuron No 
Habituation neuron 100 100 No 
Neural state machine 18 No 
Homeostatic neurons 12 12 No 
Total 457 2430 2300  1400 
Number of NeuronsNumber of ConnectionsNumber of WeightsLearning?Number of Learning Parameters
WSA 
Layer 1 225  
Layer 2 100 900 900 No 
Layer 3 600 600 Yes 600 
Inhibitory neuron No 
Habituation neuron 100 100 No 
Neural state machine 41 96 No 
Homeostatic neurons 12 12 No 
Total 485 1714 1500  600 
WTA 
Layer 1 225  
Layer 2 100 900 900 No 
Layer 3 14 1400 1400 Yes 1400 
Inhibitory neuron No 
Habituation neuron 100 100 No 
Neural state machine 18 No 
Homeostatic neurons 12 12 No 
Total 457 2430 2300  1400 

Neurons in layers 1 and 2 in both algorithms are similar since the first layer is converting the intensity of the pixels to temporal information and the second layer is detecting features. As discussed before, WSA uses 6 neurons to classify 14 patterns, whereas WTA would need at least 14 neurons. This translates into 600 learning parameters when WSA is used in comparison with 1400 learning parameters required by the WTA algorithm. Clearly, WSA requires less on-chip memory in a hardware implementation compared to WTA.

Since the images have a high degree of similarity, habituation neurons are needed in both cases to desensitize the output neurons to “mutual information” in the input patterns. Since there is one habituation neuron required for the number of neurons in layer 2 (see section 2.3.1), in both cases of WSA and WTA, 100 habituation neurons are needed.

An unsupervised learning algorithm using Hebbian learning requires a homeostatic property to prevent unbounded growth of the weights. In WSA, this control must be done at all levels of codes. Since here we are using 41 codes (up to 3 neurons firing in the same clock cycle), we need control on the appearance of the codes in three levels: single, double, or triple neurons. In the NSD block, 20 neurons monitor the combination of spiking of 3 neurons; therefore, each is connected to 3 neurons (60 connections). Fifteen neurons are monitoring the spiking of double neurons (30 connections), and 6 neurons are monitoring the spiking of single neurons (6 connections). Therefore, 41 neurons with 96 connections are required in the NSD using WSA. This number reduces to 6 neurons and 6 connections in the WTA algorithm. However, looking at the total number of neurons and connections, it is apparent that WSA requires fewer neurons, connections, and learning parameters on a chip. Note that the connections in NSD are nonplastic, and a normalized weight of 1 is used for all of them.

Table 3 generalizes the number of neurons, connections, and learning parameters at each layer for an arbitrary large network for both WTA and WSA algorithms, assuming the same neural network architecture discussed in this letter. ImageSide is the number of pixels on one side of the image, assuming the image is a square. KernelSize is the size of the kernel used to extract features of the image, and is the number of such features. NeuronL2 and NeuronL3 are the number of neurons in layers 2 and 3, which are defined in the second column of the table, and is the number of patterns being fed to the network. Parameters and are defined at the bottom of the table.

Table 3:
Comparison of the Number of Neurons, Connections, and Learning Parameters Using the WTA versus WSA Algorithm for a General Case.
Number of NeuronsNumber of ConnectionsNumber of WeightsNumber of Learning Parameters
WSA 
Layer 1  
Layer 2    
Layer 3     
Inhibition 2 
Habituatio   
Neural state machine   
Homeostatic plasticity 2 2 
WTA 
Layer 1  
Layer 2    
Layer 3     
Inhibition 
Habituatio   
Neural state machine   
Homeostatic plasticity 
Number of NeuronsNumber of ConnectionsNumber of WeightsNumber of Learning Parameters
WSA 
Layer 1  
Layer 2    
Layer 3     
Inhibition 2 
Habituatio   
Neural state machine   
Homeostatic plasticity 2 2 
WTA 
Layer 1  
Layer 2    
Layer 3     
Inhibition 
Habituatio   
Neural state machine   
Homeostatic plasticity 

In order to compare the scaling effect on both of the algorithms, we plot the growth of the number of learning parameters against the growth of two parameters: the number of patterns and the number of pixels in the image. This is shown in Figure 10. Learning parameters are the weights that require storage on neuromorphic chips, and, hence, storing them on chip not only takes up more space but also reading and writing from them consumes power. Therefore, saving on the number of learning parameters saves space and reduces the power consumption in the neuromorphic chip.

Figure 10:

Growth of the number of learning parameters with respect to the growth of number of patterns (A) and image size (B). In a comparison of both algorithms, learning parameters grow much less against the number of patterns and image size by using WSA (blue) instead of WTA (red).

Figure 10:

Growth of the number of learning parameters with respect to the growth of number of patterns (A) and image size (B). In a comparison of both algorithms, learning parameters grow much less against the number of patterns and image size by using WSA (blue) instead of WTA (red).

3.4  A Closer Look at the Codes in WSA

One interesting study is to see if there is any correlation between the distance of the patterns and the distance of the codes assigned to them by the network in the WSA algorithm. The former can be measured by the similarity between the patterns defined as the number of common pixels between the patterns, and the latter is simply the edit distance between the codes (Ukkonen, 1983). Edit distance is a way of quantifying how dissimilar two strings are to one another by counting the minimum number of operations needed to transform one string to the other. After training, the edit distance between the code assigned to one pattern (e.g., A) and the code assigned to all the other patterns is measured as a function of the pattern's (e.g., A) similarity to other patterns. Doing this for all the patterns generates a distribution for the edit distance and the number of common pixels (measure of similarity), as shown in Figure 11. Each circle's color represents the number of times each pair of (edit distance-similarity) appeared in the study. The mapping of the count to the color is illustrated with the color bar on the right side of the graph. As can be seen from the figure, the probability of having lower edit distances between the codes increases as the similarity between the patterns increases. This can be confirmed in the case where all the pixels are the same (a self-comparison), the assigned code is the same, and hence the edit distance is zero. Therefore, the output of the network gives us another piece of information other than the recognition, and that is how similar the patterns are.

Figure 11:

Distribution of the edit distance of the assigned codes to any pattern as a function of its similarity to all the other patterns. Each circle's color at every coordinate of the graph represents the number of times each pair of (Pattern Similarity versus Code Distance) appeared in the study by going through all the patterns. As the patterns become more and more similar, the probability of a lower distance between their assigned codes increases. This means that similar patterns get assigned to codes that are closer in the code space.

Figure 11:

Distribution of the edit distance of the assigned codes to any pattern as a function of its similarity to all the other patterns. Each circle's color at every coordinate of the graph represents the number of times each pair of (Pattern Similarity versus Code Distance) appeared in the study by going through all the patterns. As the patterns become more and more similar, the probability of a lower distance between their assigned codes increases. This means that similar patterns get assigned to codes that are closer in the code space.

4  Discussion

This work has explored the code space provided by the temporal encoding of the information in neurons available by the time of the spikes. An algorithm dubbed winners-share-all was used in which the combinatorial spiking of the output neurons encodes information. The algorithm was tested against an artificial data set consisting of 14 images of the English alphabet.

The network results, performance, and number of resources required by the WSA algorithm are then compared to the conventional WTA algorithms in which one neuron (or a population of neurons together) encodes the input data. Our results suggest that by taking advantage of the combinatorial codes available in the temporal codes, we can reduce the number of resources (learning parameters, also known as plastic synapse) required to perform pattern classification. Since there is encoding of the information in the network, one might ask if there is a decoding necessary in order to make sense of the output of the network (such as the one suggested in Table 1). In fact, this decoding is already being done by the neurons in the Neural state decoder (NSD), which are also being used to regulate the spiking of the neurons. Therefore, decoding the WSA output does not need any further neurons and circuits and functional blocks can be reused to decode the network output. Also, the output of the network not only classifies the patterns, but also gives us information on how correlated each pattern is to the other patterns since the closely related patterns get assigned to codes with a smaller edit distance. By taking advantage of the code space available by taking the collective firing of neurons as the computing machine, the effective memory of the synapses increases and a lot more bits of information can be encoded into the network. However, although the WSA requires fewer memory bits to perform the same task, it results in slightly lower classification accuracy (84% for WSA versus 91% for WTA). Therefore, WSA is beneficial when accuracy can be traded off with the number of available resources.

5  Conclusion

By taking advantage of the coding theory, we developed a novel algorithm to exploit the coding space provided by the spiking neural networks. We introduced the concept of winners-share-all as a replacement for winner-takes-all, which takes a combinatorial approach in coding the spikes emitted from the output neurons. By employing this algorithm, fewer connections and training parameters are required to perform classification tasks. We showed how a network of interlayer neurons can overcome the challenges encountered when employing completely unsupervised learning on such temporal codes. We used this algorithm to classify 14 artificial images using only six output neurons and achieved a classification accuracy of 84% on the test set with a completely unsupervised approach. We have compared this approach with the conventional approach using one neuron to represent each pattern at the output (one-hot-code) and have reached a classification accuracy of 91% on the test set using the same input patterns. The number of training parameters required to do such computation is more than two-fold fewer while using WSA. We generalized this problem for an arbitrary number of patterns and have concluded by calculation that as the number of patterns grows, the number of learning parameters grows much more slowly in WSA than in WTA.

In the future, it is necessary to study if this algorithm could be expanded to work at every layer. If all the layers of a deep network can be coded using this algorithm, it will be extremely efficient in terms of area since the exploitation of the temporal code will be used at every layer and the number of parameters will reduce at each layer by the parameter of WSA (introduced in section 1).

Notes

1

For a 1 ms time precision, the pure temporal code has bits of information in a window of ms (Paugam-Moisy & Bohte, 2012).

2

Neurons firing with order 1, 2, and then 3.

3

This parameter is directly or indirectly used in previous work done in temporal coding. For example, Smith (2015) defines it as the maximum extent of a temporal frame of reference, which is the time over which all the responsive neurons to the patterns should spike. This means that the computation of the network is done after this timescale for the pattern that was presented at the beginning of time frame. Gütig and Sompolinsky (2009) also use a time window in which they present their spatiotemporal pattern. The length of the time window is chosen as a 10 ms as it is in the range of the biological membrane time constant in which the effect of the spike remains because of EPSP (Jug, 2012).

Acknowledgments

This work is funded by the Air Force Office of Scientific Research MURI grant FA9550-12-1-0038. The authors would like to thank the anonymous reviewers for very helpful comments which improved the paper significantly.

References

Bialek
,
W.
,
Rieke
,
F.
,
de Ruyter van Steveninck
,
R.
, &
Warland
,
D.
(
1991
).
Reading a neural code
.
Science
,
252
(
5014
),
1854
1857
.
Bichler
,
O.
,
Querlioz
,
D.
,
Thorpe
,
S. J.
,
Bourgoin
,
J. P.
, &
Gamrat
,
C.
(
2012
).
Extraction of temporally correlated features from dynamic vision sensors with spike-timing-dependent plasticity
.
Neural Networks
,
32
,
339
348
. doi:10.1016/j.neunet.2012.02.022
Coates
,
A.
, &
Ng
,
A.
(
2011
). Selecting receptive fields in deep networks. In
J.
Shawe-Taylor
,
R. S.
Zemel
,
P. L.
Bartlett
,
F.
Pereira
, &
K. Q.
Weinberger
(Eds.),
Advances in neural information processing systems
,
24
.
Red Hook, NY
:
Curran
.
Diehl
,
P. U.
,
Neil
,
D.
,
Binas
,
J.
,
Cook
,
M.
,
Liu
,
S. C.
, &
Pfeiffer
,
M.
(
2015
).
Fast-classifying, high-accuracy spiking deep networks through weight and threshold balancing
. In
Proceedings of the International Joint Conference on Neural Networks
.
Piscataway, NJ
:
IEEE
. doi:10.1109/IJCNN.2015.7280696
Gütig
,
R.
, &
Sompolinsky
,
H.
(
2009
).
The tempotron: A neuron that learns spike timing–based decisions
.
Nature Neuroscience
,
9
(
3
),
420
428
. doi:10.1038/nn1643
Guyonneau
,
R.
,
VanRullen
,
R.
, &
Thorpe
,
S. J.
(
2004
).
Temporal codes and sparse representations: A key to understanding rapid processing in the visual system
.
Journal of Physiology Paris
,
98
,
487
497
. doi:10.1016/j.jphysparis.2005.09.004
Heiligenberg
,
W.
(
1991
).
Neural nets in electric fish
.
Cambridge, MA
:
MIT Press
.
Jug
,
F.
(
2012
).
On competition and learning in cortical structures
.
Ph.D. dissertation, ETH Zurich
.
LeCun
,
Y.
,
Bottou
,
L.
,
Bengio
,
Y.
, &
Haffner
,
P.
(
1998
).
Gradient based learning applied to document recognition
.
Proceedings of the IEEE
,
86
(
11
),
2278
2324
.
Maass
,
W.
(
2000
).
On the computational power of winner-take-all
.
Neural Computation
,
12
,
2519
2535
. doi:10.1016/S0166-218X(96)00058-3
Masquelier
,
T.
, &
Thorpe
,
S. J.
(
2007
).
Unsupervised learning of visual features through spike timing dependent plasticity
.
PLoS Computational Biology
,
3
(
2
),
0247
0257
. doi:10.1371/journal.pcbi.0030031
Mead
,
C.
(
1989
).
Analog VLSI and neural systerms
.
Reading, MA
:
Addison-Wesley
.
Ng
,
A.
,
Ngiam
,
J.
,
Foo
,
C. Y.
,
Mai
,
Y.
,
Suen
,
C.
,
Coates
,
A.
, …
Tao Wang
,
S. T.
(
n.d.
).
Optimization: Stochastic gradient descent
.
Retrieved from
http://ufldl.stanford.edu/tutorial/supervised/OptimizationStochasticGradientDescent/
Orchard
,
G.
,
Meyer
,
C.
,
Etienne-Cummings
,
R.
,
Posch
,
C.
,
Thakor
,
N.
, &
Benosman
,
R.
(
2015
).
HFirst: A temporal approach to object recognition
.
IEEE Transactions on Pattern Analysis and Machine Intelligence
,
8828
,
1
1
. doi:10.1109/TPAMI.2015.2392947
Paugam-Moisy
,
H.
, &
Bohte
,
S.
(
2012
).
Computing with spiking neuron networks
. In
G.
Rozenberg
,
T.
Bäck
, &
J. N.
Kok
(Eds.),
Handbook of natural computing
(pp.
335
376
).
Berlin
:
Springer-Verlag
. doi:10.1007/978-3-540-92910-9_10
Sheridan
,
P.
,
Ma
,
W.
, &
Lu
,
W.
(
2014
).
Pattern recognition with memristor networks
. In
Proceedings of the IEEE International Symposium on Circuits and Systems
(pp.
1078
1081
).
Piscataway, NJ
:
IEEE
. doi:10.1109/ISCAS.2014.6865326
Shouval
,
H. Z.
,
Castellani
,
G. C.
,
Blais
,
B. S.
,
Yeung
,
L. C.
, &
Cooper
,
L. N.
(
2002
).
Converging evidence for a simplified biophysical model of synaptic plasticity
.
Biological Cybernetics
,
87
(
5–6
),
383
391
. doi:10.1007/s00422-002-0362-x
Smith
,
J. E.
(
2015
).
Biologically plausible spiking neural networks
.
[Self-published monograph, Missoula, MT.]
Thorpe
,
S.
,
Delorme
,
A.
, &
Van Rullen
,
R.
(
2001
).
Spike-based strategies for rapid processing
.
Neural Networks
,
14
(
6–7
),
715
725
. doi:10.1016/S0893-6080(01)00083-1
Turrigiano
,
G. G.
, &
Nelson
,
S. B.
(
2004
).
Homeostatic plasticity in the developing nervous system
.
Nat. Rev. Neurosci.
,
5
(
2
),
97
107
. doi:10.1038/nrn1327
Ukkonen
,
E.
(
1983
).
On approximate string matching
. In
M.
Karpinski
(Ed.),
Foundations of computation theory
.
Lecture Notes in Computer Science, vol. 158
.
Berlin
:
Springer-Verlag
.

Author notes

M.P. is now at the Institute of Neuroinformatics, University of Zurich, ETH Zurich.