This letter presents a spike-based model that employs neurons with functionally distinct dendritic compartments for classifying high-dimensional binary patterns. The synaptic inputs arriving on each dendritic subunit are nonlinearly processed before being linearly integrated at the soma, giving the neuron the capacity to perform a large number of input-output mappings. The model uses sparse synaptic connectivity, where each synapse takes a binary value. The optimal connection pattern of a neuron is learned by using a simple hardware-friendly, margin-enhancing learning algorithm inspired by the mechanism of structural plasticity in biological neurons. The learning algorithm groups correlated synaptic inputs on the same dendritic branch. Since the learning results in modified connection patterns, it can be incorporated into current event-based neuromorphic systems with little overhead. This work also presents a branch-specific spike-based version of this structural plasticity rule. The proposed model is evaluated on benchmark binary classification problems, and its performance is compared against that achieved using support vector machine and extreme learning machine techniques. Our proposed method attains comparable performance while using 10% to 50% less in computational resource than the other reported techniques.
Neuromorphic systems (Mead, 1990) aim to emulate neuronal function by implementing the structure of the biological system. By using analog subthreshold very large scale integration (VLSI) circuits, these systems can implement a power-efficient model of neuronal processing. This approach is particularly useful for testing ideas about brain function that span across the levels of ion channels to neural microcircuits and systems. There has already been significant effort in the past in this realm (Arthur & Boahen, 2007; Bartolozzi & Indiveri, 2009; Basu, Petre, & Hasler, 2008; Hynna & Boahen, 2006; Liu & Boahen, 1996; Liu et al., 2001; Mahowald, 1994; Hasler & Marr, 2013). The latest neuromorphic hardware systems are event-driven real-time implementations of energy-efficient, spike-based intelligent processing systems. Two key advantages of these systems are as follows: (1) the event-driven way of processing leads to minimal power dissipation during long inactive periods (O’Connor, Neil, Liu, Delbruck, & Pfeiffer, 2013; Neil & Liu, 2014) and (2) the systems are low power because the transistors are operated in the subthreshold region where currents are small. Usable event-driven sensors such as the dynamic vision sensor retina (Lichtsteiner, Posch, & Delbruck, 2008; Serrano-Gotarredona & Linares-Barranco, 2013; Posch, Matolin, & Wohlgenannt, 2011) and the AEREAR2 cochlea (Liu, Schaik, Minch, & Delbruck, 2013) that produce asynchronous spike information about stimulus features in real-time environments are already available. There is a need to interface these sensors to event-based multineuron classifier systems that generate information only when necessary. These classifiers can benefit from the larger computational power of spiking neural networks (Maass & Schmitt, 1999) and can also be used in novel applications related to brain-machine interfaces where the incoming spikes are generated by biological neurons (Nicolelis, 2003).
Reconfigurable spiking neural networks composed of neurons and synapses have been fabricated in VLSI. The networks are usually mixed hybrid analog-digital circuits on the chip while the connectivity information of the network is stored in an off-chip digital memory. An asynchronous communication protocol, the address-event representation (AER) (Boahen, 2000; Serrano-Gotarredona et al., 2009; Choi, Merolla, Arthur, Boahen, & Shi, 2005; Vogelstein, Mallik, Culurciello, Cauwenberghs, & Etienne-Cummings, 2007), is typically used to route the neuronal spikes to and from the chip on a shared fast digital bus. Since the connectivity information is virtual, the user has full flexibility in the network reconfiguration. Synaptic weights are typically implemented using on-chip digital to analog converters (DAC) that are shared across synapses. Very few spike-based multineuron systems have reported solving real-world pattern classification or memorization problems comparable to state-of-the-art systems using algorithmic approaches. One major challenge in attaining a classification accuracy similar to a software implementation of a system on a computer is that statistical variations in VLSI devices reduce the accuracy of the synaptic weights. This problem is exacerbated by the fact that device mismatch is increasing with progressively shrinking transistor sizes. Hence, it has become increasingly important to design networks and algorithms that either exhibit robustness to variations or use low-resolution synapses, which are better suited for robust implementation in VLSI systems. A spiking network classifier was implemented on a neuromorphic hardware system in which population coding was used to achieve tolerance against variability (Schmuker, Pfeil, & Nawrot, 2014). This system, however, uses a large number of synapses, which need to have high resolution weights to attain the performance reported in the study. This neuromorphic network achieved performance comparable with a standard machine learning linear classifier. However, nonlinear classifiers like the support vector machine (SVM) can give far better results.
In another study (Neftci, Das, Pedroni, Kreutz-Delgado, & Cauwenberghs, 2014), software simulation of a spiking neuromorphic system consisting of restricted Boltzmann machines (RBMs) was proposed with a learning rule based on spike-timing-dependent plasticity (STDP) for changing synaptic weights. These RBMs composed of leaky integrate-and-fire neurons were shown to be robust to the use of finite-precision synaptic weights with less than a 3% drop in performance for a MNIST recognition task when using 8-bit versus 5-bit precision weights. However, RBMs use a high number of synapses—almost 2 million recurrent connections—which makes its hardware implementation quite costly. In contrast to this, several studies have used binary synapses inspired from the evidence that biological synapses exist in only a small number of states (Petersen, Malenka, Nicoll, & Hopfield, 1998; O’Connor, Wittenberg, & Wang, 2005). The use of binary synapses also provides a simple and robust solution to the variability due to device mismatch. A spike-based STDP learning algorithm using bistable synapses was implemented in Brader, Senn, and Fusi (2007) with VLSI implementations in Chicca et al. (2003), Badoni, Giulioni, Dante, and Giudice (2006), and Mitra, Fusi, and Indiveri (2009). This supervised learning algorithm was used to train a network to classify handwritten characters from the MNIST database (Brader et al., 2007). The classification accuracy was improved by having a pool of neurons that compute in parallel. However, this again leads to the problem of an increased number of synapses by a factor equal to the number of neurons in the pool.
In this work, we present a bioplausible structural learning rule for a neuron with nonlinear dendritic branches and suitable for implementation on neuromorphic chips. The letter is organized as follows. The full spike-based classification model of neurons with dendritic nonlinearity is presented in section 2.1, followed by the reduced model for faster training (RM) in section 2.2 and a margin-based reduced model (RMWM) in section 2.3. The theoretical capacities of the RM and RMWM provide an upper bound on the number of patterns that can be correctly classified, as discussed in section 2.4. The learning rules for the RM and RMWM are based on the structural plasticity mechanism and are presented in section 2.5. Further, a branch-specific spike-timing-based implementation of the structural learning rule (BSTDSP) is presented in section 2.6. The models were tested on a two-class classification task (see sections 3.1 to 3.3), and the performance of the RMWM was compared against the SVM and ELM methods on a set of real-world binary databases in section 3.4. The biological and computational relevance of our work is discussed in section 4, followed by the conclusion and discussion of future work in section 5.
2 Model of Nonlinear Dendrites and Structural Learning
Various computational modeling studies have shown that neuronal cells having nonlinear dendrites (NLD) can perform complex classification tasks much more accurately than their counterparts with linear subunits when the inputs are high-dimensional binary patterns (Poirazi & Mel, 2001; Hussain, Gopalakrishnan, Basu, & Liu, 2013). The presence of NLD provides a neuron with many nonlinear functions to better approximate the decision boundary of a complex classification problem. In this work, we extend the NLD model of Poirazi and Mel (2001) to perform supervised classification of binary patterns consisting of rate-encoded spike train inputs or patterns of single spikes.
We consider two types of spike inputs: mean rate and single spike patterns. These spiking inputs encode binary patterns of numbers denoted as where each component . The first type is binary patterns of mean rate inputs, which are defined such that for a spike train arriving at an afferent, where is a delta function and spike times (T is the duration of the pattern), if the mean firing rate of the spike train (averaged over T) is higher than a threshold, the corresponding xi is set to 1; else it is set to 0 (see Figure 1). The second type of spike inputs are binary patterns of single spikes in which if the ith afferent fires a single spike and if it fails to fire. Furthermore, all spikes arrive within a small time window (), , where Tsyn is a fixed time between . The binary inputs are referred to as binary vectors in the rest of the letter. To reduce the training time, we have adopted the strategy of training the model on binary vectors and testing on the original spike train patterns. We next describe the full spike-based model, followed by its reduced versions.
2.1 Spike-Based Model
We previously proposed a model of neuronal cells having a lumped dendritic nonlinearity Hussain et al. (2013). The nonlinear neuron (NL-neuron) consists of a simplified neuron with m dendritic branches and k excitatory synapses on each branch (see Figure 2a). Each synapse is driven by one of the d components of the input. We enforce the constraint in our model for several reasons: first, choosing the best k connections out of d possibilities is the basis of our learning rule; second, sparse connections reduce synaptic resources, and, third, this provides a means of regularization by reducing the number of free parameters.
Figure 3a shows the input and output of a dendrite (ellipse) when a rate-encoded spike pattern is presented. Sample waveforms denote the time evolution of and for the dendrite for a case where is a squaring function. Figure 3b shows an example case where the firing rate of the neuron trained on mean-rate encoded spike patterns of class (+) is higher for an input belonging to that class, . These output spike trains of the (+) and (−) neurons are used as inputs to the WTA, which gives the classification output.
2.2 Reduced Model
The use of a squaring nonlinearity for each dendritic branch will result in unrealistically large values for large inputs and large power dissipation in the corresponding analog VLSI hardware. Hence, we introduce a saturation level such that for , . Figure 5 shows the nonlinear function for different xthr values, with (dashed) and without (solid) the bsat term. This saturating dendritic activation function also conforms well with the electrophysiological measurements of synaptic summation within dendrites of pyramidal neurons (Polsky, Mel, & Schiller, 2004). Fits using saturating polynomial functions on the experimental data measured in pyramidal neurons in the cited study are presented in section 4.1.
2.3 Reduced Model with Margin
We found that the choice of depends on the size of the neuron measured by the number of branches and the number of synapses on each branch. The parameter can also be varied to achieve a trade-off between the training and the generalization performance, as shown in Brader et al. (2007). A larger will give better generalization ability and reduce training performance, while smaller values of will give optimal performance on the training set at the cost of reduced generalization performance. Therefore, an adaptive method was used to set the value of by assigning an initial high value and then reducing it gradually, as discussed in section 2.5.2.
2.4 Theoretical Capacity Predictions
For both the RM and RMWM, we predict the theoretical capacities of the L- and NL-neurons by estimating the number of distinct input-output functions that can be expressed by these neurons as a function of the total number of synapses, s, and the number of distinct input lines, d. We wanted to see if the added flexibility provided by the nonlinear dendrites allows us to relax the resolution of weights used. To do this, we use binary weights such that if a connection exists and 0 otherwise. Therefore, the capacity of these neurons can be given by the combinatorial expressions to count the number of input-output mappings, that is, all possible ways in which input lines can connect to synapses on dendrites resulting in distinct memory fields. These expressions were derived in Poirazi and Mel (2001) and are provided here for reference. For an L-neuron, it is given by the number of ways in which s synaptic sites can be drawn from d input lines with replacement, equal to . The spatial location of the s synapses is not considered. For an NL-neuron, we first count the number of distinct branch functions available by drawing k synapses from d afferents with replacement, given by and then the number of ways in which m branches of an NL-neuron can be drawn from f possible functions, calculated as , where the total number of synapses, , is kept the same as that for the L-neuron to allow direct comparison of capacities. The theoretical capacity calculated from these expressions is compared with the classification errors measured by training linear and nonlinear neurons on binary classification tasks as described in section 3.1.1.
2.5 Learning Algorithm
In this section, we describe a learning rule that attempts to achieve classification performance in agreement with the theoretical capacity of a neuron, that is, for a fixed number of synapses, a minimum classification error is obtained when the predicted capacity is high. This learning rule is motivated by the feasibility of the equivalent hardware implementation. The goal of the learning process is to train the L- and NL-neurons so that the classifier output is and for input samples from the positive and negative classes, respectively. A binary teacher signal, o ( or 0), guides the learning process by indicating the desired output during the training phase. The learning process can be regarded as a combinatorial search problem of selectively forming synaptic connections on each dendritic subunit. This learning rule is a supervised or error-modulated form of Hebbian learning, which is based on the change in the connection strength of a synapse as a function of the presynaptic and postsynaptic activities. However, our learning algorithm involves the formation and elimination of synapses instead of weight modification since we use binary weights.
2.5.1 Learning Rule for RM
The error term ensures that only wrongly classified patterns contribute to cij. This implies that the connection changes based on the input correlations, have minimal effect on the patterns already learned, which agrees with the basic adaptive learning principle of minimal disturbance (Widrow & Lehr, 1990). The connection changes are done by using the following logic: a synapse with the smallest cij value corresponds to a poorly performing synapse and is a candidate for replacement. Since we operate with fixed synaptic resources, this synapse needs to be replaced by another one. To do this, a set of silent synapses is first formed on the chosen branch as possible candidates for the new connection. The silent synapses are randomly selected from the set of input lines in addition to the existing synapses and do not contribute to the dendritic activity. The silent synapse with the highest cij is chosen as the replacement. Hence, we modify the grouping of connections on a dendrite guided by a Hebbian-like learning rule. This type of synaptic clustering has been observed experimentally in Kleindienst, Winnubst, Roth-Alpermann, Bonhoeffer, & Lohmann (2011) showing activity-dependent clustering of synaptic inputs on developing hippocampal dendrites. Moreover, our calculation of cij can be implemented easily in hardware since it needs only local information, while the global error signal is a binary value.
The input connections for a neuron with m branches and k synaptic contacts per branch were initialized by randomly selecting afferents from one of the d input lines with weight . We trained our model on P input pattern vectors () belonging to (+) and (−) classes. The learning process comprises the following steps in every iteration:
The activation of each dendrite, zj, is determined, and then the sum of all branch responses is calculated as in equation 2.10 for both (+) and (−) neurons, giving and for all P patterns. The overall output of the classifier is computed according to equation 2.11.
The error is calculated by averaging the absolute error over the entire batch of P patterns, which we describe as the mean absolute error (MAE). Since in our case, o and y are binary valued, the absolute error is always 1 for every misclassified pattern and, hence, the MAE is equal to the fraction of misclassified patterns.
A random set T consisting of nT () synapses is selected as candidates for replacement.
The synapse with the lowest cij in T is replaced by the best-performing (maximum cij) synapse in a random replacement set R consisting of nR () synapses. The set R was created by placing nR silent synapses from d input lines on the branch with the lowest cij synapse. They do not contribute to the calculation in step 1.
Synaptic connections are modified if the replacement led to either a reduction or no change in MAE. If the MAE increased with the replacement, a new replacement set R is created. If the MAE does not decrease after repeating this step nch () times, we assume a local minimum is encountered. We then do a replacement with the last choice of R, even if it increases MAE in an attempt to escape the local minimum. The connection matrix corresponding to the local minimum that gave the least error is saved.
Steps 1 to 5 are repeated until either all the patterns have been memorized or () local minima are encountered. At this point, the learning process is stopped. The connection patterns corresponding to the best minimum become the final learned synaptic connections of the neuron.
The fraction of synapses to be replaced in each iteration is similar to a learning rate parameter in the traditional weight modification algorithm. A minimum of one out of randomly selected nT out of all s synapses is replaced in each iteration for smooth convergence to ideal topologies. The largest possible size (nR) of the replacement set is equal to d, the dimension of the input. Using should give better optimization results, but this increases computational cost. Therefore, we chose nR as a fraction of d, which is a good trade-off between the computational cost and quality of the solution to the optimization problem. Finally, the RM trained on binary input vectors is tested on spiking inputs using the equivalent spike-based model and the connection matrix after training. Here, we wanted to test how well our reduced abstract learning rule can transfer the learned performance to classify the original spike patterns or the noisy spike inputs when the temporal dynamics of a spike-based model are introduced.
2.5.2 Learning Rule for RMWM
Training is first performed using the g() function, and then the corresponding spike inputs are presented to the model.
The values of are recorded for all cases in which the network misclassified the spiking inputs.
A margin of given by the highest value of for a wrongly classified pattern is introduced around the classification boundary to reduce the misclassified cases.
value is reduced to of if the learning algorithm gets stuck in the same local minimum for five consecutive times and is reduced to of its present value every time this condition is encountered.
After the completion of training, the testing is done in the same way as in the case of RM, on spike inputs using the spike-based model.
2.6 Spike-Time-Based Learning
Our abstract models (RM and RMWM) have used the cij values to guide structural plasticity; however, no specific spike-time-based mechanism to achieve this computation has been shown so far. In other words, all learning is done on an abstract model while only the testing has involved spike trains. We present in this section a possible spike-time-based rule that enables online calculation of cij using an anti-Hebbian rule for excitatory synapses. Such reverse STDP (RSTDP) rules have also been widely reported in biological measurements (Letzkus, Kampa, & Stuart, 2006; Froemke, Poo, & Dan, 2005; Sjostrom & Hausser, 2006). Our spike-based learning rule is inspired from a recent study in which RSTDP in concert with hyperpolarization of the postsynaptic neuron or spike-frequency adaptation was used to realize variants of the perceptron learning rule (Albers, Westkott, & Pawelzik, 2013; D’Souza, Liu, & Hahnloser, 2010) by modifying synaptic weights. The algorithm enables associative learning where a teacher signal is present during the learning process. In our case, we use a similar concept, but in contrast to Albers et al. (2013), we do not modify the weights but instead calculate the change in cij. Another major difference in our work compared to Albers et al. (2013) is that the learning rule for a synapse is modulated by its dendritic branch activation level and is hence termed branch specific. This concept of branch-specific plasticity has also been observed to a certain degree in biology (Makara, Losonczy, Wen, & Magee, 2009; Losonczy, Makara, & Magee, 2008) and has been used to model feature binding in a single neuron (Legenstein & Maass, 2011). Finally, there is a question of what physical variable in biology might represent cij. There is evidence that calcium concentration in spines is a correlation-sensitive, spike-time-dependent signal and has been implicated as a guide for structural plasticity (Helias, Rotter, Gewaltig, & Diesmann, 2008). We compute the correlations in a branch-specific manner as discussed next in our branch-specific spike-time-dependent structural plasticity (BSTDSP) model.
2.6.1 BSTDSP: Model Details
2.6.2 BSTDSP: Learning Rule and Parameter Choice
To understand the learning process intuitively, let us consider the case of two-class synchronous single-spike inputs such that each afferent either fails to fire or fires a spike at a fixed time Tsyn. During learning for , a teacher signal forcing a postsynaptic spike (at time ms) is present for the (+) neuron and absent for the (−) neuron. For , the opposite conditions for the teacher signal exist.
For an unbiased classifier, the margins for the two classes are equal. Using this condition, we can find the values of Vst and Vreset. The value of the desired margin is determined by using the spike equivalent value of margin, (see section 2.3) after learning. This is done by computing the difference in the membrane voltage of the (+) and (−) neurons (analogous to ) when a single synapse is activated. This value multiplied by gives , since in the binary input case, the synaptic strength is normalized to 1. Since is a function of m and k, we have also used different values of , which entails changing the values of parameters Vst and Vreset for different neuron configurations.
First, we discuss the procedure for training the RM and RMWM, followed by the method used to test the performance of these models:
Training on binary vectors. To demonstrate the learning process of the model, we used random patterns generated in a manner similar to Poirazi and Mel (2001). The input patterns were chosen from a do ()-dimensional spherical gaussian distribution. These patterns were randomly assigned to the positive () and negative () classes. The do-dimensional input patterns were mapped into a high d ()–dimensional space to incorporate sparse neural coding by using nonoverlapping binary-valued receptive fields (RFs) along each dimension. Hence, we generated binary vectors such that the probability of any one RF neuron being active is , that is, . The binary patterns thus generated were used to train either the RM (see section 2.2) or the RMWM (see section 2.3). The connection matrix learned after the training process is used for testing the model. This method has the advantage of shorter simulation times as compared to the learning on spike inputs. It can be used in practice as well (if the desired input patterns are known), where the learned connection matrix can be downloaded to a neuromorphic chip.
Testing on noisy spike inputs. We have tested the noise sensitivity of the RM and RMWM on noisy spiking versions of the binary input vectors used to train the models. Henceforth, testing the model will refer to testing the noise tolerance of the model except for the classification of benchmark data sets, where the generalization performance is evaluated. We introduce noise in the inputs by using Poisson spike trains or jittered single spikes. We used Poisson spike trains because we wanted to model a realistic case where our classifier receives inputs from a noisy spiking sensor or from biological neurons in the case of brain-machine interfaces. The two types of spiking inputs used for testing are generated as follows: for rate-encoded spike patterns, binary values of 1 and 0 are mapped to Poisson spike trains with mean firing rates of fhigh and flow, respectively, similar to Mitra et al. (2009), and single spike patterns are generated by mapping binary input to a single spike arriving randomly within and to no spike, where Tsyn is a fixed time within the stimulus duration T. The testing on spike patterns uses the spike-based model (see section 2.1) with the learned connection matrix.
The results obtained are organized as follows:
We discuss the classification performance of RM. These results were presented in Hussain et al. (2013), where we showed that the performance of RM on noisy spike trains is much worse than the training performance on binary vectors. Hence, RM suffers from poor noise tolerance.
We show that the margin-based RMWM leads to improvements in both training performance and noise sensitivity.
The classification results from the BSTDSP rule are presented, where we also compare the performance of the abstract RMWM with that of our biorealistic BSTDSP model.
The performance of RMWM is compared against SVM and ELM methods on three databases.
Simulation results are presented for , , and input patterns using , , and dendrites and , , , , and synapses on each dendrite unless stated otherwise. The learning parameters are . The parameters for testing on spike patterns using the spike-based model and the parameters for the BSTDSP model are summarized in Tables 2 and 3, respectively. These parameter values were chosen in accordance with the validity of the RM as discussed in section 5.2. All results are obtained by averaging across five simulation trials.
|fhigh .||flow .||.||.||I0 .||Vthr .||R .||C .||T .|
|Hz||1 Hz||2 ms||8 ms||mV||M||5 nF||ms|
|fhigh .||flow .||.||.||I0 .||Vthr .||R .||C .||T .|
|Hz||1 Hz||2 ms||8 ms||mV||M||5 nF||ms|
3.1 Performance of RM
First, we present the results of training the RM with binary vectors and then test its robustness to noisy spike inputs.
3.1.1 Training Performance
Comparison of L- and NL-neurons. In the first experiment, we compare the performance of L- and NL-neurons in a binary classification task. Both classifiers were trained on and input patterns, and their performances were compared. As shown in Figure 8, the MAEs decrease for both neurons as learning progresses. However, the NL-neuron achieves a lower error than its linear counterpart, and this difference in error grows as the classifier is trained on a larger number of patterns, as seen from the performance on versus training patterns.
In order to understand the effect of learning on the connection matrix, we investigated whether the input connections learned by a neuron are able to capture the correlations present in the input patterns. If and denote the mth binary vector belonging to class (+) and (−), respectively, then computes the second-order correlations present in the (+) pattern. is a matrix for d-dimensional patterns with d2 entries, of which the values in the upper triangle and the diagonal are unique. and denote the total number of patterns in class (+) and (−), respectively. The entries in R can be ordered according to decreasing magnitude, which provides a unique ranking of the entries in the upper triangle and the diagonal, where ri and ci denote row and column index, respectively. Lower ranks correspond to high-positive-value entries in R, which denote combinations that are specific to pattern class (+) and not in pattern class (−). Similarly, high ranks correspond to high-negative-value entries in R for those combinations that uniquely define pattern class (−). Similarly, gives the correlation matrix for the weights on a dendrite of the (+) neuron, where is the d-dimensional weight vector of the jth dendrite, , where is a d-dimensional set of nonnegative integers.
Figure 9 shows how the input correlations learned by the (+) and (−) neurons match with those present in the input patterns. In this example, neurons with dendrites and synapses per dendrite were trained to classify binary patterns where . RWj is obtained by ordering the entries in the correlation matrix according to O and is plotted for each dendrite of the (+) and (−) neurons before and after training. Each pixel shows the correlation value between a particular pair of input connections formed on a dendrite. For , the total number of unique combinations is equal to , and hence the x-axis ranges from 1 to 210 and the y-axis ranges from 1 to 10 for dendrites in each neuron. We have also shown the histograms obtained by plotting the sum of correlations for each connection pair for all the dendrites below each correlation matrix. Figures 9a and 9c show that the initial connections formed on the (+) and (−) neurons pick up the input correlations randomly, as evident from the correlation values evenly distributed along the x-axis. As seen in Figure 9(b), the (+) neuron has a much higher clustering of connections corresponding to the lower-ranked combinations unique to (+) patterns, while it learns very few of the correlations with high negative values (higher ranked) unique to the (−) patterns. Similarly, the clustered appearance toward the right end of the x-axis in Figure 9d indicates that the (−) neuron learns to capture more of the correlations present in the (−) patterns and fewer of those in (+) patterns. Hence, the proposed learning algorithm enables the connection matrix to pick out unique correlations in the data. For the binary classifier consisting of a pair of neurons, we have shown that each neuron prefers patterns belonging to a particular class by mimicking the input correlations present in that class.
Effect of number of dendrites. Next, we tested the performance of the model as the number of dendrites, m, is varied. The total number of synaptic resources was kept constant at . Figure 10 shows the MAE versus m. The circles represent the errors for the L-neuron with , and the NL-neuron errors are denoted by squares (). The performance of both types of neurons decreases with the larger number of input patterns. However, the L-neuron performs much worse than the NL-neuron with classification errors of % against 9% for patterns. Further, the NL-neuron errors decrease as the number of branches increases, which can be explained by the fact that as m increases, a larger number of dendritic functions can be selected from the f possible functions obtained by drawing k synapses from d input lines (see section 2.4). These nonlinear branch functions can well approximate the nonlinear classification boundary of the problem, as explained in the example in section 2.2. As the number of dendrites continues to increase, the number of branch functions that can be drawn from f functions reduces if m increases beyond a certain value. As seen in Figure 10, the optimal value of m is for . The theoretical capacity of the neuron is also plotted on the right-hand y-axis (dashed), calculated by taking the logarithm of the combinatorial expressions given in section 2.4. As the capacity increases, more patterns can be learned and can be correctly classified. Hence, classification errors should decrease, in accordance with the empirical results. Similar to the measured classification errors, the theoretical capacity also peaks around and further reduces as m increases. Therefore, as the number of dendrites grows, the classification performance of an NL-neuron with fixed synaptic resources (s) first increases to a peak value and then falls with further addition of dendrites.
Effect of number of synapses per dendrite. We also determined the performance of the NL-neuron as a function of the number of synapses per branch (k) since it is a free parameter for an AER-based hardware system that can be easily adjusted. The reciprocal of theoretical capacity of a neuron as a function of k is plotted in Figure 11. It shows that for a neuron with a fixed number of dendrites, the inverse capacity decreases, and therefore capacity increases as the number of synapses per dendrite is increased and as more dendrites are added, the neuron can learn more patterns. Figure 12 shows classification errors measured empirically. The errors decrease as k increases for a fixed value of m, which is due to the presence of a higher number of total synapses, . The larger number of patterns is more difficult to train and hence leads to higher errors. As a larger number of dendrites is used, the errors decrease further, and for a fixed number of synapses per branch, for example, , the classification error for patterns reduces from for by a factor of 2 for and of for . Therefore, the empirical results agree with the theoretical capacity predictions, that is, capacity increases with a higher number of synapses and more patterns can be correctly classified.
3.1.2 Noise Sensitivity Performance
The spike-based model with the same connection matrix was used to classify spike patterns of mean firing rates. The classification performance of RM on spike patterns was measured as a function of k. Figure 13 shows that the classification errors for spiking inputs are much higher than that for the mean rate inputs on which the model was trained. It can be seen that the testing errors on noisy spike train inputs are two to five times larger than that of the training errors for binary input vectors (see Figure 12c). In contrast to the training errors decreasing with k, we observe that the testing errors increase as k is increased.
The increased errors can be attributed to the fact that as the number of branches (m) increases, the linear summation of dendritic branch outputs leads to a very high average value of the input currents and () to the (+) and (−) neurons, respectively. This causes the small differences between and to become indistinguishable for the (+) and (−) neurons due to the finite precision of the spike-based model used for testing as compared with the infinite precision of the abstract RM used for training. We also see from Figure 13 that the errors increase as the number of synapses per branch increases. This behavior of the model can be explained by the fact that the use of a higher dendritic threshold Ithr to prevent branch saturation at larger k reduces the and currents due to a spike pattern and therefore leads to decreased discriminability of the input currents to the spike-based (+) and (−) neurons. Next, we investigate how these issues are rectified using RMWM.
3.2 Performance of RMWM
Here, we present the results of training the RMWM with binary vectors and then test its noise sensitivity.
3.2.1 Training Performance with Function
We present the results of applying the method discussed in section 2.5.2 to determine the value of for the function. The model was trained on binary input patterns and then tested on the spike versions of these patterns. The values for all the misclassified inputs were recorded. This experiment was repeated for NL-neurons with different number of branches and number of synapses per branch ranging from to . The probability density distribution (pdf) of as a function of number of synapses on each dendrite is shown in Figure 14 for and three values of k. We can see that the peak of the pdf of varies with both k and m (the effect of m is not shown in the figure). Therefore, for a neuron with branches and around 5 to 50 synapses per branch, can be set to around and then gradually reduced during training. Similar values can be used for training the NL-neurons with the same values of m and k on a classification data set consisting of binary input patterns.
Next, we used the function (see Figure 6b) to test if the performance of the classifier improves. The model was trained on patterns for different values of k. The function was used during the training process with . After the training was completed, the memorized patterns were recalled using function. As shown in the Figure 15, the classification errors reduce for all neuron configurations. As compared to training errors of RM (see Figure 12c), the errors corresponding to are smaller by factors of 2 to 5 and for large k values (), the errors reduce drastically by factors of more than 10 to 20. Hence, the improved classification performance is observed because the model is trained to generate a decision boundary with a margin . We will also present the results of improved noise tolerance achieved using the function.
3.2.2 Improving Noise Sensitivity
Here, we utilize the RMWM to improve the noise sensitivity by rectifying the problems discussed in section 3.1.2 in a step-wise manner.
First, in order to balance the effect of high average values of the input currents for large values of m, we adopted the following strategy. The current differences () and () are used as inputs to the two neurons (see Figure 16). As shown in Figure 17a, the reduction in classification error is most significant, by a factor of for as compared to when testing results were obtained without the differential input (see Figure 13). The errors reduce for smaller k, while for larger k, the performance still drops with increasing k. The dashed plots correspond to the training performance of the RMWM as shown in Figure 15. As described next, the RMWM proposes to achieve noise tolerance to spiking inputs, which is close to this optimal training performance obtained by the model.
Second, we used the function to train the model on binary input vectors and then tested the performance on spiking inputs. This approach was used to improve the noise tolerance to the spike train versions. The errors computed using a NL-neuron are shown in Figure 17b. As seen, the classification errors reduce for all combinations of m and k by factors of 1.5 to 6.5 of that of errors obtained using RM and the performance levels match with the training performance (dashed plots) more closely than in Figure 17a. These results suggest that training the model with function improves the robustness to noise by introducing a margin around the classification boundary. As seen in Figure 17b, for , the errors tend to rise with increasing k as observed in the earlier results. Next, we discuss the approach used to solve this problem.
Third, we test whether the use of the dendritic nonlinearity, defined in equation 2.13, improves the performance for spike patterns at larger k values. As shown in Figure 17c, the errors at larger k () are smaller by factors of 1.3 to 2.5 as compared to errors obtained with the nonlinearity (see Figure 17b). However, there is no significant change in the performance for smaller k values. This method further improves the test performance on noisy spiking inputs, and the overall reduction in the classification errors for RMWM is by a factor of about 1.5 to 10 of that of test errors for RM.
3.2.3 Performance on Single Spike Patterns
We also tested the performance of the model on patterns of single spikes. The input connections were learned using the RMWM on binary input vectors. The testing was done on single spike patterns arriving within a time window of , , , and , where is the fall time constant of the postsynaptic waveform (see equation 2.6) and ms. The results, shown in Figure 18, suggest that the model generalizes well on single spike patterns with testing errors about 1.2 to 4 times that of the training errors (dashed) when , where ms. We found that this condition is satisfied for different values of , , and ms; the results for and ms are not shown in Figure 18. As increases, the testing errors increase because the spikes arriving at the synapses on the same dendrite are now more spread out in time, leading to a reduced dendritic response as compared to the response learned with binary vectors.
3.3 Performance of BSTDSP Model
We have evaluated the performance of our BSTDSP rule discussed in section 2.6 on a two-class classification of spike patterns. The architecture used here is similar to the two-neurons model shown in Figure 16, where the input current into each neuron is the difference of currents with opposite signs. For training, we used random binary patterns of single spikes with . In this case, we chose a low value of Vthr ( mV) such that for the initial random connections, a neuron evokes a spike for about of the input patterns. The reason for choosing a low Vthr value can be understood from Table 1, where for the RM, any misclassification leads to values (see the last two rows of Table 1) of opposite signs, which contributes to the connection change in both (+) and (−) neurons after an epoch. This strategy of modifying both neurons limits the changes required in each neuron as compared to the case when only one neuron contributes at a time. Figure 19 shows a case where a (+) pattern is not classified correctly. However, only the (+) neuron has a nonzero ( value for and in Table 1), while because the (−) neuron does not fire a spike due to high Vthr. The use of a lower Vthr will result in both and to be nonzero and of opposite signs.
The classification performance of the spike-based algorithm is shown in Figure 20. The training errors are plotted as a function of k for patterns. The model was also tested on noisy versions of the patterns of single spikes by adding a uniform random noise within the time window around the spike time Tsyn. Figure 20a shows the performance when the margin was not introduced, that is, (circles) and when the margin was introduced by using (squares), corresponding to different k values. We can see that both training and noise testing performances improve significantly as a result of margin-based learning, similar to the results obtained for RMWM learning (see section 3.2). The errors for noisy spike inputs with margin classifier reduce by up to a factor of 5 as compared to learning without margin.
We also compared margin-based learning using spike inputs with the RMWM learning on mean rate inputs, as shown in Figure 20b. The results show that the spike-timing based learning yields training errors (solid, squares) similar to those of training errors for RMWM (solid, circles). The test performance on noisy patterns of single spikes for the two instances of learning is shown with dashed plots. The test errors for spike learning are higher than the errors for RMWM learning by about times, and this factor increases with k. These differences in the performance of BSTDSP from RMWM can be attributed to the differences in calculations for the two schemes, which are not exactly same. As discussed in section 2.6.2, the value of as calculated in equation 2.29 is based on the assumption that , which is not perfectly true in our case. This also results in the estimate of Trise used for calculating to be different from the actual Trise for each pattern. Hence, the values of are slightly different from the ideal values as shown in Table 1. Another discrepancy between the learning for BSTDSP and RMWM can arise from the fact that the margin setting is not exact. As discussed in section 2.6.2, the margin for spike-based learning was set using , where a wrong estimate of the constant will mean that and are not truly analogous to each other, leading to different results. However, we can see that even with these differences in the learning for the two models, our proposed biorealistic spike-based learning can achieve comparable errors with the more precise learning in the abstract model presented earlier.
3.4 Performance on Benchmark Data Sets
Finally, we tested the performance of the proposed algorithm on real-world benchmark classification tasks and compared the results with the performance of the SVM and ELM methods. The binary data sets include data sets on breast cancer (BC), heart disease (HEART), and ionosphere (ION). These data sets (see Table 4) are taken from the UCI Machine Learning Repository (Bache & Lichman, 2013). The performance of our improved RMWM with structural learning rule was evaluated for both binary vector patterns as well as spiking inputs, where each sample corresponding to a feature was mapped into 10 nonoverlapping density-matched RFs, which were generated from the probability distribution of all samples of that particular feature. The resulting binary samples were then converted into spike patterns by using the method discussed in section 3.
|.||Number of .||Number of .||Number of .|
|Datasets .||Features .||Training Samples .||Testing Samples .|
|.||Number of .||Number of .||Number of .|
|Datasets .||Features .||Training Samples .||Testing Samples .|
The performance measures like number of neurons, number of weights, and classification accuracy for different algorithms are given in Table 5 to compare the classification performance as a function of the computational complexity of the network used. The results for SVM and ELM are taken from Babu and Suresh (2013). The number of nonlinear dendrites of a neuron is analogous to the number of support vectors of the SVM and the number of hidden neurons for the ELM. The performance of different classifiers is also compared on the basis of synaptic resources used. The total number of weights in an N-input network (SVM and ELM) using M neurons is given by . The total number of weights in the dendritic model with m dendrites and k synapses per dendrite for each neuron in the classifier is given by . The classification accuracy obtained in this work for spike inputs is slightly less (0.1–0.8%) than that for binary input vectors. The reduced accuracy is probably due to the noise in the Poisson spike trains and the effects of the integrate-and-fire process of the neuron. As the results suggest, the proposed model uses a similar number of dendritic processing units as the number of neurons used by SVM and ELM and achieves generalization performance within 1% to 2% of that of these algorithms. Moreover, our method achieves this performance by utilizing 10 to 50% fewer binary synapses, particularly for the HEART and ION data sets, therefore rendering the model feasible for hardware implementation.
|.||SVM .||ELM .||Dendrite Model .|
|.||Number of .||Number of .||.||Number of .||Number of .||.||Number of .||Number of .||Accuracy .||Accuracy .|
|Data Sets .||Neurons .||Weights .||Accuracy .||Neurons .||Weights .||Accuracy .||Dendrites .||Weights .||(Binary) .||(Spike) .|
|.||SVM .||ELM .||Dendrite Model .|
|.||Number of .||Number of .||.||Number of .||Number of .||.||Number of .||Number of .||Accuracy .||Accuracy .|
|Data Sets .||Neurons .||Weights .||Accuracy .||Neurons .||Weights .||Accuracy .||Dendrites .||Weights .||(Binary) .||(Spike) .|
Here, we discuss the biological and machine learning perspectives of models that compute a higher-order function of the inputs, particularly polynomial functions. We also present neurobiological and computational evidence to suggest that these higher-order units can be mapped to active dendrites consisting of clustered coactive synapses and the fact that formation of such synaptic clusters is supported by the mechanism of structural plasticity. Finally, we compare the features of our model with those of other machine learning classifier models based on concepts similar to ours.
4.1 Sum-of-Products: Biological Perspective
Our proposed model employs a static quadratic dendritic nonlinearity as an approximation to the nonlinear dendritic processing due to voltage-gated ion channels. This simplification leads to an abstract neuron model in which the neuron output can be expressed as a sum of squaring subunit responses, which can be further written as the sum of products of synaptic inputs on a dendritic subunit. The idea that the input-output behavior of a complex dendritic tree can be reduced to a simple algebraic formula consisting of sum-of-squares is supported by an earlier compartmental model for disparity tuning in complex cells of primary visual cortex (Archie & Mel, 2000). This model, consisting of simple-cell-like subunits mapped onto active dendrites, behaves like a hypothetical cell with squaring subunits. In another study, the firing rate of a complex biophysical pyramidal cell model was consistent with the mean rate predicted by an abstract two-layer neural network with sigmoidal subunits representing the thin terminal dendrites (Poirazi, Brannon, & Mel, 2003). Hence, these and other studies (Mel, 1991, 1993) suggest that from a biophysical perspective, a pyramidal cell with varying density, spatial distribution, and kinetics of voltage-gated ion channels can be well approximated by an abstract model of sum of nonlinear subunits.
To investigate the particular form of nonlinear function that might represent the input-output behavior of a dendrite, we look at the experimental study supporting the two-layer model of the sum of subunits (Polsky et al., 2004). In this study, confocal imaging was combined with focal synaptic stimulation to examine the rules of subthreshold synaptic integration in thin dendrites of pyramidal neurons. Figure 21 shows the measured and expected peak responses when two nearby sites on the same branch were stimulated simultaneously, where the expected response is the arithmetic sum of the responses to individual stimulations. The solid plot marked with circles corresponds to the measured responses. This curve has been reproduced from the data presented in Figure 4 in Polsky et al. (2004) for single-pulse stimulation. Therefore, experimental evidence suggests that the dendritic response to synaptic inputs on that branch is governed by a saturating expansive nonlinearity. In order to determine the form of subunit nonlinearity, we have fit polynomials to the experimental data because it is easier to implement polynomial functions in a hardware system. We have used the polynomials , and with saturating effects, to fit the data, where the constant was chosen to get the best fit. As shown in Figure 21, the quadratic function provides the best fit as compared with degree 3 and degree 5 polynomials tested here. Hence, for the few data points obtained from biological measurements, we can consider the squaring function as a valid choice for the dendritic nonlinearity.
Further, we addressed this question from the perspective of performance in machine learning tasks. We tested the different dendritic polynomial nonlinearities used above in the random pattern binary classification task for our proposed model. As above, we compared the polynomials of degree 2, 3, and 5 under the conditions when saturation is either enforced or not. The exact polynomial function used is given in the equation for (equation 2.13), where the exponent is 2, 3, or 5. Moreover, the value of margin used in the function is determined for each form of nonlinearity used according to the method discussed in section 2.5.2. The results shown in Figure 22a suggest that when there is no saturation on the nonlinearity, all three polynomials result in similar performance, that is, improvement comes with more synapses on each dendrite. This indicates that the particular form of nonlinearity does not have any influence over the classification performance. This result can be explained by the fact that from the combinatorial expressions used to predict the storage capacity of the nonlinear neurons (see section 2.4), we can see that the purpose of branch nonlinearity is to partition the multiple dendrites into independent computational units, and therefore the type of nonlinearity used should not have an effect. Figure 22b shows that when a fixed saturation level is applied, lower-degree polynomials (z2, z3) yield better classification performance than a higher-degree polynomial (z5), which is due to more branches getting saturated faster in case of a higher-power branch function. Finally, our choice of quadratic nonlinearity is supported by easier implementation in hardware, biological plausibility, and better performance in a classification task.
4.2 Sum-of-Products: Machine Learning Perspective
It was shown in section 2.2 that our proposed classifier uses a discriminant function of higher-order terms to perform nonlinear mapping for classification. Several higher-order networks have been proposed to solve nonlinear classification problems. These models use higher-order combinations of some of the input components to solve the problem. A method to construct higher-order networks involves incremental techniques in which the number of free parameters is gradually increased (Ghosh & Shin, 1992). A polynomial neural network uses this technique, which is based on an evolutionary strategy of fitting a high-degree polynomial to the input components using a multilayered network (Misra, Satapathy, Biswal, Dash, & Panda, 2006). This approach employs polynomial theory (Ivakhnenko, 1971) to determine the polynomial description of optimum complexity necessary to achieve high prediction accuracies. Another method to generate higher-order functions of the input components involves a functional link network, which estimates an unknown function as , where is the input vector and serve as basis functions such as products of inputs or sinusoidal functions. This method suffers from the need to use a priori knowledge in order to choose optimal basis functions (Ghosh & Shin, 1992).
4.3 Correlation-Based Synaptic Clustering: Biological Relevance and Computational Modeling
Our hardware-friendly learning rule involves aggregation of correlated synapses on a dendrite, which is consistent with several experimental findings. These studies have demonstrated that clustering of synapses on dendritic branches might be a substrate for learning. This phenomenon suggests that single dendritic segments serve as functional subunits, thereby expanding the neuronal properties significantly. This is in agreement with the predictions of earlier computational studies. The mechanism of nonlinear dendritic integration leads to regenerative events, including dendritic spikes, which have a stronger influence on neuronal firing than the activation of synapses located on different dendrites (Larkum & Nevian, 2008). This evidence suggests clustering of coactive inputs within a dendritic subregion, which can lead to compartmentalization of various forms of signaling in dendrites, like electrical, biochemical, and cellular (Branco & Hausser, 2010). Based on these results, it has been proposed that at the single cell level, dendritic branches act as basic functional units for long-term memory storage (Govindarajan, Israely, Huang, & Tonegawa, 2011). One such study has reported that long-term potentiation (LTP) at a single spine can modulate the threshold for plasticity at neighbouring spines (Harvey & Svoboda, 2007). This crosstalk between plasticity at nearby synapses results in dendritic segments having similar properties, which could allow the memory traces to preferentially form at synapses clustered within these dendritic subunits (Govindarajan et al., 2011). Gordon, Polsky, and Schiller (2006) showed that synapses innervating the proximal and distal regions of basal dendrites are governed by different rules for the induction of LTP. This domain-specific plasticity mechanism may support different learning functions.
The first demonstration that synaptic clustering might be a substrate for learning was given in the barn owl auditory system (McBride, Rodriguez-Contreras, Trinh, Bailey, & DeBello, 2008). The authors reported that behavioral experience drives changes in the clustering of axo-dendritic contacts. A related study has shown that new spines emerge in clusters along dendrites as a result of the repetitive activation of the cortical circuitry during motor learning (Fu, Yu, Lu, & Zuo, 2012). Synaptic clustering was also reported from a calcium imaging study demonstrating that locally convergent inputs in spontaneously active networks lead to frequent synchronizati