## Abstract

We present a neuromorphic current mode implementation of a spiking neural classifier with lumped square law dendritic nonlinearity. It has been shown previously in software simulations that such a system with binary synapses can be trained with structural plasticity algorithms to achieve comparable classification accuracy with fewer synaptic resources than conventional algorithms. We show that even in real analog systems with manufacturing imperfections (CV of 23.5% and 14.4% for dendritic branch gains and leaks respectively), this network is able to produce comparable results with fewer synaptic resources. The chip fabricated in m complementary metal oxide semiconductor has eight dendrites per cell and uses two opposing cells per class to cancel common-mode inputs. The chip can operate down to a V and dissipates 19 nW of static power per neuronal cell and 125 pJ/spike. For two-class classification problems of high-dimensional rate encoded binary patterns, the hardware achieves comparable performance as software implementation of the same with only about a 0.5% reduction in accuracy. On two UCI data sets, the IC integrated circuit has classification accuracy comparable to standard machine learners like support vector machines and extreme learning machines while using two to five times binary synapses. We also show that the system can operate on mean rate encoded spike patterns, as well as short bursts of spikes. To the best of our knowledge, this is the first attempt in hardware to perform classification exploiting dendritic properties and binary synapses.

## 1 Introduction

Spiking neural networks (SNN), considered to be the third generation of neural networks, were proposed due to the advent of neurobiological evidence (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1999) that suggested biological neural systems use timing of action potentials, or spikes to convey information. These types of network are considered to be more biorealistic and computationally more powerful than their predecessors (Maass & Markram, 2004; Maass, 1999; Gutig & Sompolinsky, 2006). Neuromorphic engineering (Mead, 1990) aims to emulate the analog processing of neuronal structures in circuits to achieve low-power, low-area, very large-scale integrated circuit implementations. Thus, while theoretical studies on SNN have progressed rapidly, neuromorphic engineers have, in parallel, implemented low-power VLSI circuits that emulate sensory systems (Chan, Liu, & Schaik, 2007; Culurciello, Etienne-Cummings, & Boahen, 2003; Lichtsteiner, Posch, & Delbruck, 2008; Hsieh & Tang, 2012) and higher cognitive functions like learning and memory (Arthur & Boahen, 2007; Wang, Hamilton, Tapson, & van Schaik, 2014). Since silicon systems face many challenges similar to neuronal ones, we hope to gain insight into some operating principles of the brain by making such neuromorphic systems. Also, with the advent of brain-machine interfaces and the internet of things, there is now a pressing need for area- and energy-efficient neural networks for pattern classification.

Recently Roy, Basu, and Hussain (2013) and Hussain, Gopalakrishnan, Basu, and Liu (2013), proposed structures inspired by the nonlinear properties of dendrites in neurons that require many fewer synaptic resources than other neuromorphic designs. The learning of these structures involves network rewiring of binary synapses that is comparable to the structural plasticity observed in biological neural systems. As an example, in vivo imaging studies have shown that synaptic rewiring mediated by rerouting of whole axonal branches to different postsynaptic targets takes place in the mature cortex (Stettler, Yamahachi, Li, Denk, & Gilbert, 2006). Several experimental studies have provided evidence for the formation and elimination of synapses in the adult brain (Trachtenberg et al., 2002), including activity-dependent pruning of weaker synapses during early development (Be & Markram, 2006). Inspired by these phenomena, our learning algorithm tries to find the best sparse combinations of input on each dendrite to improve performance. This choice of connectivity can be easily incorporated in hardware systems using address event representation (AER) protocols, commonly used in current neuromorphic systems, where the connection matrix is stored in memory. Since this memory had to be stored for implementing any AER-based system, no extra overhead is needed to implement our method other than the dendritic nonlinearity. Instead, the reduced number of synaptic connections translates to a reduction in memory access and communication overhead that is often the most power-consuming aspect of large-scale spiking neural networks (Hasler & Marr, 2013).

In this work, we present a neuromorphic current mode implementation of the above neural classifier with shared synapses driven by AER. Some of the initial characterization results were presented in Banerjee, Kar, Roy, Bhaduri, and Basu (2015). In this letter, we present complete characterization results and detailed results on pattern classification using the chip. The organization of the letter is as follows. We first present some architectural modifications of the basic dendritic cell for improved hardware performance. Next, we present circuit descriptions and simulations of each building block. Measurement results from a chip fabricated in m complementary metal oxide semiconductor (CMOS) are presented in the following section to prove functional correctness. Finally, we conclude with discussions in the last section.

## 2 Background and Theory

Roy and colleagues (Roy et al., 2013; Roy, Banerjee, & Basu, 2014) and Hussain and colleagues (Hussain, Gopalakrishnan, Basu, & Liu, 2013; Hussain, Basu, & Liu, 2014) have described spike train classifiers employing neurons with nonlinear dendrites (NNLD) and binary synapses. Due to the presence of binary synapses, the learning in these types of architectures happens by morphological changes in the connections between inputs and dendrites, not by weight update. Thus, these architectures are amenable to neuromorphic implementation employing AER protocols. In this letter, we present a circuit to implement the architecture proposed in Hussain et al. (2014) that has comparable performance to other spike-based classifiers such as O'Connor, Neil, Liu, Delbruck, & Pfeiffer (2013) but use 2 to 12 times fewer synaptic resources. Hence, our hardware implementation requires correspondingly less memory to store the connectivity information. It also needs proportionately less energy to communicate the connection information for each spike. Note that in this work, the training is done on a PC and the learned connection matrix is downloaded to the hardware platform for testing. Next, for completeness, we briefly describe the architecture of a basic NNLD, a classifier composed of two such NNLDs, the learning rule to train the classifier, and some modifications to improve hardware performance.

### 2.1 Architecture

**x**() is applied to this system (), then each synapse is excited by the input connected to it and the output response of the th dendritic branch is given by . Here is a model of the dendritic nonlinearity given by /, is the synaptic weight of the th synapse on the th branch, is the corresponding input, and is a scaling constant. We choose a square law nonlinearity since it has been shown in Hussain et al. (2015) to match the measured dendritic nonlinearity reported in Polsky, Mel, and Schiller (2004). Other popular nonlinearities like RELU will not be applicable here since the input to the nonlinearity is always positive because the synapses are only excitatory. Also, squaring circuits can be made using only five transistors, as we will show. Let denote the total summed output current from all the dendrites that enter the neuron Then the overall output of a single neuronal cell is given by where denotes the linear neuronal current-frequency conversion function where is the Heaviside function defined as for , for , and for . This signifies that the neuron produces zero outputs for negative inputs.

### 2.2 Learning Algorithm: Network Rewiring Rule

The classifier shown in Figure 1b was trained by a structural plasticity-based network rewiring learning rule. For each input pattern, a binary target signal is provided to the classifier. The training was based on mean rate binary inputs , and testing was performed by mapping each input dimension to a Poisson spike train with high or low mean firing rates. The learning algorithm primarily consists of the following steps:

The inputs are connected to the dendritic branches of the NNLDs via binary synapses ( = 0 or 1), so the network learns through connection changes instead of weight changes. Since , learning involves choosing the best choices of connection for each dendrite.

At each epoch of learning, a randomly selected set of synapses is chosen for possible replacement. The performance index for each synapse in the set is computed as for the positive cell and for the negative cell. Here, denotes averaging over the entire training set. The synapse having the least value of in () is tagged for replacement.

For replacement, a candidate set is formed by randomly selecting of the input lines. The synapses in are placed on the dendrite with the lowest synapse from the previous step. The performance index is again computed for synapses in , and the synapse having the highest () is used to replace .

For more details about the algorithm, refer to Hussain, Liu, and Basu (2015) and Roy, Banerjee et al. (2014). We also invite readers to look into Roy et al. (2013) and Hussain et al. (2014). For a detailed description of the architectures and learning rules, see Roy et al. (2013) and Hussain et al. (2014).

### 2.3 Modifications for Hardware Implementation

Three modifications of the single-ended classifier and learning algorithm are used to improve performance of the hardware.

#### 2.3.1 Differential Architecture

#### 2.3.2 Dendritic Leak

#### 2.3.3 Margin-Based Learning

: Input vector |

: Dimensionality of the input |

: Number of dendrites per cell |

: Number of synapses per dendrite |

: Nonlinear dendritic function without leak |

: Nonlinear dendritic function with leak |

: Input value at the th synapse of the th dendrite |

: Scaling constant of the nonlinear dendritic function |

: Input-output function of WTA without margin |

: Input-output function of WTA with margin |

: Sum of the currents from dendritic branches of P cell |

: Sum of the currents from dendritic branches of N cell |

: Margin of classification |

: Mean rate of Poisson spike train mapped to binary input 1 |

: Mean rate of Poisson spike train mapped to binary input 0 |

: Input vector |

: Dimensionality of the input |

: Number of dendrites per cell |

: Number of synapses per dendrite |

: Nonlinear dendritic function without leak |

: Nonlinear dendritic function with leak |

: Input value at the th synapse of the th dendrite |

: Scaling constant of the nonlinear dendritic function |

: Input-output function of WTA without margin |

: Input-output function of WTA with margin |

: Sum of the currents from dendritic branches of P cell |

: Sum of the currents from dendritic branches of N cell |

: Margin of classification |

: Mean rate of Poisson spike train mapped to binary input 1 |

: Mean rate of Poisson spike train mapped to binary input 0 |

## 3 VLSI Implementation of Neuromorphic IC

The VLSI architecture of the implemented neuromorphic integrated circuit (IC) is shown in Figures 3a to 3d, where AER is used to provide the synaptic input. A differential pair integrator (DPI) circuit has been used to implement the synaptic function of the neuron; in this circuit, it is possible to achieve the linear filtering property of each synapse (Bartolozzi & Indiveri, 2007) by proper biasing. This linear filtering property is implemented here to replace all the synapses in one dendritic connection by a single shared synapse, drastically reducing the effective layout area of the IC.

This IC is interfaced with a field programmable gate array (FPGA) controller that generates input spikes and addresses, as shown in Figure 4. The learned sparse connectivity matrix is stored inside the FPGA memory in a very compressed form by using two look-up tables. This constitutes the crossbar or routing array shown in Figure 1a. Based on the input line address, the controller reads the connectivity information of the address line from memory and generates an -bit decoder address to route the spike to the proper dendrite. Spikes from the FPGA output reach the synapse circuit input through an -bit to a -bit address decoder followed by digital switches. There are, in all, P-dendrites and N-dendrites connected to a NEURON block. The shared synapse circuit in each dendrite is followed by a square-law nonlinear circuit. For the chip that we have fabricated and present in section 5, implies that both P- and N-cells have dendrites. The difference between the total currents from the P-dendrites and the N-dendrites appears to the input of the neuron block and is converted to equivalent output spikes.

The output of the neuron block is digitized for proper handshaking of the IC with the FPGA in the form of an and an signal. We have implemented only one cell to compute one term () in equation 2.6. (Unless otherwise mentioned, the inhibitory interneuron is not to be taken into account.) The second term, (), is computed by passing the same input again through the IC, but with the connection matrix of the P-cell interchanged with that of the N-cell. In this case, the output of the IC is . This is shown in Figure 3e. So basically, the two units presented in Figure 2c are implemented through passing the same input spike train into the same circuit but with the swapped connectivity. The final decision of the pattern class is taken on the PC after computing the difference of these two results and applying the Heaviside function to it. Thus, although the chip does not itself perform learning, most of the processing from input to output is done on-chip.

The neuromorphic IC has been fabricated in AMS m CMOS technology, and each classifier comprising two neurons occupies area. In this section, we briefly describe the circuit implementation of these blocks and show simulation results to describe the functionality of each block.

### 3.1 Synapse Circuit

### 3.2 Square Block Circuit

### 3.3 Neuron Circuit

## 4 FPGA Controller

To validate the operation of the neuromorphic IC, an FPGA-based controller logic (using Opal Kelly XEM 3010) has been implemented for generating spiking events to emulate the real-time behavior of a spike-based sensor. The address event representation (AER) protocol (Boahen, 2000), which is commonly used for other asynchronous neuromorphic hardware (Chan et al., 2007; Brink et al., 2013); is used for communication between the IC and the FPGA controller. The FPGA controller shown in Figure 6 further performs the task of using these input pulse addresses to determine a corresponding decoder or dendritic address that needs to be pulsed. Details of the controller are presented in the appendix.

## 5 Measurement Results

We have designed the dendritic classifier IC in the AMS m process and evaluated performance with the test setup described in the section 4. The microphotograph of the fabricated chip is shown in Figure 7. The DPI synapse, the square block, and the neuron combined together form a basic unit of the fabricated chip governed by equation 3.7. For this chip, each P- and N-cells has dendrites. We first show some characterization results of the chip followed by pattern classification under different conditions.

### 5.1 Characterization

This is shown in Figure 9. It should be noted that the two parameters and together represent the total mismatch from each branch, and the effect of this mismatch cannot be eliminated by multiplexing the same neuromorphic circuits, as we have done, to get the classifier output. There is a significant mismatch across the branches with CV of and in and , respectively. We will show later how the margin enhancement algorithm helps achieve classification accuracies close to those of software. In the future, these separate values for each branch will also be used in the learning algorithm to calibrate for the mismatch.

Next, we characterized the power dissipation of the chip for different values. The chip is functional for power supply voltages as low as 1.8 V due to the current mode design. The static current is approximately 10.5 nA, and the dynamic current depends on . The dynamic power is normalized to the spike frequency for different values of to get energy per spike () and is plotted in Figure 10. The lowest value of pJ is attained at V. This characterization can be used to estimate the energy/classification operation. We expect this value to decrease quadratically when is reduced; hence, moving to a smaller process node like 65 nm and reducing to 0.45 V should reduce to approximately 8 pJ.

### 5.2 Pattern Classification: Random Binary Pattern Set

#### 5.2.1 Input Generation

To evaluate the performance of our chip, we first choose the two-class classification problem of high-dimensional random binary patterns reported in Hussain et al. (2014), Poirazi and Mel (2001), and Hussain, Liu, and Basu (2015). A 40-dimensional random gaussian datum is mapped to a neurally plausible sparse -dimensional binary vector using 10 nonoverlapping receptive fields per dimension (Hussain et al., 2014). The widths of the receptive fields at each location were chosen such that they have equal probability of being active. Since 10 receptive fields span each dimension, each of them had a probability of 0.1 of being active in any given pattern. This sparse, high-dimensional vector is next converted to a spiking input by mapping the two binary values to Poisson spike trains with mean firing rates of and , respectively. This type of random binary spike pattern is also used to test the classification performance of other hardware neuromorphic circuits (Mitra, Fusi, & Indiveri, 2009). A ms sequence of pulse trains along with the input line addresses is generated for each of the random patterns that are split equally and arbitrarily into the two classes. The network is first trained on the binary vectors (Hussain et al., 2015) to find the desired connection matrix on a PC, which is then downloaded to the FPGA for hardware evaluation. Next, the spike patterns are sent as input to the chip, and classification results are obtained for various configurations. A software implementation of the NNLD classifier is also used to evaluate performance on spike train inputs; we use that as a baseline to compare the performance of the hardware. In the following text, we use the term *simulation* to refer to this software implementation of NNLD classifier.

#### 5.2.2 Classification Performance

We mention that the number of unclassified patterns (those causing equal neuronal firing on both current difference inputs) is too insignificant to be reported separately from Matlab simulations and is almost always zero in measurements.

To reduce this error, we next introduced dendritic leak currents that have been shown to be useful in reducing average currents (Hussain et al., 2015) to the dendrite and increase its effective dynamic range. The value of was set equal to to cancel most of the common-mode current. Here, denotes the probability of a randomly selected input dimension being high for a given pattern; it depends on the number of receptive fields used to generate the high-dimensional mapping. The desired value of for this is obtained from the curve fit shown in Figure 8b. It can be seen from Figure 12a that the classification error indeed reduces for , 15, and 20 by adding dendritic leak for Hz. However, the reduction is not as much as expected from software simulations due to the mismatch between the transistors in Figure 3b to create the leak currents. In the future, we can use a current splitter-based configurable leak current in each dendrite to remove the mismatch. Since this is a static setting (unlike if a splitter is used in a synapse), the dynamic performance of the splitter for small currents is not a bottleneck. Another option is to use the characterization results of this mismatch in the training process to find a new connection matrix. Finally, as we show next, we can partially reduce the effect of this mismatch by using margin-based training.

The effect of increasing margin during training was tested when noisy background spikes are added by choosing . The network is retrained for large margins with using the adaptive algorithm in Hussain et al. (2015), and the new connection matrix is used for testing. The errors for the cases of , 2.5, 5, 7.5, and 10 are compared with an increasing number of random spikes in Figure 12b when Hz. It can be seen that for all noise levels (obtained by increasing ), the increased margin setting of is the optimal setting in our case, achieving to less error than the nominal case of , proving that the added margin indeed helps in improving robustness. However, it can also be observed that the classification performance degraded slightly with an increase in the margin from to in both software simulation of NNLD and chip measurements. This is attributed to the fact that increasing margin beyond a point makes it difficult for the training process to converge. We can also use the measured result of error (for and no random spikes) to benchmark the performance of the chip with other classifiers.

A software implementation of the NNLD achieves a comparable error for the same parameter setting. To compare with other non-spike-based classifiers in software, we modified the binary inputs by adding noise to approximate the situation of noisy spike trains. The variance of noise was set so that the performance of NNLD on these binary inputs matches its software performance on spike train inputs. With these noisy binary patterns, a perceptron classifier achieves error, while an extreme learning Machine (ELM; Huang, Zhu, & Siew, 2006) achieves error with 2000 hidden neurons. Note that the NNLD uses only 160 binary synapses, while the perceptron uses 400 high-resolution weights. The ELM uses 2000 high-resolution weights and close to 1 million random weights. This shows the benefit of our approach over networks with weighted synapses.

### 5.3 Pattern Classification: UCI Data Sets

#### 5.3.1 Data Set and Input Generation

To evaluate the classification performance of our system on real-world data sets, we next tested it on two standard UCI data sets: the Breast Cancer (BC) data set and the Heart data set. These were also two-class classification problems; however, they differ from the former experiments in that the random data sets tested only the noise resilience property of the NNLD since the training patterns were converted to noisy spike trains for testing. In the UCI data sets, training and testing use different patterns and, hence, evaluate the generalization capability of our chip. For both data sets, each input vector was mapped to a higher-dimensional sparse binary vector. Similar to the method in section 5.2.1, this mapping was done by employing 10 nonoverlapping receptive fields to span each of the original dimensions of the data. The width of each receptive field was again chosen so that all of the higher dimensions have an equal probability of being active. Hence, the original 9 and 13 dimensions of the BC and Heart data sets were mapped to 90 and 130, respectively. The numbers of training and testing samples for the BC data set are 222 and 383, respectively, while the corresponding numbers for the Heart data set are 70 and 200, respectively.

#### 5.3.2 Data Set and Input Generation

Table 2 compares the performance of different classifiers on these data sets. For the NNLD classifier, was fixed at 7, and a margin of was used, with Hz, because these settings yielded the best results for the random binary pattern classification case. Five trials were conducted on each UCI data set, where the NNLD was separately trained in software with different initial conditions. Also, different instances of Poisson spike trains were generated in each trial. For the NNLD case, we report the result of software simulation on binary inputs and spike inputs to show the loss in performance expected due to mapping to noisy spike trains. The results for SVM and ELM are taken from Babu and Suresh (2013). We also show the results for software implementation of a perceptron to classify the same high-dimensional binary patterns that are input to the NNLD. We find that for all cases, software implementation of the NNLD achieves performance comparable to SVM or ELM and superior to that of the perceptron. Clearly, the network we have proposed performs well compared to the more extensive networks constructed with weighted synapses (high-resolution weights). In all cases, there is a drop of approximately in accuracy in software when the binary inputs are mapped to noisy spike trains. Finally, the measured accuracy of the IC is approximately to less than the spike testing accuracy in software. It should be noted that the number of weights our NNLD used are two to six times fewer compared to SVM or ELM and comparable to that needed by the perceptron. But the perceptron used high-resolution weights, while the NNLD uses only binary weights, underlining its higher computational power.

. | . | . | . | NNLD . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | SVM . | ELM . | Perceptron . | . | ||||||||||

Data Sets . | . | . | . | . | . | . | . | . | . | . | . | (binary, software) . | (spike, software) . | (chip) . |

BC | 24 | 240 | 96.7 | 66 | 660 | 96.4 | 1 | 90 | 16 | 112 | ||||

Heart | 42 | 588 | 75.5 | 36 | 504 | 76.5 | 1 | 130 | 16 | 112 |

. | . | . | . | NNLD . | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

. | SVM . | ELM . | Perceptron . | . | ||||||||||

Data Sets . | . | . | . | . | . | . | . | . | . | . | . | (binary, software) . | (spike, software) . | (chip) . |

BC | 24 | 240 | 96.7 | 66 | 660 | 96.4 | 1 | 90 | 16 | 112 | ||||

Heart | 42 | 588 | 75.5 | 36 | 504 | 76.5 | 1 | 130 | 16 | 112 |

Note: : number of neurons; : number of weights; : number of dendrites; : classification accuracy in percentage.

The classification performance of the IC on the Heart data set is similar to performance of the perceptron in software. This is due to limitations on maximum and values we could set in this IC. However, this can be partially overcome by the concept of boosting (Mitra et al., 2009), where the output of several such NNLD classifiers may be combined to produce the final decision by voting. If each NNLD makes mistakes on different subsets of the pattern, the combined vote should result in far fewer errors. Of course, this comes at the expense of more dendrites and synapses. We tried this on the Heart data set by summing the spike counts of the P-cells and the N-cells separately for the five trials before comparing them. The resulting accuracy was —better than those obtained from a single run of classification as well as those from other classifiers. But the effective number of synapses used increases to 560, which is similar to SVM and ELM. However, the synapses for NNLD are binary compared to high-resolution tunable weights required in other approaches.

### 5.4 Pattern Classification: Classification Speed

Finally, in order to gauge the speed of classification (trade-off with accuracy) achievable by the classifier, it was tested again on the standard BC data set. This was approached by shrinking the time window of observation () of an individual pattern. The advantage of using a smaller time window for classification is greater classification speed, in addition to lower firing of the output neuron and, hence, lower energy dissipated. Two cases were considered.

#### 5.4.1 Poisson Spike Train

Decreasing the time window indiscriminately has the demerit of introducing noise in the input, and this trade-off was tested, as we describe next.

*Data set and input generation.* The input data set generation followed the same principle as described in section 5.3.1: each input vector was mapped to a higher-dimensional sparse binary vector, and each resulting vector converted to a Poisson spike train, with the binary value 1 mapped to a train of mean firing rate and 0 to a firing rate . The training and the testing sets differ. The learning or training of the classifier followed.

*Classification performance.* The pulse train sequence fed to the network had different values of , varying from ms to ms in steps of 50 ms. In each case, 383 binary patterns in the testing subset of the BC data set were classified with and Hz, these being the optimal settings from the earlier runs. In addition, for each case of , the bias was also swept over a range. The expected trend is found to be true from Figure 15a, which shows error to gradually decrease and finally become almost constant with increasing classification time intervals; the errors also become lower with increasing for a particular . For example, at the highest setting of , the optimal performance corresponding to a classification error is reached within 300 ms.

The trend with -variation is easily explained: with increasing bias, the synaptic time constant (see section 3.1) increases, resulting in more averaging of the noisy Poisson input. Also, current from a synapse on an active afferent rises. This results in a greater difference between an ON and an OFF synapse. These two effects combine to produce lower error at higher setting of .

#### 5.4.2 Single Spike Burst

*Data set and input generation.* The input generation for this two-class classification problem differed from that in section 5.2.I. After the mapping with the nonoverlapping receptive fields to expand each of the original dimensions of the data, the resulting high-dimensional vector is converted to a spiking input by mapping the binary value 1 to a burst of spikes concentrated over a very small time duration (2 msec), and the value 0 to no spikes at all.

*Classification performance.* The classification was conducted through a series of runs, each of which corresponded to a different spike count in the input burst per active afferent of the classifier. Each run itself measured the performance over increasing classification time windows (or pattern duration) , which was kept at a maximum of 50 ms this time. As is evident, this is substantially lowered from the previous value of 500 ms, because the sole purpose of this experiment was to see how quickly our system could classify a pattern set with reasonable accuracy (hence, only a small spike burst instead of an entire noisy spike train applied to a binary 1-afferent). Figure 15b shows the experimental results. Again, we find improved performance with expanding classification time in each case, similar to the trend obtained in section 5.4.1. The drastic reduction in accuracy at values of ms is because of the finite number of spikes obtainable in such a short time. This limits the discriminative ability of the neuron. It might be noted that the general classification accuracy improves with increasing spike density per burst as well; this is intuitively agreeable because more spikes per input burst per active afferent implies a greater difference in the current of an ON and an OFF synapse (as discussed in section 5.2.2). In this case, we reach an error of at ms with 20 spikes per burst.

## 6 Discussion

### 6.1 Relation to Other Work

In terms of spike-based classifiers in hardware, our results of error in classifying 200 random binary patterns in hardware using only 160 binary synapses show much better performance compared to other systems. For example, the work in Mitra et al. (2009) could classify about 12 random binary patterns at an error of using 1200 binary synapses; of course, they included learning capabilities on-chip that are not present in our work. In terms of VLSI implementations of dendritic function, Nease, George, Hasler, Koziol, and Brink (2012) present a voltage mode diffuser circuit to model passive properties of dendrites. They show the ability of this circuit to exhibit properties matching cable theory. Wang and Liu (2010, 2013) present a current mode implementation of active dendrite models and use it to study its response to spatiotemporal spike inputs. Compared to these detailed models, our model of dendritic nonlinearity is much more simplified. Instead, we have focused on using this model for an application of spike-based pattern classification. Our results show the viability of using binary synapses for pattern classification when coupled with margin-based learning to counter the imperfections of VLSI implementations. The key to achieving good performance is to have a larger ratio of nonlinear to linear processing—short linear summation of synaptic inputs have to be passed through a nonlinearity before further summations.

This method can be used by other analog neuromorphic processors (Brink et al., 2013; Qiao et al., 2015) where the dendritic nonlinearity can be replaced by neurons, while the software model used for training uses structural plasticity to choose the few best connections per neuron, which can then be incorporated into look-up tables for AER. The inherent learning capability in some of these systems can be used to fine-tune the performance further. Here, we assume that the benefits available in the NNLD network are due to the additional nonlinearities provided by the dendrites. Hence, it should be possible to get similar benefits by using neuronal nonlinearities in place of dendrites. However, in that case, the area would increase a lot since the dendrite circuit does not have capacitors and is more compact than an IF neuron, which has at least one capacitor for membrane dynamics and more for refractory period and spike frequency adaptation implementations. Finally, recently Spiess, George, Cook, and Diehl (2016) have explored the use of structural plasticity to denoise neuronal responses. However, no real-world classification problems were reported, and no hardware measurements were done. We believe our work is the first to show such results.

### 6.2 Future Work

Though the initial results from this simple chip are promising, it has certain limitations that can be improved in future designs. We discuss these aspects and other extensions in this section.

In this proof-of-concept work, since the WTA for comparison was not implemented, the number of spikes from the output neuron was not regulated down. On average, the neurons fired about 15,000 spikes in a pattern duration of 500 ms. However, not all of these spikes are necessary, and far fewer spikes can give comparable accuracy if an inhibitory interneuron (IIN)-based WTA is included on the chip. This has already been depicted in Figure 2a for a differential architecture. From this representation, it is clear that the IIN steady-state current is fed back on a negative feedback path to the inputs of both neurons, so that for class 1 patterns, current () would be rapidly decreased to zero, with a simultaneous decrease of the current (), while for class 0 patterns, the reverse would occur. This would suggest a significant (possibly greater than ) reduction in spiking of the output neuron. The mathematical formulation in equation 2.6 also supports this statement if one notes that the current () is not the negative of the current (); rather, it is the representation of the difference between the N-cell current and the P-cell current with the connectivities swapped. Obviously, for a pattern to be classified in class 1, () must exceed (), and vice versa for classification in class 0.

To evaluate the possible benefits, we implemented a software model of the WTA and varied the parameters and denoting the time constant and peak amplitude of the inhibitory postsynaptic current generated by the interneuron. The performance of the NNLD classifier with , , and was evaluated for the case of random binary inputs mapped to spike trains with Hz. As shown in Figure 16a, different combinations of and lead to different error rates; errors increase with higher as an increasing proportion of the output neuronal spikes gets eliminated from the negative feedback. This data are replotted in Figure 16b to show the trade-off in terms of accuracy versus spike rate. It can be seen that compared to the original accuracy of , slightly lower accuracies of and can be obtained with a concomitant reduction of and , respectively, in the number of spikes. This points to the great benefit of having an appropriately tuned inhibitory interneuron and a way to lower output neuronal firing without cutting down on the classification interval; it will be included on chip in future versions of the system.

The current chip has a limited number of dendrites () per cell. Combined with the constraint on the maximum number of synapses per dendrite (posed by transistors in the DPI synapse and squaring block going out of subthreshold regime as shown in equations 3.5 and 3.3), this leads to limited computational capability of each cell. In future versions of the chip, we will increase the size of the output transistor of the DPI synapse, as well as the transistors in the translinear loop of the square block to get a higher operating range. Architecturally, we will move to cells with more dendrites since that also increases the computational power of the classifier (Hussain et al., 2015). Future versions of the system will integrate multiple NNLD per chip with a fully asynchronous AER-based input and output interface so that multiple chips can be tiled to create bigger processors. In this kind of system, we can also use the concept of boosting (shown in section 5.3.2 to improve performance) to allocate multiple NNLD to decide on a class by voting. Mismatch between different branches reduces the accuracy of the hardware (though margin-based learning is useful to counter this effect). We will use the results of chip characterization to modify the software model during training. The new connection matrix, where the learning accounts for the mismatch, should produce a better match in results between hardware and software. We also plan to extend our hardware to include on-chip learning of connection matrices that can account for all mismatches in the hardware directly. Initial architectural explorations in this direction have already been done (Roy, Kar, & Basu, 2014) and a chip is being fabricated to test this idea.

In terms of algorithms, it has already been shown that the NNLD-based classifier can easily be extended to multiclass problems (Hussain et al., 2014) as well as to spike-time-based pattern classification (Roy, Banerjee et al., 2014; San, Basu, & Hussain, 2014). We will employ future systems to perform classification of handwritten digits from the MNIST database, a commonly used multiclass classification standard in current neuromorphic spike-based algorithms (O'Connor et al., 2013; Neftci, Das, Pedroni, Kreutz-Delgado, & Cauwenberghs, 2013). This problem is still considered a mean-rate-encoded system. To assess the ability of the hardware to identify spike-time-based patterns, we will employ it to classify the spike trains from a liquid state machine as done in Roy, Banerjee et al. (2014) and random spike latency patterns as done in Gutig & Sompolinsky (2006). We also plan to connect these ICs to real-time spike-generating auditory sensors (Chan et al., 2007) to perform rapid speech classification in low power.

## 7 Conclusion

In this letter, we have presented the VLSI circuit design in m CMOS of a neuromorphic spike-based classifier with eight nonlinear dendrites per neuron and two opponent neurons per class. We presented characterization results to prove the functionality of all subblocks from a as low as 1.8 V and also some results of classifying complex spike-based high-dimensional binary patterns. The classification error in classifying 200 patterns randomly assigned to two classes was obtained under different conditions. We showed that the addition of a dendritic leak and classification margin helps improve performance and makes the system robust against noise. With an optimal classification margin, the hardware system performs comparably to SVM and ELM on two UCI data sets and with far fewer binary weights. We also demonstrated that the margin enhancement algorithm allows single-output differential operation with similar accuracy as double output, thus allowing us to reduce the number of neurons. Pattern classifications within 50 ms were possible while using bursts of spikes to represent input binary value. The accuracy was degraded compared to the software due to transistor mismatch and non–square law behavior of the dendritic nonlinearity at high currents. Future work will use calibration to improve accuracy and include WTA on chip to reduce the output spike counts. We will also employ future generations of this chip for multiclass and spike time based classifications.

## Appendix: Details of the FPGA Controller

The FPGA controller consists of the following blocks:

### A.1 Block Memory (Module A)

Module A, shown in Figure 6, is the volatile memory array implemented in the hardware. For generating presynaptic pulses, the input spike train vector (*spkTm*), input line addresses (*spkAd*), and connectivity matrix created in the pattern classification program are transferred from a PC to block memory A, block memory B, and block memory C, respectively. The structures of the *spkTm* and *spkAd* vectors are shown in Figure 17. Since *spkTm* is a sparse array of binary firing events, only those time instances of *spkTm* having binary value 1 are stored in block memory A. These time instances, s, are generated with respect to an FPGA internal clock frequency (). Similarly, input addresses at these particular time instances are stored in block memory B. These two parts store input information and will not be needed while interfacing with a real spiking sensor.

The connectivity matrix of this neural network has the dimension of (as discussed in section 2), where is the number of input lines and is the number of dendrites. Since the number of connections per dendrite , this matrix is also sparse in nature. Storing the entire content of the sparse matrix as a look-up table (LUT) would require memory space on the order of and is wasteful. Instead, we store only the addresses of nonzero elements as a linear array *synCol*, as shown in Figure 18 where and denote decoder addresses. For example, if the first input line connects to dendrites 2, 3, and 5, the first three entries of *synCol* will be , , and . But now, given an input pulse address, determining which decoder addresses have to be generated is a bit difficult since the input pulse address cannot be used to directly index into this array. To circumvent this, we store another linear integer array, *addrPtr*, with entries. The th entry in *addrPtr* stores the address of the *synCol* array where the first dendritic connection (decoder address) for the th line is stored. The number of dendritic connections