Abstract

We present a neuromorphic current mode implementation of a spiking neural classifier with lumped square law dendritic nonlinearity. It has been shown previously in software simulations that such a system with binary synapses can be trained with structural plasticity algorithms to achieve comparable classification accuracy with fewer synaptic resources than conventional algorithms. We show that even in real analog systems with manufacturing imperfections (CV of 23.5% and 14.4% for dendritic branch gains and leaks respectively), this network is able to produce comparable results with fewer synaptic resources. The chip fabricated in m complementary metal oxide semiconductor has eight dendrites per cell and uses two opposing cells per class to cancel common-mode inputs. The chip can operate down to a V and dissipates 19 nW of static power per neuronal cell and 125 pJ/spike. For two-class classification problems of high-dimensional rate encoded binary patterns, the hardware achieves comparable performance as software implementation of the same with only about a 0.5% reduction in accuracy. On two UCI data sets, the IC integrated circuit has classification accuracy comparable to standard machine learners like support vector machines and extreme learning machines while using two to five times binary synapses. We also show that the system can operate on mean rate encoded spike patterns, as well as short bursts of spikes. To the best of our knowledge, this is the first attempt in hardware to perform classification exploiting dendritic properties and binary synapses.

1  Introduction

Spiking neural networks (SNN), considered to be the third generation of neural networks, were proposed due to the advent of neurobiological evidence (Rieke, Warland, de Ruyter van Steveninck, & Bialek, 1999) that suggested biological neural systems use timing of action potentials, or spikes to convey information. These types of network are considered to be more biorealistic and computationally more powerful than their predecessors (Maass & Markram, 2004; Maass, 1999; Gutig & Sompolinsky, 2006). Neuromorphic engineering (Mead, 1990) aims to emulate the analog processing of neuronal structures in circuits to achieve low-power, low-area, very large-scale integrated circuit implementations. Thus, while theoretical studies on SNN have progressed rapidly, neuromorphic engineers have, in parallel, implemented low-power VLSI circuits that emulate sensory systems (Chan, Liu, & Schaik, 2007; Culurciello, Etienne-Cummings, & Boahen, 2003; Lichtsteiner, Posch, & Delbruck, 2008; Hsieh & Tang, 2012) and higher cognitive functions like learning and memory (Arthur & Boahen, 2007; Wang, Hamilton, Tapson, & van Schaik, 2014). Since silicon systems face many challenges similar to neuronal ones, we hope to gain insight into some operating principles of the brain by making such neuromorphic systems. Also, with the advent of brain-machine interfaces and the internet of things, there is now a pressing need for area- and energy-efficient neural networks for pattern classification.

Recently Roy, Basu, and Hussain (2013) and Hussain, Gopalakrishnan, Basu, and Liu (2013), proposed structures inspired by the nonlinear properties of dendrites in neurons that require many fewer synaptic resources than other neuromorphic designs. The learning of these structures involves network rewiring of binary synapses that is comparable to the structural plasticity observed in biological neural systems. As an example, in vivo imaging studies have shown that synaptic rewiring mediated by rerouting of whole axonal branches to different postsynaptic targets takes place in the mature cortex (Stettler, Yamahachi, Li, Denk, & Gilbert, 2006). Several experimental studies have provided evidence for the formation and elimination of synapses in the adult brain (Trachtenberg et al., 2002), including activity-dependent pruning of weaker synapses during early development (Be & Markram, 2006). Inspired by these phenomena, our learning algorithm tries to find the best sparse combinations of input on each dendrite to improve performance. This choice of connectivity can be easily incorporated in hardware systems using address event representation (AER) protocols, commonly used in current neuromorphic systems, where the connection matrix is stored in memory. Since this memory had to be stored for implementing any AER-based system, no extra overhead is needed to implement our method other than the dendritic nonlinearity. Instead, the reduced number of synaptic connections translates to a reduction in memory access and communication overhead that is often the most power-consuming aspect of large-scale spiking neural networks (Hasler & Marr, 2013).

In this work, we present a neuromorphic current mode implementation of the above neural classifier with shared synapses driven by AER. Some of the initial characterization results were presented in Banerjee, Kar, Roy, Bhaduri, and Basu (2015). In this letter, we present complete characterization results and detailed results on pattern classification using the chip. The organization of the letter is as follows. We first present some architectural modifications of the basic dendritic cell for improved hardware performance. Next, we present circuit descriptions and simulations of each building block. Measurement results from a chip fabricated in m complementary metal oxide semiconductor (CMOS) are presented in the following section to prove functional correctness. Finally, we conclude with discussions in the last section.

2  Background and Theory

Roy and colleagues (Roy et al., 2013; Roy, Banerjee, & Basu, 2014) and Hussain and colleagues (Hussain, Gopalakrishnan, Basu, & Liu, 2013; Hussain, Basu, & Liu, 2014) have described spike train classifiers employing neurons with nonlinear dendrites (NNLD) and binary synapses. Due to the presence of binary synapses, the learning in these types of architectures happens by morphological changes in the connections between inputs and dendrites, not by weight update. Thus, these architectures are amenable to neuromorphic implementation employing AER protocols. In this letter, we present a circuit to implement the architecture proposed in Hussain et al. (2014) that has comparable performance to other spike-based classifiers such as O'Connor, Neil, Liu, Delbruck, & Pfeiffer (2013) but use 2 to 12 times fewer synaptic resources. Hence, our hardware implementation requires correspondingly less memory to store the connectivity information. It also needs proportionately less energy to communicate the connection information for each spike. Note that in this work, the training is done on a PC and the learned connection matrix is downloaded to the hardware platform for testing. Next, for completeness, we briefly describe the architecture of a basic NNLD, a classifier composed of two such NNLDs, the learning rule to train the classifier, and some modifications to improve hardware performance.

2.1  Architecture

Figure 1a shows the basic architecture of a single NNLD with dendritic branches and with excitatory synapses per branch as described in Hussain et al. (2013) and Roy et al. (2013). If a dimensional binary input pattern x () is applied to this system (), then each synapse is excited by the input connected to it and the output response of the th dendritic branch is given by . Here is a model of the dendritic nonlinearity given by /, is the synaptic weight of the th synapse on the th branch, is the corresponding input, and is a scaling constant. We choose a square law nonlinearity since it has been shown in Hussain et al. (2015) to match the measured dendritic nonlinearity reported in Polsky, Mel, and Schiller (2004). Other popular nonlinearities like RELU will not be applicable here since the input to the nonlinearity is always positive because the synapses are only excitatory. Also, squaring circuits can be made using only five transistors, as we will show. Let denote the total summed output current from all the dendrites that enter the neuron Then the overall output of a single neuronal cell is given by
formula
2.1
formula
2.2
where denotes the linear neuronal current-frequency conversion function where is the Heaviside function defined as for , for , and for . This signifies that the neuron produces zero outputs for negative inputs.
Figure 1:

(a) Architecture of a single neuron with multiple dendritic branches and sparse synaptic connectivity. The black circles denote synaptic connections, and the -blocks represent lumped dendritic nonlinearities. is the neuron activation function. (b) Single-ended classifier composed of two neurons with nonlinear dendrites (NNLD) and winner-take-all (WTA) for comparing the outputs of the neurons.

Figure 1:

(a) Architecture of a single neuron with multiple dendritic branches and sparse synaptic connectivity. The black circles denote synaptic connections, and the -blocks represent lumped dendritic nonlinearities. is the neuron activation function. (b) Single-ended classifier composed of two neurons with nonlinear dendrites (NNLD) and winner-take-all (WTA) for comparing the outputs of the neurons.

For rate-encoded spike trains at the input, a low-pass filtering synaptic kernel is used in the earlier work (Hussain et al., 2013; Hussain, Basu, & Liu, 2014; Roy, Banerjee et al., 2014) to produce an average synaptic output current proportional to the input spike rate. This average current for the th synapse on the th branch is denoted as in equations 2.1 and 2.2. Hussain et al. (2014) constructed a two-class classifier by comparing the output of two NNLDs (denoted by P and N) using winner-take-all (WTA) based on an inhibitory interneuron (IIN; Oster, Douglas, & Liu, 2009). This structure is similar to the canonical microcircuit in Douglas, Martin, and Whitteridge (1989) but without the recurrent excitatory connections. This simple classifier is shown in Figure 1b where denotes the operation of the WTA. and denote the total currents from the dendritic branches of P- and N-cells, respectively. The operation of the WTA is as follows. The IIN produces a current proportional to the firing rates of the two excitatory neurons. This current provides inhibition to both input currents and causes spike frequency adaptation in the excitatory neurons. We can express the current as
formula
2.3
where denotes the effective gain from the input spikes to output current of the IIN. Suppose . In that case, it can be shown that if , a mild condition that is almost always true if . Thus, due to the inhibitory current, the weaker of the two inputs gets cancelled, and the stronger persists at a diminished value. This helps in reducing the firing rates of the excitatory neurons. We will refer to this architecture as a single-ended classifier. Since logically the WTA is selecting one of the inputs as the winner (larger), we can express the output of the single-ended classifier as
formula
2.4
formula
2.5
where the function is a Heaviside function of the difference of inputs. The last step shows that we can identify the winner even by just noting and or, effectively, and . This is the approach we take in our prototype IC and do not include the IIN of the WTA. This serves as a proof of concept for classification accuracy with the understanding that the high firing rates in measurements will be reduced in future versions with inclusion of on-chip WTA.

2.2  Learning Algorithm: Network Rewiring Rule

The classifier shown in Figure 1b was trained by a structural plasticity-based network rewiring learning rule. For each input pattern, a binary target signal is provided to the classifier. The training was based on mean rate binary inputs , and testing was performed by mapping each input dimension to a Poisson spike train with high or low mean firing rates. The learning algorithm primarily consists of the following steps:

  1. The inputs are connected to the dendritic branches of the NNLDs via binary synapses ( = 0 or 1), so the network learns through connection changes instead of weight changes. Since , learning involves choosing the best choices of connection for each dendrite.

  2. At each epoch of learning, a randomly selected set of synapses is chosen for possible replacement. The performance index for each synapse in the set is computed as for the positive cell and for the negative cell. Here, denotes averaging over the entire training set. The synapse having the least value of in () is tagged for replacement.

  3. For replacement, a candidate set is formed by randomly selecting of the input lines. The synapses in are placed on the dendrite with the lowest synapse from the previous step. The performance index is again computed for synapses in , and the synapse having the highest () is used to replace .

For more details about the algorithm, refer to Hussain, Liu, and Basu (2015) and Roy, Banerjee et al. (2014). We also invite readers to look into Roy et al. (2013) and Hussain et al. (2014). For a detailed description of the architectures and learning rules, see Roy et al. (2013) and Hussain et al. (2014).

2.3  Modifications for Hardware Implementation

Three modifications of the single-ended classifier and learning algorithm are used to improve performance of the hardware.

2.3.1  Differential Architecture

We found that hardware implementation of the single-ended architecture of Figure 1b has the problem of high baseline or common-mode current at the input of the spiking neurons, leading to frequency saturation whenever the number of dendrites or firing rates of input afferents denoted by grow large. In other words, due to a large number of dendrites producing positive currents, even arbitrary connections before learning will generate a large amount of current that can saturate the neuron. Hence, we modified the architecture to a differential one, reported in Hussain et al. (2015) and shown in Figure 2a with the following equation:
formula
2.6
formula
2.7
Although the final expression in equation 2.7 is same as the one in equation 2.5, the modification in the intermediate step makes the performance more robust to architectural changes like changing (the number of dendrites per cell) since the system performance is now less dependent on change in average currents (average quantities get cancelled when evaluating () or ()). It should again be noted that the final winner can be inferred without having the IIN on chip, albeit at the cost of producing more spikes. However, this was the preferred option for simplicity in the test chip.
Figure 2:

Three modifications to the basic NNLD-based classifier for improved performance in hardware. (a) Modification of the single-ended architecture to a differential one for increased robustness to architectural changes and less dependence on average currents. (b) Subtraction of constant dendritic leak from total synaptic input current to reduce the input current to dendritic nonlinearity. (c) Graphical representation of the function used during training to achieve better generalization.

Figure 2:

Three modifications to the basic NNLD-based classifier for improved performance in hardware. (a) Modification of the single-ended architecture to a differential one for increased robustness to architectural changes and less dependence on average currents. (b) Subtraction of constant dendritic leak from total synaptic input current to reduce the input current to dendritic nonlinearity. (c) Graphical representation of the function used during training to achieve better generalization.

2.3.2  Dendritic Leak

The second modification to the basic architecture is shown in Figure 2b: a constant dendritic leak current is subtracted from the total synaptic input current on the branch to form the input to the dendritic nonlinearity. In other words, we can modify the equation of the dendritic nonlinearity to become a new function given by
formula
2.8
where is the dendritic leak current and denotes the Heaviside function. As Hussain et al. (2013) showed, large values of (synapses per dendrite) led to saturation of each dendritic nonlinearity, reducing performance. In the hardware implementation, even without having saturation in each branch, large values of lead to a high average current per branch, increasing power dissipation after squaring tremendously. Also, since we employed subthreshold translinear design principles to implement the square nonlinearity, large input currents cause the transistors to enter above the threshold regime, causing deviations from ideal behavior. Hence, adding dendritic leak helps in removing common-mode current in every branch and helps to increase the range of usable values.

2.3.3  Margin-Based Learning

The last modification shown in Figure 2c is a change in the learning procedure and does not modify the architecture. As Hussain et al. (2014) showed, using a function for the WTA, as shown in Figure 2c, instead of the Heaviside during training allows building a margin of around the classification boundary. In other words, the output during training is given by
formula
2.9
During testing, the normal Heaviside is used as before. This helps in increased generalization and noise robustness because the learning algorithm does not stop modifying connections for a pattern immediately when it is correctly classified. As an example, consider a pattern for which but because . In such a case, the learning algorithm tries to modify connections to increase and decrease for this pattern. If is used, the error for this pattern becomes zero immediately when since is a Heaviside function. Hence, it is quite possible for inputs slightly different from those encountered during training to be wrongly classified due to slight changes in or . However, using the function shown in Figure 2c, the error during training is nonzero until . Intuitively, as long as the effect of noise does not alter any of the currents or by magnitudes approximately larger than , noisy inputs during testing will not lead to misclassifications. The detailed algorithm to choose and adapt is given in Hussain et al. (2015). We show the effect of adding margin during training in our measurement results in section 5. Table 1 summarizes the notations introduced in this section that will be used throughout the remainder of the letter.
Table 1:
Notations Used in This Letter.
: Input vector 
: Dimensionality of the input 
: Number of dendrites per cell 
: Number of synapses per dendrite 
: Nonlinear dendritic function without leak 
: Nonlinear dendritic function with leak 
: Input value at the th synapse of the th dendrite 
: Scaling constant of the nonlinear dendritic function 
: Input-output function of WTA without margin 
: Input-output function of WTA with margin 
: Sum of the currents from dendritic branches of P cell 
: Sum of the currents from dendritic branches of N cell 
: Margin of classification 
: Mean rate of Poisson spike train mapped to binary input 1 
: Mean rate of Poisson spike train mapped to binary input 0 
: Input vector 
: Dimensionality of the input 
: Number of dendrites per cell 
: Number of synapses per dendrite 
: Nonlinear dendritic function without leak 
: Nonlinear dendritic function with leak 
: Input value at the th synapse of the th dendrite 
: Scaling constant of the nonlinear dendritic function 
: Input-output function of WTA without margin 
: Input-output function of WTA with margin 
: Sum of the currents from dendritic branches of P cell 
: Sum of the currents from dendritic branches of N cell 
: Margin of classification 
: Mean rate of Poisson spike train mapped to binary input 1 
: Mean rate of Poisson spike train mapped to binary input 0 

3  VLSI Implementation of Neuromorphic IC

The VLSI architecture of the implemented neuromorphic integrated circuit (IC) is shown in Figures 3a to 3d, where AER is used to provide the synaptic input. A differential pair integrator (DPI) circuit has been used to implement the synaptic function of the neuron; in this circuit, it is possible to achieve the linear filtering property of each synapse (Bartolozzi & Indiveri, 2007) by proper biasing. This linear filtering property is implemented here to replace all the synapses in one dendritic connection by a single shared synapse, drastically reducing the effective layout area of the IC.

Figure 3:

(a) VLSI architecture of the implemented neuromorphic IC composed of dendritic branches with shared DPI synapses in P- and N-cells (from Banerjee et al., 2015). Each dendrite has a squaring nonlinearity, and the difference of the currents from P and N dendrites enters an integrate-and-fire (I&F) neuron. Detailed circuit diagrams from Banerjee et al. (2015) of subblocks (b) DPI synapse, (c) squaring block employing translinear principle, and (d) I&F neuron. (e) Generation of the difference currents and by swapping the excitatory and inhibitory connections for the differential architecture.

Figure 3:

(a) VLSI architecture of the implemented neuromorphic IC composed of dendritic branches with shared DPI synapses in P- and N-cells (from Banerjee et al., 2015). Each dendrite has a squaring nonlinearity, and the difference of the currents from P and N dendrites enters an integrate-and-fire (I&F) neuron. Detailed circuit diagrams from Banerjee et al. (2015) of subblocks (b) DPI synapse, (c) squaring block employing translinear principle, and (d) I&F neuron. (e) Generation of the difference currents and by swapping the excitatory and inhibitory connections for the differential architecture.

This IC is interfaced with a field programmable gate array (FPGA) controller that generates input spikes and addresses, as shown in Figure 4. The learned sparse connectivity matrix is stored inside the FPGA memory in a very compressed form by using two look-up tables. This constitutes the crossbar or routing array shown in Figure 1a. Based on the input line address, the controller reads the connectivity information of the address line from memory and generates an -bit decoder address to route the spike to the proper dendrite. Spikes from the FPGA output reach the synapse circuit input through an -bit to a -bit address decoder followed by digital switches. There are, in all, P-dendrites and N-dendrites connected to a NEURON block. The shared synapse circuit in each dendrite is followed by a square-law nonlinear circuit. For the chip that we have fabricated and present in section 5, implies that both P- and N-cells have dendrites. The difference between the total currents from the P-dendrites and the N-dendrites appears to the input of the neuron block and is converted to equivalent output spikes.

Figure 4:

FPGA-IC interface where the FPGA supplies input spikes to individual dendrites of the IC and captures output spikes from the neuron (from Banerjee et al., 2015).

Figure 4:

FPGA-IC interface where the FPGA supplies input spikes to individual dendrites of the IC and captures output spikes from the neuron (from Banerjee et al., 2015).

The output of the neuron block is digitized for proper handshaking of the IC with the FPGA in the form of an and an signal. We have implemented only one cell to compute one term () in equation 2.6. (Unless otherwise mentioned, the inhibitory interneuron is not to be taken into account.) The second term, (), is computed by passing the same input again through the IC, but with the connection matrix of the P-cell interchanged with that of the N-cell. In this case, the output of the IC is . This is shown in Figure 3e. So basically, the two units presented in Figure 2c are implemented through passing the same input spike train into the same circuit but with the swapped connectivity. The final decision of the pattern class is taken on the PC after computing the difference of these two results and applying the Heaviside function to it. Thus, although the chip does not itself perform learning, most of the processing from input to output is done on-chip.

The neuromorphic IC has been fabricated in AMS m CMOS technology, and each classifier comprising two neurons occupies area. In this section, we briefly describe the circuit implementation of these blocks and show simulation results to describe the functionality of each block.

3.1  Synapse Circuit

The circuit schematic shown in Figure 3b is a differential paired integrator (DPI) circuit that converts presynaptic voltage pulses into postsynaptic currents, . In this circuit, transistors operate in the subthreshold regime. As Bartolozzi and Indiveri (2007) mentioned, the bias voltages and are used to set , which simplifies this nonlinear circuit to a canonical first-order low-pass filter. For an input spike (arriving at and ending at ) applied to the DPI synapse, the rise time of is very small. The discharge profile can be modeled by the following equation.
formula
3.1
where =, is the subthreshold slope factor and is the thermal voltage. This synapse circuit, when typically excited with a continuous pulse train of average frequency and pulse duration , will have a steady-state output current (average of the output transient current) that holds a linear relationship with . The relationship has been derived in Bartolozzi and Indiveri (2007) as
formula
3.2
where is a constant set by , , and and . Simulation results for this block, shown in Figure 5a, demonstrate the expected linearity. Also, for larger values of weight governed by the bias , the slope of this curve increases as expected. The maximum output current of this circuit is limited by the constraint of keeping the transistor driven by and creating the current (see Figure 3a) in subthreshold. Denoting the threshold current of the MOS creating by , this constraint can be written as
formula
3.3
Figure 5:

SPICE simulation results of (a) a synapse block showing the linearity of average current with input frequency ( = 0.48 V, 0.50 V, 0.52 V), (b) a square nonlinearity block ( = 2.87 V, 2.90 V, 2.93 V), and (c) a neuron block showing output frequency versus input current relationship ( = 0.37 V, 0.40 V, 0.43 V).

Figure 5:

SPICE simulation results of (a) a synapse block showing the linearity of average current with input frequency ( = 0.48 V, 0.50 V, 0.52 V), (b) a square nonlinearity block ( = 2.87 V, 2.90 V, 2.93 V), and (c) a neuron block showing output frequency versus input current relationship ( = 0.37 V, 0.40 V, 0.43 V).

3.2  Square Block Circuit

We have designed the current mode squaring circuit given in Figure 3c as described in Hussain et al. (2013). The transistors , , , and form a translinear loop. To implement the dendritic leak, the transistor is added with a current set by the bias . This current gets subtracted from the input current coming from the synapse. Hence, the current through M can be expressed as given by
formula
3.4
The transistor is biased to pass a maximum current of (set by ) that can implement a saturating nonlinearity if needed. is the DC current through set by its gate voltage (). Figure 5b shows simulation results of the square nonlinearity block when . Also, it can be seen that increasing reduces in equation 3.4 and increases the output current. The dynamic range of this circuit is limited by the requirement of keeping all the transistors in the translinear loop in subthreshold. Assuming and all transistors in the translinear loop have the same dimensions with threshold current , the constraint on average synaptic current entering this block can be written as
formula
3.5
It can be seen that the constraint set by equation 3.5 is the same as the one set by equation 3.3 if can be set to its maximum value of .

3.3  Neuron Circuit

The circuit diagram of the implemented spiking neuron as described in Indiveri (2003) is depicted in Figure 3d. It comprises one integrating capacitor , an inverter ( and ) with positive feedback circuit (, and ), transistors and to control the refractory period of the neuron, a leakage transistor () for controlling the minimum to start charging , , and to generate signal to the FPGA, and to be turned on for an from FPGA. The output current of the square block () is integrated by to generate the dynamics of the membrane voltage (). As increases and approaches the switching voltage of the inverter (-), the feedback current starts flowing through - causing a sudden rapid increase of the profile and generates an signal indicating spike emission. The positive feedback reduces the short circuit current in the inverter, helping to reduce power dissipation. When an signal is received, is quickly discharged back to ground through the reset transistor , the on-resistance of which is controlled by . When is fully discharged, is driven back to , causing to turn on, which slowly discharges the node through transistor . As long as is sufficiently high, is active and is clamped to ground. This is called the refractory period of the neuron; hence, there will be no pulse generated by the neuron during this period. The entire current at the output of the square block will pass through . The input-output characteristic (for a negligible refractory period) can be represented by equation as
formula
3.6
where denotes the spiking frequency of the neuron. is the leakage current through set by ; hence, is the total current charging the neuron membrane capacitance. Finally, denotes the slope of the - curve of the neuron. Figure 5c displays the above linear relationship between and for the negligible refractory period. Also, the effect of increasing the leak current that causes a rightward shift of the curve has been shown in Figure 5.
Finally, combining equations 3.2, 3.4, and 3.6, we get the nonlinear relationship of the IC given as
formula
3.7

4  FPGA Controller

To validate the operation of the neuromorphic IC, an FPGA-based controller logic (using Opal Kelly XEM 3010) has been implemented for generating spiking events to emulate the real-time behavior of a spike-based sensor. The address event representation (AER) protocol (Boahen, 2000), which is commonly used for other asynchronous neuromorphic hardware (Chan et al., 2007; Brink et al., 2013); is used for communication between the IC and the FPGA controller. The FPGA controller shown in Figure 6 further performs the task of using these input pulse addresses to determine a corresponding decoder or dendritic address that needs to be pulsed. Details of the controller are presented in the appendix.

Figure 6:

Internal architecture of the controller blocks implemented on FPGA for interfacing with the neuromorphic IC. Module A stores the spike times and addresses of input, as well as the connectivity matrix. Module C creates input spikes for the IC, and modules D and E receive events from the IC.

Figure 6:

Internal architecture of the controller blocks implemented on FPGA for interfacing with the neuromorphic IC. Module A stores the spike times and addresses of input, as well as the connectivity matrix. Module C creates input spikes for the IC, and modules D and E receive events from the IC.

5  Measurement Results

We have designed the dendritic classifier IC in the AMS m process and evaluated performance with the test setup described in the section 4. The microphotograph of the fabricated chip is shown in Figure 7. The DPI synapse, the square block, and the neuron combined together form a basic unit of the fabricated chip governed by equation 3.7. For this chip, each P- and N-cells has dendrites. We first show some characterization results of the chip followed by pattern classification under different conditions.

Figure 7:

Microphotograph of the neuromorphic IC.

Figure 7:

Microphotograph of the neuromorphic IC.

5.1  Characterization

The input to the chip can be given only in terms of spikes or pulses, and only the output neuronal spike frequency can be measured. Hence, we characterized the functionality of the chip in terms of curves. In other words, each of the subblocks—synapse, square nonlinearity, and neuron—cannot be separately characterized, but their relative contribution to the curve can be ascertained by changing the biases associated with them. This method was used to first ensure the full functionality of each block. Note that periodic spike trains were used for characterization. Next, it was found that there is some leakage current–induced spurious firing of the neuron even without input pulses. The neuronal leak was adjusted (by regulating or tuning the corresponding bias voltage) until this spurious firing disappeared (so that ) and the curve started from for zero dendritic leak. In that case, we can rewrite equation 3.7 as
formula
5.1
formula
5.2
where , , and is the Heaviside function. This implies that for zero neuronal leak, the neuron block does not fire at its output until and unless the steady-state current from the synapse block due to the spiking input exceeds the dendritic leak current. Further, assuming subthreshold operation of the transistor M, the current where is the value of when the bias voltage and is defined as . So equation 5.2 can be rewritten as
formula
5.3
where is the value of when . Figure 8a plots the measured curve for one of the dendritic branches with along with a fit using equation 5.3. Note that the -axis is on a logarithmic scale. The curve departs from the expected square law at low values of when the effect of becomes prominent. It also deviates at high values of (larger than about 1.2 kHz) due to transistors in the synapse and the square blocks entering the above-threshold regime. For the same branch, the value of was then varied, and values were obtained for each case. These values are plotted in Figure 8b and fitted to the theoretically expected as shown. This model was next used to set desired bias conditions during future classification experiments. Of course, there is some mismatch between each branch that can be characterized in terms of mismatch between and for each branch.
Figure 8:

(a) Measured characteristics of one dendritic branch with corresponding theoretical fit. Note the logarithmic scale on both axes. (b) Extraction of parameter for one dendritic branch. Note the logarithmic scale on the -axis.

Figure 8:

(a) Measured characteristics of one dendritic branch with corresponding theoretical fit. Note the logarithmic scale on both axes. (b) Extraction of parameter for one dendritic branch. Note the logarithmic scale on the -axis.

This is shown in Figure 9. It should be noted that the two parameters and together represent the total mismatch from each branch, and the effect of this mismatch cannot be eliminated by multiplexing the same neuromorphic circuits, as we have done, to get the classifier output. There is a significant mismatch across the branches with CV of and in and , respectively. We will show later how the margin enhancement algorithm helps achieve classification accuracies close to those of software. In the future, these separate values for each branch will also be used in the learning algorithm to calibrate for the mismatch.

Figure 9:

Variation of (values averaged over four trials) and for the eight dendrites of the P-cell. The colors indicate the vertical axes corresponding to the plots.

Figure 9:

Variation of (values averaged over four trials) and for the eight dendrites of the P-cell. The colors indicate the vertical axes corresponding to the plots.

Next, we characterized the power dissipation of the chip for different values. The chip is functional for power supply voltages as low as 1.8 V due to the current mode design. The static current is approximately 10.5 nA, and the dynamic current depends on . The dynamic power is normalized to the spike frequency for different values of to get energy per spike () and is plotted in Figure 10. The lowest value of pJ is attained at V. This characterization can be used to estimate the energy/classification operation. We expect this value to decrease quadratically when is reduced; hence, moving to a smaller process node like 65 nm and reducing to 0.45 V should reduce to approximately 8 pJ.

Figure 10:

Energy dissipated per spike by the IC as a function of the power supply . Note the logarithmic scale on the -axis.

Figure 10:

Energy dissipated per spike by the IC as a function of the power supply . Note the logarithmic scale on the -axis.

5.2  Pattern Classification: Random Binary Pattern Set

5.2.1  Input Generation

To evaluate the performance of our chip, we first choose the two-class classification problem of high-dimensional random binary patterns reported in Hussain et al. (2014), Poirazi and Mel (2001), and Hussain, Liu, and Basu (2015). A 40-dimensional random gaussian datum is mapped to a neurally plausible sparse -dimensional binary vector using 10 nonoverlapping receptive fields per dimension (Hussain et al., 2014). The widths of the receptive fields at each location were chosen such that they have equal probability of being active. Since 10 receptive fields span each dimension, each of them had a probability of 0.1 of being active in any given pattern. This sparse, high-dimensional vector is next converted to a spiking input by mapping the two binary values to Poisson spike trains with mean firing rates of and , respectively. This type of random binary spike pattern is also used to test the classification performance of other hardware neuromorphic circuits (Mitra, Fusi, & Indiveri, 2009). A ms sequence of pulse trains along with the input line addresses is generated for each of the random patterns that are split equally and arbitrarily into the two classes. The network is first trained on the binary vectors (Hussain et al., 2015) to find the desired connection matrix on a PC, which is then downloaded to the FPGA for hardware evaluation. Next, the spike patterns are sent as input to the chip, and classification results are obtained for various configurations. A software implementation of the NNLD classifier is also used to evaluate performance on spike train inputs; we use that as a baseline to compare the performance of the hardware. In the following text, we use the term simulation to refer to this software implementation of NNLD classifier.

5.2.2  Classification Performance

In the first experiment, the values of and were varied while keeping , margin , and dendritic leak . The results for this are plotted in Figure 11 in terms of
formula
5.4
expressed as a percentage. In general, we expect higher values of to be better since this increases the difference in current of an ON and an OFF synapse. This should make the system more robust to noise and mismatch. This trend was indeed found to be true for , 3, 4, 5, and 7 and correlates well with software simulations of the NNLD classifier as shown in Figure 11. However, for larger values of , the increased current in the square law block and DPI synapse pushes the transistors to above threshold, making it deviate from desired translinear behavior. The second trend is observable by looking at the change in error with number of synapses per branch (note that the total number of synapses is ). From simulations, Hussain et al. (2013) observed that increasing initially reduces error due to more synaptic resources. But for very large , testing error would increase since the large input current saturates the dendritic nonlinearity. The same trend is observed for the case of Hz, where the error reduces from to before increasing later. However, the increase in error at larger is for a different reason here: because of the non–square law behavior of the dendritic branches at large currents. For larger values of , the increase in error starts occurring at lower values of (a larger results in more current per synapse and hence fewer synapses are needed to push the square law block and synapse above threshold).
Figure 11:

Measured classification error for different values of and , taken over five trials. For each value of , increasing initially reduces the error due to the higher computational power of more synapses. But eventually the error increases because the square nonlinearity block enters above threshold regime. Values of between 7 and 10 are found to be optimal.

Figure 11:

Measured classification error for different values of and , taken over five trials. For each value of , increasing initially reduces the error due to the higher computational power of more synapses. But eventually the error increases because the square nonlinearity block enters above threshold regime. Values of between 7 and 10 are found to be optimal.

We mention that the number of unclassified patterns (those causing equal neuronal firing on both current difference inputs) is too insignificant to be reported separately from Matlab simulations and is almost always zero in measurements.

To reduce this error, we next introduced dendritic leak currents that have been shown to be useful in reducing average currents (Hussain et al., 2015) to the dendrite and increase its effective dynamic range. The value of was set equal to to cancel most of the common-mode current. Here, denotes the probability of a randomly selected input dimension being high for a given pattern; it depends on the number of receptive fields used to generate the high-dimensional mapping. The desired value of for this is obtained from the curve fit shown in Figure 8b. It can be seen from Figure 12a that the classification error indeed reduces for , 15, and 20 by adding dendritic leak for Hz. However, the reduction is not as much as expected from software simulations due to the mismatch between the transistors in Figure 3b to create the leak currents. In the future, we can use a current splitter-based configurable leak current in each dendrite to remove the mismatch. Since this is a static setting (unlike if a splitter is used in a synapse), the dynamic performance of the splitter for small currents is not a bottleneck. Another option is to use the characterization results of this mismatch in the training process to find a new connection matrix. Finally, as we show next, we can partially reduce the effect of this mismatch by using margin-based training.

Figure 12:

(a) The effect of increasing dendritic leak is to reduce the measured classification error for Hz by reducing the current entering the square block and preventing it from going into above-threshold regime. The data are averaged over five trials. (b) Increasing margin improves the robustness of the system to added random spikes for and Hz observed over five trials. Increasing initially reduces the error to . Increasing beyond that increases error due to difficulty in training.

Figure 12:

(a) The effect of increasing dendritic leak is to reduce the measured classification error for Hz by reducing the current entering the square block and preventing it from going into above-threshold regime. The data are averaged over five trials. (b) Increasing margin improves the robustness of the system to added random spikes for and Hz observed over five trials. Increasing initially reduces the error to . Increasing beyond that increases error due to difficulty in training.

The effect of increasing margin during training was tested when noisy background spikes are added by choosing . The network is retrained for large margins with using the adaptive algorithm in Hussain et al. (2015), and the new connection matrix is used for testing. The errors for the cases of , 2.5, 5, 7.5, and 10 are compared with an increasing number of random spikes in Figure 12b when Hz. It can be seen that for all noise levels (obtained by increasing ), the increased margin setting of is the optimal setting in our case, achieving to less error than the nominal case of , proving that the added margin indeed helps in improving robustness. However, it can also be observed that the classification performance degraded slightly with an increase in the margin from to in both software simulation of NNLD and chip measurements. This is attributed to the fact that increasing margin beyond a point makes it difficult for the training process to converge. We can also use the measured result of error (for and no random spikes) to benchmark the performance of the chip with other classifiers.

A software implementation of the NNLD achieves a comparable error for the same parameter setting. To compare with other non-spike-based classifiers in software, we modified the binary inputs by adding noise to approximate the situation of noisy spike trains. The variance of noise was set so that the performance of NNLD on these binary inputs matches its software performance on spike train inputs. With these noisy binary patterns, a perceptron classifier achieves error, while an extreme learning Machine (ELM; Huang, Zhu, & Siew, 2006) achieves error with 2000 hidden neurons. Note that the NNLD uses only 160 binary synapses, while the perceptron uses 400 high-resolution weights. The ELM uses 2000 high-resolution weights and close to 1 million random weights. This shows the benefit of our approach over networks with weighted synapses.

Finally, note that the value of used here follows the notation of Hussain et al. (2015) introduced in section 2, where the weight of a synapse is normalized to 1 and . To translate this value to a margin in terms of input current to the IF neuron, we can apply the following transformation,
formula
5.5
where in the software model used for training. This can also be converted to an equivalent margin in terms of frequency at the output of the neuron by multiplying with for the IC. This points to another way to observe the effect of increased margin: plot the distribution of output spike count differences that is proportional to in equation 2.6, without taking the IIN steady-state current into account. Applying the Heaviside function on this quantity gives the classification output. This is shown in Figures 13a and 13b for , Hz, and two values of and 7.5, respectively. The -axis shows the difference in number of spike counts of the two cells, which is equal to . It can be seen that increasing the value of also increases the separation between the distributions for class 0 (blue solid bars) and class 1 (bold white bars). This is the reason for better classification performance with increased .
Figure 13:

Measured distribution of the difference of output spike counts showing that increasing margin from (a) 0 to (b) 7.5 increases the separation between the spike count distributions of the two classes. In this case, and Hz.

Figure 13:

Measured distribution of the difference of output spike counts showing that increasing margin from (a) 0 to (b) 7.5 increases the separation between the spike count distributions of the two classes. In this case, and Hz.

An interesting option is to classify the patterns based on only one of the differential outputs in equation 2.6, again without taking the IIN steady-state current into account. In that case, a threshold has to be set on the output spike counts, and we declare a pattern as belonging to class 1 if . Thus, effectively the classification equation becomes
formula
5.6
This would still have the advantage of rejecting common-mode currents but would not need two output neurons. However, the value of will need to be changed if system parameters like , and change. We will refer to this type of architecture as single-output differential (the original differential architecture in Figure 2a is henceforth referred to as double output). We plot the distribution of output spikes for this type of architecture in Figure 14a and set , Hz, and to compare with Figure 13b. It can be seen from the distribution that a large number of patterns belonging to class 0 generate zero output spikes. Next, we vary the threshold for both the single-output and double-output differential cases and plot the receiver operating characteristic (ROC), a commonly used metric to compare classifiers (Mitra et al., 2009). We plot the ROC for two different values of —0 and 7.5—in Figures 14b and 14c, respectively. It can be seen that the ROC for the differential case is slightly higher than the single-ended case in Figure 14b, indicating better performance for the former. To quantify this difference, we use area under the curve (AUC) as the metric and find that the double-output differential one has an AUC of 0.95 compared to 0.928 for a single-output differential. However, for the higher-margin case in Figure 14c, the similar AUC values of 0.978 and 0.969 for double and single outputs, respectively, show that margin enhancement makes both cases similar. This indicates that if a small performance penalty is acceptable and the threshold can be manually set, single-output differential architectures should be used since they eliminate the need for one IF neuron. This is one additional benefit of margin enhanced learning.
Figure 14:

(a) Measured distribution of output spikes proportional to in equation 5.6 for the case of , Hz, and . (b) ROC curve for . Other parameters remain same. Here, AUC averaged over four trials for two outputs is 0.95 compared to the earlier result of 0.93 for a single output, showing more benefits of considering balanced differential outputs in low-margin cases. (c) ROC curves for the case of , , Hz showing that the differential architecture with two outputs performs similarly (AUC = 0.978 averaged over four trials) compared to classifying based on a single output only (AUC = 0.969 averaged over four trials).

Figure 14:

(a) Measured distribution of output spikes proportional to in equation 5.6 for the case of , Hz, and . (b) ROC curve for . Other parameters remain same. Here, AUC averaged over four trials for two outputs is 0.95 compared to the earlier result of 0.93 for a single output, showing more benefits of considering balanced differential outputs in low-margin cases. (c) ROC curves for the case of , , Hz showing that the differential architecture with two outputs performs similarly (AUC = 0.978 averaged over four trials) compared to classifying based on a single output only (AUC = 0.969 averaged over four trials).

5.3  Pattern Classification: UCI Data Sets

5.3.1  Data Set and Input Generation

To evaluate the classification performance of our system on real-world data sets, we next tested it on two standard UCI data sets: the Breast Cancer (BC) data set and the Heart data set. These were also two-class classification problems; however, they differ from the former experiments in that the random data sets tested only the noise resilience property of the NNLD since the training patterns were converted to noisy spike trains for testing. In the UCI data sets, training and testing use different patterns and, hence, evaluate the generalization capability of our chip. For both data sets, each input vector was mapped to a higher-dimensional sparse binary vector. Similar to the method in section 5.2.1, this mapping was done by employing 10 nonoverlapping receptive fields to span each of the original dimensions of the data. The width of each receptive field was again chosen so that all of the higher dimensions have an equal probability of being active. Hence, the original 9 and 13 dimensions of the BC and Heart data sets were mapped to 90 and 130, respectively. The numbers of training and testing samples for the BC data set are 222 and 383, respectively, while the corresponding numbers for the Heart data set are 70 and 200, respectively.

5.3.2  Data Set and Input Generation

Table 2 compares the performance of different classifiers on these data sets. For the NNLD classifier, was fixed at 7, and a margin of was used, with Hz, because these settings yielded the best results for the random binary pattern classification case. Five trials were conducted on each UCI data set, where the NNLD was separately trained in software with different initial conditions. Also, different instances of Poisson spike trains were generated in each trial. For the NNLD case, we report the result of software simulation on binary inputs and spike inputs to show the loss in performance expected due to mapping to noisy spike trains. The results for SVM and ELM are taken from Babu and Suresh (2013). We also show the results for software implementation of a perceptron to classify the same high-dimensional binary patterns that are input to the NNLD. We find that for all cases, software implementation of the NNLD achieves performance comparable to SVM or ELM and superior to that of the perceptron. Clearly, the network we have proposed performs well compared to the more extensive networks constructed with weighted synapses (high-resolution weights). In all cases, there is a drop of approximately in accuracy in software when the binary inputs are mapped to noisy spike trains. Finally, the measured accuracy of the IC is approximately to less than the spike testing accuracy in software. It should be noted that the number of weights our NNLD used are two to six times fewer compared to SVM or ELM and comparable to that needed by the perceptron. But the perceptron used high-resolution weights, while the NNLD uses only binary weights, underlining its higher computational power.

Table 2:
Classification Performance on UCI Data Sets.
NNLD
SVMELMPerceptron
Data Sets (binary, software) (spike, software) (chip)
BC 24 240 96.7 66 660 96.4 90  16 112    
Heart 42 588 75.5 36 504 76.5 130  16 112    
NNLD
SVMELMPerceptron
Data Sets (binary, software) (spike, software) (chip)
BC 24 240 96.7 66 660 96.4 90  16 112    
Heart 42 588 75.5 36 504 76.5 130  16 112    

Note: : number of neurons; : number of weights; : number of dendrites; : classification accuracy in percentage.

The classification performance of the IC on the Heart data set is similar to performance of the perceptron in software. This is due to limitations on maximum and values we could set in this IC. However, this can be partially overcome by the concept of boosting (Mitra et al., 2009), where the output of several such NNLD classifiers may be combined to produce the final decision by voting. If each NNLD makes mistakes on different subsets of the pattern, the combined vote should result in far fewer errors. Of course, this comes at the expense of more dendrites and synapses. We tried this on the Heart data set by summing the spike counts of the P-cells and the N-cells separately for the five trials before comparing them. The resulting accuracy was —better than those obtained from a single run of classification as well as those from other classifiers. But the effective number of synapses used increases to 560, which is similar to SVM and ELM. However, the synapses for NNLD are binary compared to high-resolution tunable weights required in other approaches.

5.4  Pattern Classification: Classification Speed

Finally, in order to gauge the speed of classification (trade-off with accuracy) achievable by the classifier, it was tested again on the standard BC data set. This was approached by shrinking the time window of observation () of an individual pattern. The advantage of using a smaller time window for classification is greater classification speed, in addition to lower firing of the output neuron and, hence, lower energy dissipated. Two cases were considered.

5.4.1  Poisson Spike Train

Decreasing the time window indiscriminately has the demerit of introducing noise in the input, and this trade-off was tested, as we describe next.

Data set and input generation. The input data set generation followed the same principle as described in section 5.3.1: each input vector was mapped to a higher-dimensional sparse binary vector, and each resulting vector converted to a Poisson spike train, with the binary value 1 mapped to a train of mean firing rate and 0 to a firing rate . The training and the testing sets differ. The learning or training of the classifier followed.

Classification performance. The pulse train sequence fed to the network had different values of , varying from ms to ms in steps of 50 ms. In each case, 383 binary patterns in the testing subset of the BC data set were classified with and Hz, these being the optimal settings from the earlier runs. In addition, for each case of , the bias was also swept over a range. The expected trend is found to be true from Figure 15a, which shows error to gradually decrease and finally become almost constant with increasing classification time intervals; the errors also become lower with increasing for a particular . For example, at the highest setting of , the optimal performance corresponding to a classification error is reached within 300 ms.

Figure 15:

(a) Effect of gradually increasing the classification time interval or window is to reduce the measured error for Hz by reducing the noisy spikes in the output spike train through longer integration. Increasing bias also results in lower error due to more filtering, as well as more synaptic current for an active afferent raising the difference between an ON and an OFF synapse. (b) Classification accuracy generally improves with an increasing time window, a trend observed in panel a, as well as with greater spike density per input burst per active afferent, again due to a more detectable difference between the currents of an ON and an OFF synapse.

Figure 15:

(a) Effect of gradually increasing the classification time interval or window is to reduce the measured error for Hz by reducing the noisy spikes in the output spike train through longer integration. Increasing bias also results in lower error due to more filtering, as well as more synaptic current for an active afferent raising the difference between an ON and an OFF synapse. (b) Classification accuracy generally improves with an increasing time window, a trend observed in panel a, as well as with greater spike density per input burst per active afferent, again due to a more detectable difference between the currents of an ON and an OFF synapse.

The trend with -variation is easily explained: with increasing bias, the synaptic time constant (see section 3.1) increases, resulting in more averaging of the noisy Poisson input. Also, current from a synapse on an active afferent rises. This results in a greater difference between an ON and an OFF synapse. These two effects combine to produce lower error at higher setting of .

5.4.2  Single Spike Burst

Data set and input generation. The input generation for this two-class classification problem differed from that in section 5.2.I. After the mapping with the nonoverlapping receptive fields to expand each of the original dimensions of the data, the resulting high-dimensional vector is converted to a spiking input by mapping the binary value 1 to a burst of spikes concentrated over a very small time duration (2 msec), and the value 0 to no spikes at all.

Classification performance. The classification was conducted through a series of runs, each of which corresponded to a different spike count in the input burst per active afferent of the classifier. Each run itself measured the performance over increasing classification time windows (or pattern duration) , which was kept at a maximum of 50 ms this time. As is evident, this is substantially lowered from the previous value of 500 ms, because the sole purpose of this experiment was to see how quickly our system could classify a pattern set with reasonable accuracy (hence, only a small spike burst instead of an entire noisy spike train applied to a binary 1-afferent). Figure 15b shows the experimental results. Again, we find improved performance with expanding classification time in each case, similar to the trend obtained in section 5.4.1. The drastic reduction in accuracy at values of ms is because of the finite number of spikes obtainable in such a short time. This limits the discriminative ability of the neuron. It might be noted that the general classification accuracy improves with increasing spike density per burst as well; this is intuitively agreeable because more spikes per input burst per active afferent implies a greater difference in the current of an ON and an OFF synapse (as discussed in section 5.2.2). In this case, we reach an error of at ms with 20 spikes per burst.

6  Discussion

6.1  Relation to Other Work

In terms of spike-based classifiers in hardware, our results of error in classifying 200 random binary patterns in hardware using only 160 binary synapses show much better performance compared to other systems. For example, the work in Mitra et al. (2009) could classify about 12 random binary patterns at an error of using 1200 binary synapses; of course, they included learning capabilities on-chip that are not present in our work. In terms of VLSI implementations of dendritic function, Nease, George, Hasler, Koziol, and Brink (2012) present a voltage mode diffuser circuit to model passive properties of dendrites. They show the ability of this circuit to exhibit properties matching cable theory. Wang and Liu (2010, 2013) present a current mode implementation of active dendrite models and use it to study its response to spatiotemporal spike inputs. Compared to these detailed models, our model of dendritic nonlinearity is much more simplified. Instead, we have focused on using this model for an application of spike-based pattern classification. Our results show the viability of using binary synapses for pattern classification when coupled with margin-based learning to counter the imperfections of VLSI implementations. The key to achieving good performance is to have a larger ratio of nonlinear to linear processing—short linear summation of synaptic inputs have to be passed through a nonlinearity before further summations.

This method can be used by other analog neuromorphic processors (Brink et al., 2013; Qiao et al., 2015) where the dendritic nonlinearity can be replaced by neurons, while the software model used for training uses structural plasticity to choose the few best connections per neuron, which can then be incorporated into look-up tables for AER. The inherent learning capability in some of these systems can be used to fine-tune the performance further. Here, we assume that the benefits available in the NNLD network are due to the additional nonlinearities provided by the dendrites. Hence, it should be possible to get similar benefits by using neuronal nonlinearities in place of dendrites. However, in that case, the area would increase a lot since the dendrite circuit does not have capacitors and is more compact than an IF neuron, which has at least one capacitor for membrane dynamics and more for refractory period and spike frequency adaptation implementations. Finally, recently Spiess, George, Cook, and Diehl (2016) have explored the use of structural plasticity to denoise neuronal responses. However, no real-world classification problems were reported, and no hardware measurements were done. We believe our work is the first to show such results.

6.2  Future Work

Though the initial results from this simple chip are promising, it has certain limitations that can be improved in future designs. We discuss these aspects and other extensions in this section.

In this proof-of-concept work, since the WTA for comparison was not implemented, the number of spikes from the output neuron was not regulated down. On average, the neurons fired about 15,000 spikes in a pattern duration of 500 ms. However, not all of these spikes are necessary, and far fewer spikes can give comparable accuracy if an inhibitory interneuron (IIN)-based WTA is included on the chip. This has already been depicted in Figure 2a for a differential architecture. From this representation, it is clear that the IIN steady-state current is fed back on a negative feedback path to the inputs of both neurons, so that for class 1 patterns, current () would be rapidly decreased to zero, with a simultaneous decrease of the current (), while for class 0 patterns, the reverse would occur. This would suggest a significant (possibly greater than ) reduction in spiking of the output neuron. The mathematical formulation in equation 2.6 also supports this statement if one notes that the current () is not the negative of the current (); rather, it is the representation of the difference between the N-cell current and the P-cell current with the connectivities swapped. Obviously, for a pattern to be classified in class 1, () must exceed (), and vice versa for classification in class 0.

To evaluate the possible benefits, we implemented a software model of the WTA and varied the parameters and denoting the time constant and peak amplitude of the inhibitory postsynaptic current generated by the interneuron. The performance of the NNLD classifier with , , and was evaluated for the case of random binary inputs mapped to spike trains with Hz. As shown in Figure 16a, different combinations of and lead to different error rates; errors increase with higher as an increasing proportion of the output neuronal spikes gets eliminated from the negative feedback. This data are replotted in Figure 16b to show the trade-off in terms of accuracy versus spike rate. It can be seen that compared to the original accuracy of , slightly lower accuracies of and can be obtained with a concomitant reduction of and , respectively, in the number of spikes. This points to the great benefit of having an appropriately tuned inhibitory interneuron and a way to lower output neuronal firing without cutting down on the classification interval; it will be included on chip in future versions of the system.

Figure 16:

(a) Change in classification accuracy when the time constant and peak amplitude of the inhibitory postsynaptic current in the WTA are varied. (b) The data in the earlier simulation are replotted to show change in accuracy against the percentage reduction of spikes. Compared to the original accuracy of , slightly lower accuracies of and can be obtained with a concomitant reduction of and , respectively, in the number of spikes.

Figure 16:

(a) Change in classification accuracy when the time constant and peak amplitude of the inhibitory postsynaptic current in the WTA are varied. (b) The data in the earlier simulation are replotted to show change in accuracy against the percentage reduction of spikes. Compared to the original accuracy of , slightly lower accuracies of and can be obtained with a concomitant reduction of and , respectively, in the number of spikes.

The current chip has a limited number of dendrites () per cell. Combined with the constraint on the maximum number of synapses per dendrite (posed by transistors in the DPI synapse and squaring block going out of subthreshold regime as shown in equations 3.5 and 3.3), this leads to limited computational capability of each cell. In future versions of the chip, we will increase the size of the output transistor of the DPI synapse, as well as the transistors in the translinear loop of the square block to get a higher operating range. Architecturally, we will move to cells with more dendrites since that also increases the computational power of the classifier (Hussain et al., 2015). Future versions of the system will integrate multiple NNLD per chip with a fully asynchronous AER-based input and output interface so that multiple chips can be tiled to create bigger processors. In this kind of system, we can also use the concept of boosting (shown in section 5.3.2 to improve performance) to allocate multiple NNLD to decide on a class by voting. Mismatch between different branches reduces the accuracy of the hardware (though margin-based learning is useful to counter this effect). We will use the results of chip characterization to modify the software model during training. The new connection matrix, where the learning accounts for the mismatch, should produce a better match in results between hardware and software. We also plan to extend our hardware to include on-chip learning of connection matrices that can account for all mismatches in the hardware directly. Initial architectural explorations in this direction have already been done (Roy, Kar, & Basu, 2014) and a chip is being fabricated to test this idea.

In terms of algorithms, it has already been shown that the NNLD-based classifier can easily be extended to multiclass problems (Hussain et al., 2014) as well as to spike-time-based pattern classification (Roy, Banerjee et al., 2014; San, Basu, & Hussain, 2014). We will employ future systems to perform classification of handwritten digits from the MNIST database, a commonly used multiclass classification standard in current neuromorphic spike-based algorithms (O'Connor et al., 2013; Neftci, Das, Pedroni, Kreutz-Delgado, & Cauwenberghs, 2013). This problem is still considered a mean-rate-encoded system. To assess the ability of the hardware to identify spike-time-based patterns, we will employ it to classify the spike trains from a liquid state machine as done in Roy, Banerjee et al. (2014) and random spike latency patterns as done in Gutig & Sompolinsky (2006). We also plan to connect these ICs to real-time spike-generating auditory sensors (Chan et al., 2007) to perform rapid speech classification in low power.

7  Conclusion

In this letter, we have presented the VLSI circuit design in m CMOS of a neuromorphic spike-based classifier with eight nonlinear dendrites per neuron and two opponent neurons per class. We presented characterization results to prove the functionality of all subblocks from a as low as 1.8 V and also some results of classifying complex spike-based high-dimensional binary patterns. The classification error in classifying 200 patterns randomly assigned to two classes was obtained under different conditions. We showed that the addition of a dendritic leak and classification margin helps improve performance and makes the system robust against noise. With an optimal classification margin, the hardware system performs comparably to SVM and ELM on two UCI data sets and with far fewer binary weights. We also demonstrated that the margin enhancement algorithm allows single-output differential operation with similar accuracy as double output, thus allowing us to reduce the number of neurons. Pattern classifications within 50 ms were possible while using bursts of spikes to represent input binary value. The accuracy was degraded compared to the software due to transistor mismatch and non–square law behavior of the dendritic nonlinearity at high currents. Future work will use calibration to improve accuracy and include WTA on chip to reduce the output spike counts. We will also employ future generations of this chip for multiclass and spike time based classifications.

Appendix:  Details of the FPGA Controller

The FPGA controller consists of the following blocks:

A.1  Block Memory (Module A)

Module A, shown in Figure 6, is the volatile memory array implemented in the hardware. For generating presynaptic pulses, the input spike train vector (spkTm), input line addresses (spkAd), and connectivity matrix created in the pattern classification program are transferred from a PC to block memory A, block memory B, and block memory C, respectively. The structures of the spkTm and spkAd vectors are shown in Figure 17. Since spkTm is a sparse array of binary firing events, only those time instances of spkTm having binary value 1 are stored in block memory A. These time instances, s, are generated with respect to an FPGA internal clock frequency (). Similarly, input addresses at these particular time instances are stored in block memory B. These two parts store input information and will not be needed while interfacing with a real spiking sensor.

Figure 17:

Flowchart depicting the pulse and address generation algorithm implemented on FPGA.

Figure 17:

Flowchart depicting the pulse and address generation algorithm implemented on FPGA.

The connectivity matrix of this neural network has the dimension of (as discussed in section 2), where is the number of input lines and is the number of dendrites. Since the number of connections per dendrite , this matrix is also sparse in nature. Storing the entire content of the sparse matrix as a look-up table (LUT) would require memory space on the order of and is wasteful. Instead, we store only the addresses of nonzero elements as a linear array synCol, as shown in Figure 18 where and denote decoder addresses. For example, if the first input line connects to dendrites 2, 3, and 5, the first three entries of synCol will be , , and . But now, given an input pulse address, determining which decoder addresses have to be generated is a bit difficult since the input pulse address cannot be used to directly index into this array. To circumvent this, we store another linear integer array, addrPtr, with entries. The th entry in addrPtr stores the address of the synCol array where the first dendritic connection (decoder address) for the th line is stored. The number of dendritic connections