## Abstract

Recent work in computer science has shown the power of deep learning driven by the backpropagation algorithm in networks of artificial neurons. But real neurons in the brain are different from most of these artificial ones in at least three crucial ways: they emit spikes rather than graded outputs, their inputs and outputs are related dynamically rather than by piecewise-smooth functions, and they have no known way to coordinate arrays of synapses in separate forward and feedback pathways so that they change simultaneously and identically, as they do in backpropagation. Given these differences, it is unlikely that current deep learning algorithms can operate in the brain, but we that show these problems can be solved by two simple devices: learning rules can approximate dynamic input-output relations with piecewise-smooth functions, and a variation on the feedback alignment algorithm can train deep networks without having to coordinate forward and feedback synapses. Our results also show that deep spiking networks learn much better if each neuron computes an intracellular teaching signal that reflects that cell’s nonlinearity. With this mechanism, networks of spiking neurons show useful learning in synapses at least nine layers upstream from the output cells and perform well compared to other spiking networks in the literature on the MNIST digit recognition task.

## 1 Introduction

Recent results in computer science have revealed the power of deep learning (Bengio, 2009; Farabet, Couprie, Najman, & LeCun, 2013; Hinton, Osindero, & Teh, 2006; Hinton & Salakhutdinov, 2006; Krizhevsky, Sutskever, & Hinton, 2012; Schmidhuber, 2015). But it is unclear which insights from this work apply to the brain because current algorithms for deep learning are designed for networks of very simple neurons. Real neurons are different in at least three crucial respects. First, real neurons communicate by streams of voltage spikes, or action potentials, whereas neurons in most artificial deep networks have continuous, graded outputs. Second, real neurons are dynamic in the sense that their activity at any moment depends not only on their inputs and synaptic weights at that moment but also on their inputs and weights over the last few milliseconds (Eliasmith & Anderson, 2002). And third, real neurons almost certainly lack weight transport, meaning they cannot send each other detailed information about the weights (i.e., strengths) of all their synapses in the way that is required in current algorithms for deep learning (Chinta & Tweed, 2012; Crick, 1989; Grossberg, 1987; Kolen & Pollack, 1994; Levine, 2000; Rolls & Deco, 2002; Stork, 1989).

Of course, these three aspects of real neurons are not necessarily flaws or shortcomings, as spiking and dynamics may bring computational advantages (Hinton, 2016; Maass & Markram, 2004; Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014). And of course real neurons differ from artificial ones in other ways besides these three. But these three properties do suggest that the computations underlying biological learning must differ from those of current deep learning algorithms in computer science. And the same three issues are also relevant to networks embodied in very large-scale integrated (VLSI) circuits (Azghadi, Iannella, Al-Sarawi, Indiveri, & Abbott, 2014) and field-programmable gate arrays (FPGA) (Neil & Liu, 2014). We describe the computational problems raised by these three issues and then show how those problems can be solved.

To begin with spiking and dynamics, the key issue is that in real neurons, spiking depends on current and past inputs and synaptic weights, whereas in the artificial neurons of most nonrecurrent deep networks, output depends only on existing inputs and parameters (weights and biases). In the best-performing algorithms for deep learning, each neuron receives a drive , which depends on its inputs and parameters. The neuron emits a signal , which is a function of — where is called the activation function. This function matters because deep learning algorithms rely on the backpropagation algorithm, which works by computing the derivative of the network’s current output error with respect to the weights of all the synapses in the net, and these derivatives depend on the derivative of with respect to , (Ciresan, Meier, Gambardella, & Schmidhuber, 2010; Hinton et al., 2006; Krizhevsky et al., 2012; Sermanet et al., 2013; Srivastava et al., 2014).

But in a dynamic neuron, there is no function relating the present to the present , and so there is no derivative . Of course, a real neuron’s outputs are still related to its inputs, but not by a function in the mathematical sense, which implies that any one input is always paired with the same output . One could tackle this problem by working from the fact that the current is a function of current and past s. But that approach increases the dimensionality of the problem. In this letter, we apply a simpler method, which uses, in place of the activation function, the function relating the expected value of to (O’Connor, Neil, Liu, Delbruck, & Pfeiffer, 2013).

The third difference we are considering—the brain’s lack of weight transport—sets up further barriers to the backpropagation algorithm. Backpropagation works by sending error derivatives along a feedback path that drives learning in the forward part of the network. But those derivatives depend on the weights of the synapses in the forward path, which means that the feedback circuits that drive learning must have information about those weights. In the brain, there is no known way for them to get that information.

Specifically, backpropagation continually adjusts the synaptic weights in the feedback path so that each one stays equal to its corresponding weight in the forward path, with the result that the matrix of feedback weights in each layer equals the transpose of the matrix of forward weights in that layer (in convolutional networks, there is more complicated coordination of weights). In a computer, it is easy to set each feedback weight equal to the appropriate forward-path weight at each time step. But the brain, lacking weight transport, has no mechanism to coordinate large numbers of evolving synapses on different neural pathways in this way (Chinta & Tweed, 2012; Crick, 1989; Grossberg, 1987; Kolen & Pollack, 1994; Levine, 2000; Rolls & Deco, 2002; Stork, 1989).

Surprisingly, though, it has recently been found that layered networks can learn even if synapses in the feedback path are not coordinated at all with those in the forward path but are instead frozen at random values. This algorithm is called *feedback alignment*, because in it, the forward-path synapses evolve to resemble the fixed synapses in the feedback circuits, so that in the end, it is as if those feedback synapses had been set equal to the forward ones as required by backpropagation. The reasons that feedback alignment works are not fully understood, but what is known is described in Lillicrap, Cownden, Tweed, and Akerman (2014) and Hinton (2016).

Here we show that a variant of feedback alignment can drive deep learning in dynamic, spiking networks. Connections between our results and other recent discoveries in the field of spiking networks (Beyeler, Dutt, & Krichmar, 2013; Bohte, Kok, & La Poutre, 2002; Brader, Senn, & Fusi, 2007; Diehl & Cook, 2015; Diehl et al., 2015; Eliasmith et al., 2012; Henderson, Gibson, & Wiles, 2015; Jimenez Rezende & Gerstner, 2014; Maass & Markram, 2004; Neftci, Das, Pedroni, Kreutz-Delgado, & Cauwenberghs, 2014; Neil & Liu, 2014; O’Connor et al., 2013) are laid out in section 4.

## 2 Methods

### 2.1 Neurons

We use a mathematical model called the leaky-integrate-and-fire (LIF), neuron (Eliasmith & Anderson, 2002), which is popular because it strikes a useful balance between realism and complexity.

### 2.2 Backpropagation and Feedback Alignment

The key point is that in backpropagation, all the variables of the form play a double role: they represent the synaptic weights in the forward path, but they also appear in equation 2.5, where they multiply signals in the feedback path. In other words, each acts as a synapse in two different neural pathways. In the brain the forward and feedback synapses are of course physically distinct, which means that for backpropagation to run in the brain, each synapse in the feedback path would have to always stay equal to its specific corresponding synapse in the forward path, even though the latter synapse is constantly evolving as the network learns. This is the weight transport problem, which is one of the main reasons backpropagation is not considered feasible in the brain (Chinta & Tweed, 2012; Crick, 1989; Grossberg, 1987; Kolen & Pollack, 1994; Levine, 2000; Rolls & Deco, 2002; Stork, 1989).

### 2.3 Broadcast Alignment

Feedback alignment and broadcast alignment differ in their handling of the derivatives . Feedback alignment includes multilayer information about in its feedback signals; for example, the formula for in equation 2.6 contains the th layer derivative term and also includes , which in turn was computed using information about the 1st layer derivatives , and so on through all the layers. That is, in feedback alignment as in backpropagation, the feedback signals accumulate information about the derivatives of all downstream neurons. Broadcast alignment, in contrast, omits from its feedback signals in equation 2.7, but incorporates into its intracellular learning mechanism in equation 2.8. Therefore, learning in any one neuron is based solely on the derivative of its own activation function and gets no information about any downstream . So broadcast alignment delivers less information to each learning neuron than backpropagation or feedback alignment does. In section 3 we show that it learns very effectively nonetheless.

We also looked at whether this simplification could be pushed one step further by omitting all information about from the learning algorithm, that is, by combining derivative-free feedback, equation 2.7, with the derivative-free intracellular process, equation 2.4. But we will show that this derivative-free algorithm does not learn nearly as well as broadcast alignment. That is, the minimal derivative information in equation 2.8 is very useful for deep learning.

### 2.4 Dynamics and Activation

Broadcast alignment requires that learning neurons have information about the derivative of their activation function. But LIF neurons have no such function. The leaky integrator in equation 2.2 makes LIF cells dynamic: the values of and therefore depend not only on the drive at this moment but also on the values has had over the last few milliseconds. Plotting versus , as in Figure 2A, illustrates the problem. The blue dots show s for 500 random values of : for any , can be either 0 or 1 depending on recent history, and the graph does not resemble any smooth curve with a derivative .

The best-fitting function, plotted as a thin black curve in Figure 2B, has coefficients and , and does resemble the E() graph. We chose tanh as the basis of our fitting function because it is a popular bounded activation function in the machine learning literature. Other function forms based on logarithms are also possible and actually yield slightly better fits to E() because they saturate more slowly, but in our preliminary experiments, spiking networks based on these alternative functions learned no better or worse than tanh-based ones. The question is whether equation 2.10 can play the role of the activation function in deep learning, given that the standard deviation of about E() is so wide (light blue region in Figure 2B). We address this question with simulations in section 3.

### 2.5 Error Feedback

All neurons in our networks, except first-layer neurons, which simply carry input signals, learn by adjusting their weights and biases based on feedback. Learning is driven by spiking error signals , which are the differences between the desired outputs of the network and its actual outputs . For instance if a network has three layers, the are the activities of the third-layer neurons, . Both and always consist of 0s and 1s (where, again, 1 means a spike and 0 means no spike), and therefore the error signals consist of 0s, 1s, and 1s. Because real neurons cannot produce negative spikes, we propose two populations of error feedback neurons, all of them carrying signals of 0 or 1, but half of them being inhibitory cells, whose spikes signal negative errors.

This scheme does not imply the existence of any unphysiological “supervisor” guiding the learning. For convenience, we speak of desired outputs , as in the machine learning literature, but the network need not receive any signals. All that matters is that it get signals representing the errors . For instance learning circuits in the cerebellum adjust the processing in the vestibulo-ocular reflex so that when the head moves, the eyes counterrotate in the head at just the right velocity to keep the visual images stable on the retinas. This learning is driven by error signals from the visual system, which code retinal-image slip velocity; retinal-slip signals provide a useful error vector, with no need for any signals coding desired eye velocities (Lisberger, 1994). Another source of teacher signals that seems plausible physiologically is the networks’ own inputs, as in the artificial networks known as autoencoders, which learn useful representations of sense data based on error signals that are differences between the networks’ own inputs and outputs (Bengio, 2009; Hinton, 2016).

### 2.6 Learning Mechanism

For , the sech is a simple, indeed monotonic function that slopes down from its peak at 0 like the right half of a gaussian. Hence, the sech term in equation 2.11 means that these neurons are more responsive to error signals when they are less excited. Biologically, it implies that some intracellular agent of synaptic change varies its activity as a function of the cell’s overall drive. We know of no cell-biological evidence for or against such a dependence, but it is not implausible, and it does greatly improve learning, as we will show. For that reason, we use equation 2.11 in all our spiking network simulations; that is, we propose that each learning neuron computes its own based on its error feedback and its drive .

time step ms, time constant ms, |

threshold , activation-function fitting constant |

initialize forward weights , biases , feedback weights , |

drives , hillock potentials , activities , |

times since refractory periods began ref, |

and learning rate constants |

for each example |

sample inputs and desired outputs |

for time ms step |

for each layer in the network |

for each cell in the layer |

// drive |

if ref, refref // how long has the cell been refractory? |

if ref ms, ref // end refractory period |

if ref, else // hillock potential |

if , ref (any tiny positive number) // start a refractory period |

if ref, else // activity |

end for |

end for |

if ms |

// error |

for each layer from output back to layer 2 |

for each cell in the layer |

// feedback |

if else |

// intracellular teaching signal |

// weights |

// bias |

end for |

end for |

end if |

end for |

end for |

time step ms, time constant ms, |

threshold , activation-function fitting constant |

initialize forward weights , biases , feedback weights , |

drives , hillock potentials , activities , |

times since refractory periods began ref, |

and learning rate constants |

for each example |

sample inputs and desired outputs |

for time ms step |

for each layer in the network |

for each cell in the layer |

// drive |

if ref, refref // how long has the cell been refractory? |

if ref ms, ref // end refractory period |

if ref, else // hillock potential |

if , ref (any tiny positive number) // start a refractory period |

if ref, else // activity |

end for |

end for |

if ms |

// error |

for each layer from output back to layer 2 |

for each cell in the layer |

// feedback |

if else |

// intracellular teaching signal |

// weights |

// bias |

end for |

end for |

end if |

end for |

end for |

### 2.7 Simulations

In all simulations, we computed dynamics by Euler integration with a time step of 0.25 ms. Network s and s were initialized so that the s of all neurons in all layers had a mean of 8 and a standard deviation of 10, because with these values, the neurons’ activity is spread out over the middles of their operating ranges, as shown in section 3 in Figure 2B (see appendix B for details of this initialization).

During training, we used minibatches of 100 examples. It seems unlikely that the brain uses minibatches, but using them in our experiments reduced the computer run times and did not alter anything essential in the proposed learning model.

As in other learning studies with dynamic neurons, each input was presented for a brief interval of simulated time (100 ms in our case) rather than for a single time step, as is done with static, graded neurons. And the network did not adjust its s or s until it had been viewing an image for 20 ms. Similarly during testing, we ignored the network’s outputs for the first 20 ms; we averaged its output activity vector over the remaining 80 ms and took that average as the network’s answer. One motivation for these numbers is that humans need about 100 ms of viewing time to recognize objects in pictures.

In MNIST trials, performance was assessed in the usual way: the network was regarded as giving the correct answer when the appropriate output neuron was more active than all the others. For instance, when the handwritten digit is a 3, then the fourth of the 10 output neurons should be spiking and the other 9 should all be silent, so the output was considered correct when the fourth neuron produced more spikes than any of the others during the 80 ms answering period.

## 3 Results

### 3.1 Performance in Nonspiking Networks

First we tested our candidate deep learning algorithm, broadcast alignment, against three other methods: derivative-free learning, feedback alignment and backpropagation. These last two algorithms cannot run on LIF neurons, and therefore the tests of all four were run on networks of nonspiking neurons. Although the neurons were nonspiking, their activation function equaled the approximate activation function of LIF neurons, given in equation 2.10, for better comparison with the spiking neuron results in section 3.2.

In all these tests, the learning network had the same deep and narrow structure, with 2 input neurons, 2 output neurons, and 8 hidden layers of 10 neurons each. The task of the learning network was to match the outputs of a nonspiking teacher, or target, network. The target network was also deep and narrow, again with 10 layers and 2 input and 2 output neurons, to create tasks where deep learning was likely to be useful. To make the tasks more challenging, the target net had different types of neurons than the learner in all layers but the first: the 8 hidden layers each consisted of 2 nonrectified tanh cells, and the output layer had two nontanh, one-hot output cells; that is, the only possible outputs were (1, 0) and (0, 1). The probabilities of these two outputs were always close to equal, that is, always within 0.001 of 0.5.

Each algorithm was tested 500 times, each time with new, random weights in the target network and new, random initializations of the learning network, to present the learners with a large and varied set of tasks.

We ran these 500 tests on each of nine versions of each algorithm, which differed in their depth of learning. For instance, all tests of backpropagation ran on the 10-layer learning nets described above, but in the depth-1 version, learning was restricted to the synapses in the tenth (i.e., the output) layer of the net, and all upstream synapses stayed fixed at their initial values. In the depth-2 version, synapses in the last two layers were adjusted, and so on down to depth-9, the deepest possible version where all the synapses in the network were adjusted. The point of these comparisons was to see how far upstream each algorithm was able to deliver useful teaching signals.

Figure 3A shows the results for backpropagation. Each of the nine curves shows the performance error, averaged over 500 tests, for one of the depth versions of the algorithm: the top-most, bright green curve for the shallowest, or depth-1 version; the bottom blue curve for the deepest, depth-9, version; and the curves in between for the seven intermediate depths. Each curve is centered on the mean of its 500 trials, and its thickness equals 2 standard errors of the mean. Trial-to-trial variance was large because each trial used a different target function, but after 500 trials, the standard errors were small enough that the nine bands are distinctly separate. In particular, the lowest of the nine learning curves lies well below the second-lowest, showing that depth-9 learning was better than depth-8. This finding means that backpropagation delivered useful teaching signals all the way to the deepest layer of synapses in the network.

Figure 3B shows that for feedback alignment also, depth-9 learning was clearly better than depth-8. Figure 3C shows the same for broadcast alignment. That is, these two algorithms also delivered useful teaching signals to the deepest parts of the net. Their error rates, though, were slightly higher than those of backpropagation: in this class of tasks, backpropagation was slightly better than feedback alignment, which in turn was slightly better than broadcast alignment.

Figure 3D shows that with the derivative-free algorithm, depth-9 learning was no better than depth-1 on average by the ends of the trials (curves for only those two depths are shown, to reduce clutter). The deeper version was faster and so gave better results early in the trials (near the left sides of the graphs). But neither worked as well as the deeper versions of the other three algorithms.

In summary, of the two candidate deep-learning algorithms compatible with LIF neurons, broadcast alignment and derivative-free learning, the former worked much better than the latter. Therefore, we chose broadcast alignment for implementation in spiking nets.

### 3.2 Broadcast Alignment in Spiking Networks

We tested an LIF version of broadcast alignment on the same task as in section 3.1 The target network was identical to that in section 3.1. The learning net had the same structure as in section 3.1 except that it contained only spiking neurons after the input layer. That is, in the learning network, the two neurons of the first layer represented sensory receptors and so had graded activity—their activities were real numbers, not necessarily 0s or 1s. All other neurons in the learning network were of the LIF type—all neurons in forward layers 2 through 10 and all the feedback neurons. We ran 100 trials, with a different target function in each trial.

### 3.3 High Dimensions

To show that the same principles still hold in higher-dimensional problems, we trained networks to recognize the handwritten digits in the MNIST database (LeCun, Bottou, Bengio, & Haffner, 1998). Again we started with nonspiking networks so we could compare all four algorithms: backpropagation, feedback alignment, broadcast alignment, and derivative-free learning. We considered two networks. One had three layers (including the input layer), with 784 input neurons representing the input image (i.e., the grayscale values of a 28-by-28 array of pixels), then 1000 neurons in the second layer, and 10 in the output layer. The other network had four layers (including input), with 784, 630, 370, and 10 neurons. We ran three trials of each algorithm in each architecture, and again we tested different depths of learning.

Table 2 summarizes the results. In the three-layer network with depth-2 learning (i.e., adjusting both layers of synapses), backpropagation correctly classified 98.56% (mean over the three trials) of the 10,000 images in the test set; feedback alignment managed 98.42%; broadcast alignment 97.67%; and derivative-free learning 96.12%. In the same three-layer network but with depth-1 (i.e., shallow) learning, backpropagation, feedback alignment, and broadcast alignment all managed 95.98% (because these three algorithms are identical in this setting), and derivative-free learning 95.29%. So the key finding was that derivative-free learning was again scarcely better than shallow learning, whereas broadcast alignment was again able to deliver useful teaching signals to upstream synapses.

Algorithm . | Network Depth . | Learning Depth . | Score . |
---|---|---|---|

BP, FA, BA | 3 | 1 | 95.98 |

DF | 3 | 1 | 95.29 |

BP | 3 | 2 | 98.56 |

FA | 3 | 2 | 98.42 |

BA | 3 | 2 | 97.67 |

DF | 3 | 2 | 96.12 |

BP | 4 | 3 | 98.60 |

FA | 4 | 3 | 98.22 |

BA | 4 | 3 | 97.64 |

DF | 4 | 3 | 95.62 |

LIF-BA | 3 | 1 | 90.49 |

LIF-BA | 3 | 2 | 96.02 |

LIF-BA | 4 | 3 | 97.05 |

Algorithm . | Network Depth . | Learning Depth . | Score . |
---|---|---|---|

BP, FA, BA | 3 | 1 | 95.98 |

DF | 3 | 1 | 95.29 |

BP | 3 | 2 | 98.56 |

FA | 3 | 2 | 98.42 |

BA | 3 | 2 | 97.67 |

DF | 3 | 2 | 96.12 |

BP | 4 | 3 | 98.60 |

FA | 4 | 3 | 98.22 |

BA | 4 | 3 | 97.64 |

DF | 4 | 3 | 95.62 |

LIF-BA | 3 | 1 | 90.49 |

LIF-BA | 3 | 2 | 96.02 |

LIF-BA | 4 | 3 | 97.05 |

Notes: BP: backpropagation; FA: feedback alignment; BA: broadcast alignment; DF: derivative-free learning. The first 10 rows show results of nonspiking networks; the last 3, LIF networks.

None of the algorithms did appreciably better in the four-layer network. Backpropagation managed 98.60%, feedback alignment 98.22%, broadcast alignment 97.64%, and derivative-free 95.62. Most likely backpropagation was at the limit of what can be achieved without some form of regularization, such as convolution, dropout, or data augmentation. The others might have done better with devices such as cross-entropy loss and annealing, but we avoided those methods because they would have been complicated or controversial to include in the LIF network.

Turning now to the LIF networks, the three-layer net running broadcast alignment managed an average score of 96.02%, as shown by the blue curves in Figure 5. Specifically, these three curves depict three runs. In each run, the network learned from the 60,000 images in the MNIST training set. After every 1000 training examples, the network was tested on 100 test examples—100 images randomly drawn from a test set of 10,000 images that were never used for training, only for assessment. These test scores are plotted in the graph to show the network’s improvement. After 1.8 million training examples (30 passes through the training set), we tested the network on all 10,000 images in the test set and plotted its score as a horizontal line at the right side of the plot, though the three lines, for the three runs, are too close together to distinguish in the graph: they range from 95.97% to 96.11%.

With depth-1 learning (i.e., when synaptic adjustment was restricted to the third layer of the network), performance was not as good. The three runs (green curves in Figure 5) achieved final scores ranging from 90.26% to 90.82%, mean 90.49%. So the 96% achieved in the earlier tests (the blue curves) depended on synaptic adjustments in the upstream, second layer.

We also tested the four-layer network of 784, 630, 370, and 10 neurons. It had the same total number of neurons as the three-layer network, but fewer synaptic weights and far fewer cells and synapses in the shallower parts—the last and second-last layers. Nevertheless, it outperformed the three-layer version, achieving scores in the range 96.99% to 97.09% in its three runs (black curves in Figure 5), for a mean of 97.05%.

## 4 Discussion

We have shown that dynamic spiking networks can learn by applying a variant of the feedback alignment algorithm and replacing its factor with the derivative of E(). Deeper networks learn better than shallower ones, showing that with this method, useful teaching signals reach upstream layers.

Using the algorithm described in equations 2.7 to 2.9, 2.12, and 2.13, our four-layer networks scored 97% on MNIST, which so far as we know, is the best score yet achieved by learning by any all-spiking network. In what follows, we relate our results to other recent discoveries involving spiking networks. In cases where these other studies also used the MNIST task, we will report their scores on it. But we emphasize that these different studies often had widely different aims and that most of them, like our own, were not concerned with setting records on MNIST but with demonstrating computational principles.

Several labs have looked into creating useful spiking networks not by training them directly but by training nonspiking networks and then translating the results into spiking nets. By this method, Diehl et al. (2015) created spiking networks that achieved 98.68% on MNIST and convolutional spiking nets that managed 99.12%. By similar methods, Eliasmith et al. (2012) and O’Connor et al. (2013) both constructed spiking networks that achieved 94%, and Neil and Liu (2014) managed 92%.

Other labs have devised spiking networks that do learn with one layer of plastic synapses. Beyeler et al. (2013) developed a network of 71,026 neurons that learned to score 92% on MNIST. Diehl and Cook (2015) achieved 91.9% with 2384 neurons and 95.0% with 7184. Jimenez Rezende and Gerstner (2014) trained networks to reproduce temporal patterns of spikes. Neftci et al. (2014) achieved 91.9% on MNIST with a restricted Boltzmann machine of 1324 stochastic spiking neurons. Brader et al. (2007) achieved 96.5% with just 934 neurons.

Few labs have considered deep learning in spiking networks. Bohte et al. (2002) developed the SpikeProp algorithm and used it to train three-layer networks on several tasks. But SpikeProp is not fully spiking: its forward layers spike, but its feedback signals are real valued. It also requires weight transport, as backpropagation does. In contrast, Henderson et al. (2015) used fixed feedback weights and only spiking neurons in both the forward and feedback paths, and scored 87.4% on a subset of MNIST with a four-layer network of 4058 cells.

In many of these other cited studies, as in our own, the MNIST scores were achieved without the benefit of several devices used in the best-performing nonspiking networks: no cross-entropy, no weight decay, no adaptive gradients, no dropout (Srivastava et al., 2014), no validation set to monitor for overfitting, no data augmentation (Ciresan et al., 2010), no annealing or variation of momentum (Sutskever, 2013), and no convolution (Fukushima, 1979, 2013; Krizhevsky et al., 2012; LeCun et al., 1998; Sermanet et al., 2013). And the networks learned for only a few epochs rather than thousands. So there is scope for improvement.

Our results on deep learning are biologically interesting because it seems likely that at least some of the brain’s learning circuits are multilayered. In the best-studied learning circuit in motor physiology, the cerebellum, most research has focused on a single layer of synapses—those between the parallel fibers and Purkinje cells (Sakurai, 1987)—but other synapses, from mossy fibers onto cerebellar granule cells are also plastic (D’Angelo & De Zeeuw, 2008). Therefore, this system appears to have at least two layers and may form part of a deeper circuit including deep cerebellar or brainstem nuclei (Lisberger, 1994; Medina & Mauk, 2000).

Theoretically, deep learning has advantages and disadvantages. Its main drawback is its complexity. In networks with one-layer learning, such as support vector machines, gaussian processes, and other kernel methods (Liu, Príncipe, & Haykin, 2010), there is a simple, usually linear relation between the network’s output errors and all its adjustable weights. As a result the risk surface (e.g., the graph of squared error as a function of the weights) is convex, sloping down smoothly in all dimensions toward a single optimum. In networks where two or more layers learn, there is nonlinear processing between *e* and some of the weights. This nonlinearity complicates the risk surface and also means that information about the form of the nonlinearity must be delivered to upstream synapses, as in equations 2.8 and 2.11.

On the positive side, deep learning makes networks more flexible by reducing nonoptimized parameters. That is, a network with just one layer of learning almost always needs at least one additional processing layer upstream for expansion recoding (Liu et al., 2010). If that upstream layer cannot learn, then its synapses stay frozen forever at suboptimal values (or decay or otherwise change through some process other than learning). It is true that kernel algorithms have clever ways to initialize frozen synapses, and natural selection may have done this for our brains, but even so, a network with all its synapses unfrozen will be more adaptable.

Another advantage is thought to be that deep networks contain a kind of hierarchy of layers that reflects the hierarchies in many stimuli (Saxe, McClelland, & Ganguli, 2013; for example, many images show objects made of parts that are made of smaller parts (Yamins et al., 2014). In other words deep networks perform a useful kind of regularization (Bengio & LeCun, 2007; Ba & Caruana, 2014).

Our findings show that multilayer networks of dynamic spiking neurons can learn by mechanisms similar to the backpropagation algorithm that is used with the static, nonspiking artificial neurons of the deep-learning literature. But the feedback calculations in our method, in equations 2.7 and 2.8, are simpler than those in backpropagation.

In all the simulations in Figures 3, 4, and 5, we used a momentum value of 0.9. We chose that value because it is common in machine learning; at present, we have no biological justification for it except that it works. But it may be that the precise value of momentum is not critical. With broadcast alignment, we have observed that even momentum-free four-layer networks sized like those in Figure 5 can still achieve 97% on MNIST (results not shown).

In equation 2.8, we assumed that the learning mechanism within each neuron has information about the derivative of its own activation function. We tried removing that assumption, with our derivative-free algorithm, but learning suffered badly. Hence, it appears that deep learning in a spiking network is more effective if each neuron’s learning reflects its own nonlinearity in this sense (i.e., if neurons respond more strongly to error signals when their drive is weaker). We suggest that real neurons may show a similar dependence, on the grounds that it would be very useful for deep learning

This letter has addressed three computational issues but deferred many other questions as topics for future study. For instance, we have treated synapses as simple, scalar weights that multiply their incoming signals, whereas real synapses are more complex. We have also ignored issues of timing: like most other neural network simulations, ours send their feedback signals to all learning cells simultaneously and without delay, and all their variables are updated abruptly and then stay constant for the duration of one time step. Further, we have described no biochemical implementations for the computations in our model, including those of momentum and the intracellular teaching signal in equation 2.8. Also in equation 2.8, it remains to be seen how precisely the variable must be represented. We have shown that if it is omitted entirely (i.e., assumed to be 1), as in our derivative-free algorithm, then learning is poor. But if a cell’s estimate of were only slightly inaccurate, then the consequences might be less extreme. Even an inexact estimate might make a network learn better than it would with the derivative-free algorithm. There are also open questions about the variables that might feed into the calculation of in equation 2.8. In equation 2.11, we based the computation of on the drive variable; that is, we proposed that the cell estimates its based on some intracellular correlate of . But might instead be estimated based on , perhaps directly or perhaps by first filtering to yield an estimate of , and there are other ways neurons might estimate (Hinton, 2016).

## Appendix A: Plotting and

To compute the data in Figure 2A we presented a series of 1000 drives , all between 50 and 50, to a single LIF neuron, with each applied for 0.1 s. We ignored the neuron’s activity, , over the first 0.02 s, because the neuron was dynamic and its activity was settling over that time. Then we recorded 320 s and s over the remaining 0.08 s. In all, then, we recorded 320,000 input-output pairs (, )—320 for each of the 1000 s. The 500 blue dots in Figure 2A are a random subset of those pairs.

## Appendix B: Initialization

We initialized the network weights and biases using techniques closely analogous to those used in computer science. The mechanisms used in the brain are likely quite different and outside the scope of this letter. Our methods were simply a fast way to get weights and biases that prevented the forward and feedback signals from vanishing or saturating.

*rand*had a uniform distribution over the range [0, 1]. Given these values for and assuming the s in layer have the desired mean and standard deviation , it follows that the s in layer will have those same statistics. That is, this initialization ensures that at least at the start of the run, the drives in all layers have reasonable values—neither too small nor too large on average, and varying over a reasonable range when the network inputs vary.

Our values for and imply that E() will have a mean of 0.64 and a standard deviation of 0.8. Therefore, we ensured that network inputs had these statistics; for example, in MNIST trials, the input vectors were preprocessed so all 784 pixels had the same mean value of 0.64, across all the training images.

## Acknowledgments

We thank Sara Scharf for comments. This study was supported by the Natural Sciences and Engineering Research Council of Canada grant 391349-2010.