We show that Langevin Markov chain Monte Carlo inference in an energy-based model with latent variables has the property that the early steps of inference, starting from a stationary point, correspond to propagating error gradients into internal layers, similar to backpropagation. The backpropagated error is with respect to output units that have received an outside driving force pushing them away from the stationary point. Backpropagated error gradients correspond to temporal derivatives with respect to the activation of hidden units. These lead to a weight update proportional to the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. Simulations and a theoretical argument suggest that this rate-based update rule is consistent with those associated with spike-timing-dependent plasticity. The ideas presented in this article could be an element of a theory for explaining how brains perform credit assignment in deep hierarchies as efficiently as backpropagation does, with neural computation corresponding to both approximate inference in continuous-valued latent variables and error backpropagation, at the same time.
It has been hypothesized numerous times (Hinton & Sejnowski, 1986; Friston & Stephan, 2007; Berkes, Orban, Lengyel, & Fiser, 2011) that, given a state of sensory information (current and past inputs), neurons are collectively performing inference, that is, moving toward configurations that better “explain” the observed sensory data and so minimize the corresponding “surprise” and maximize a form of coherence between sensory observations and internal representations. We can think of the configuration of internal neurons (hidden units or latent variables) as an “explanation” for the observed sensory data.
Under this hypothesis, when an unexpected signal arrives at visible neurons, the other neurons will change their state (from one that is near a stochastic equilibrium) so as to better reflect that change in input.1 If this error signal is slowly driving visible neurons toward the externally observed value, this creates a perturbation for the whole network. In this article, we consider what happens early in this process, when the perturbation due to the incorrect prediction of visible neurons is propagated toward inner areas of the brain (hidden units and hidden layers). As one of the two main contributions of this work, we show how the propagation of this perturbation is mathematically equivalent to the propagation of activation gradients by the backpropagation algorithm in a deep neural network.
This result assumes that latent variables are continuous valued (unlike in the vanilla Boltzmann machine), making the system an energy-based model similar to the continuous Hopfield net (Hopfield, 1984), defined by an energy function over both continuous visible and hidden units. It also assumes that the neurons are leaky integrators with injected noise. As a consequence, neural computation corresponds to gradually going down the energy while adding noise, which corresponds to running a Langevin Markov chain Monte Carlo (MCMC) as inference mechanism.
By applying the chain rule, the backpropagated activation gradients can be used to estimate the error gradient with respect to the weights of the incoming connections. This leads to a weight update performing gradient descent on the output prediction error which is similar to a temporal form of contrastive Hebbian learning. The update is proportional to the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. This brings us to the second main contribution of this work. Biological experiments (Markram & Sakmann, 1995; Gerstner, Kempter, van Hemmen, & Wagner, 1996) suggest that synaptic change in several types of neurons and stimulation contexts depends on the relative timing of presynaptic and postsynaptic spikes. Learning rules that mimic this observation are called spike-timing dependent plasticity (STDP) rules. We show that our proposed weight update that depends only on firing rates and their temporal derivatives is consistent with the STDP observations, according to a series of simulations. This result provides an interesting machine learning interpretation for STDP, since artificial neural networks are generally cast in terms of firing rates.
This work is only a stepping stone. Several points still need to be elucidated before a complete theory of learning, inference, and credit assignment is elaborated that is both biologically plausible and makes sense from a machine learning point of view for whole networks (with global optimization of the whole network and not being limited to learning of visible neurons that receive a target). See section 6 for a discussion.
We start by explaining the connection between the leaky integrator-based neural computation and Langevin MCMC in energy-based models in section 2. Based on this, we show in section 3 that, starting from a fixed point, early inference in such an energy-based model performs backpropagation of prediction error gradients. In section 4, we analyze the resulting weight update and show that it mimics STDP. Section 5 discusses related work and open questions, and directions of future research are given in section 6. We conclude in section 7.
2 Neural Computation Does Inference: Going Down the Energy
We consider the hypothesis that a central interpretation of neural computation (what neurons do, on a short timescale at which weights can be considered fixed) is that it is performing iterative inference. Iterative inference means that the hidden units of the network are gradually changed toward configurations that are more probable given the sensory input and according to the current model of the world associated with the parameters of the model. In other words, they are approximately moving toward configurations more probable under some probability distribution associated with the model and eventually sampling from .
Before making the connection between Boltzmann machines or energy-based models and backpropagation, let us see in more mathematical detail how neural computation could be interpreted as inference.
2.1 Leaky Integrator Neurons
For this purpose, consider the classical leaky integrator neural equation. Let represent the state of the system at time , a vector with one element per unit (e.g., representing a single neuron), where is a real-valued quantity associated with the th unit, corresponding to a time-integrated voltage potential. Let us denote for the state of the visible units (i.e., a subset of the elements of representing the externally driven input) and for the state of the hidden units (i.e., the remaining elements of such that ). Furthermore, let be the function that computes a new complete state given its previous state , with denoting the parts of that, respectively, output the new states of the visible and hidden units.
2.2 Machine Learning Interpretation: Leaky Integration as Langevin MCMC
Why can equation 2.1 be seen as performing inference, moving toward states more probable under ? Note, first, that in equation 2.1 represents a guess for a new configuration, with being a noisy direction of movement and is a noise-free direction.
2.3 A Possible Energy Function
Note that from a biological perspective, the nonlinear transformation can be seen as modeling the monotonically increasing dependency of the actual firing rate (or, equivalently, the probability of firing) on the integrated activity (the expected membrane potential, averaged over the random effects of both pre- and postsynaptic spikes).
This formula for the driving input is similar to the usual weighted sum of firing rates, except for the novel factor , which is 1 if and 0 otherwise. This would suggest that when neuron is saturated (being either shut off or firing at the maximal rate), the external inputs have no impact on its state. The only term that remains in the neural update equation, equation 2.1, is the one that drives the state toward 0, bringing it out of the saturation region and back into a regime where the neuron is sensitive to the outside feedback, so long as is not a saturated value (which is guaranteed by the condition that ). This idea is developed further below.
2.3.1 Fixed-Point Behavior
3 Link to Backpropagation
We are now ready to present the first important result of this article: a link between neural computation as inference in an energy-based model and backpropagation of prediction error gradients.
3.1 Propagation of Perturbations
Consider what happens when a network such as we have described is in an equilibrium, that is, when the state is at a fixed point. At that point, according to equation 2.11, the average gradient over the noise of the energy is 0 and weight updates are also 0 on average.
3.2 Stochastic Gradient Descent Weight Update
The above result is consistent with and inspired by the idea previously proposed by Hinton (2007) that temporal change can encode backpropagated gradients. What would it take for the at layer to turn into a stochastic gradient descent (SGD) weight update with respect to the prediction error ?
With the above STDP-compatible learning rule, the change in weights due to the initial perturbation would be approximately proportional to the backpropagation update. However, note that the multiplicative factor for units at layer makes the initial changes much slower for the lower layers. This is because the leaky integrator neurons have not had time to integrate the information yet, so practically, it will take on the order of the time constant (from equation 2.3) times for the change in to become significant unless we adjust the per layer learning rates accordingly. Interestingly, as Scellier and Bengio (2016) show, the factor disappears if we go beyond the initial perturbation propagation and allow the network to settle after nudging the output units by . This allows one to show that the resulting weight update actually corresponds to stochastic gradient descent on the prediction error, provided the weights are also constrained to remain symmetric: .
4 Link to Spike-Timing-Dependent Plasticity
We now present the second main contribution of this work: we analyze the weight update rule given by equations 3.18 in more detail and establish a connection to STDP. With the opposite sign, such a weight update rule was proposed by Hinton and McClelland (1988) well before the STDP behavior was reported. A similar rule, also studied by Xie and Seung (2000), also drew a link to STDP. They started by assuming a particular pattern relating spike timing and weight change and then showed that the resulting weight change could be approximated by the product of the presynaptic firing rate and the temporal rate of change of the postsynaptic firing rate. Instead, we go in the other direction, showing that if the weight change is proportional to the product of the presynaptic firing rate (or simply the presence of a spike) and the temporal rate of change of the postsynaptic activity, then we recover a relationship between spike timing and weight change that has the precise characteristics of the one observed experimentally by biologists. We present both an easy-to-understand theoretical justification for this result (see section 4.1), as well as simulations that confirm it (see section 4.2).
4.1 Spike-Timing-Dependent Plasticity
Spike-timing-dependent plasticity is a central subject of research in synaptic plasticity. However, much more research is needed to solidify the links between STDP and a machine learning interpretation of it at the scale of a whole network, that is, at the scale of a network with “hidden layers” which need to receive a useful training signal. (See Markram, Gerstner, & Sjöström, 2012, for a recent review of the neuroscience research on STDP.)
We present here a simple view of STDP that has been observed at least in some common neuron types and preparations: there is a weight change if there is a presynaptic spike in the temporal vicinity of a postsynaptic spike. This change is positive if the postsynaptic spike happens just after the presynaptic spike (and larger if the timing difference is small) and negative if it happens just before (and, again, larger if the timing difference is small, on the order of a few milliseconds). This is illustrated in Figure 1 from Bi and Poo (2001) based on biological data. The amount of change decays to zero as the temporal difference between the two spikes increases beyond a few tens of milliseconds. We are thus interested in this temporal window around a presynaptic spike during which a postsynaptic spike, before or after the presynaptic spike, induces a change of the weight.
Keep in mind that the above pattern of spike-timing dependence is only one aspect of synaptic plasticity. (See Feldman, 2012, for an overview of aspects of synaptic plasticity that go beyond spike timing, including firing rate, as one would expect from this work, but also synaptic cooperativity—nearby synapses on the same dendritic subtree—and depolarization—due to multiple consecutive pairings or spatial integration across nearby locations on the dendrite, as well as the effect of the synapse’s distance to the soma.)
4.1.1 Rate-Based Aspect of STDP
A hypothesis inspired by Xie and Seung (2000) and Hinton (2007) is that the STDP weight change can be associated with the temporal rate of change of postsynaptic activity, as a proxy for its association with post minus presynaptic spike times. Both of the above contributions focus on this rate aspect of neural activity, and this is also the case of the synaptic update rule, equation 3.18, studied here.
4.1.2 Why It Works
Why does the proposed rate-based weight update mimic the STDP behavior observed during biological experiments? Consider a rising postsynaptic activity level, , as in Figure 2, and a presynaptic spike occurring somewhere during that rise. Let us assume a Poisson process for the postsynaptic spike as well, with a rate proportional to . According to this assumption, the rising postsynaptic activity makes postsynaptic spikes more likely in a fixed time window following the presynaptic spike than in a window of the same length preceding it. Therefore, if only one spike happens in the whole time interval, it is more likely to be after the presynaptic spike, yielding a positive spike timing difference and, thus, a positive weight change. The situation would be symmetrically reversed if the activity level was decreasing (i.e., ), which would make negative weight changes more likely.
Furthermore, the stronger the slope is, the greater will this effect be. A larger slope also reduces the expected relative spike timing between the presynaptic spike and stochastically occurring postsynaptic spikes, which for a strong slope possibly occur just after or just before the presynaptic spike. When spikes occur on both sides and with about the same time difference, the effects on cancel each other.
4.2 Simulation Results
We simulate random variations in the presynaptic neural firing rate as well as random variations in the postsynaptic firing rate induced by an externally driven voltage. By exploring many configurations of variations and levels at pre- and postsynaptic sides, we hope to cover the possible natural variations. The Poisson process underlying the spike generation (as described in section 4.1) is approximated by performing a discrete-time simulation with a Bernoulli draw of the binary decision spike-versus-no-spike within the time interval of interest. That is, we generate and record pre- and postsynaptic spikes sampled according to a Bernoulli at each discrete with probability proportional to and , respectively, and record as well, in order to implement either equation 3.18 or a classical nearest-neighbor STDP update rule (Markram, Lübke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Morrison, Diesmann, & Gerstner, 2008; Van Rossum, Bi, & Turrigiano, 2000).2
For every presynaptic spike, we consider a window of 20 time steps before and after. If we have one or more postsynaptic spikes in both the left and right windows, no postsynaptic spike at all, or a single postsynaptic spike that coincides with the presynaptic spike, the weight is not updated. Otherwise we measure the time difference between the closest postsynaptic spike (the nearest neighbor) and the presynaptic spike and compute the weight change using the current variable values.
To compute appropriate averages, 500 random sequences of rates are generated, each of length 160 time steps, and 1000 randomly sampled spike trains are generated according to these rates.
For measuring the effect of weight changes, we measure the average squared rate of change (or, rather, its discrete time approximation ) in two conditions: with weight changes (according to equation 3.18), and without.
An example of spike sequence, the integrated postsynaptic activity, and the corresponding weight values is illustrated in Figure 3.
Figure 4 shows the results of these simulations, comparing the weight change obtained at various spike timing differences for equation 3.18 (left) and for the nearest-neighbor STDP rule (right), both matching the biological data well (see Figure 1).
Figure 5 shows that both update rules are strongly correlated, in the sense that for a given amount of weight change induced by one rule, we observe an average a linearly proportional weight change by the other rule.
5 Related Work
An important inspiration for this work is the idea proposed by Hinton (2007) that brains can implement backpropagation by using temporal derivatives to represent activation gradients and the suggestion that combining this assumption with STDP would approximately yield SGD on the synaptic weights.
The idea of neural computation corresponding to a form of stochastic relaxation toward lower energy configurations is, of course, very old—for example with the Boltzmann machine (Hinton & Sejnowski, 1986) and its Gibbs sampling procedure. (For more recent work in this direction, see also Berkes et al., 2011.) What differs here from the Boltzmann machine is that we consider the state space to be continuous (associated with the expected voltage potential, integrating out the random effects due to spikes) rather than discrete and that we consider very small steps (going down the gradient of the energy), which is more like a Langevin MCMC, rather than allowing each neuron to stochastically jump with higher probability to its optimal state, given its neighbors’ configuration, which is what Gibbs sampling does.
The closest work to the STDP-related weight update proposed here is probably that of Xie and Seung (2000), which we have already mentioned. Notable additions to this work brought by our analysis include demonstrating that the relationship of spike timing to weight change is a consequence of equation 3.18, rather than the other way around (and a small difference in the use of instead of ).
The proposed rule can be contrasted with theoretical synaptic learning rules, which tend to be based on the Hebbian product of pre- and postsynaptic activity, such as the BCM rule (Bienenstock, Cooper, & Munro, 1982; Intrator & Cooper, 1992). The main difference is that the rule studied here involves the temporal derivative of the postsynaptic activity rather than the actual level of postsynaptic activity.
There are, of course, many other papers on theoretical interpretations of STDP, and there are many references in Markram et al. (2012). But more work is needed to explore the connection of STDP to an unsupervised learning objective that could be used to train not just a single-layer network (like PCA and traditional Hebbian updates) but also a deep unsupervised model. Many approaches (Fiete & Seung, 2006; Rezende & Gerstner, 2014) rely on variants of the REINFORCE algorithm (Williams, 1992) to estimate the gradient of a global objective function (basically by correlating stochastic variations at each neuron with the changes in the global objective). Although this principle is simple, it is not clear that it will scale to very large networks due to the linear growth of the variance of the estimator with the number of neurons, which makes it tempting to explore other avenues.
6 Open Questions and Future Work
This work is only a stepping-stone on the way to a complete theory of learning that is both biologically plausible and makes sense from a machine learning perspective. Many open questions remain to be answered, especially on the side of biological plausibility.
If the energy function proposed in equation 2.7 is closer to a biological truth than the energy defined for continuous Hopfield networks (Hopfield, 1984), we should see that the average soma voltage potential is insensitive to changes in the input firing rates when the neuron is saturated (completely turned off or maximally turned on) and synaptic weight changes should also vanish under this condition, as seen by inspection of equation 3.18. It would clearly be interesting to test these predictions in actual biological experiments.
Another open question is how to reconcile the need for symmetric weights when we introduce an energy function and the fact that and are stored at two physically different places in biological neurons. An encouraging observation is that earlier work on autoencoders empirically showed that even when the forward and backward weights are not tied, they tend to converge to symmetric values, and in the linear case the minimization of reconstruction error automatically yields symmetric weights (Vincent, Larochelle, Lajoie, Bengio, & Manzagol, 2010). Another encouraging piece of evidence, also linked to autoencoders, is the theoretical result from Arora, Liang, and Ma (2015), showing that the symmetric solution minimizes the autoencoder reconstruction error between two successive layers of rectifying (ReLU) units. In addition, Lillicrap et al. (2014) show that even when the feedback weights do not match the feedforward weights, backpropagation still works because the feedforward weights end up aligning better with the feedback weights through feedback alignment. This suggests that the strict weight symmetry condition can be relaxed without hurting the learning procedure.
While the work we have presented focuses on supervised learning, much remains to be done to obtain a complete probabilistic theory of unsupervised learning that is consistent with STDP. We believe that we have put interesting ingredients in place and that some of our ideas here could be helpful for building such a theory.
Many open questions also remain in the field of neural plasticity. STDP describes only one view of synaptic plasticity, and it may be different in different neurons and preparations. A particularly interesting type of result to consider in the future is about the relationship between more than two spikes induced statistically by the proposed STDP rule. It would be interesting to compare the statistics of triplets or quadruplets of spike timings and weight change under the proposed rule against the corresponding biological observations (Froemke & Dan, 2002).
Future work should also attempt to reconcile the theory with neuroscience experiments showing more complex relationships between spikes and weight change, involving factors other than timing, such as firing rate, of course, but also synaptic cooperativity and depolarization (Feldman, 2012). In addition to these factors, some synaptic changes do not conform to the STDP pattern, in either sign (anti-Hebbian STDP) or nature (e.g., incorporating a general Hebbian element as in the BCM rule). In this regard, although this work suggests that some spike timing behavior may be explained purely in terms of the rates trajectory, it remains an open question if (and where) precise timing of spikes remains essential to explain network-level learning behavior from a machine learning perspective. It is interesting to note in this regard how the BCM rule was adapted to account for the relative timing of spike triplets (Gjorgjievaa, Clopath, Audet, & Pfister, 2011).
In artificial neural networks, communication between neurons usually corresponds to real-valued messages. While this works very well in artificial neurons, it does not fit its biological counterpart. However, connections can be drawn between these real-valued messages and spiking (i.e., binary) messages. A real value can be sent with binary messages by sending the value one and the value zero of the time, so that the message will then have the right expected value. Recent work actually has shown that stochastically quantizing the activations of neural networks still allowed them to be trained (Hubara, Courbariaux, Soudry, El-Yaniv, & Bengio, 2016). As Hinton (2016) explained, sending noisy binary messages drawn from a Poisson process is similar to dropout, which is a powerful regularizer, allowing one to train networks with more parameters than the number of available examples.
This work presents a piece of a theory that aims at closing the gap between learning in artificial and biological neural networks. We started by showing that neural updates based on leaky integration correspond to Langevin MCMC steps in an energy-based model with continuous hidden variables. Thus, neural computation can be seen as performing probabilistic inference in the sense of moving toward states that have higher probability under the energy-based probability distribution associated with the model. Starting from a stationary point and letting visible units receive an outside driving force pushing them toward some target value, early steps of neural computation correspond to propagating error gradients into internal layers, similar to backpropagation in supervised artificial neural networks. The propagated error gradients can be turned into a weight update that performs stochastic gradient descent minimizing the difference between the state of the visible units and the target value. Scellier and Bengio (2016) have actually shown through experiments that it is possible to train deep supervised neural networks using this approach. They have also extended the theoretical framework presented here in order to prove that after the output units have been nudged toward a better value (according to prediction error) and after the dynamics have settled to a very near fixed point, applying the contrastive Hebbian update similar to the one presented here corresponds to gradient descent on the prediction error.
The resulting weight update is proportional to the rate of change of postsynaptic activity times the presynaptic activity. Interestingly, as we have shown through simulations and a qualitative argument, it yields a behavior similar to the STDP observations in the brain that depend on spike timing differences. On top of underlining the biological plausibility of the analyzed learning process, this result provides an interesting machine learning interpretation for STDP, since artificial neural networks are generally cast in terms of firing rates.
We thank Benjamin Scellier, Dong-Hyun Lee, Jyri Kivinen, Jorg Bornschein, Roland Memisevic, and Tim Lillicrap for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.
Visible neurons are in contact with the outside world, that is, sensory neurons, or, alternatively, internal neurons that directly reflect the sensory signal.
Python scripts for those simulations are available at http://www.iro.umontreal.ca/%E2%88%BCbengioy/src/STDP_simulations.