Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which differ in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.
In deep learning, the backpropagation algorithm used to train neural networks requires a side network for the propagation of error derivatives, which is widely seen as biologically implausible (Crick, 1989). One fascinating hypothesis, first formulated by Hinton and McClelland (1988), is that in biological neural networks, error signals could be encoded in the temporal derivatives of the neural activities. This allows for error signals to be propagated in the network via the neuronal dynamics itself, without the need for a side network. Neural computation would correspond to both inference and error backpropagation. The work presented in this letter supports this hypothesis.
In section 2, we present the machine learning setting we are interested in. The neurons of the network follow the gradient of an energy function, such as the Hopfield energy (Cohen & Grossberg, 1983; Hopfield, 1984). Energy minima correspond to preferred states of the model. At prediction time, inputs are clamped, and the network relaxes to a fixed point, corresponding to a local minimum of the energy function. The prediction is then read out on the output neurons. This corresponds to the first phase of the algorithm. The goal of learning is that of minimizing the cost at the fixed point, called the objective.
Section 3 presents recurrent backpropagation (Almeida, 1987; Pineda, 1987), an algorithm that computes the gradient of the objective. In the second phase of recurrent backpropagation, an iterative procedure computes error derivatives.
In section 4, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective. In the second phase of equilibrium propagation, when the target values for output neurons are observed, the output neurons are nudged toward their targets, and the network starts a second relaxation phase toward a second but nearby fixed point that corresponds to slightly smaller prediction error. The gradient of the objective can be computed based on a contrastive Hebbian learning rule at the first fixed point and second fixed point.
Section 5 (in particular, theorem 3) constitutes the main contribution of our work. We establish a close connection between recurrent backpropagation and equilibrium propagation. We show that at every moment in the second phase of equilibrium propagation, the temporal derivative of the neural activities code (i.e., are equal to) intermediate error derivatives, which recurrent backpropagation computes iteratively. Our work shows that one does not require a special computational path for the computation of the error derivatives in the second phase; the same information is available in the temporal derivatives of the neural activities. Furthermore we show that in equilibrium propagation, halting the second phase before convergence to the second fixed point is equivalent to truncated recurrent backpropagation.
2 Machine Learning Setting
We consider the supervised setting in which we want to predict a target given an input . The pair is a data point. The model is a network specified by a state variable and a parameter variable . The dynamics of the network are determined by two differentiable scalar functions, and , which we call energy function and cost function, respectively. In most of the letter, to simplify the notations, we omit the dependence on and and simply write and . Furthermore, we write and the partial derivatives of with respect to and , respectively. Similarly and denote the partial derivatives of .
Several methods have been proposed to compute the gradient of with respect to . Early work by Almeida (1987) and Pineda (1987) introduced the recurrent backpropagation algorithm, which we present in section 3. In Scellier and Bengio (2017) we proposed another algorithm—at first sight very different. We present it in section 4. In section 5 we show that there is actually a profound connection between these two algorithms.
2.1 Example: Hopfield Model
In this section we propose particular forms for the energy function and the cost function to ease understanding. Nevertheless, the theory presented in this letter is general and does not rely on the particular forms of the functions and chosen here.
Recall that we consider the supervised setting where we must predict a target given an input . To illustrate the idea, we consider the case where the neurons of the network are split in layers , , and , as in Figure 1.3 In this setting, the state variable is the set of layers . Each of the layers of neurons , , and is a vector whose coordinates are real numbers representing the membrane voltages of the neurons. The output layer corresponds to the layer where the prediction is read and has the same dimension as the target . Furthermore is a deterministic function (nonlinear activation) that maps a neuron's voltage onto its firing rate. We commit a small abuse of notation and denote the vector of firing rates of the neurons in layer ; here the function is applied elementwise to the coordinates of the vector . Therefore, the vector has the same dimension as . Finally, the parameter variable is the set of (bidirectional) weight matrices between the layers .
3 Recurrent Backpropagation
In this section, we present recurrent backpropagation, an algorithm introduced by Almeida (1987) and Pineda (1987) that computes the gradient of (see equation 2.3). The original algorithm was described in the discrete-time setting and for a general state-to-state dynamics. Here we present it in the continuous-time setting in the particular case of a gradient dynamics (see equation 2.1). A direct derivation based on the adjoint method can also be found in LeCun, Touresky, Hinton, and Sejnowski (1988).
3.1 Projected Cost Function
For , the projected cost is simply the cost of the current state .
As , the projected cost converges to the objective .
3.2 Process of Error Derivatives
Note that the Hessian is positive definite since is an energy minimum. Therefore, equation 3.7 guarantees that as , in agreement with the fact that is (locally) insensitive to the initial state ( in our case).
4 Equilibrium Propagation
In this section, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective function , equation 2.3. At first sight, equilibrium propagation and recurrent backpropagation have little in common. However, in section 5, we will show a profound connection between these algorithms.
4.1 Augmented Energy Function
Theorem 2 shows that the gradient can be estimated based on measurements at the fixed points and .
Theorem 2 offers another way to estimate the gradient of . As in recurrent backpropagation, in the first phase (or free phase), the network follows the free dynamics (see equation 2.1). This is equivalent to saying that the network follows the augmented dynamics (see equation 4.2) when the value of is set to 0. The network relaxes to the free fixed point , where is measured. In the second phase, which we call nudged phase, the influence parameter takes on a small, positive value , and the network relaxes to a new but nearby fixed point where is measured. The gradient of the objective function is estimated using the formula in equation 4.4.
In the case of the modified Hopfield energy (see equation 2.4), the components of are , , and . For instance, is a matrix of size whose entries can be measured locally at each synapse based on the presynaptic activity and postsynaptic activity. Thus, the learning rule of equation 4.4 is a kind of contrastive Hebbian learning rule at the free and nudged fixed points.
At the beginning of the second phase, the network is initially at the free fixed point . When the influence parameter takes on a small, positive value , the novel term in the dynamics of the state variable perturbs the system. This perturbation propagates into the layers of the network until convergence to the new fixed point .
In the next section, we go beyond the analysis of fixed points and show that at every moment in the nudged phase, the temporal derivative encodes the error derivative of equation 3.3.
5 Temporal Derivatives Code for Error Derivatives
Theorem 2 shows that the gradient of can be estimated based on the free and nudged fixed points only. In this section, we study the dynamics of the network in the second phase, from the free fixed point to the nudged fixed point. Recall that is the flow of the dynamical system (see equation 2.1), that is, the state of the network at time when it starts from an initial state at time and follows the free dynamics. Similarly, we define for any value of when the network follows the augmented dynamics (see equation 4.2).
In equilibrium propagation, the state of the network at the beginning of the nudged phase is the free fixed-point . We choose as origin of time the moment when the second phase starts: the network is in the state , and the influence parameter takes on a small, positive value . With our notations, the state of the network after a duration in the nudged phase is . As , the network's state converges to the nudged fixed point .
5.1 Process of Temporal Derivatives
The process is simply the temporal derivative in the second phase, rescaled by (so that its value does not depend on the particular choice of ).
Theorem 3 is proved in appendix C. In essence, equation 5.4 says that in the second phase of equilibrium propagation, the temporal derivative (rescaled by ) encodes the error derivative (equation 3.3).
Here is an interpretation of equation 5.4. Suppose that the network is initially at the fixed point . Consider the cost , a duration in the future if one moved the initial state by a small step . The goal is to find the direction , which minimizes . The naive approach by trial and error is neither biologically plausible nor efficient. Equation 5.4 tells us that there is a physically realistic way of finding such a direction in one attempt. This direction is encoded in the temporal derivative at time after starting the nudged phase.
Note that as , both sides of equation 5.4 converge to 0. This is a consequence of equation 3.7 and the fact that the Hessian is positive definite (since is an energy minimum), as already mentioned in section 3. Intuitively, the right-hand side converges to 0 because converges smoothly to the nudged fixed point . As for the left-hand side, when is large, is close to the cost of the energy minimum and thus has little sensitivity to the initial state .
Our work establishes a close connection between two algorithms for fixed-point recurrent networks: recurrent backpropagation and equilibrium propagation. The temporal derivatives of the neural activities in the second phase of equilibrium propagation are equal to the error derivatives that recurrent backpropagation computes iteratively. Moreover, we have shown that halting the second phase before convergence in equilibrium propagation is equivalent to truncated recurrent backpropagation. Our work supports the hypothesis that in biological networks, temporal changes in neural activities may represent error signals for supervised learning from a machine learning perspective.
One important drawback of the theory presented here is that it assumes the existence of an energy function. In the case of the Hopfield energy, this implies symmetric connections between neurons. However, the analysis presented here can be generalized to dynamics that do not involve energy functions. This is the subject of Scellier et al. (2018). Another concern is the fact that our algorithm is rate based, whereas biological neurons emit spikes. Ideally we would like a theory applicable to spiking networks. Finally, the assumption of the existence of specialized output neurons ( here) would need to be relaxed too.
From a practical point of view, another issue is that the time needed to converge to the first fixed point was experimentally found to grow exponentially with the number of layers in Scellier and Bengio (2017). Although equation 5.5 provides a new justification for saving time by stopping the second phase early, our algorithm (as well as recurrent backpropagation) still requires convergence to the free fixed point in the first phase.
Appendix A: Recurrent Backpropagation: Proof
Appendix B: Equilibrium Propagation: Proof
Since the data point does not play any role, its dependence is omitted in the notations. We assume that the energy function and the cost function (and thus the augmented energy function ) are twice differentiable and that the conditions of the implicit function theorem are satisfied so that the fixed point is a continuously differentiable function of .
Appendix C: Temporal Derivatives Code for Error Derivatives: Proof
In order to prove theorem 3, we have to show that the process satisfies the same differential equations as , namely, equations 3.5 to 3.8 (see theorem 1). We conclude by using the uniqueness of the solution to the differential equation with initial condition.
In general, the fixed point defined by equation 2.2 is not unique unless further assumptions are made on (e.g., convexity). The fixed point depends on the initial state of the dynamics (see equation 2.1) and so does the objective function of equation 2.3. However, for ease of presentation, we avoid delving into these mathematical details here.
In this expression, both the cost function and the fixed-point depend on . directly depends on , whereas indirectly depends on through (see equation 2.2).
We choose to number the layers in increasing order from output to input, in the sense of propagation of error signals (see section 4).
The case without the constraint of symmetric connections is studied in Scellier, Goyal, Binas, Mesnard, and Bengio (2018).
Given two vectors and , their product element by element is .
In this specific example the cost function does not depend on , and .
The quantity represents the partial derivative of with respect to , evaluated at the fixed point . This does not include the differentiation path through the fixed point .
Equation A.3 is the Kolmogorov backward equation for deterministic processes.
The notations and are used to mean the partial derivatives with respect to the arguments of , whereas and represent the total derivatives with respect to and , respectively (which include the differentiation path through ). The total derivative (resp. ) is performed for fixed (resp. fixed ).
We thank Jonathan Binas for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.