## Abstract

Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which differ in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.

## 1 Introduction

In deep learning, the backpropagation algorithm used to train neural networks requires a side network for the propagation of error derivatives, which is widely seen as biologically implausible (Crick, 1989). One fascinating hypothesis, first formulated by Hinton and McClelland (1988), is that in biological neural networks, error signals could be encoded in the temporal derivatives of the neural activities. This allows for error signals to be propagated in the network via the neuronal dynamics itself, without the need for a side network. Neural computation would correspond to both inference and error backpropagation. The work presented in this letter supports this hypothesis.

In section 2, we present the machine learning setting we are interested in. The neurons of the network follow the gradient of an energy function, such as the Hopfield energy (Cohen & Grossberg, 1983; Hopfield, 1984). Energy minima correspond to preferred states of the model. At prediction time, inputs are clamped, and the network relaxes to a fixed point, corresponding to a local minimum of the energy function. The prediction is then read out on the output neurons. This corresponds to the first phase of the algorithm. The goal of learning is that of minimizing the cost at the fixed point, called the objective.

Section 3 presents recurrent backpropagation (Almeida, 1987; Pineda, 1987), an algorithm that computes the gradient of the objective. In the second phase of recurrent backpropagation, an iterative procedure computes error derivatives.

In section 4, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective. In the second phase of equilibrium propagation, when the target values for output neurons are observed, the output neurons are nudged toward their targets, and the network starts a second relaxation phase toward a second but nearby fixed point that corresponds to slightly smaller prediction error. The gradient of the objective can be computed based on a contrastive Hebbian learning rule at the first fixed point and second fixed point.

Section 5 (in particular, theorem ^{3}) constitutes the main contribution of our work. We establish a close connection between recurrent backpropagation and equilibrium propagation. We show that at every moment in the second phase of equilibrium propagation, the temporal derivative of the neural activities code (i.e., are equal to) intermediate error derivatives, which recurrent backpropagation computes iteratively. Our work shows that one does not require a special computational path for the computation of the error derivatives in the second phase; the same information is available in the temporal derivatives of the neural activities. Furthermore we show that in equilibrium propagation, halting the second phase before convergence to the second fixed point is equivalent to truncated recurrent backpropagation.

## 2 Machine Learning Setting

We consider the supervised setting in which we want to predict a target $y$ given an input $x$. The pair $(x,y)$ is a data point. The model is a network specified by a state variable $s$ and a parameter variable $\theta $. The dynamics of the network are determined by two differentiable scalar functions, $E\theta (x,s)$ and $C\theta (y,s)$, which we call *energy function* and *cost function*, respectively. In most of the letter, to simplify the notations, we omit the dependence on $x$ and $y$ and simply write $E\theta (s)$ and $C\theta (s)$. Furthermore, we write $\u2202E\theta \u2202\theta (s)$ and $\u2202E\theta \u2202s(s)$ the partial derivatives of $(\theta ,s)\u21a6E\theta (s)$ with respect to $\theta $ and $s$, respectively. Similarly $\u2202C\theta \u2202\theta (s)$ and $\u2202C\theta \u2202s(s)$ denote the partial derivatives of $(\theta ,s)\u21a6C\theta (s)$.

^{1}

*free phase*, and the energy minimum $s\theta 0$ is called the

*free fixed point*.

^{2}We introduce the objective function (for a single data point $(x,y)$):

Several methods have been proposed to compute the gradient of $J$ with respect to $\theta $. Early work by Almeida (1987) and Pineda (1987) introduced the recurrent backpropagation algorithm, which we present in section 3. In Scellier and Bengio (2017) we proposed another algorithm—at first sight very different. We present it in section 4. In section 5 we show that there is actually a profound connection between these two algorithms.

### 2.1 Example: Hopfield Model

In this section we propose particular forms for the energy function $E\theta (x,s)$ and the cost function $C\theta (y,s)$ to ease understanding. Nevertheless, the theory presented in this letter is general and does not rely on the particular forms of the functions $E\theta (x,s)$ and $C\theta (y,s)$ chosen here.

Recall that we consider the supervised setting where we must predict a target $y$ given an input $x$. To illustrate the idea, we consider the case where the neurons of the network are split in layers $s0$, $s1$, and $s2$, as in Figure 1.^{3} In this setting, the state variable $s$ is the set of layers $s=s0,s1,s2$. Each of the layers of neurons $s0$, $s1$, and $s2$ is a vector whose coordinates are real numbers representing the membrane voltages of the neurons. The output layer $s0$ corresponds to the layer where the prediction is read and has the same dimension as the target $y$. Furthermore $\rho $ is a deterministic function (nonlinear activation) that maps a neuron's voltage onto its firing rate. We commit a small abuse of notation and denote $\rho (si)$ the vector of firing rates of the neurons in layer $si$; here the function $\rho $ is applied elementwise to the coordinates of the vector $si$. Therefore, the vector $\rho (si)$ has the same dimension as $si$. Finally, the parameter variable $\theta $ is the set of (bidirectional) weight matrices between the layers $\theta =W01,W12,W23$.

^{4}

^{5}

^{6}

## 3 Recurrent Backpropagation

In this section, we present recurrent backpropagation, an algorithm introduced by Almeida (1987) and Pineda (1987) that computes the gradient of $J$ (see equation 2.3). The original algorithm was described in the discrete-time setting and for a general state-to-state dynamics. Here we present it in the continuous-time setting in the particular case of a gradient dynamics (see equation 2.1). A direct derivation based on the adjoint method can also be found in LeCun, Touresky, Hinton, and Sejnowski (1988).

### 3.1 Projected Cost Function

*flow*. We introduce the projected cost function:

- •
For $t=0$, the projected cost is simply the cost of the current state $L\theta (s,0)=C\theta s$.

- •
As $t\u2192\u221e$, the projected cost converges to the objective $L\theta (s,t)\u2192J(\theta )$.

### 3.2 Process of Error Derivatives

^{7}The recurrent backpropagation algorithm computes $S\xaft$ and $\Theta \xaft$ iteratively for increasing values of $t$.

^{1}, proved in appendix A, offers us a two-phase method to compute the gradient $\u2202J\u2202\theta \theta $. In the first phase (or free phase), the state variable $s$ follows the free dynamics (see equation 2.1) and relaxes to the fixed-point $s\theta 0$. Reaching this fixed point is necessary for evaluating the Hessian $\u22022E\theta \u2202s2s\theta 0$, which is required in the second phase. In the second phase, one computes $S\xaft$ and $\Theta \xaft$ iteratively for increasing values of $t$ using equations 3.5 to 3.8. We obtain the desired gradient in the limit $t\u2192\u221e$, as a consequence of equation 3.2:

Note that the Hessian $\u22022E\theta \u2202s2s\theta 0$ is positive definite since $s\theta 0$ is an energy minimum. Therefore, equation 3.7 guarantees that $\u2202L\theta \u2202ss\theta 0,t\u21920$ as $t\u2192\u221e$, in agreement with the fact that $J(\theta )$ is (locally) insensitive to the initial state ($s=s\theta 0$ in our case).

From the point of view of biological plausibility, the requirement to run the dynamics for $S\xaft$ and $\Theta \xaft$ to compute the gradient $\u2202J\u2202\theta (\theta )$ is not satisfying. It is not clear what the quantities $S\xaft$ and $\Theta \xaft$ would represent in a biological network. We address this issue in sections 4 and 5.

## 4 Equilibrium Propagation

In this section, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective function $J$, equation 2.3. At first sight, equilibrium propagation and recurrent backpropagation have little in common. However, in section 5, we will show a profound connection between these algorithms.

### 4.1 Augmented Energy Function

*influence parameter*. The free dynamics (see equation 2.1) is then replaced by the augmented dynamics:

Theorem ^{2} shows that the gradient $\u2202J\u2202\theta (\theta )$ can be estimated based on measurements at the fixed points $s\theta 0$ and $s\theta \beta $.

A proof of theorem ^{2} is given in appendix B. Note that theorem ^{2} is also a consequence of the more general formula of equation 5.5 (theorem ^{3}), established in the next section.

Theorem ^{2} offers another way to estimate the gradient of $J(\theta )$. As in recurrent backpropagation, in the first phase (or free phase), the network follows the free dynamics (see equation 2.1). This is equivalent to saying that the network follows the augmented dynamics (see equation 4.2) when the value of $\beta $ is set to 0. The network relaxes to the free fixed point $s\theta 0$, where $\u2202E\theta \u2202\theta s\theta 0$ is measured. In the second phase, which we call *nudged phase*, the influence parameter takes on a small, positive value $\beta \u22730$, and the network relaxes to a new but nearby fixed point $s\theta \beta $ where $\u2202E\theta \beta \u2202\theta s\theta \beta $ is measured. The gradient of the objective function is estimated using the formula in equation 4.4.

In the case of the modified Hopfield energy (see equation 2.4), the components of $\u2202E\theta \u2202\theta (s)$ are $\u2202E\theta \u2202W01(s)$, $\u2202E\theta \u2202W12(s)$, and $\u2202E\theta \u2202W23(s)$. For instance, $\u2202E\theta \u2202W01(s)=-\rho s0\xb7\rho s1T$ is a matrix of size $dim(s0)\xd7dim(s1)$ whose entries can be measured locally at each synapse based on the presynaptic activity and postsynaptic activity. Thus, the learning rule of equation 4.4 is a kind of contrastive Hebbian learning rule at the free and nudged fixed points.

At the beginning of the second phase, the network is initially at the free fixed point $s\theta 0$. When the influence parameter takes on a small, positive value $\beta \u22730$, the novel term $-\beta \u2202C\theta \u2202s(s)$ in the dynamics of the state variable perturbs the system. This perturbation propagates into the layers of the network until convergence to the new fixed point $s\theta \beta $.

In the next section, we go beyond the analysis of fixed points and show that at every moment $t$ in the nudged phase, the temporal derivative $dsdt$ encodes the error derivative of equation 3.3.

## 5 Temporal Derivatives Code for Error Derivatives

Theorem ^{2} shows that the gradient of $J$ can be estimated based on the free and nudged fixed points only. In this section, we study the dynamics of the network in the second phase, from the free fixed point to the nudged fixed point. Recall that $S\theta 0(s,t)$ is the flow of the dynamical system (see equation 2.1), that is, the state of the network at time $t\u22650$ when it starts from an initial state $s$ at time $t=0$ and follows the free dynamics. Similarly, we define $S\theta \beta (s,t)$ for any value of $\beta $ when the network follows the augmented dynamics (see equation 4.2).

In equilibrium propagation, the state of the network at the beginning of the nudged phase is the free fixed-point $s\theta 0$. We choose as origin of time $t=0$ the moment when the second phase starts: the network is in the state $s\theta 0$, and the influence parameter takes on a small, positive value $\beta \u22730$. With our notations, the state of the network after a duration $t$ in the nudged phase is $S\theta \beta s\theta 0,t$. As $t\u2192\u221e$, the network's state converges to the nudged fixed point $S\theta \beta s\theta 0,t\u2192s\theta \beta $.

### 5.1 Process of Temporal Derivatives

The process $S\u02dct$ is simply the temporal derivative $dsdt$ in the second phase, rescaled by $1\beta $ (so that its value does not depend on the particular choice of $\beta \u22730$).

Theorem ^{3} is proved in appendix C. In essence, equation 5.4 says that in the second phase of equilibrium propagation, the temporal derivative $dsdt$ (rescaled by $1\beta $) encodes the error derivative (equation 3.3).

Here is an interpretation of equation 5.4. Suppose that the network is initially at the fixed point $s=s\theta 0$. Consider the cost $L\theta s\theta 0+\Delta s,t$, a duration $t$ in the future if one moved the initial state $s=s\theta 0$ by a small step $\Delta s$. The goal is to find the direction $\Delta s$, which minimizes $L\theta s\theta 0+\Delta s,t$. The naive approach by trial and error is neither biologically plausible nor efficient. Equation 5.4 tells us that there is a physically realistic way of finding such a direction $\Delta s$ in one attempt. This direction is encoded in the temporal derivative $dsdt$ at time $t$ after starting the nudged phase.

Note that as $t\u2192\u221e$, both sides of equation 5.4 converge to 0. This is a consequence of equation 3.7 and the fact that the Hessian $\u22022E\theta \u2202s2s\theta 0$ is positive definite (since $s\theta 0$ is an energy minimum), as already mentioned in section 3. Intuitively, the right-hand side converges to 0 because $S\theta \beta s\theta 0,t$ converges smoothly to the nudged fixed point $s\theta \beta $. As for the left-hand side, when $t$ is large, $L\theta s,t$ is close to the cost of the energy minimum and thus has little sensitivity to the initial state $s$.

Finally, as $t\u2192\u221e$ in equation 5.5, one recovers the gradient formula of equilibrium propagation (theorem ^{2}). Interestingly, equation 5.5 shows that in equilibrium propagation, halting the second phase before convergence to the nudged fixed point corresponds to truncated recurrent backpropagation.

## 6 Conclusion

Our work establishes a close connection between two algorithms for fixed-point recurrent networks: recurrent backpropagation and equilibrium propagation. The temporal derivatives of the neural activities in the second phase of equilibrium propagation are equal to the error derivatives that recurrent backpropagation computes iteratively. Moreover, we have shown that halting the second phase before convergence in equilibrium propagation is equivalent to truncated recurrent backpropagation. Our work supports the hypothesis that in biological networks, temporal changes in neural activities may represent error signals for supervised learning from a machine learning perspective.

One important drawback of the theory presented here is that it assumes the existence of an energy function. In the case of the Hopfield energy, this implies symmetric connections between neurons. However, the analysis presented here can be generalized to dynamics that do not involve energy functions. This is the subject of Scellier et al. (2018). Another concern is the fact that our algorithm is rate based, whereas biological neurons emit spikes. Ideally we would like a theory applicable to spiking networks. Finally, the assumption of the existence of specialized output neurons ($s0$ here) would need to be relaxed too.

From a practical point of view, another issue is that the time needed to converge to the first fixed point was experimentally found to grow exponentially with the number of layers in Scellier and Bengio (2017). Although equation 5.5 provides a new justification for saving time by stopping the second phase early, our algorithm (as well as recurrent backpropagation) still requires convergence to the free fixed point in the first phase.

## Appendix A: Recurrent Backpropagation: Proof

^{8}

$\u25a1$

## Appendix B: Equilibrium Propagation: Proof

In this appendix, we prove theorem ^{2}. The same proof is provided in Scellier and Bengio (2017).

Since the data point $(x,y)$ does not play any role, its dependence is omitted in the notations. We assume that the energy function $E\theta (s)$ and the cost function $C\theta (s)$ (and thus the augmented energy function $E\theta \beta (s)$) are twice differentiable and that the conditions of the implicit function theorem are satisfied so that the fixed point $s\theta \beta $ is a continuously differentiable function of $(\theta ,\beta )$.

^{9}when evaluated at the point $\beta =0$:

$\u25a1$

## Appendix C: Temporal Derivatives Code for Error Derivatives: Proof

In order to prove theorem ^{3}, we have to show that the process $(S\u02dct,\Theta \u02dct)$ satisfies the same differential equations as $(S\xaft,\Theta \xaft)$, namely, equations 3.5 to 3.8 (see theorem ^{1}). We conclude by using the uniqueness of the solution to the differential equation with initial condition.

$\u25a1$

## Notes

^{1}

In general, the fixed point defined by equation 2.2 is not unique unless further assumptions are made on $E\theta (s)$ (e.g., convexity). The fixed point depends on the initial state of the dynamics (see equation 2.1) and so does the objective function of equation 2.3. However, for ease of presentation, we avoid delving into these mathematical details here.

^{2}

In this expression, both the cost function $C\theta (s)$ and the fixed-point $s\theta 0$ depend on $\theta $. $C\theta (s)$ directly depends on $\theta $, whereas $s\theta 0$ indirectly depends on $\theta $ through $E\theta (s)$ (see equation 2.2).

^{3}

We choose to number the layers in increasing order from output to input, in the sense of propagation of error signals (see section 4).

^{4}

The case without the constraint of symmetric connections is studied in Scellier, Goyal, Binas, Mesnard, and Bengio (2018).

^{5}

Given two vectors $a=a1,\u2026,an$ and $b=b1,\u2026,bn$, their product element by element is $a\u2299b=a1b1,\u2026,anbn$.

^{6}

In this specific example the cost function $C\theta (y,s)$ does not depend on $\theta $, $s1$ and $s2$.

^{7}

The quantity $\Theta \xaft=\u2202L\theta \u2202\theta s\theta 0,t$ represents the partial derivative of $L\theta (s,t)$ with respect to $\theta $, evaluated at the fixed point $s=s\theta 0$. This does not include the differentiation path through the fixed point $s\theta 0$.

^{8}

Equation A.3 is the Kolmogorov backward equation for deterministic processes.

^{9}

The notations $\u2202E\theta \beta \u2202\theta $ and $\u2202E\theta \beta \u2202\beta $ are used to mean the partial derivatives with respect to the arguments of $E\theta \beta $, whereas $dd\theta $ and $dd\beta $ represent the total derivatives with respect to $\theta $ and $\beta $, respectively (which include the differentiation path through $s\theta \beta $). The total derivative $dd\theta $ (resp. $dd\beta $) is performed for fixed $\beta $ (resp. fixed $\theta $).

## Acknowledgments

We thank Jonathan Binas for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.