## Abstract

Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which differ in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.

## 1  Introduction

In deep learning, the backpropagation algorithm used to train neural networks requires a side network for the propagation of error derivatives, which is widely seen as biologically implausible (Crick, 1989). One fascinating hypothesis, first formulated by Hinton and McClelland (1988), is that in biological neural networks, error signals could be encoded in the temporal derivatives of the neural activities. This allows for error signals to be propagated in the network via the neuronal dynamics itself, without the need for a side network. Neural computation would correspond to both inference and error backpropagation. The work presented in this letter supports this hypothesis.

In section 2, we present the machine learning setting we are interested in. The neurons of the network follow the gradient of an energy function, such as the Hopfield energy (Cohen & Grossberg, 1983; Hopfield, 1984). Energy minima correspond to preferred states of the model. At prediction time, inputs are clamped, and the network relaxes to a fixed point, corresponding to a local minimum of the energy function. The prediction is then read out on the output neurons. This corresponds to the first phase of the algorithm. The goal of learning is that of minimizing the cost at the fixed point, called the objective.

Section 3 presents recurrent backpropagation (Almeida, 1987; Pineda, 1987), an algorithm that computes the gradient of the objective. In the second phase of recurrent backpropagation, an iterative procedure computes error derivatives.

In section 4, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective. In the second phase of equilibrium propagation, when the target values for output neurons are observed, the output neurons are nudged toward their targets, and the network starts a second relaxation phase toward a second but nearby fixed point that corresponds to slightly smaller prediction error. The gradient of the objective can be computed based on a contrastive Hebbian learning rule at the first fixed point and second fixed point.

Section 5 (in particular, theorem 3) constitutes the main contribution of our work. We establish a close connection between recurrent backpropagation and equilibrium propagation. We show that at every moment in the second phase of equilibrium propagation, the temporal derivative of the neural activities code (i.e., are equal to) intermediate error derivatives, which recurrent backpropagation computes iteratively. Our work shows that one does not require a special computational path for the computation of the error derivatives in the second phase; the same information is available in the temporal derivatives of the neural activities. Furthermore we show that in equilibrium propagation, halting the second phase before convergence to the second fixed point is equivalent to truncated recurrent backpropagation.

## 2  Machine Learning Setting

We consider the supervised setting in which we want to predict a target $y$ given an input $x$. The pair $(x,y)$ is a data point. The model is a network specified by a state variable $s$ and a parameter variable $θ$. The dynamics of the network are determined by two differentiable scalar functions, $Eθ(x,s)$ and $Cθ(y,s)$, which we call energy function and cost function, respectively. In most of the letter, to simplify the notations, we omit the dependence on $x$ and $y$ and simply write $Eθ(s)$ and $Cθ(s)$. Furthermore, we write $∂Eθ∂θ(s)$ and $∂Eθ∂s(s)$ the partial derivatives of $(θ,s)↦Eθ(s)$ with respect to $θ$ and $s$, respectively. Similarly $∂Cθ∂θ(s)$ and $∂Cθ∂s(s)$ denote the partial derivatives of $(θ,s)↦Cθ(s)$.

The state variable $s$ is assumed to move spontaneously toward low-energy configurations by following the gradient of the energy function:
$dsdt=-∂Eθ∂s(s).$
(2.1)
The state $s$ eventually settles to a minimum of the energy function, written $sθ0$ and characterized by1
$∂Eθ∂ssθ0=0.$
(2.2)
Since the dynamics in equation 2.1 depends on only the input $x$ (through $Eθ(x,s)$) and not on the target $y$, we call this relaxation phase the free phase, and the energy minimum $sθ0$ is called the free fixed point.
The goal of learning is that of finding $θ$ such that the cost at the fixed point $Cθsθ0$ is minimal.2 We introduce the objective function (for a single data point $(x,y)$):
$J(θ):=Cθsθ0.$
(2.3)
Note the distinction between the cost function and the objective function: the cost function $Cθ(s)$ is defined for any state $s$, whereas the objective function $J(θ)$ is the cost at the fixed point.

Several methods have been proposed to compute the gradient of $J$ with respect to $θ$. Early work by Almeida (1987) and Pineda (1987) introduced the recurrent backpropagation algorithm, which we present in section 3. In Scellier and Bengio (2017) we proposed another algorithm—at first sight very different. We present it in section 4. In section 5 we show that there is actually a profound connection between these two algorithms.

### 2.1  Example: Hopfield Model

In this section we propose particular forms for the energy function $Eθ(x,s)$ and the cost function $Cθ(y,s)$ to ease understanding. Nevertheless, the theory presented in this letter is general and does not rely on the particular forms of the functions $Eθ(x,s)$ and $Cθ(y,s)$ chosen here.

Recall that we consider the supervised setting where we must predict a target $y$ given an input $x$. To illustrate the idea, we consider the case where the neurons of the network are split in layers $s0$, $s1$, and $s2$, as in Figure 1.3 In this setting, the state variable $s$ is the set of layers $s=s0,s1,s2$. Each of the layers of neurons $s0$, $s1$, and $s2$ is a vector whose coordinates are real numbers representing the membrane voltages of the neurons. The output layer $s0$ corresponds to the layer where the prediction is read and has the same dimension as the target $y$. Furthermore $ρ$ is a deterministic function (nonlinear activation) that maps a neuron's voltage onto its firing rate. We commit a small abuse of notation and denote $ρ(si)$ the vector of firing rates of the neurons in layer $si$; here the function $ρ$ is applied elementwise to the coordinates of the vector $si$. Therefore, the vector $ρ(si)$ has the same dimension as $si$. Finally, the parameter variable $θ$ is the set of (bidirectional) weight matrices between the layers $θ=W01,W12,W23$.

We consider the following modified Hopfield energy function:
$Eθ(x,s)=12∥s0∥2+∥s1∥2+∥s2∥2-ρ(s0)T·W01·ρ(s1)-ρ(s1)T·W12·ρ(s2)-ρ(s2)T·W23·ρ(x).$
(2.4)
With this choice of energy function, the dynamics (see equation 2.1) translate into a form of leaky integration neural dynamics with symmetric connections:4
$ds0dt=ρ'(s0)T⊙W01·ρ(s1)-s0,$
(2.5)
$ds1dt=ρ'(s1)T⊙W12·ρ(s2)+W01T·ρ(s0)-s1,$
(2.6)
$ds2dt=ρ'(s2)T⊙W23·ρ(x)+W12T·ρ(s1)-s2.$
(2.7)
Here again the derivative of the function $ρ$ (denoted $ρ'$) is applied elementwise to the coordinates of the vectors $s0$, $s1$, and $s2$, and the notation $⊙$ is used to mean element-wise multiplication.5
Figure 1:

Graph of the network. Input $x$ is clamped. State variable $s$ includes hidden layers $s2$ and $s1$, and output layer $s0$ (layer where the prediction is read). Output layer $s0$ has the same dimension as target $y$.

Figure 1:

Graph of the network. Input $x$ is clamped. State variable $s$ includes hidden layers $s2$ and $s1$, and output layer $s0$ (layer where the prediction is read). Output layer $s0$ has the same dimension as target $y$.

Finally we consider the quadratic cost function,6
$Cθ(y,s)=12∥y-s0∥2,$
(2.8)
which measures the discrepancy between the output layer $s0$ and the target $y$.

Note that the results established in this letter hold for any energy function $Eθ(s)$ and any cost function $Cθ(s)$, and are not limited to the Hopfield energy and the quadratic cost (see equations 2.4 and 2.8).

## 3  Recurrent Backpropagation

In this section, we present recurrent backpropagation, an algorithm introduced by Almeida (1987) and Pineda (1987) that computes the gradient of $J$ (see equation 2.3). The original algorithm was described in the discrete-time setting and for a general state-to-state dynamics. Here we present it in the continuous-time setting in the particular case of a gradient dynamics (see equation 2.1). A direct derivation based on the adjoint method can also be found in LeCun, Touresky, Hinton, and Sejnowski (1988).

### 3.1  Projected Cost Function

Let $Sθ0(s,t)$ denote the state of the network at time $t≥0$ when it starts from an initial state $s$ at time $t=0$ and follows the free dynamics (see equation 2.1). In the theory of dynamical systems, $Sθ0(s,t)$ is called the flow. We introduce the projected cost function:
$Lθ(s,t):=CθSθ0(s,t).$
(3.1)
This is the cost of the state projected a duration $t$ in the future, when the network starts from $s$ and follows the free dynamics. For fixed $s$, the process $Lθ(s,t)t≥0$ represents the successive cost values taken by the state of the network, along the free dynamics when it starts from the initial state $s$. Notable cases include these:
• For $t=0$, the projected cost is simply the cost of the current state $Lθ(s,0)=Cθs$.

• As $t→∞$, the projected cost converges to the objective $Lθ(s,t)→J(θ)$.

The second property comes from the fact that the dynamics converges to the fixed point, $Sθ0(s,t)→sθ0$ as $t→∞$. Under mild regularity conditions on $Eθ(s)$ and $Cθ(s)$, the gradient of the projected cost function converges to the gradient of the objective function in the limit of infinite duration,
$∂Lθ∂θ(s,t)→∂J∂θ(θ),$
(3.2)
as $t→∞$. Therefore, if we can compute $∂Lθ∂θ(s,t)$ for a particular value of $s$ and for any $t≥0$, we can obtain the desired gradient $∂J∂θ(θ)$ by letting $t→∞$. We will show next that this is what recurrent backpropagation does in the case where the initial state $s$ is the fixed point $sθ0$.

### 3.2  Process of Error Derivatives

We introduce the process of error derivatives $(S¯t,Θ¯t)t≥0$, defined as
$S¯t:=∂Lθ∂ssθ0,t,t≥0,$
(3.3)
$Θ¯t:=∂Lθ∂θsθ0,t,t≥0.$
(3.4)
The process $S¯t$ takes values in the state space (space of the state variable $s$), and the process $Θ¯t$ takes values in the parameter space (space of the parameter variable $θ$).7 The recurrent backpropagation algorithm computes $S¯t$ and $Θ¯t$ iteratively for increasing values of $t$.
Theorem 1
(recurrent backpropagation). The process of error derivatives $(S¯t,Θ¯t)$ satisfies
$S¯0=∂Cθ∂ssθ0,$
(3.5)
$Θ¯0=∂Cθ∂θsθ0,$
(3.6)
$ddtS¯t=-∂2Eθ∂s2sθ0·S¯t,$
(3.7)
$ddtΘ¯t=-∂2Eθ∂θ∂ssθ0·S¯t.$
(3.8)
Theorem 1, proved in appendix A, offers us a two-phase method to compute the gradient $∂J∂θθ$. In the first phase (or free phase), the state variable $s$ follows the free dynamics (see equation 2.1) and relaxes to the fixed-point $sθ0$. Reaching this fixed point is necessary for evaluating the Hessian $∂2Eθ∂s2sθ0$, which is required in the second phase. In the second phase, one computes $S¯t$ and $Θ¯t$ iteratively for increasing values of $t$ using equations 3.5 to 3.8. We obtain the desired gradient in the limit $t→∞$, as a consequence of equation 3.2:
$Θ¯t→∂J∂θ(θ).$
(3.9)

Note that the Hessian $∂2Eθ∂s2sθ0$ is positive definite since $sθ0$ is an energy minimum. Therefore, equation 3.7 guarantees that $∂Lθ∂ssθ0,t→0$ as $t→∞$, in agreement with the fact that $J(θ)$ is (locally) insensitive to the initial state ($s=sθ0$ in our case).

From the point of view of biological plausibility, the requirement to run the dynamics for $S¯t$ and $Θ¯t$ to compute the gradient $∂J∂θ(θ)$ is not satisfying. It is not clear what the quantities $S¯t$ and $Θ¯t$ would represent in a biological network. We address this issue in sections 4 and 5.

## 4  Equilibrium Propagation

In this section, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective function $J$, equation 2.3. At first sight, equilibrium propagation and recurrent backpropagation have little in common. However, in section 5, we will show a profound connection between these algorithms.

### 4.1  Augmented Energy Function

The central idea of equilibrium propagation is to introduce the augmented energy function,
$Eθβ(s):=Eθ(s)+βCθ(s),$
(4.1)
where $β≥0$ is a scalar that we call influence parameter. The free dynamics (see equation 2.1) is then replaced by the augmented dynamics:
$dsdt=-∂Eθβ∂s(s).$
(4.2)
The state variable now follows the dynamics $dsdt=-∂Eθ∂s(s)-β∂Cθ∂s(s)$. When $β>0$, in addition to the usual term $-∂Eθ∂s(s)$, a term $-β∂Cθ∂s(s)$ nudges $s$ toward configurations that have lower cost values. In the case of the model described in section 2.1 with the quadratic cost function (see equation 2.8), the new term $-β∂Cθ∂s(s)$ is the vector of $dim(s)$ whose component on $s0$ is $βy-s0$ and whose components on $s1$ and $s2$ are zero. Thus, the new term takes the form of a force that nudges the output layer $s0$ toward the target $y$. Unlike the free dynamics, which depends only on $x$ (through $Eθ(x,s)$), not on $y$, the augmented dynamics also depends on $y$ (through $Cθ(y,s)$).
Note that the free dynamics corresponds to the value $β=0$. We then generalize the notion of fixed point for any value of $β$. The augmented dynamics converges to the fixed-point $sθβ$, an energy minimum of $Eθβ$, characterized by
$∂Eθβ∂ssθβ=0.$
(4.3)

Theorem 2 shows that the gradient $∂J∂θ(θ)$ can be estimated based on measurements at the fixed points $sθ0$ and $sθβ$.

Theorem 2
(equilibrium propagation). The gradient of the objective function with respect to $θ$ can be estimated using the formula
$∂J∂θ(θ)=limβ→01β∂Eθβ∂θsθβ-∂Eθ0∂θsθ0.$
(4.4)

A proof of theorem 2 is given in appendix B. Note that theorem 2 is also a consequence of the more general formula of equation 5.5 (theorem 3), established in the next section.

Theorem 2 offers another way to estimate the gradient of $J(θ)$. As in recurrent backpropagation, in the first phase (or free phase), the network follows the free dynamics (see equation 2.1). This is equivalent to saying that the network follows the augmented dynamics (see equation 4.2) when the value of $β$ is set to 0. The network relaxes to the free fixed point $sθ0$, where $∂Eθ∂θsθ0$ is measured. In the second phase, which we call nudged phase, the influence parameter takes on a small, positive value $β≳0$, and the network relaxes to a new but nearby fixed point $sθβ$ where $∂Eθβ∂θsθβ$ is measured. The gradient of the objective function is estimated using the formula in equation 4.4.

In the case of the modified Hopfield energy (see equation 2.4), the components of $∂Eθ∂θ(s)$ are $∂Eθ∂W01(s)$, $∂Eθ∂W12(s)$, and $∂Eθ∂W23(s)$. For instance, $∂Eθ∂W01(s)=-ρs0·ρs1T$ is a matrix of size $dim(s0)×dim(s1)$ whose entries can be measured locally at each synapse based on the presynaptic activity and postsynaptic activity. Thus, the learning rule of equation 4.4 is a kind of contrastive Hebbian learning rule at the free and nudged fixed points.

At the beginning of the second phase, the network is initially at the free fixed point $sθ0$. When the influence parameter takes on a small, positive value $β≳0$, the novel term $-β∂Cθ∂s(s)$ in the dynamics of the state variable perturbs the system. This perturbation propagates into the layers of the network until convergence to the new fixed point $sθβ$.

In the next section, we go beyond the analysis of fixed points and show that at every moment $t$ in the nudged phase, the temporal derivative $dsdt$ encodes the error derivative of equation 3.3.

## 5  Temporal Derivatives Code for Error Derivatives

Theorem 2 shows that the gradient of $J$ can be estimated based on the free and nudged fixed points only. In this section, we study the dynamics of the network in the second phase, from the free fixed point to the nudged fixed point. Recall that $Sθ0(s,t)$ is the flow of the dynamical system (see equation 2.1), that is, the state of the network at time $t≥0$ when it starts from an initial state $s$ at time $t=0$ and follows the free dynamics. Similarly, we define $Sθβ(s,t)$ for any value of $β$ when the network follows the augmented dynamics (see equation 4.2).

In equilibrium propagation, the state of the network at the beginning of the nudged phase is the free fixed-point $sθ0$. We choose as origin of time $t=0$ the moment when the second phase starts: the network is in the state $sθ0$, and the influence parameter takes on a small, positive value $β≳0$. With our notations, the state of the network after a duration $t$ in the nudged phase is $Sθβsθ0,t$. As $t→∞$, the network's state converges to the nudged fixed point $Sθβsθ0,t→sθβ$.

### 5.1  Process of Temporal Derivatives

Now we are ready to introduce the process of temporal derivatives $(S˜t,Θ˜t)t≥0$, defined by
$S˜t:=-limβ→01β∂Sθβ∂tsθ0,t,$
(5.1)
$Θ˜t:=limβ→01β∂Eθβ∂θSθβsθ0,t-∂Eθ0∂θsθ0.$
(5.2)
Like $S¯t$ and $Θ¯t$, the processes $S˜t$ and $Θ˜t$ take values in the state space and parameter space, respectively.

The process $S˜t$ is simply the temporal derivative $dsdt$ in the second phase, rescaled by $1β$ (so that its value does not depend on the particular choice of $β≳0$).

Theorem 3
(temporal derivatives as error derivatives). The process of error derivatives $(S¯t,Θ¯t)$ and the process of temporal derivatives $(S˜t,Θ˜t)$ are equal:
$∀t≥0,S¯t=S˜t,Θ¯t=Θ˜t,$
(5.3)
or, using explicit forms,
$∂Lθ∂ssθ0,t=-limβ→01β∂Sθβ∂tsθ0,t,$
(5.4)
$∂Lθ∂θsθ0,t=limβ→01β∂Eθβ∂θSθβsθ0,t-∂Eθ0∂θsθ0.$
(5.5)

Theorem 3 is proved in appendix C. In essence, equation 5.4 says that in the second phase of equilibrium propagation, the temporal derivative $dsdt$ (rescaled by $1β$) encodes the error derivative (equation 3.3).

Here is an interpretation of equation 5.4. Suppose that the network is initially at the fixed point $s=sθ0$. Consider the cost $Lθsθ0+Δs,t$, a duration $t$ in the future if one moved the initial state $s=sθ0$ by a small step $Δs$. The goal is to find the direction $Δs$, which minimizes $Lθsθ0+Δs,t$. The naive approach by trial and error is neither biologically plausible nor efficient. Equation 5.4 tells us that there is a physically realistic way of finding such a direction $Δs$ in one attempt. This direction is encoded in the temporal derivative $dsdt$ at time $t$ after starting the nudged phase.

Note that as $t→∞$, both sides of equation 5.4 converge to 0. This is a consequence of equation 3.7 and the fact that the Hessian $∂2Eθ∂s2sθ0$ is positive definite (since $sθ0$ is an energy minimum), as already mentioned in section 3. Intuitively, the right-hand side converges to 0 because $Sθβsθ0,t$ converges smoothly to the nudged fixed point $sθβ$. As for the left-hand side, when $t$ is large, $Lθs,t$ is close to the cost of the energy minimum and thus has little sensitivity to the initial state $s$.

Finally, as $t→∞$ in equation 5.5, one recovers the gradient formula of equilibrium propagation (theorem 2). Interestingly, equation 5.5 shows that in equilibrium propagation, halting the second phase before convergence to the nudged fixed point corresponds to truncated recurrent backpropagation.

## 6  Conclusion

Our work establishes a close connection between two algorithms for fixed-point recurrent networks: recurrent backpropagation and equilibrium propagation. The temporal derivatives of the neural activities in the second phase of equilibrium propagation are equal to the error derivatives that recurrent backpropagation computes iteratively. Moreover, we have shown that halting the second phase before convergence in equilibrium propagation is equivalent to truncated recurrent backpropagation. Our work supports the hypothesis that in biological networks, temporal changes in neural activities may represent error signals for supervised learning from a machine learning perspective.

One important drawback of the theory presented here is that it assumes the existence of an energy function. In the case of the Hopfield energy, this implies symmetric connections between neurons. However, the analysis presented here can be generalized to dynamics that do not involve energy functions. This is the subject of Scellier et al. (2018). Another concern is the fact that our algorithm is rate based, whereas biological neurons emit spikes. Ideally we would like a theory applicable to spiking networks. Finally, the assumption of the existence of specialized output neurons ($s0$ here) would need to be relaxed too.

From a practical point of view, another issue is that the time needed to converge to the first fixed point was experimentally found to grow exponentially with the number of layers in Scellier and Bengio (2017). Although equation 5.5 provides a new justification for saving time by stopping the second phase early, our algorithm (as well as recurrent backpropagation) still requires convergence to the free fixed point in the first phase.

## Appendix A:  Recurrent Backpropagation: Proof

Proof of Theorem 1.
First, by definition of $L$, equation 3.1, we have $Lθ(s,0)=Cθ(s)$. Therefore, the initial conditions, equations 3.5 and 3.6, are satisfied:
$S¯0=∂Lθ∂ssθ0,0=∂Cθ∂ssθ0$
(A.1)
and
$Θ¯0=∂Lθ∂θsθ0,0=∂Cθ∂θsθ0.$
(A.2)
It remains to show equations 3.7 and 3.8. Temporarily, we omit writing the dependence in $θ$ to keep notations simple. As a preliminary result, we show that for any initial state $s$ and time $t$, we have8
$∂L∂t(s,t)+∂L∂s(s,t)·∂E∂s(s)=0.$
(A.3)
To this end, note that (by definition of $L$ and $S0$) we have for any $t$ and $u$,
$LS0(s,u),t=L(s,t+u).$
(A.4)
The derivatives of the right-hand side of equation A.4 with respect to $t$ and $u$ are clearly equal:
$ddtL(s,t+u)=dduL(s,t+u).$
(A.5)
Therefore the derivatives of the left-hand side of equation A.4 are equal too:
$∂L∂tS0(s,u),t=dduLS0(s,u),t$
(A.6)
$=-∂L∂sS0(s,u),t·∂E∂sS0(s,u).$
(A.7)
Here we have used the differential equation of motion, equation 2.1. Evaluating this expression for $u=0$, we get equation A.3.
Now we are ready to show that $S¯t=∂L∂ss0,t$ satisfies the differential equation in equation 3.7. Differentiating equation A.3 with respect to $s$, we get
$∂2L∂t∂s(s,t)+∂2L∂s2(s,t)·∂E∂s(s)+∂L∂s(s,t)·∂2E∂s2(s)=0.$
(A.8)
Evaluating this expression at the fixed point $s=s0$ and using the fixed-point condition $∂E∂ss0=0$, we get
$ddt∂L∂ss0,t=-∂2E∂s2s0·∂L∂ss0,t.$
(A.9)
Therefore $S¯t=∂L∂ss0,t$ satisfies equation 3.7.
We prove equation 3.8 similarly. Differentiating equation A.3 with respect to $θ$, we get
$∂2Lθ∂t∂θs,t+∂2Lθ∂s∂θs,t·∂Eθ∂s(s)+∂Lθ∂s(s,t)·∂2Eθ∂s∂θ(s)=0.$
(A.10)
Evaluating this expression at the fixed point $s=sθ0$, we get
$ddt∂Lθ∂θsθ0,t=-∂2Eθ∂θ∂ssθ0·∂Lθ∂ssθ0,t.$
(A.11)
Hence the result.

$□$

## Appendix B:  Equilibrium Propagation: Proof

In this appendix, we prove theorem 2. The same proof is provided in Scellier and Bengio (2017).

Since the data point $(x,y)$ does not play any role, its dependence is omitted in the notations. We assume that the energy function $Eθ(s)$ and the cost function $Cθ(s)$ (and thus the augmented energy function $Eθβ(s)$) are twice differentiable and that the conditions of the implicit function theorem are satisfied so that the fixed point $sθβ$ is a continuously differentiable function of $(θ,β)$.

Proof of Theorem 2.
Recall that we want to show the gradient formula:
$∂J∂θθ=limβ→01β∂Eθβ∂θsθβ-∂Eθ0∂θsθ0.$
(B.1)
The gradient formula, equation B.1, is a particular case of the following formula,9when evaluated at the point $β=0$:
$ddθ∂Eθβ∂βsθβ=ddβ∂Eθβ∂θsθβ.$
(B.2)
Therefore, in order to prove equation B.1, it is sufficient to prove equation B.2.
First, the cross-derivatives of $(θ,β)↦Eθβsθβ$ are equal:
$ddθddβEθβsθβ=ddβddθEθβsθβ.$
(B.3)
Second, by the chain rule of differentiation, we have
$ddβEθβsθβ=∂Eθβ∂βsθβ+∂Eθβ∂ssθβ·∂sθβ∂β$
(B.4)
$=∂Eθβ∂βsθβ.$
(B.5)
Here we have used the fixed-point condition,
$∂Eθβ∂ssθβ=0.$
(B.6)
Similarly we have
$ddθEθβsθβ=∂Eθβ∂θsθβ.$
(B.7)
Plugging equations B.5 and B.7 in B.3, we get equation B.2. Hence the result.

$□$

## Appendix C:  Temporal Derivatives Code for Error Derivatives: Proof

Proof of Theorem 3.

In order to prove theorem 3, we have to show that the process $(S˜t,Θ˜t)$ satisfies the same differential equations as $(S¯t,Θ¯t)$, namely, equations 3.5 to 3.8 (see theorem 1). We conclude by using the uniqueness of the solution to the differential equation with initial condition.

First, note that
$∂2Sθβ∂β∂tβ=0sθ0,t=limβ→01β∂Sθβ∂tsθ0,t-∂Sθ0∂tsθ0,t$
(C.1)
$=limβ→01β∂Sθβ∂tsθ0,t.$
(C.2)
The latter equality comes from the fact that $Sθ0sθ0,t=sθ0$ for every $t≥0$, implying that $∂Sθ0∂tsθ0,t=0$ at every moment $t≥0$. Furthermore,
$ddββ=0∂Eθβ∂θSθβsθ0,t=limβ→01β∂Eθβ∂θSθβsθ0,t-∂Eθ0∂θSθ0sθ0,t$
(C.3)
$=limβ→01β∂Eθβ∂θSθβsθ0,t-∂Eθ∂θsθ0.$
(C.4)
Again, the latter equality comes from the fact that $Sθ0sθ0,t=sθ0$ for every $t≥0$. Therefore,
$S˜t=-∂2Sθβ∂β∂tβ=0sθ0,t,∀t≥0,$
(C.5)
$Θ˜t=ddββ=0∂Eθβ∂θSθβsθ0,t,∀t≥0.$
(C.6)
Now we prove that $S˜t$ is the solution of equations 3.5 and 3.7. We omit writing the dependence in $θ$ to keep notations simple. The process $Sβs0,tt≥0$ is the solution of the differential equation,
$∂Sβ∂ts0,t=-∂Eβ∂sSβs0,t,$
(C.7)
with initial condition $Sβs0,0=s0$. Differentiating equation C.7 with respect to $β$, we get
$ddt∂Sβ∂βs0,t=-∂2Eβ∂s∂βSβs0,t-∂2Eβ∂s2Sβs0,t·∂Sβ∂βs0,t.$
(C.8)
Evaluating at $β=0$ and using the fact that $S0s0,t=s0$, we get
$ddt∂Sβ∂ββ=0s0,t=-∂C∂ss0-∂2E∂s2s0·∂Sβ∂ββ=0s0,t.$
(C.9)
Since at time $t=0$, the initial state of the network $Sβs0,0=s0$ is independent of $β$, we have
$∂Sβ∂βs0,0=0.$
(C.10)
Therefore, evaluating equation 4.9 at $t=0$, we get the initial condition, equation 3.5:
$S˜0=-∂2Sβ∂t∂ββ=0s0,0=∂C∂ss0.$
(C.11)
Moreover, differentiating equation C.9 with respect to time, we get
$ddt∂2Sβ∂t∂ββ=0s0,t=-∂2E∂s2s0·∂2Sβ∂t∂ββ=0s0,t.$
(C.12)
Hence, equation 3.7:
$ddtS˜t=-∂2E∂s2s0·S˜t.$
(C.13)
Now we prove the result for $Θ˜t$ (see equations 3.6 and 3.8). First, we differentiate $∂Eθβ∂θSθβsθ0,t$ with respect to $β$:
$ddβ∂Eθβ∂θSθβsθ0,t=∂Eθβ∂θ∂βSθβsθ0,t+∂Eθβ∂θ∂sSθβsθ0,t·∂Sθβ∂βsθ0,t.$
(C.14)
Again we evaluate at $β=0$ and use the fact that $Sθ0sθ0,t=sθ0$. We get
$ddββ=0∂Eθβ∂θSθβsθ0,t=∂Cθ∂θsθ0+∂Eθ∂θ∂ssθ0·∂Sθβ∂ββ=0sθ0,t.$
(C.15)
Evaluating equation C.15 at time $t=0$ and using equation C.10, we get the initial condition, equation 3.6:
$Θ˜0=ddββ=0∂Eθβ∂θSθβsθ0,0=∂Cθ∂θsθ0.$
(C.16)
Moreover, differentiating equation C.15 with respect to time, we get
$ddtddββ=0∂Eθβ∂θSθβsθ0,t=∂Eθ∂θ∂ssθ0·∂2Sθβ∂t∂ββ=0sθ0,t.$
(C.17)
Hence, equation 3.8:
$ddtΘ˜t=-∂Eθ∂θ∂ssθ0·S˜t.$
(C.18)
This completes the proof.

$□$

## Notes

1

In general, the fixed point defined by equation 2.2 is not unique unless further assumptions are made on $Eθ(s)$ (e.g., convexity). The fixed point depends on the initial state of the dynamics (see equation 2.1) and so does the objective function of equation 2.3. However, for ease of presentation, we avoid delving into these mathematical details here.

2

In this expression, both the cost function $Cθ(s)$ and the fixed-point $sθ0$ depend on $θ$. $Cθ(s)$ directly depends on $θ$, whereas $sθ0$ indirectly depends on $θ$ through $Eθ(s)$ (see equation 2.2).

3

We choose to number the layers in increasing order from output to input, in the sense of propagation of error signals (see section 4).

4

The case without the constraint of symmetric connections is studied in Scellier, Goyal, Binas, Mesnard, and Bengio (2018).

5

Given two vectors $a=a1,…,an$ and $b=b1,…,bn$, their product element by element is $a⊙b=a1b1,…,anbn$.

6

In this specific example the cost function $Cθ(y,s)$ does not depend on $θ$, $s1$ and $s2$.

7

The quantity $Θ¯t=∂Lθ∂θsθ0,t$ represents the partial derivative of $Lθ(s,t)$ with respect to $θ$, evaluated at the fixed point $s=sθ0$. This does not include the differentiation path through the fixed point $sθ0$.

8

Equation A.3 is the Kolmogorov backward equation for deterministic processes.

9

The notations $∂Eθβ∂θ$ and $∂Eθβ∂β$ are used to mean the partial derivatives with respect to the arguments of $Eθβ$, whereas $ddθ$ and $ddβ$ represent the total derivatives with respect to $θ$ and $β$, respectively (which include the differentiation path through $sθβ$). The total derivative $ddθ$ (resp. $ddβ$) is performed for fixed $β$ (resp. fixed $θ$).

## Acknowledgments

We thank Jonathan Binas for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.

## References

Almeida
,
L. B.
(
1987
).
A learning rule for asynchronous perceptrons with feedback in a combinatorial environment
. In
Proceedings of the First International Conference on Neural Networks
(vol. 2
, pp.
609
618
).
Piscataway, NJ
:
IEEE
.
Cohen
,
M. A.
, &
Grossberg
,
S.
(
1983
).
Absolute stability of global pattern formation and parallel memory storage by competitive neural networks
.
IEEE Transactions on Systems, Man, and Cybernetics
,
5
,
815
826
.
Crick
,
F.
(
1989
).
The recent excitement about neural networks
.
Nature
,
337
(
6203
),
129
132
.
Hinton
,
G. E.
, &
McClelland
,
J. L.
(
1988
). Learning representations by recirculation. In
D. Z.
Anderson
(Ed.),
Neural information processing systems
(pp.
358
366
).
College Park, MD
:
American Institute of Physics
.
Hopfield
,
J. J.
(
1984
).
Neurons with graded responses have collective computational properties like those of two-state neurons
.
PNAS
,
81
,
3088
3092
.
LeCun
,
Y.
,
Touresky
,
D.
,
Hinton
,
G.
, &
Sejnowski
,
T.
(
1988
).
A theoretical framework for back-propagation
. In
Proceedings of the 1988 Connectionist Models Summer School
(pp.
21
28
).
San Mateo, CA
:
Morgan Kaufmann
.
Pineda
,
F. J.
(
1987
).
Generalization of back-propagation to recurrent neural networks
.
Physical Review Letters
,
59
,
2229
2232
.
Scellier
,
B.
, &
Bengio
,
Y.
(
2017
).
Equilibrium propagation: Bridging the gap between energy-based models and backpropagation
.
Frontiers in Computational Neuroscience
,
11
.
Scellier
,
B.
,
Goyal
,
A.
,
Binas
,
J.
,
Mesnard
,
T.
, &
Bengio
,
Y.
(
2018
).
Generalization of equilibrium propagation to vector field dynamics
.
arXiv:1808.04873
.