Abstract

Recurrent backpropagation and equilibrium propagation are supervised learning algorithms for fixed-point recurrent neural networks, which differ in their second phase. In the first phase, both algorithms converge to a fixed point that corresponds to the configuration where the prediction is made. In the second phase, equilibrium propagation relaxes to another nearby fixed point corresponding to smaller prediction error, whereas recurrent backpropagation uses a side network to compute error derivatives iteratively. In this work, we establish a close connection between these two algorithms. We show that at every moment in the second phase, the temporal derivatives of the neural activities in equilibrium propagation are equal to the error derivatives computed iteratively by recurrent backpropagation in the side network. This work shows that it is not required to have a side network for the computation of error derivatives and supports the hypothesis that in biological neural networks, temporal derivatives of neural activities may code for error signals.

1  Introduction

In deep learning, the backpropagation algorithm used to train neural networks requires a side network for the propagation of error derivatives, which is widely seen as biologically implausible (Crick, 1989). One fascinating hypothesis, first formulated by Hinton and McClelland (1988), is that in biological neural networks, error signals could be encoded in the temporal derivatives of the neural activities. This allows for error signals to be propagated in the network via the neuronal dynamics itself, without the need for a side network. Neural computation would correspond to both inference and error backpropagation. The work presented in this letter supports this hypothesis.

In section 2, we present the machine learning setting we are interested in. The neurons of the network follow the gradient of an energy function, such as the Hopfield energy (Cohen & Grossberg, 1983; Hopfield, 1984). Energy minima correspond to preferred states of the model. At prediction time, inputs are clamped, and the network relaxes to a fixed point, corresponding to a local minimum of the energy function. The prediction is then read out on the output neurons. This corresponds to the first phase of the algorithm. The goal of learning is that of minimizing the cost at the fixed point, called the objective.

Section 3 presents recurrent backpropagation (Almeida, 1987; Pineda, 1987), an algorithm that computes the gradient of the objective. In the second phase of recurrent backpropagation, an iterative procedure computes error derivatives.

In section 4, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective. In the second phase of equilibrium propagation, when the target values for output neurons are observed, the output neurons are nudged toward their targets, and the network starts a second relaxation phase toward a second but nearby fixed point that corresponds to slightly smaller prediction error. The gradient of the objective can be computed based on a contrastive Hebbian learning rule at the first fixed point and second fixed point.

Section 5 (in particular, theorem 3) constitutes the main contribution of our work. We establish a close connection between recurrent backpropagation and equilibrium propagation. We show that at every moment in the second phase of equilibrium propagation, the temporal derivative of the neural activities code (i.e., are equal to) intermediate error derivatives, which recurrent backpropagation computes iteratively. Our work shows that one does not require a special computational path for the computation of the error derivatives in the second phase; the same information is available in the temporal derivatives of the neural activities. Furthermore we show that in equilibrium propagation, halting the second phase before convergence to the second fixed point is equivalent to truncated recurrent backpropagation.

2  Machine Learning Setting

We consider the supervised setting in which we want to predict a target y given an input x. The pair (x,y) is a data point. The model is a network specified by a state variable s and a parameter variable θ. The dynamics of the network are determined by two differentiable scalar functions, Eθ(x,s) and Cθ(y,s), which we call energy function and cost function, respectively. In most of the letter, to simplify the notations, we omit the dependence on x and y and simply write Eθ(s) and Cθ(s). Furthermore, we write Eθθ(s) and Eθs(s) the partial derivatives of (θ,s)Eθ(s) with respect to θ and s, respectively. Similarly Cθθ(s) and Cθs(s) denote the partial derivatives of (θ,s)Cθ(s).

The state variable s is assumed to move spontaneously toward low-energy configurations by following the gradient of the energy function:
dsdt=-Eθs(s).
(2.1)
The state s eventually settles to a minimum of the energy function, written sθ0 and characterized by1
Eθssθ0=0.
(2.2)
Since the dynamics in equation 2.1 depends on only the input x (through Eθ(x,s)) and not on the target y, we call this relaxation phase the free phase, and the energy minimum sθ0 is called the free fixed point.
The goal of learning is that of finding θ such that the cost at the fixed point Cθsθ0 is minimal.2 We introduce the objective function (for a single data point (x,y)):
J(θ):=Cθsθ0.
(2.3)
Note the distinction between the cost function and the objective function: the cost function Cθ(s) is defined for any state s, whereas the objective function J(θ) is the cost at the fixed point.

Several methods have been proposed to compute the gradient of J with respect to θ. Early work by Almeida (1987) and Pineda (1987) introduced the recurrent backpropagation algorithm, which we present in section 3. In Scellier and Bengio (2017) we proposed another algorithm—at first sight very different. We present it in section 4. In section 5 we show that there is actually a profound connection between these two algorithms.

2.1  Example: Hopfield Model

In this section we propose particular forms for the energy function Eθ(x,s) and the cost function Cθ(y,s) to ease understanding. Nevertheless, the theory presented in this letter is general and does not rely on the particular forms of the functions Eθ(x,s) and Cθ(y,s) chosen here.

Recall that we consider the supervised setting where we must predict a target y given an input x. To illustrate the idea, we consider the case where the neurons of the network are split in layers s0, s1, and s2, as in Figure 1.3 In this setting, the state variable s is the set of layers s=s0,s1,s2. Each of the layers of neurons s0, s1, and s2 is a vector whose coordinates are real numbers representing the membrane voltages of the neurons. The output layer s0 corresponds to the layer where the prediction is read and has the same dimension as the target y. Furthermore ρ is a deterministic function (nonlinear activation) that maps a neuron's voltage onto its firing rate. We commit a small abuse of notation and denote ρ(si) the vector of firing rates of the neurons in layer si; here the function ρ is applied elementwise to the coordinates of the vector si. Therefore, the vector ρ(si) has the same dimension as si. Finally, the parameter variable θ is the set of (bidirectional) weight matrices between the layers θ=W01,W12,W23.

We consider the following modified Hopfield energy function:
Eθ(x,s)=12s02+s12+s22-ρ(s0)T·W01·ρ(s1)-ρ(s1)T·W12·ρ(s2)-ρ(s2)T·W23·ρ(x).
(2.4)
With this choice of energy function, the dynamics (see equation 2.1) translate into a form of leaky integration neural dynamics with symmetric connections:4
ds0dt=ρ'(s0)TW01·ρ(s1)-s0,
(2.5)
ds1dt=ρ'(s1)TW12·ρ(s2)+W01T·ρ(s0)-s1,
(2.6)
ds2dt=ρ'(s2)TW23·ρ(x)+W12T·ρ(s1)-s2.
(2.7)
Here again the derivative of the function ρ (denoted ρ') is applied elementwise to the coordinates of the vectors s0, s1, and s2, and the notation is used to mean element-wise multiplication.5
Figure 1:

Graph of the network. Input x is clamped. State variable s includes hidden layers s2 and s1, and output layer s0 (layer where the prediction is read). Output layer s0 has the same dimension as target y.

Figure 1:

Graph of the network. Input x is clamped. State variable s includes hidden layers s2 and s1, and output layer s0 (layer where the prediction is read). Output layer s0 has the same dimension as target y.

Finally we consider the quadratic cost function,6
Cθ(y,s)=12y-s02,
(2.8)
which measures the discrepancy between the output layer s0 and the target y.

Note that the results established in this letter hold for any energy function Eθ(s) and any cost function Cθ(s), and are not limited to the Hopfield energy and the quadratic cost (see equations 2.4 and 2.8).

3  Recurrent Backpropagation

In this section, we present recurrent backpropagation, an algorithm introduced by Almeida (1987) and Pineda (1987) that computes the gradient of J (see equation 2.3). The original algorithm was described in the discrete-time setting and for a general state-to-state dynamics. Here we present it in the continuous-time setting in the particular case of a gradient dynamics (see equation 2.1). A direct derivation based on the adjoint method can also be found in LeCun, Touresky, Hinton, and Sejnowski (1988).

3.1  Projected Cost Function

Let Sθ0(s,t) denote the state of the network at time t0 when it starts from an initial state s at time t=0 and follows the free dynamics (see equation 2.1). In the theory of dynamical systems, Sθ0(s,t) is called the flow. We introduce the projected cost function:
Lθ(s,t):=CθSθ0(s,t).
(3.1)
This is the cost of the state projected a duration t in the future, when the network starts from s and follows the free dynamics. For fixed s, the process Lθ(s,t)t0 represents the successive cost values taken by the state of the network, along the free dynamics when it starts from the initial state s. Notable cases include these:
  • For t=0, the projected cost is simply the cost of the current state Lθ(s,0)=Cθs.

  • As t, the projected cost converges to the objective Lθ(s,t)J(θ).

The second property comes from the fact that the dynamics converges to the fixed point, Sθ0(s,t)sθ0 as t. Under mild regularity conditions on Eθ(s) and Cθ(s), the gradient of the projected cost function converges to the gradient of the objective function in the limit of infinite duration,
Lθθ(s,t)Jθ(θ),
(3.2)
as t. Therefore, if we can compute Lθθ(s,t) for a particular value of s and for any t0, we can obtain the desired gradient Jθ(θ) by letting t. We will show next that this is what recurrent backpropagation does in the case where the initial state s is the fixed point sθ0.

3.2  Process of Error Derivatives

We introduce the process of error derivatives (S¯t,Θ¯t)t0, defined as
S¯t:=Lθssθ0,t,t0,
(3.3)
Θ¯t:=Lθθsθ0,t,t0.
(3.4)
The process S¯t takes values in the state space (space of the state variable s), and the process Θ¯t takes values in the parameter space (space of the parameter variable θ).7 The recurrent backpropagation algorithm computes S¯t and Θ¯t iteratively for increasing values of t.
Theorem 1
(recurrent backpropagation). The process of error derivatives (S¯t,Θ¯t) satisfies
S¯0=Cθssθ0,
(3.5)
Θ¯0=Cθθsθ0,
(3.6)
ddtS¯t=-2Eθs2sθ0·S¯t,
(3.7)
ddtΘ¯t=-2Eθθssθ0·S¯t.
(3.8)
Theorem 1, proved in appendix A, offers us a two-phase method to compute the gradient Jθθ. In the first phase (or free phase), the state variable s follows the free dynamics (see equation 2.1) and relaxes to the fixed-point sθ0. Reaching this fixed point is necessary for evaluating the Hessian 2Eθs2sθ0, which is required in the second phase. In the second phase, one computes S¯t and Θ¯t iteratively for increasing values of t using equations 3.5 to 3.8. We obtain the desired gradient in the limit t, as a consequence of equation 3.2:
Θ¯tJθ(θ).
(3.9)

Note that the Hessian 2Eθs2sθ0 is positive definite since sθ0 is an energy minimum. Therefore, equation 3.7 guarantees that Lθssθ0,t0 as t, in agreement with the fact that J(θ) is (locally) insensitive to the initial state (s=sθ0 in our case).

From the point of view of biological plausibility, the requirement to run the dynamics for S¯t and Θ¯t to compute the gradient Jθ(θ) is not satisfying. It is not clear what the quantities S¯t and Θ¯t would represent in a biological network. We address this issue in sections 4 and 5.

4  Equilibrium Propagation

In this section, we present equilibrium propagation (Scellier & Bengio, 2017), another algorithm that computes the gradient of the objective function J, equation 2.3. At first sight, equilibrium propagation and recurrent backpropagation have little in common. However, in section 5, we will show a profound connection between these algorithms.

4.1  Augmented Energy Function

The central idea of equilibrium propagation is to introduce the augmented energy function,
Eθβ(s):=Eθ(s)+βCθ(s),
(4.1)
where β0 is a scalar that we call influence parameter. The free dynamics (see equation 2.1) is then replaced by the augmented dynamics:
dsdt=-Eθβs(s).
(4.2)
The state variable now follows the dynamics dsdt=-Eθs(s)-βCθs(s). When β>0, in addition to the usual term -Eθs(s), a term -βCθs(s) nudges s toward configurations that have lower cost values. In the case of the model described in section 2.1 with the quadratic cost function (see equation 2.8), the new term -βCθs(s) is the vector of dim(s) whose component on s0 is βy-s0 and whose components on s1 and s2 are zero. Thus, the new term takes the form of a force that nudges the output layer s0 toward the target y. Unlike the free dynamics, which depends only on x (through Eθ(x,s)), not on y, the augmented dynamics also depends on y (through Cθ(y,s)).
Note that the free dynamics corresponds to the value β=0. We then generalize the notion of fixed point for any value of β. The augmented dynamics converges to the fixed-point sθβ, an energy minimum of Eθβ, characterized by
Eθβssθβ=0.
(4.3)

Theorem 2 shows that the gradient Jθ(θ) can be estimated based on measurements at the fixed points sθ0 and sθβ.

Theorem 2
(equilibrium propagation). The gradient of the objective function with respect to θ can be estimated using the formula
Jθ(θ)=limβ01βEθβθsθβ-Eθ0θsθ0.
(4.4)

A proof of theorem 2 is given in appendix B. Note that theorem 2 is also a consequence of the more general formula of equation 5.5 (theorem 3), established in the next section.

Theorem 2 offers another way to estimate the gradient of J(θ). As in recurrent backpropagation, in the first phase (or free phase), the network follows the free dynamics (see equation 2.1). This is equivalent to saying that the network follows the augmented dynamics (see equation 4.2) when the value of β is set to 0. The network relaxes to the free fixed point sθ0, where Eθθsθ0 is measured. In the second phase, which we call nudged phase, the influence parameter takes on a small, positive value β0, and the network relaxes to a new but nearby fixed point sθβ where Eθβθsθβ is measured. The gradient of the objective function is estimated using the formula in equation 4.4.

In the case of the modified Hopfield energy (see equation 2.4), the components of Eθθ(s) are EθW01(s), EθW12(s), and EθW23(s). For instance, EθW01(s)=-ρs0·ρs1T is a matrix of size dim(s0)×dim(s1) whose entries can be measured locally at each synapse based on the presynaptic activity and postsynaptic activity. Thus, the learning rule of equation 4.4 is a kind of contrastive Hebbian learning rule at the free and nudged fixed points.

At the beginning of the second phase, the network is initially at the free fixed point sθ0. When the influence parameter takes on a small, positive value β0, the novel term -βCθs(s) in the dynamics of the state variable perturbs the system. This perturbation propagates into the layers of the network until convergence to the new fixed point sθβ.

In the next section, we go beyond the analysis of fixed points and show that at every moment t in the nudged phase, the temporal derivative dsdt encodes the error derivative of equation 3.3.

5  Temporal Derivatives Code for Error Derivatives

Theorem 2 shows that the gradient of J can be estimated based on the free and nudged fixed points only. In this section, we study the dynamics of the network in the second phase, from the free fixed point to the nudged fixed point. Recall that Sθ0(s,t) is the flow of the dynamical system (see equation 2.1), that is, the state of the network at time t0 when it starts from an initial state s at time t=0 and follows the free dynamics. Similarly, we define Sθβ(s,t) for any value of β when the network follows the augmented dynamics (see equation 4.2).

In equilibrium propagation, the state of the network at the beginning of the nudged phase is the free fixed-point sθ0. We choose as origin of time t=0 the moment when the second phase starts: the network is in the state sθ0, and the influence parameter takes on a small, positive value β0. With our notations, the state of the network after a duration t in the nudged phase is Sθβsθ0,t. As t, the network's state converges to the nudged fixed point Sθβsθ0,tsθβ.

5.1  Process of Temporal Derivatives

Now we are ready to introduce the process of temporal derivatives (S˜t,Θ˜t)t0, defined by
S˜t:=-limβ01βSθβtsθ0,t,
(5.1)
Θ˜t:=limβ01βEθβθSθβsθ0,t-Eθ0θsθ0.
(5.2)
Like S¯t and Θ¯t, the processes S˜t and Θ˜t take values in the state space and parameter space, respectively.

The process S˜t is simply the temporal derivative dsdt in the second phase, rescaled by 1β (so that its value does not depend on the particular choice of β0).

Theorem 3
(temporal derivatives as error derivatives). The process of error derivatives (S¯t,Θ¯t) and the process of temporal derivatives (S˜t,Θ˜t) are equal:
t0,S¯t=S˜t,Θ¯t=Θ˜t,
(5.3)
or, using explicit forms,
Lθssθ0,t=-limβ01βSθβtsθ0,t,
(5.4)
Lθθsθ0,t=limβ01βEθβθSθβsθ0,t-Eθ0θsθ0.
(5.5)

Theorem 3 is proved in appendix C. In essence, equation 5.4 says that in the second phase of equilibrium propagation, the temporal derivative dsdt (rescaled by 1β) encodes the error derivative (equation 3.3).

Here is an interpretation of equation 5.4. Suppose that the network is initially at the fixed point s=sθ0. Consider the cost Lθsθ0+Δs,t, a duration t in the future if one moved the initial state s=sθ0 by a small step Δs. The goal is to find the direction Δs, which minimizes Lθsθ0+Δs,t. The naive approach by trial and error is neither biologically plausible nor efficient. Equation 5.4 tells us that there is a physically realistic way of finding such a direction Δs in one attempt. This direction is encoded in the temporal derivative dsdt at time t after starting the nudged phase.

Note that as t, both sides of equation 5.4 converge to 0. This is a consequence of equation 3.7 and the fact that the Hessian 2Eθs2sθ0 is positive definite (since sθ0 is an energy minimum), as already mentioned in section 3. Intuitively, the right-hand side converges to 0 because Sθβsθ0,t converges smoothly to the nudged fixed point sθβ. As for the left-hand side, when t is large, Lθs,t is close to the cost of the energy minimum and thus has little sensitivity to the initial state s.

Finally, as t in equation 5.5, one recovers the gradient formula of equilibrium propagation (theorem 2). Interestingly, equation 5.5 shows that in equilibrium propagation, halting the second phase before convergence to the nudged fixed point corresponds to truncated recurrent backpropagation.

6  Conclusion

Our work establishes a close connection between two algorithms for fixed-point recurrent networks: recurrent backpropagation and equilibrium propagation. The temporal derivatives of the neural activities in the second phase of equilibrium propagation are equal to the error derivatives that recurrent backpropagation computes iteratively. Moreover, we have shown that halting the second phase before convergence in equilibrium propagation is equivalent to truncated recurrent backpropagation. Our work supports the hypothesis that in biological networks, temporal changes in neural activities may represent error signals for supervised learning from a machine learning perspective.

One important drawback of the theory presented here is that it assumes the existence of an energy function. In the case of the Hopfield energy, this implies symmetric connections between neurons. However, the analysis presented here can be generalized to dynamics that do not involve energy functions. This is the subject of Scellier et al. (2018). Another concern is the fact that our algorithm is rate based, whereas biological neurons emit spikes. Ideally we would like a theory applicable to spiking networks. Finally, the assumption of the existence of specialized output neurons (s0 here) would need to be relaxed too.

From a practical point of view, another issue is that the time needed to converge to the first fixed point was experimentally found to grow exponentially with the number of layers in Scellier and Bengio (2017). Although equation 5.5 provides a new justification for saving time by stopping the second phase early, our algorithm (as well as recurrent backpropagation) still requires convergence to the free fixed point in the first phase.

Appendix A:  Recurrent Backpropagation: Proof

Proof of Theorem 1.
First, by definition of L, equation 3.1, we have Lθ(s,0)=Cθ(s). Therefore, the initial conditions, equations 3.5 and 3.6, are satisfied:
S¯0=Lθssθ0,0=Cθssθ0
(A.1)
and
Θ¯0=Lθθsθ0,0=Cθθsθ0.
(A.2)
It remains to show equations 3.7 and 3.8. Temporarily, we omit writing the dependence in θ to keep notations simple. As a preliminary result, we show that for any initial state s and time t, we have8
Lt(s,t)+Ls(s,t)·Es(s)=0.
(A.3)
To this end, note that (by definition of L and S0) we have for any t and u,
LS0(s,u),t=L(s,t+u).
(A.4)
The derivatives of the right-hand side of equation A.4 with respect to t and u are clearly equal:
ddtL(s,t+u)=dduL(s,t+u).
(A.5)
Therefore the derivatives of the left-hand side of equation A.4 are equal too:
LtS0(s,u),t=dduLS0(s,u),t
(A.6)
=-LsS0(s,u),t·EsS0(s,u).
(A.7)
Here we have used the differential equation of motion, equation 2.1. Evaluating this expression for u=0, we get equation A.3.
Now we are ready to show that S¯t=Lss0,t satisfies the differential equation in equation 3.7. Differentiating equation A.3 with respect to s, we get
2Lts(s,t)+2Ls2(s,t)·Es(s)+Ls(s,t)·2Es2(s)=0.
(A.8)
Evaluating this expression at the fixed point s=s0 and using the fixed-point condition Ess0=0, we get
ddtLss0,t=-2Es2s0·Lss0,t.
(A.9)
Therefore S¯t=Lss0,t satisfies equation 3.7.
We prove equation 3.8 similarly. Differentiating equation A.3 with respect to θ, we get
2Lθtθs,t+2Lθsθs,t·Eθs(s)+Lθs(s,t)·2Eθsθ(s)=0.
(A.10)
Evaluating this expression at the fixed point s=sθ0, we get
ddtLθθsθ0,t=-2Eθθssθ0·Lθssθ0,t.
(A.11)
Hence the result.

Appendix B:  Equilibrium Propagation: Proof

In this appendix, we prove theorem 2. The same proof is provided in Scellier and Bengio (2017).

Since the data point (x,y) does not play any role, its dependence is omitted in the notations. We assume that the energy function Eθ(s) and the cost function Cθ(s) (and thus the augmented energy function Eθβ(s)) are twice differentiable and that the conditions of the implicit function theorem are satisfied so that the fixed point sθβ is a continuously differentiable function of (θ,β).

Proof of Theorem 2.
Recall that we want to show the gradient formula:
Jθθ=limβ01βEθβθsθβ-Eθ0θsθ0.
(B.1)
The gradient formula, equation B.1, is a particular case of the following formula,9when evaluated at the point β=0:
ddθEθββsθβ=ddβEθβθsθβ.
(B.2)
Therefore, in order to prove equation B.1, it is sufficient to prove equation B.2.
First, the cross-derivatives of (θ,β)Eθβsθβ are equal:
ddθddβEθβsθβ=ddβddθEθβsθβ.
(B.3)
Second, by the chain rule of differentiation, we have
ddβEθβsθβ=Eθββsθβ+Eθβssθβ·sθββ
(B.4)
=Eθββsθβ.
(B.5)
Here we have used the fixed-point condition,
Eθβssθβ=0.
(B.6)
Similarly we have
ddθEθβsθβ=Eθβθsθβ.
(B.7)
Plugging equations B.5 and B.7 in B.3, we get equation B.2. Hence the result.

Appendix C:  Temporal Derivatives Code for Error Derivatives: Proof

Proof of Theorem 3.

In order to prove theorem 3, we have to show that the process (S˜t,Θ˜t) satisfies the same differential equations as (S¯t,Θ¯t), namely, equations 3.5 to 3.8 (see theorem 1). We conclude by using the uniqueness of the solution to the differential equation with initial condition.

First, note that
2Sθββtβ=0sθ0,t=limβ01βSθβtsθ0,t-Sθ0tsθ0,t
(C.1)
=limβ01βSθβtsθ0,t.
(C.2)
The latter equality comes from the fact that Sθ0sθ0,t=sθ0 for every t0, implying that Sθ0tsθ0,t=0 at every moment t0. Furthermore,
ddββ=0EθβθSθβsθ0,t=limβ01βEθβθSθβsθ0,t-Eθ0θSθ0sθ0,t
(C.3)
=limβ01βEθβθSθβsθ0,t-Eθθsθ0.
(C.4)
Again, the latter equality comes from the fact that Sθ0sθ0,t=sθ0 for every t0. Therefore,
S˜t=-2Sθββtβ=0sθ0,t,t0,
(C.5)
Θ˜t=ddββ=0EθβθSθβsθ0,t,t0.
(C.6)
Now we prove that S˜t is the solution of equations 3.5 and 3.7. We omit writing the dependence in θ to keep notations simple. The process Sβs0,tt0 is the solution of the differential equation,
Sβts0,t=-EβsSβs0,t,
(C.7)
with initial condition Sβs0,0=s0. Differentiating equation C.7 with respect to β, we get
ddtSββs0,t=-2EβsβSβs0,t-2Eβs2Sβs0,t·Sββs0,t.
(C.8)
Evaluating at β=0 and using the fact that S0s0,t=s0, we get
ddtSβββ=0s0,t=-Css0-2Es2s0·Sβββ=0s0,t.
(C.9)
Since at time t=0, the initial state of the network Sβs0,0=s0 is independent of β, we have
Sββs0,0=0.
(C.10)
Therefore, evaluating equation 4.9 at t=0, we get the initial condition, equation 3.5:
S˜0=-2Sβtββ=0s0,0=Css0.
(C.11)
Moreover, differentiating equation C.9 with respect to time, we get
ddt2Sβtββ=0s0,t=-2Es2s0·2Sβtββ=0s0,t.
(C.12)
Hence, equation 3.7:
ddtS˜t=-2Es2s0·S˜t.
(C.13)
Now we prove the result for Θ˜t (see equations 3.6 and 3.8). First, we differentiate EθβθSθβsθ0,t with respect to β:
ddβEθβθSθβsθ0,t=EθβθβSθβsθ0,t+EθβθsSθβsθ0,t·Sθββsθ0,t.
(C.14)
Again we evaluate at β=0 and use the fact that Sθ0sθ0,t=sθ0. We get
ddββ=0EθβθSθβsθ0,t=Cθθsθ0+Eθθssθ0·Sθβββ=0sθ0,t.
(C.15)
Evaluating equation C.15 at time t=0 and using equation C.10, we get the initial condition, equation 3.6:
Θ˜0=ddββ=0EθβθSθβsθ0,0=Cθθsθ0.
(C.16)
Moreover, differentiating equation C.15 with respect to time, we get
ddtddββ=0EθβθSθβsθ0,t=Eθθssθ0·2Sθβtββ=0sθ0,t.
(C.17)
Hence, equation 3.8:
ddtΘ˜t=-Eθθssθ0·S˜t.
(C.18)
This completes the proof.

Notes

1

In general, the fixed point defined by equation 2.2 is not unique unless further assumptions are made on Eθ(s) (e.g., convexity). The fixed point depends on the initial state of the dynamics (see equation 2.1) and so does the objective function of equation 2.3. However, for ease of presentation, we avoid delving into these mathematical details here.

2

In this expression, both the cost function Cθ(s) and the fixed-point sθ0 depend on θ. Cθ(s) directly depends on θ, whereas sθ0 indirectly depends on θ through Eθ(s) (see equation 2.2).

3

We choose to number the layers in increasing order from output to input, in the sense of propagation of error signals (see section 4).

4

The case without the constraint of symmetric connections is studied in Scellier, Goyal, Binas, Mesnard, and Bengio (2018).

5

Given two vectors a=a1,,an and b=b1,,bn, their product element by element is ab=a1b1,,anbn.

6

In this specific example the cost function Cθ(y,s) does not depend on θ, s1 and s2.

7

The quantity Θ¯t=Lθθsθ0,t represents the partial derivative of Lθ(s,t) with respect to θ, evaluated at the fixed point s=sθ0. This does not include the differentiation path through the fixed point sθ0.

8

Equation A.3 is the Kolmogorov backward equation for deterministic processes.

9

The notations Eθβθ and Eθββ are used to mean the partial derivatives with respect to the arguments of Eθβ, whereas ddθ and ddβ represent the total derivatives with respect to θ and β, respectively (which include the differentiation path through sθβ). The total derivative ddθ (resp. ddβ) is performed for fixed β (resp. fixed θ).

Acknowledgments

We thank Jonathan Binas for feedback and discussions, as well as NSERC, CIFAR, Samsung, and Canada Research Chairs for funding.

References

Almeida
,
L. B.
(
1987
).
A learning rule for asynchronous perceptrons with feedback in a combinatorial environment
. In
Proceedings of the First International Conference on Neural Networks
(vol. 2
, pp.
609
618
).
Piscataway, NJ
:
IEEE
.
Cohen
,
M. A.
, &
Grossberg
,
S.
(
1983
).
Absolute stability of global pattern formation and parallel memory storage by competitive neural networks
.
IEEE Transactions on Systems, Man, and Cybernetics
,
5
,
815
826
.
Crick
,
F.
(
1989
).
The recent excitement about neural networks
.
Nature
,
337
(
6203
),
129
132
.
Hinton
,
G. E.
, &
McClelland
,
J. L.
(
1988
). Learning representations by recirculation. In
D. Z.
Anderson
(Ed.),
Neural information processing systems
(pp.
358
366
).
College Park, MD
:
American Institute of Physics
.
Hopfield
,
J. J.
(
1984
).
Neurons with graded responses have collective computational properties like those of two-state neurons
.
PNAS
,
81
,
3088
3092
.
LeCun
,
Y.
,
Touresky
,
D.
,
Hinton
,
G.
, &
Sejnowski
,
T.
(
1988
).
A theoretical framework for back-propagation
. In
Proceedings of the 1988 Connectionist Models Summer School
(pp.
21
28
).
San Mateo, CA
:
Morgan Kaufmann
.
Pineda
,
F. J.
(
1987
).
Generalization of back-propagation to recurrent neural networks
.
Physical Review Letters
,
59
,
2229
2232
.
Scellier
,
B.
, &
Bengio
,
Y.
(
2017
).
Equilibrium propagation: Bridging the gap between energy-based models and backpropagation
.
Frontiers in Computational Neuroscience
,
11
.
Scellier
,
B.
,
Goyal
,
A.
,
Binas
,
J.
,
Mesnard
,
T.
, &
Bengio
,
Y.
(
2018
).
Generalization of equilibrium propagation to vector field dynamics
.
arXiv:1808.04873
.