## Abstract

Identifying, formalizing, and combining biological mechanisms that implement known brain functions, such as prediction, is a main aspect of research in theoretical neuroscience. In this letter, the mechanisms of spike-timing-dependent plasticity and homeostatic plasticity, combined in an original mathematical formalism, are shown to shape recurrent neural networks into predictors. Following a rigorous mathematical treatment, we prove that they implement the online gradient descent of a distance between the network activity and its stimuli. The convergence to an equilibrium, where the network can spontaneously reproduce or predict its stimuli, does not suffer from bifurcation issues usually encountered in learning in recurrent neural networks.

## 1. Introduction

One of the main functions of the brain is prediction (Bar, 2009). This function is generally thought to rely on the idea that cortical regions learn a model of the world and simulate it to generate predictions of future events (Gilbert & Wilson, 2007; Schacter, Addis, & Buckner, 2008). Several recent experimental findings support this view, showing in particular that the spontaneous neuronal activity after presentation of a stimulus is correlated with the evoked activity (Kenet, Bibitchkov, Tsodyks, Grinvald, & Arieli, 2003), and that this similarity increases along development and learning (Berkes, Orbán, Lengyel, & Fiser, 2011). Moreover, at the scale of neuronal networks, prediction can also be seen as a general organization principle. It has been argued (Rao, & Ballard, 1999; Clark, 2012) that the brain would contain a hierarchy of predictive units able to predict their direct stimuli or inputs through the modification of the synaptic connections.

Understanding the mechanisms and principles underlying this prediction function is a key challenge, not only from a neuroscience perspective but also for machine learning, where the number of applications requiring prediction is significant.

In the field of machine learning, recurrent neural networks have been successfully proposed as candidates for these predictive units (Williams & Zipser, 1989, 1995; Pearlmutter, 1995; Jaeger & Haas, 2004; Sussillo & Abbott, 2009). In most cases, these algorithms aim at creating a neural network that autonomously and spontaneously reproduces a given time series. The Bayesian approach is also useful in designing predictors (Dayan, Hinton, Neal, & Zemel, 1995; George & Hawkins, 2009) and has been mapped to neural networks (Deneve, 2008; Friston, 2010; Bitzer & Kiebel, 2012). However, apart from a rough conceptual equivalence, this letter is devoid of Bayesian terminology and directly focuses on neural networks. In this framework, prediction is often achieved by minimizing a distance between the activity of the neural network and the target time series. Although neural networks were originally studied in a feedforward framework (Rosenblatt, 1958), the most efficient networks for prediction shall involve recurrent connections giving the network some memory properties. So-called gradient descent algorithms in recurrent neural network (Mandic & Chambers, 2001) involve the learning of the entire connectivity matrix. They minimize the distance between a target trajectory and the trajectory of the network. On the other hand, researchers in the field of reservoir computing (Lukosevicius & Jaeger, 2009) optimize only some connections in the network, whereas the others are randomly drawn and fixed. To do prediction, they minimize the one-step-ahead error corresponding to the distance between the network predictions and the next time step of the target time series. Thus, these algorithms are derived to optimize an accuracy criterion, with learning rules generally favoring prediction efficiency over biological meaning.

The field of neuroscience has seen many discoveries in the study of synaptic plasticity, in particular providing experimental evidence and possible mechanisms for two major concepts in the current biology of learning. The first is the discovery of spike-timing-dependent plasticity (STDP) (Markram, Lübke, Frotscher, & Sakmann, 1997; Bi & Poo, 1998; Caporale & Dan, 2008; Sjöström & Gerstner, 2010). It is a temporally asymmetric form of Hebbian learning induced by temporal correlations between the spikes of pre- and postsynaptic neurons. The general principle is that if a neuron fires before (resp. after) another, then the strength of the connection from the former to the latter will be increased (resp. decreased). The summation of all of these modifications leads to the strengthening of causality links among neurons. Although STDP is based on spiking networks, it has several extensions or analogs for rate-based networks (those used in machine learning) (Kempter, Gerstner, & Van Hemmen, 1999; Izhikevich & Desai, 2003; Pfister & Gerstner, 2006). The functional role of STDP is still discussed—for instance, in reducing latency (Song, Miller, & Abbott, 2000), optimizing information transfer (Hennequin, Gerstner, & Pfister, 2010), invariant recognition (Sprekeler, Michaelis, & Wiskott, 2007), and even learning temporal patterns (Gerstner, Ritz, & Van Hemmen, 1993; Rao & Sejnowski, 2001; Yoshioka, Scarpetta, & Marinaro, 2007).

Second, the notion of homeostatic plasticity (Miller, 1996; Abbott & Nelson, 2000; Turrigiano & Nelson, 2004), including mechanisms such as synaptic scaling, has proved to be important to moderate the growth of connection strength. In contrast to previously theory-motivated normalization of the connectivity (Miller & MacKay, 1994; Oja, 1982), there is need of a biologically plausible means to prevent the connectivity from exploding under the influence of shaping mechanisms like Hebbian learning or STDP.

From a theoretical viewpoint, STDP and homeostatic plasticity are almost always studied independently. An extensive bottom-up numerical analysis of the combination of such learning mechanisms, done by Triesch and colleagues, has already led to biologically relevant behaviors (Lazar, Pipa, & Triesch, 2007, 2009; Zheng, Dimitrakakis, & Triesch, 2013). However, the mathematical understanding of their combination in terms of functionality still stands as an undocumented challenge to researchers.

This letter aims at bridging the gap between biological mechanisms and machine learning regarding the issue of predictive neural networks. We rigorously show how a biologically inspired learning rule, developed from an original combination of STDP mechanism and homeostatic plasticity, mimics the gradient descent of a distance between the activity of the neural network and its direct stimuli. This results in capturing the underlying dynamical behavior of the stimuli into a recurrent neural network and therefore in designing a biologically plausible predictive network.

The letter is organized as follows. In section 2, we construct a theoretical learning rule designed for prediction, based on an appropriate gradient descent method. In section 3, we introduce a biologically inspired learning rule that combines the concepts of STDP and homeostatic plasticity, whose purpose is to mimic the theoretical learning rule. We discuss the various biological mechanisms that may be involved in this new learning rule. Finally in section 4, we provide a mathematical justification of the link between the theoretical and the biologically inspired learning rule, based on the key idea that STDP can be seen as a differential operator.

## 2. Theoretical Learning Rule for Prediction

In a machine learning approach, we introduce a procedure to design a neural network that autonomously replays a target time series.

### 2.1. Setup.

We consider a recurrent neural network made of *n* neurons that is exposed to a time-dependent input of the same dimension. Our aim is to construct a learning rule that will enable the network to reproduce the input's behavior.

**u**is generated by an arbitrary dynamical system, with a smooth vector field from to . We also assume that the trajectory of the inputs or stimuli is -periodic. The key mathematical assumption on the input

**u**is in fact ergodicity, but we restrict our study to periodic inputs for simplicity. In particular, periodic inputs can be constructed by the repetition of a given finite-time sample.

### 2.2. Gradient Descent Learning Rule.

**W**that will minimize a distance between the two functions and −

*l*

**u**+

**W**.

*S*(

**u**). In this perspective, we define the following quantity: When

*H*

_{u}=0, the vector fields of systems 2.1 and 2.2 are equal on the trajectories of the inputs. This quantity may be viewed as a distance between the two vector fields defining the dynamics of the inputs and the neuronal network along the trajectories of the inputs. It is similar to classical gradient methods (Pearlmutter, 1995; Williams & Zipser, 1995; Mandic & Chambers, 2001), except that the norms are applied to the flows of inputs and neural networks instead of their activity. Thus, it focuses more specifically on the dynamical structure of the inputs. Moreover, it is possible to show, using Girsanov's theorem, that this definition coincides with the concept of relative entropy between two diffusion processes—the ones obtained by adding a standard gaussian perturbation to both equations. Therefore, we call

*H*

_{u}the relative entropy.

*i*,

*j*of is for

**x**,

**y**functions from to . Equivalently, these functions can be seen as semicontinuous matrices in , and is the transpose of

**x**. Thus, an algorithm implementing the gradient descent is a good candidate to minimize the relative entropy between inputs and spontaneous activity. Since

*H*is quadratic in

_{u}**W**, it follows that is a convex function, thus excluding situations with multiple local minima. Moreover, if is invertible, one can compute directly the minimizing connectivity as

Implementing a gradient descent based on equation 2.4 does not immediately lead to a biologically relevant mechanism. First, it requires direct access to the inputs **u**, whereas synaptic plasticity mechanisms rely on only the network activity **v**. Second, it is a batch learning algorithm, which requires access to the entire history of the inputs. Third, it requires the ability to compute . We will see in section 3 how to overcome these issues combining biologically inspired synaptic plasticity mechanisms.

### 2.3. Example.

In order to illustrate the idea that learning rule 2.4 enables the network to learn dynamical features of the input, we have constructed the following experiment. We present to the network an input movie displaying sequentially the writing of the letter A (see Figure 1, top row). To each pixel we assign one neuron, so that the input and the network share the same dimension. This input movie is repeated periodically until the connectivity matrix of the network, evolving under rule 2.4, stabilizes. Then the input is turned off, and we set the initial state of the network to a priming image showing the bottom left part of letter A. The network evolving without input strikingly reproduces the dynamical writing of letter A as displayed in Figure 1. Thus, with this example, we have illustrated the ability of the learning rule we have derived from a theoretical principle to capture a dynamic input into a connectivity matrix.

## 3. A Biological Learning Rule

We now introduce a biological learning rule made of the combination of STDP and homeostatic plasticity. In section 4, we show that this learning rule minimizes *H*_{u}. Here, we first give a mathematical description of this learning rule and relate the different terms to biological mechanisms.

### 3.1. Mathematical Description.

*g*is defined as , with

_{c}*H*the Heaviside function and for any positive number

*c*, as shown in the left panel of Figure 2. As illustrated in the right panel of Figure 2, the operator roughly corresponds to the classical STDP window (Bi & Poo, 1998) (taking into account a

*y*-axis symmetry corresponding to the symmetric formalism we are using). The constant is a time constant corresponding to the width of the STDP window used for learning.

### 3.2. Biological Mechanisms.

#### 3.2.1. An Input Estimate.

The variable can be seen as a spatiotemporal differential variable that approximates the inputs **u**. Although unsupervised learning rules are often algebraic combinations of element-wise functions applied to the activity of the network (Gerstner & Kistler, 2002), it is not precisely the case here. Indeed, learning is based on the variable , which corresponds to the subtraction of the temporally integrated synaptic drive from the activity of the neurons *l***v**. For each neuron, this variable takes into account the past of all the neurons, which are then spatially averaged through the connectivity to be subtracted from the current activity. This gives a differential flavor to this variable that is reminiscent of former learning rules (Bienenstock, Cooper, & Munro, 1982; Sejnowski, 1977) for the temporal aspect and (Miller & MacKay, 1994) for the spatial aspect. Note that this variable is not strictly speaking local (i.e., the connection **W**_{ij} needs the values of **v**_{k} to be updated), yet it is biologically plausible since the term (**W**.*S*(**v**))_{i} is accessible for neuron *i* on its dendritic tree, a form of locality in a broader sense.

#### 3.2.2. An STDP Mechanism.

The first term in equation 3.1 can be related to STDP. The antisymmetric part of this term is responsible for retrieving the drift in equation 2.4. The symmetric part (corresponding to Hebbian learning) is responsible for retrieving the second term in equation 2.4. Thus, it captures the causality structure of the inputs, a task generally attributed to STDP (Sjöström & Gerstner, 2010). Beyond the simple similarity of functional role, we believe a simplification of this term may shed light on the deep link it has with STDP. The main difference between our setup and STDP is that the former is based on a rate-based dynamics, whereas the latter is based on a spiking dynamics. In a pure spike framework (i.e., the activity is a sum of Diracs), the STDP can be seen as this simple learning rule: . Indeed, the term is nonnull only when the postsynaptic neuron *i* is firing and then, via the factor , it counts the number of preceding presynaptic spikes that might have caused *i*’s excitation and weight them by the decreasing exponential . Thus, this term exactly accounts for the positive part of the STDP curve. The negative term takes the opposite perspective and accounts for the negative part of the STDP curve. A loose extension of this rule to the case where the activity is smoothly evolving leads to identifying the function to the STDP mechanism for rate-based networks (Galtier, 2012; Izhikevich & Desai, 2003).

#### 3.2.3. Homeostatic Plasticity.

The second term in equation 3.1 accounts for what is usually presented as homeostatic plasticity mechanisms. The previous STDP term seems to be a powerful mechanism to shape the response of the network. However, there is a need for a regulatory process to prevent uncontrolled growth of the network connectivity (Abbott & Nelson, 2000; Turrigiano & Nelson, 2004; Miller, 1996). It has been argued that STDP could be self-regulatory (Van Rossum, Bi, & Turrigiano, 2000; Song et al., 2000), but it is not the case in our framework, and an explicit balancing mechanism is necessary to avoid the divergence of the system. This last term is the only one with a negative sign and is multiplicative with respect to the connectivity. Thus, according to Abbott and Nelson (2000), it is a reasonable candidate for homeostasis. It has been argued (Turrigiano, Leslie, Desai, Rutherford, & Nelson, 1998; Kim, Tsien, & Alger, 2012) that homeostatic plasticity might keep the relative synaptic weights by dividing the connectivity with a common scaling factor, theoretically preventing a possible information loss. In contrast to these ad hoc renormalizations often introduced in other learning rules (Miller & MacKay, 1994; Oja, 1982), our relative entropy-minimizing learning rule thus naturally introduces an original form of homeostatic plasticity.

### 3.3. Numerical Application.

Although the focus of the letter is on theory, we introduce a simple numerical example to illustrate the predictive properties of the biological learning rule. More precisely, we investigate the question of retrieving the connectivity of a neural network based on the observation of the time series of its activity. This is an inverse problem, usually a challenging topic in computational neuroscience (Friston, Harrison, & Penny, 2003; Galán, 2008; Potthast & beim Graben, 2009) since it may give access to large-scale effective connectivities simply from observing a neuronal activity. Here we address it in an elementary framework. The network generating the activity patterns, referred to as an input network, evolves according to . For this example, the network is made of *n* = 3 neurons, and its connectivity **W**_{0} is shown in Figure 3a. These parameters were chosen so that the activity is periodic, as shown by the dashed curves in Figure 3c. Then we simulate the entire system, equation 3.1, with a decay constant *l _{net}* and observe that its connectivity

**W**

_{net}converges to

**W**

_{0}.

As we show in the next section, it is necessary that in order to approximate accurately the input's activity with the online learning rule, equation 3.1. Given that the time constant of the inputs is governed by *l*_{0}, the previous approximation holds only if . If this assumption is broken (e.g., *l _{net}*=

*l*

_{0}), then the final connectivity matrix is different from

**W**

_{0}(see the black dot-dashed curve in Figure 3b). Indeed, in this case, the network learns only to replay the slow variation of the inputs.

A method to recover the precise time course of the inputs consists in artificially changing the time constants at different steps of an algorithm. First, simulate the network equation (top equation in equation 3.1) with a constant . Yet the time constant in the learning equation (bottom equation in equation 3.1) is to be kept at *l*_{0}, thus introducing a hybrid model. In this framework, the connectivity converges exactly to **W**_{0} as shown in the gray dashed curve in Figure 3b. After learning, simulate the network equation with the learned connectivity and with a time constant switched back to *l*_{0}. This gives an activity as shown in the solid curves in Figure 3c.

## 4. Link Between Theoretical and Biological Learning Rules

### 4.1. Three Technical Results.

Equation 2.4 is not biologically plausible for three main reasons: (1) it requires a direct access to the inputs **u**, whereas synaptic plasticity mechanisms rely on only the network activity **v**; (2) it is a batch learning algorithm, which requires access to the entire history of the inputs; and (3) it requires the ability to compute (equal to the time derivative of the inputs according to equation 2.1).

On the contrary, the biological learning rule, equation 3.1, does not have these problems. First, it uses an estimate of the inputs, noted , based on the activity variable only. Second, it is an online learning rule; it takes input on the fly and relies on a slow-fast mechanism to temporally average the variables. Third, it approximates the computation of with a function inspired from STDP convolution. In the following, we mathematically address these three points successively.

#### 4.1.1. Is an Input Estimate.

**v**governed by the top part of equation 3.1. However, to be comparable to the gradient of the relative entropy, equation 2.4, we first need to make explicit the dependence on the inputs

**u**. Therefore, we show that the network dynamics induces a simple relation between and the inputs

**u**. A simple computation in the Fourier domain shows that the convolution between the temporal operator and

*g*results in

_{l}*lI*. Applying this result to the neural dynamics leads to reformulating the top part of equation 3.1 as . We recognize the definition of the variable (originally defined according to this relation) such that the network's dynamics of the fast equation in equation 3.1 is equivalent to

_{d}#### 4.1.2. Temporal Averaging of the Learning Rule.

**v**corresponds to the assumption . In this case, we can apply classic results of periodic averaging (Sanders & Verhulst, 1985, see also Galtier & Wainrib, 2012, in the context of neural networks) to show that the evolution of

**W**is well approximated by where and . The right panel of Figure 2 shows a plot of the function .

#### 4.1.3. STDP as a Differential Operator.

Both proofs consist in going in the Fourier domain, where convolutions are turned into multiplications. We use the unitary, ordinary frequency convention for the Fourier transform. Observe that the Fourier transform of is . And we define such that and .^{1}

Let us show that . We proceed in two steps:

We show that . The Fourier transform of the convolution is the product . From the usual Fourier tables, we observe that the right-hand side is the Fourier transform of . This immediately leads to the result.

Using result (1) and applying result (2) to and **y** leads to the result .

### 4.2. Main Result.

We prove here that the biological learning rule, 3.1, implements the gradient descent based on equation 2.4.

If **u** is slow enough (i.e., ), then this equation is precisely the gradient descent of *H*_{u}. If **u** is too fast, the network will learn only to predict the slow variations of the inputs. Some fast-varying information is lost in the averaging process, mainly because the network equation in equation 3.1 acts as a relaxation equation, which filters the activity. Note that this loss of information is not necessarily a problem for the brain, since it may be extracting and treating information at different timescales (Kiebel, Daunizeau, & Friston, 2008). Actually, the choice of the parameter *l* specifies the timescale under which the inputs are observed.

### 4.3. Stability.

Both theoretical and biological learning rules are assured to make the connectivity converge to an equilibrium point whatever the initial condition. In particular, this method does not suffer from the bifurcation issues often encountered in recurrent neural network learning (Doya, 1992). In most cases, problems arise when learning leads the network activity to a bifurcation. The bifurcation suddenly changes the value of the quantity being minimized, and this intricate coupling leads to very slow convergence rates. The reason that this method has such an unproblematic converging property is that the relative entropy (as opposed to other energy functions traditionally used) is independent of the network activity **v**. Besides, for any network pattern **v**, equation 4.1 ensures that is always equal to the convolved inputs. These two arguments break the pathologic coupling preventing from getting interesting convergence results.

From a mathematical perspective, the Krasovskii-Lasalle invariance principle (Khalil & Grizzle, 1992) ensures that the theoretical learning rule converges to an equilibrium point. Indeed, the relative entropy acts as an energy or Lyapunov function. If is invertible, then the equilibrium point is unique and defined by equation 2.5. If it is not invertible, then the equilibrium depends on the initial condition.

If the inputs are slow enough, it has been shown that biological learning is well approximated by the theoretical gradient descent. Therefore, the former exhibits the same stable converging behavior as the latter. Thus, the biological learning rule, equation 3.1, is stable. In practice, we have never encountered a diverging situation even when the inputs are not slow enough.

## 5. Discussion and Conclusion

We have shown that a biological learning rule shapes a network into a predictor of its stimuli. This learning rule is made of a combination of an STDP mechanism and homeostatic plasticity. After learning, the network would spontaneously predict and replay the stimuli it has previously been exposed to. This was achieved by showing that this learning rule minimizes a quantity analogous to the relative entropy (or Kullback-Leibler divergence) between the stimuli and the network activity as for other well-known algorithms (Ackley, Hinton, & Sejnowski, 1985; Hinton, 2009).

We believe this letter brings interesting arguments in the debate concerning the functional role of STDP. We have shown that the antisymmetric part of STDP can be seen as a differential operator. When its effect is moderated by an appropriate scaling term, we argue that it could correspond to a generic mechanism shaping neural networks into predictive units. This idea is not new, but this letter may give it a stronger theoretical basis.

This study also gives central importance to the time constants characterizing the network. Indeed, the fact the biological learning rule, equation 3.1, implements a gradient descent is rigorously true for slow inputs. Inputs are slow if they are significantly slower than both the time window of the STDP and the decay constant of the network. This means that subnetworks in the brain could select the frequency of the information they are predicting. Given that the brain may process information at different timescales (Kiebel et al., 2008), this may be an interesting feature. Besides, note that the proposed biological learning rule, equation 3.1, is partly characterized by the activity decay parameter *l*. This is surprising because it links the intrinsic dynamics of the neurons to the learning processes corresponding to the synapse. Therefore, it may provide a direction to experimentally check this theory. Is the symmetric part of STDP (Hebbian learning) stronger between faster neurons?

One of the characteristic features of this research is the combination of rate-based networks and the concept of spike-timing-dependent plasticity. Obviously this made it impossible to consider the rigorous definition of STDP. However, we have argued that the function in equation 3.2 is an alternative formulation equivalent to the others in the spiking framework, and it can be trivially extended to analog networks. Besides, it does capture the functional mechanism of the STDP in the rate-based framework. When the activation of a neuron precedes (resp. follows) the activation of another, the strength of the synapse from the former to the latter is increased (resp. decreased). Finally, we believe the theory presented in this letter gives support to this rate-based STDP since it appears to fill a gap between machine learning and biology by implementing a differential operator.

This approach can be applied to forms of network equations other than equation 2.2, such as the Wilson-Cowan or Kuramoto models, leading to different learning rules. In this perspective, learning can be seen as a projection of a given arbitrary dynamical system to a versatile neuronal network model, and the learning rule will depend on the chosen model. However, any network equation with an additive structure—intrinsic dynamics plus communication with other neurons—as in equation 2.2 will lead to a similar structure for the learning rule, with terms that may share similar biological interpretations as developed above. A special case is the linear network for which various statistical methods to estimate the connectivity matrix have been applied in, for example, climate modeling (Penland & Sardeshmukh, 1995), gene regulatory networks (Yeung, Tegnér, & Collins, 2002) and spontaneous neuronal activity (Galán, 2008). In this simpler case, the biological learning rule in equation 3.1 would be with . The method developed in this letter can be used to extend the inverse modeling approach previously developed in the linear case to models with nonlinear interactions.

One of the main restrictions of this approach is that the dimensionality of the stimuli and that of the network have to be identical. Therefore, the accuracy of this biological mechanism does not match the state-of-the-art machine learning algorithms. This is not necessarily a big issue since a high-dimensional projection of the inputs may be used as preprocessing. From a biological viewpoint and taking the example of vision, this would correspond to the retino-cortical pathway, which is not one-to-one and very redundant. But this limitation also puts forward the need to study a network containing hidden neurons in a similar fashion. Ongoing research is revealing that the mathematical formalism is well suited to extend this approach to the field of reservoir computing (Jaeger & Haas, 2004).

## Acknowledgments

We thank Herbert Jaeger for his helpful comments on the manuscript. M.N.G. was partially funded by the Amarsi project (FP7-ICT 24833), ERC advanced grant NerVi 227747, the région PACA, France, and the IP project BrainScaleS 269921.

## References

## Note

^{1}

This notation is motivated by the following: the convolution with can be written as a matrix multiplication with a continuous Toeplitz matrix whose component *ts* is . In this framework, the convolution with corresponds to the multiplication with the transpose of the previous Toeplitz operator.