## Abstract

In this theoretical contribution, we provide mathematical proof that two of the most important classes of network learning—correlation-based differential Hebbian learning and reward-based temporal difference learning—are asymptotically equivalent when timing the learning with a modulatory signal. This opens the opportunity to consistently reformulate most of the abstract reinforcement learning framework from a correlation-based perspective more closely related to the biophysics of neurons.

## 1. Introduction

Network learning subdivides into supervised (SL), reinforcement (RL), and correlation based (Hebbian, CL) according to the existence and role of an error signal for controlling the learning (Doya, 2000). In SL, explicit error signals exist, in RL networks learn from unspecific reinforcement signals (rewards), whereas in CL, learning takes place as a self-organization process relying on multiplicative signal correlations only, hence without an error signal. Convergence proofs exist for some generic cases of these learning mechanisms (Widrow & Hoff, 1960; Oja, 1982; Hopfield, 1982; Linsker, 1988; Dayan & Sejnowski, 1994). The equivalence of the different rules, however, remains a pressing question as it would allow extending conclusions on convergence across different mechanisms.

Recently there have been several contributions toward solving this question (Izhikevich, 2007; Roberts, Santiago, & Lafferriere, 2008; Florian, 2007; Potjans, Morrison, & Diesmann, 2009), which presented specific solutions to be discussed later (see section 4). Thus, there is more and more evidence emerging that Hebbian learning and reinforcement learning can be brought together under a more unifying framework. Such an equivalence would have substantial influence on our understanding of network learning, as these two types of learning could be interchanged under these conditions.

Thus, the goal of this study is to prove that the most influential form of RL, which relies on the temporal difference (TD) learning rule (Sutton, 1988), is asymptotically equivalent to CL and convergent over wide parameter ranges when using a third factor as a gating signal, together with a differential Hebbian emulation of CL.

The idea of differential Hebbian learning was first used by Klopf (1988) to describe classical conditioning relating to the stimulus substitution model of Sutton and Barto (1981). Around the same time, Kosco (1986) used differential Hebbian learning in a machine learning context and examined its features. One of its most important features is the implicit introduction of negative weight changes (LTD), which leads to intrinsic stabilization properties in networks. Earlier approaches had to explicitly introduce negative weight changes into the learning rule, for example, by way of a threshold (Oja, 1982). Negative weight changes (LTD) had first been discovered physiologically by Dudek and Bear (1992) and later in conjunction with the timing of pre- and postsynaptic activity by Markram, Lübke, Frotscher, & Sakmann (1997) (spike-timing-dependent plasticity).

One drawback of reinforcement learning algorithms, like temporal difference learning, is their use of discrete time and discrete nonoverlapping states. In real neural systems, time is continuous, and the state space can be represented only by the activity of neurons, many of which will be active at the same time and for the same space. This creates a rather continuous state-space representation in real systems. In order to allow overlapping states or for generalizing over a wider range of input regions, RL algorithms are usually extended by value function approximation methods (Sutton & Barto, 1998). However, while biologically more realistic (Tamosiunaite et al. 2008), this makes initially elegant RL algorithms often quite opaque, and convergence cannot be guaranteed (Tsitsiklis & Van Roy, 1997; Wiering, 2004). Here we are not concerned with function approximation, but instead address the question of how to transform an RL algorithm (TD-learning) to continuous time using differential Hebbian learning with a third factor and remaining fully compatible with neuronally plausible operations. Only a few other approaches to formulate RL in continuous time and space exist (Baird, 1993; Doya, 1996, 2000). The differences to our approach are discussed in section 4.

Biophysical considerations about how such a third factor might be implemented in real neural tissue are of secondary importance for this study. At this stage, we are concerned with a formal proof only. Some biophysical aspects are treated in section 4, though.

This letter proceeds as follows. First, we describe the basics of TD and differential Hebbian learning. This will lead us to a general signal (= “state”) structure that is necessary to put both frameworks together. Next, we take a close look at the differential Hebbian learning rule and its differential equation. This gives us definitions that we will use for the formal analysis of the equivalence. A simulated network will then be investigated, and some technical constraints will be evaluated before we finish with a general discussion. Details of analytical calculations used in the main text are given in the appendix.

### 1.1. Emulating RL by Temporal Difference Learning.

Reinforcement learning maximizes the rewards *r*(*s*) an agent will receive in the future when following a policy π traveling along states *s*. The return *R* is defined as the sum of the future rewards: *R*(*s _{i}*) = ∑

_{k}γ

^{k}

*r*(

*s*

_{i+k+1}), where future rewards are discounted by a factor 0 < γ < 1. One central goal of RL is to determine the values

*V*(

*s*) for each state given by the average expected return

*E*

^{π}{

*R*}, which can be obtained when following policy π. Many algorithms exist to determine the values, almost all of which rely on the temporal difference (TD) learning rule, equation 1.1 (Sutton, 1988).

*s*, it updates the value

_{i}*V*(

*s*) with the discounted value

_{i}*V*(

*s*

_{i+1}) and the reward

*r*(

*s*

_{i+1}) of the next state that is associated with the consecutive state

*s*

_{i+1}: where α is the learning rate. This rule is called TD(λ = 0), shortened as TD(0), as it only evaluates adjacent states. For values of λ ≠ 0, more of the recently visited states are used for value-function update. TD(0) is by far the most influential RL learning rule as it is the simplest way to ensure optimality of learning (Dayan & Sejnowski, 1994; Sutton & Barto, 1998).

### 1.2. Differential Hebbian Learning with a Third Factor.

In traditional Hebbian learning, the change of a weight ρ relies on the correlation between input *u*(*t*) and output *v*(*t*) of a neuron: , where is the learning rate and prime denotes the temporal derivative. If we consider the change of the postsynaptic signal and therefore replace *v*(*t*) with *v*′(*t*), we will arrive at differential Hebbian learning. Then, negative weight changes are also possible, and this yields properties similar to experimental neurophysiological observations (spike-timing-dependent plasticity; Markram et al., 1997).

*M*(

*t*) responsible for controlling the learning (Porr & Wörgötter, 2007). The three-factor differential Hebbian learning rule is then where

*u*(

_{k}*t*) is the considered presynaptic signal and the postsynaptic activity of a model neuron with weights ρ

_{k}(

*t*). We will assume in the following that our modulatory signal

*M*(

*t*) is either 1 or 0, thus represented by a step function. By means of the learning rate , we can set the ratio between the weight change over the weight ρ′/ρ to be significantly smaller than the state change over the state value

*u*′/

*u*.

## 2. Analytical Derivation

We are going to analyze the weight change of weight ρ_{i}(*t*) when considering three consecutive signals *u*_{i−1}(*t*), *u _{i}*(

*t*), and

*u*

_{i+1}(

*t*), with the index

*i*representing a temporal (and not, e.g., a spatial) ordering. The third factor

*M*(

*t*) opens a time window for the considered weight ρ

_{i}(

*t*) in which changes can occur. Although this time window could be located anywhere depending on the output

*v*(

*t*), it should be placed at times where a state change occurs, and thus at the beginning or end of the state

*s*(

_{i}*t*), as it makes sense only if states correlate with temporally neighboring states.

The relation between state *s*(*t*) and input *u*(*t*) is determined by a convolution: *u*(*t*) = ∫^{∞}_{0}*s*(*z*)*h*(*t* − *z*)*dz* with filter function *h*(*t*), which is identical for all states. As we are using only states that are either on or off during visiting duration *S*, the input functions *u*(*t*) do not differ between states. Therefore, we will use *u _{i}*(

*t*) (with index

*i*) having a particular state in mind and

*u*(

*t*) (without index

*i*) when pointing to functional development.

Furthermore, we define the time period between the end of a state *s _{i}*(

*t*) and the beginning of the next state

*s*

_{i+1}(

*t*) as

*T*(

*T*< 0 in case of overlapping states). Concerning the modulatory third factor

*M*(

*t*), we define its length as

*L*and the time period between the beginning of

*M*(

*t*) and the end of the state

*s*(

_{i}*t*) as

*O*. These four parameters (

*L*,

*O*,

*T*, and

*S*) are constant over states and are displayed in detail in Figure 1B.

### 2.1. Differential Equation.

*M*(

*t*) is defining the integration boundaries (see equations 2.7 and A.2). The first summand leads to the homogeneous solution, which we will define as autocorrelation ρ

^{ac}(

*t*). The second summand(s) will lead to the inhomogeneous solution, and this we will define as cross-correlation ρ

^{cc}(

*t*). Together we have ρ(

*t*) = ρ

^{ac}(

*t*) + ρ

^{cc}(

*t*).

In general, the overall change of the weight ρ_{i}(*t*) after integrating over the visiting duration of *s _{i}*(

*t*) and

*s*

_{i+1}(

*t*) and using the modulatory signal

*M*(

*t*) is Δρ

_{i}≕ Δ

_{i}= Δ

^{ac}

_{i}+ Δ

^{cc}

_{i}

Without restrictions, we can now limit further analysis of equation 2.1, in particular of the cross-correlation term, to the case of *j* = *i* ± 1 as the modulatory factor affects only the weights of the last and the following state.

Weight changes in general are slow, and we can assume a quasi-static process (, ). As a consequence, the derivatives of ρ on the right-hand side of equation 2.1 can be neglected.

^{ac}

_{i}(

*t*) is then in general (see section A.1) and the overall weight change with the third factor being present between

*t*=

*O*and

*t*=

*O*+

*L*as well as between

*t*=

*O*+

*S*+

*T*and

*t*=

*O*+

*S*+

*T*+

*L*(see Figure 1B) is therefore Using again the argument of a quasi-static process (), we can expand the exponential function to the first order: where we have defined κ to be positive: Here we have left out the index

*i*as all states are identical and split κ into κ

^{+}and κ

^{−}representing the first (positive slope of signal

*u*; see Figure 1B) and the second (negative slope of signal

*u*; see Figure 1B) occurrence of the third factor. To this end, we also have to split Δ

^{ac}

_{i}into and .

^{cc}(

*t*) again under the assumption of a quasi-static process. This leads us to where we can use the third factor

*M*(

*t*) to define the integration boundaries. The third factor is present between

*t*=

*O*and

*t*=

*O*+

*L*for the first summand and

*t*=

*O*+

*S*+

*T*and

*t*=

*O*+

*S*+

*T*+

*L*for the second summand (see Figure 1B). Furthermore, as all signals are identical, we can shift between signals by

*t*=

*S*+

*T*. In detail, this is

*u*

_{i−1}(

*t*) =

*u*(

_{i}*t*+

*S*+

*T*) and

*u*

_{i+1}(

*t*) =

*u*(

_{i}*t*−

*S*−

*T*). The overall weight change can then be split into Δ

^{cc−}

_{i}and Δ

^{cc+}

_{i}: Here we defined τ

^{±}as being positive: which also is independent of

*i*.

Both τ^{±} and κ depend on the actual signal shape *u*(*t*) used and the values for the parameters *L*, *O*, *T*, and *S*.

### 2.2. Asymptotic Equivalence.

Without restrictions, we can now limit the discussion to the situation in Figure 1A, where we have one intermediate state transition (from *s _{i}* to

*s*

_{i+1}) and a final one (from

*s*

_{i+1}to

*s*), which leads to the reward. It represents the fact that rewards are usually associated with unconditioned stimuli, which will lead to strong, insuppressible responses. This indicates that a strong prewired connection from the receptor of the unconditioned stimulus onto the target neurons exists. For simplicity, this structure here is incorporated in the network as state

_{R}*s*. Thus, three-factor differential Hebbian will influence two synaptic connections ρ

_{R}_{i}and ρ

_{i+1}of states

*s*and

_{i}*s*

_{i+1}respectively, which directly project onto neuron

*v*.

Figure 1B shows a realistic situation of state transitions leaving the old state *s*_{i−1} and entering the new state *s _{i}* and so on. The signals as such could be considered as membrane voltages or firing rates of neurons.

The lower part of the signal traces indicates the times when the states *s* are visited and the upper part when the third factor *M* is either on or off. The part above the signal traces highlights the different contributions (Δ-values) to the overall weight change defined in the previous section.

First, we consider the weight change of ρ_{i}. This results from the transition between *s*_{i−1} and *s _{i}* from the visiting state

*s*itself and from the transition between

_{i}*s*and

_{i}*s*

_{i+1}. The short learning period at the beginning of the signal

*u*will cause a negative weight change Δ

_{i}^{cc−}

_{i}because of the correlation between the negative derivative of

*u*

_{i−1}and the positive value of

*u*. Additionally, there is a weight change Δ

_{i}^{ac+}

_{i}caused by the signal itself. Due to the positive slope of the signal

*u*at the beginning of the state, the contribution will be positive. The next learning interval occurs when the state

_{i}*s*has been left, and the signal

_{i}*u*already decays. Thus, having a negative slope, the autocorrelation causes a weight change Δ

_{i}^{ac−}

_{i}< 0. The fourth contribution yields a positive weight change Δ

^{cc+}

_{i}because the positive derivative of the next state signal

*u*

_{i+1}correlates with the positive value of signal

*u*of state

_{i}*s*. The same sequence exists when the next state transition occurs, yielding contributions for the Δ

_{i}_{i+1}-values. During the first trial (where all weights are zero), only the cross-correlation Δ

^{cc}

_{i+1}yields a contribution due to the finding of the reward.

_{i}and the four Δ

_{i}-values: Using equations 2.4 and 2.8, we can reformulate equation 2.11 into Substituting κ = −(κ

^{−}+ κ

^{+}), and γ

^{±}= τ

^{±}/κ, we get The convergence on ρ

_{i}= γρ

_{i+1}is a property of these kinds of equations (see section A.2). Therefore, we can use equation A.15 in the appendix with μ = γ, ε

_{1}= τ

^{+}/κ = γ

^{+}, and ε

_{2}= τ

^{−}/κ = γ

^{−}, although only if the values, namely κ and τ

^{±}, are strictly positive (will be discussed in section 2.3). This gives us and our weight development can be simplified to At this point, we can make the transition from weights ρ

_{i}(differential Hebbian learning) to states

*V*(

*s*) (temporal difference learning). Additionally, we note that sequences terminate only at

_{i}*i*+ 1; thus, this index will capture the reward state

*s*and its value

_{R}*r*(

*s*

_{i+1}), while this is not the case for all other indices (see section 4 for a detailed discussion of rewards at nonterminal states). Consequently this gives us an equation almost identical to equation 1.1: where one small difference arises, as in equation 2.16, the reward is scaled by γ. However, this has no influence, as numerical reward values are arbitrary. Thus, if learning follows this third-factor differential Hebbian rule, weights will converge to the optimal estimated TD values. This proves that under some conditions for the signal shape and the parameters

*S*,

*L*,

*O*, and

*T*(which influence whether κ>0 and τ

^{±}>0), TD(0) and the proposed three-factor differential Hebbian learning are indeed asymptotically equivalent.

### 2.3. Analysis of (as Well as and *τ*^{±}).

*τ*

^{±}Here we will take a closer look at the signal shape and the parameters (*L*, *O*, *T*, and *S*) which influence the values of κ (see equation 2.5) and τ^{±} (see equations 2.9 and 2.10), and therefore γ (see equation 2.14). For guaranteed convergence, these values are constrained by two conditions (see section A.2): τ^{±} ⩾ 0 and κ>0 (where κ = 0 is allowed in case of τ^{±} = 0). A nonpositive value of κ would lead to divergent weights ρ and negative values of τ^{±} to oscillating weight pairs (ρ_{i}, ρ_{i+1}). However, even if fulfilled, these conditions will not always lead to meaningful weight developments. In particular, τ^{±}-values of 0 leave all weights at their initial weight value, and discount factors, which are represented by γ-values exceeding 1, are usually not considered in reinforcement learning (Sutton & Barto, 1998). Thus, it makes sense to introduce more rigorous conditions and demand that 0 < γ ⩽ 1 and κ>0.

Furthermore, as these conditions depend on the signal shape, the following theoretical considerations need to be guided by biophysics. Hence, we will discuss neuronally plausible signals that can arise at a synapse. This constrains *u* to functions that possess only one maximum and divide the signal into a rising and a falling phase.

One quite general possibility for the shape of the signal *u* is the function used in Figure 1, for which we investigate the area of convergence. We have three parameters to be varied, as we do not have to consider the parameter *S* if we take this value to be large compared to |*T*|, *L*, or *O*. For this, Figure 2 shows the γ-value in 3 different panels. In each panel, we varied the parameters *O* and *T* from minus to plus 2 *P*, where *P* is the time the signal *u* needs to reach the maximum. In each of the panels, we plot γ-values for a particular fraction of *L*/*P*.

The gray shading displays in detail the γ-values for which the condition is fulfilled, whereas white represents those areas for which we receive γ>1. The striped area indicates parameter configurations for which no overlap between two consecutive signals and the third factor exists (τ = 0), and for the patterned regions, κ is smaller than zero.

If the *L*-value is greater than *P* − *O* − *T*, the area of convergence does not depend on *L* anymore, as the third factor then reaches a plateau as well as covers the whole falling phase of the signal *u*. On the contrary, if the *L*-value reaches the rising phase of the consecutive state, the area of convergence decreases again (not shown).

For positive *O*-values, there exist γ-values that are independent of (negative) *T*-values. Hence, if states overlap (*T* < 0), the γ-value is invariant with respect to the degree of overlap. This is an important aspect, as value function approximation methods often use overlapping kernels to represent features. In a biological context, this corresponds to overlapping receptive fields providing the input to the system. We find that in these cases, γ remains unaffected by the degree of (receptive field) overlap, which in general is different for any two input units.

To extend these considerations to more general but smoother shapes, we can therefore Taylor expand both the rising and the falling phases to the second order. With these constraints, γ can be calculated analytically (see section A.3) and is then plotted against *O* and *T* in Figure 3 for nine input functions, shown in the lower right. On the upper left, the ratio between the duration of the third factor and *P* was set to one-third, in the upper right to two-thirds, and on the lower left to four-thirds. Analogous to the exponential function, the area of convergence increases with increasing *L*-values.

If comparing the results of Figure 3 with Figure 9 (in section A.3), both figures match each other, as we have used the same derivations as for Figure 3 in the appendix. However, even if more general filter functions are used, like the exponential one used for Figure 2, both figures still match quite well, especially in the regions where the system diverges and where it stays constant.

In summary, Figures 2, 3, and 9 show clearly that the area of convergence changes only gradually and the area as such is increasing with increasing duration of the third factor. Altogether, it shows that for a general neuronally plausible signal shape *u*, the condition for asymptotic equivalence between temporal difference learning and differential Hebbian learning with a third factor is fulfilled for a wide parameter range and thus all realistic relative timing intervals between state activations and third factor.

## 3. Simulation of a Small Network

In this section, we show that we can reproduce the behavior of TD-learning in a small, linear network of neurons designed according to our algorithm. Obtained weights of the differential Hebbian learning neuron represent the corresponding TD-value (see Figure 4A, inset). It is known that in a linear TD-learning system, values at the end of learning will follow an exponential function with a decay rate given by the discount factor γ. This is shown in Figure 4A. In Figure 4B, we also investigate the assumption of a quasi-static process.

Details of this simulation are as follows. The network consists of *N* states *s*, which are connected to a neuron *v*, which uses differential Hebbian learning. The modulatory signal is added by an additional neuron *M*. The network is shown in the inset of Figure 4A. The states are indexed such that the state closest to the reward has index 1 (hence, the reward has the index 0). At the beginning of learning, all weights are initialized to 0 except the weight connected with the reward. Each trial begins with state *N* approaching the reward at which a trial is terminated; thus, each state becomes active once.

The weights of the states connected to the differential Hebbian learning neuron are shown in Figure 4A for three different γ-values after learning. States indexed with higher numbers (hence, farther away from the reward) have smaller weights, and the relation ρ_{i+1} = γρ_{i} where *i* indicates the distance to the reward holds for each γ-value. This is indicated by an exponential fit. It also should be noted that the weights at states far away from the reward deviate from the exponential fit, but only for the highest γ-value. This is an effect caused by the finite number of states and at the same time by a γ^{+}-value that is higher than 1 (see section 4.1 for details).

In these systems, learning rates are usually in the range of 10^{−5} to 10^{−2} (Porr & Wörgötter, 2003, 2007). The question arises whether in this range, the assumption of a quasi-static process will hold. If it holds, we would expect that the weight closest to the reward (ρ_{1}) will exactly reach the value of γ after learning. In Figure 4B, the deviation from this expectation given by ρ_{1} − γ is plotted against the learning rate. As indicated by equation 2.5, the deviation increases with increasing learning rates, but remains small up to a rate of 10^{−1}, which is well in the range of useful learning rates. The actual shape of the curves is a consequence of different interacting processes depending, for example, on the total number of states (see section 4.1) and others.

Looking at higher γ-values, it is apparent that the effect of a finite number of states behaves antagonistically to the deviation caused by the increased learning rate—the weight after learning is shifted to higher values (independent of the learning rate). Therefore, if higher γ-values (or a smaller number of states) are used, the simulated weight ρ_{1} may be modified such that it will be identical to the calculated γ-value even if finite learning rates larger than 0 are used. However, this will not correct the simulated γ-value as such since the weights are then no longer arranged exponentially (indicated by Figure 4A).

## 4. Discussion

### 4.1. Technical Discussion.

Most realistic learning examples are not affected by these constraints. Thus, readers who are not interested in these special aspects may choose to skip this section.

#### 4.1.1. Quasi-Static Process, α ≪ 1.

Second, we truncated the expansion of equation 2.3 after the first order, which is allowed only for α ≪ 1. Would the necessary condition κ>0 be affected if we had not truncated the expansion? Considering ν as defined in equation 4.1 and equation 2.5, we observe that a positive value of κ corresponds to a negative value of ν and will lead to a negative weight change of Δ^{ac}_{i}. Hence, given a negative value of ν (a necessary condition for κ>0 if taking only the first-order terms of the expansion into account), this leads directly to a negative weight change of Δ^{ac}_{i} in equation 2.3. This is due to the properties of the exponential function (*e ^{x}* − 1 < 0 ∀

*x*< 1) or of the inverse square root function (), and, as a consequence, constraining to α ≪ 1 is allowed here as well.

Third, we neglected, due to α ≪ 1, the variability of the homogeneous solution (ρ^{ac}; see equation 2.2) in order to calculate the inhomogeneous solution (ρ^{cc}; see equation 2.6) of the weight ρ. However, taking the variability into consideration will not affect the linearity with respect to ρ. This is because equation 2.8 can be directly split into τ^{±} and ρ, and the additional homogeneous solution will change only the integral (see equations 2.10 and 2.9), which leads to τ^{±} and, as the solution does not rely on ρ, will not complicate the linear decomposition.

#### 4.1.2. Reward Only at the End of a Sequence.

In most physiological experiments (Schultz, Apicella, Scarnati, & Ljungberg, 1992; Montague, Dayan, & Sejnowski, 1996; Morris, Nevet, Arkadir, Vaadia, & Bergman, 2006) the reward is given at the end of the stimulus sequence. Our assumption that the reward state is a terminating state and is therefore only at the end of the learning sequence conforms to this paradigm. However, for TD in general, we cannot assume that the reward is provided only at the end. Differential Hebbian learning will then lead to a slightly different solution compared to TD-learning. This solution has already been discussed in a another context (Dayan, 2002). Specifically, the difference in our case is the final result for the state-value after convergence for states that provide a reward. We get *V*(*s*) → γ*V*(*s*_{i+1}) + *r*(*s*_{i+1}) − *r*(*s _{i}*) compared to TD-learning:

*V*(

*s*) → γ

*V*(

*s*

_{i+1}) +

*r*(

*s*

_{i+1})). It would be interesting to assess with physiological and or behavioral experiments which of the two equations more closely represents experimental reality. To do so, one has to guarantee that the reward given at the end is worth the costs of the animal incurred until reaching it (Hassani, Cromwell, & Schultz, 2001).

#### 4.1.3. Finite Number of States.

If we consider just a finite number of states without periodic boundary conditions and assume that the same state neuron *s*_{0} always starts the whole sequence, the corresponding weight will not converge to ρ_{0} = γ ρ_{1} but to ρ_{0} = γ^{+} ρ_{1} due to the missing ρ_{−1} weight (see equations 2.15 and A.11). However, in this case, the gradual order of the weight values is disturbed only if the γ^{+}-value is larger than 1.

#### 4.1.4. Stochastically Uncertain Environments.

It is known that in stochastically uncertain environments, the TD-values converge with probability one only when the learning rate decreases (Kushner & Clark, 1978; Dayan & Sejnowski, 1994). In our implementation, the signal *M*(*t*) is constant. If it were instead implemented to diminish during repeated encounters with the same state, it would immediately incorporate the property of decreasing learning rates too.

#### 4.1.5. Non-Markov.

In the algorithm presented, each TD-value arises from the interaction of three states. Hence, for a considered state, it is of relevance from where it had been reached. Thus, strictly speaking, the algorithm is history dependent and violates the Markov property required for TD-learning (Sutton & Barto, 1998). This has no deeper consequence, as the standard trick for non-Markovian systems, which often are also encountered in conventional RL problems, can be applied here in the same way. It is possible, without much effort, to design a network where in a higher network layer, states are being concatenated into new larger states, which now obey the Markov property. In many practical applications, this is not even required as the value-gradient field will build up toward the reward regardless of the non-Markovian algorithm presented here. An implementation of SARSA (Singh, Jaakkola, Littman, & Szepesvári, 2000) using our algorithm behaves in this way (data not shown).

### 4.2. General Discussion.

The TD-rule has become the most influential algorithm in reinforcement learning because of its tremendous simplicity and proven convergence to the optimal value function (Sutton & Barto, 1998). It had been successfully transferred to control problems too in the form of Q- or SARSA learning (Watkins & Dayan, 1992; Singh et al., 2000), which uses the same algorithmic structure while maintaining similar advantageous mathematical properties (Watkins & Dayan, 1992).

In this study, we have shown that TD(0)-learning and differential Hebbian learning modulated by a third factor are asymptotically equivalent under certain conditions. This proof relies on commonly applicable, fairly general assumptions, thus rendering a generic result that does not constrain the design of larger networks. It has long been suspected that RL in neuronal tissue would have to rely on the use of a third factor in a Hebb rule (Schultz, 1998), and several earlier results have pointed to the possibility of an equivalence between RL and CL. Izhikevich (2007) solved the distal reward problem using a spiking neural network, yet with fixed exponential functions (Gerstner, Kempter, van Hemmen, & Wagner, 1996) to emulate differential Hebbian characteristics. His approach is related to neurophysiological findings on spike-timing-dependent plasticity (STDP; Markram et al., 1997). Each synapse learned the correlation between conditioned stimuli and unconditioned stimuli (e.g., a reward) through STDP and a third signal. Furthermore Roberts, Santiago, and Lafferriere (2008) showed that asymmetrical STDP and TD-learning are related. In our differential Hebbian learning model, in contrast to the work described above, STDP emerges automatically because of the use of the derivative in the postsynaptic potential (see equation 1.2). The relation between STDP and differential Hebbian learning and its asymptotic equivalence when using serial states was discussed in Roberts (1999). Rao and Sejnowski (2001) showed that using the temporal difference will directly lead to STDP, but they could not provide a rigorous proof for the equivalence. Recently it has been shown that the online policy-gradient RL algorithm (OLPOMDP) developed by Baxter, Bartlett, & Weaver (2001) can be emulated by spike-timing-dependent plasticity (Florian, 2007), however, in a complex way using a global reward signal. On the other hand, the observations reported here provide a rather simple, equivalent correlation-based implementation of TD and support the importance of three-factor learning for providing a link between conventional Hebbian approaches and reinforcement learning.

Additionally, our approach is intrinsically based on continuous states as well as on continuous time. Several attempts to shape RL algorithms, which usually have discrete states and time, into continuous systems exist but lack biological motivation. In particular, Baird (1993) extended Q-learning by the “advantage updating” method, and Doya (2000) performed the transformation from a discrete sum to a continuous integral for the calculation of the return *R*. In his case, every value function *V* consists of a state representation and a corresponding weight. These weights need to be adjusted in order to let the delta error converge to zero. This is done by a gradient descent algorithm, which results in an update rule that demands a weight derivative, which is difficult to emulate in a biologically realistic way.

Our results rely in a fundamental way on the third factor *M*, and the analysis performed in this study indicates that the third factor is necessary for the emulation of TD-learning by a differential Hebb rule. To explain the reason requires a closer look at the TD-learning rule. We find that the TD-rule requires a leakage term −α · *V*(*s*). If this term does not exist, values would diverge. It can be shown that in differential Hebbian learning without a third factor, the autocorrelation part, the source of the leakage needed (see equations 2.11 and 2.4), is nonexistent (Kolodziejski, Porr, & Wörgötter, 2008). This shows that only through a well-timed third factor is the ratio between the cross-correlation and autocorrelation term correctly adjusted. This ratio is at the end responsible for the γ-value we will get using differential Hebbian learning to emulate TD-learning.

This study is mainly concerned with showing the formal equivalence between TD and differential Hebbian learning. Possible links to biophysical mechanisms play a minor role here. Nonetheless, one could consider neuromodulators for the role of the third factor *M*. The required reliability of timing, however, makes it unlikely that dopamine could take on this role, as the timing of these signals does not seems reliable enough (Redgrave & Gurney, 2006), although Pawlak and Kerr (2008) could show that LTP in the striatum emerges only in the presence of dopamine. The attributed, albeit still much discussed, role of the dopaminergic responses from neurons in the substatia nigra (pars compacta) or the ventral tegmental Area (VTA) as possibly representing the Δ-error of TD learning (Schultz et al., 1992; Montague et al., 1996) is thus neither questioned nor supported in our study. A very good alternative for the role of a well-timed third factor, however, seems to be the response characteristic of the cholinergic tonically active neurons (TAN). Their response, which is a reduction of activity, is exceedingly well timed and occurs together with conditioned stimuli (Graybiel, 1998; Morris, Arkadir, Nevet, Vaadia, & Bergman, 2004). The fact that TANs cease to fire would require an additional inversion to make this compatible with our *M* factor, but when considering possible disinhibitory effects, this should not pose a fundamental problem.

Thus, it is also important that we were able to show that our algorithm is stable across a wider range of possible biological signals as different temporal profiles exist, for example, for synapse and channel activation (compare AMPA versus NMDA characteristics). This is required, as it is not yet clear which signals are involved in any three-factor learning, and this might also depend on the considered cell type and brain structure.

## Appendix: Derivations and Analytical Calculations

### A.1. Solving the Homogeneous Part.

*i*: Next we separate the variables and integrate both sides from zero to infinity: We mentioned in the beginning that

*M*(

*t*) is either 1 or 0; thus, we can model this function as a sum of Heaviside functions Θ(

*t*): As the Heaviside functions determine the integration boundaries of the right-hand side of equation A.2, this can be simplified to The left side of equation A.2 solves to a logarithmic function, and for its right side, we use following derivative: All in all, this leads to where we have to invert the logarithmic function, and we get the weight ρ after integrating over a pulse pair if having weight ρ

_{0}before.

### A.2. Solving the Difference Equation.

*t*, and the inhomogeneous solution can be found by the method of variation of parameters: This gives us for the convergence, where

*C*is a constant. This shows that difference equations like equation A.8 or differential equations like equation A.9 always converge to

*y*. Additionally, Kushner and Clark (1978) showed that this holds also for a stochastic variable with mean

*y*.

*x*

_{∞}=

*y*, and we then have the following difference equation, which needs to be solved: We get the solution using the ansatz ρ

_{n}= μ

^{n}: Assuming μ ≠ 0, we can divide equation A.13 by μ

^{n−1}and solve the following quadratic equation: A negative solution, however, would lead to an oscillation; therefore, we consider only the positive sign, which leads to a positive solution, that, potentiated by

*n*, gives us the value to which ρ

_{n}will converge. Therefore, we can set ρ

_{n}= μ

^{−1}ρ

_{n+1}, and equation A.12 is simplified.

### A.3. Analytical Calculation of Using First- and Second-Order Terms.

#### A.3.1. Taylor Expansion of the Filter Function.

*U*is the height of the plateau,

*P*and

^{r}*P*the length of the rising and the falling phase, respectively,

^{f}*S*the length of a state, and η and ξ the degree of the second-order term for the rising and the falling phase, respectively. Here, η = 1 or ξ = 1 leads to a linear slope, η = 0 or ξ = 2 to a convex, and η = 2 or ξ = 0 to a concave slope.

#### A.3.2. Intervals.

Having defined the actual shape of the filter function, we now have to distinguish between different occurrence times of the third factor. Figure 5 shows the six essential regions of both the rising and the falling phase. For this, we defined intervals I to VI for the rising phase and A to F for the falling phase (see Table 1).

. | . | . | . |
---|---|---|---|

I | O < 0∩O + L < 0 | A | O + T < S∩O + L + T < S |

II | O < 0∩O + L ⩾ 0∩O + L < P _{r} | B | O + T < S∩O + L + T ⩾ S∩O + L + T < P _{f} |

III | O < 0∩O + L ⩾ P _{r} | C | O + T < S∩O + L + T ⩾ P _{f} |

IV | O ⩾ 0∩O < P∩_{r}O + L < P _{r} | D | O + T ⩾ S∩O + T < P∩_{f}O + L + T < P _{f} |

V | O ⩾ 0∩O < P∩_{r}O + L ⩾ P _{r} | E | O + T ⩾ S∩O + T < P∩_{f}O + L + T ⩾ P _{f} |

VI | O ⩾ P∩_{r}O + L ⩾ P _{r} | F | O + T ⩾ P∩_{f}O + L + T ⩾ P _{f} |

. | . | . | . |
---|---|---|---|

I | O < 0∩O + L < 0 | A | O + T < S∩O + L + T < S |

II | O < 0∩O + L ⩾ 0∩O + L < P _{r} | B | O + T < S∩O + L + T ⩾ S∩O + L + T < P _{f} |

III | O < 0∩O + L ⩾ P _{r} | C | O + T < S∩O + L + T ⩾ P _{f} |

IV | O ⩾ 0∩O < P∩_{r}O + L < P _{r} | D | O + T ⩾ S∩O + T < P∩_{f}O + L + T < P _{f} |

V | O ⩾ 0∩O < P∩_{r}O + L ⩾ P _{r} | E | O + T ⩾ S∩O + T < P∩_{f}O + L + T ⩾ P _{f} |

VI | O ⩾ P∩_{r}O + L ⩾ P _{r} | F | O + T ⩾ P∩_{f}O + L + T ⩾ P _{f} |

Note: See Figure 5 for a more intuitive representation.

Additionally we need to define another four intervals used for the correlation of the signals when calculating τ^{±} (see Table 2).

#### A.3.3. Analytical Calculation of κ.

*O*+

*L*<

*S*and

*O*+

*S*+

*T*>

*P*which holds if

_{r}*S*is sufficiently larger than

*O*,

*L*, and |

*T*|. These assumptions prevent cases where the third factor would affect the signal after the next signal, which is nonsensical. To simplify the cases, we split κ into κ

^{+}and κ

^{−}, as done in the main text: κ = −(κ

^{+}+ κ

^{−}). In Table 3 the analytical results for κ

^{±}are stated using equations A.17 and A.18. These equations result from equation 2.5, where we included the filter function, equation A.16. In detail, equation A.17 represents the squared rising phase, and equation A.18 the squared falling phase. It is important to mention that both functions Φ(

*t*) and Ψ(

*t*) are bounded between 0 and 1: The results can be summarized by plotting the areas, which represent definitive divergent areas where we simplify the rising and falling time to identical values:

*P*=

*P*=

_{r}*P*. These areas are composed of intervals in which the sum of κ

_{f}^{+}and κ

^{−}is always positive. For instance, for interval A conjoined with either of the intervals I to VI, the sum of κ

^{+}and κ

^{−}is always greater than zero and thus divergent. The same holds for interval F. Both areas are indicated in Figure 6 with different shades of gray. There is an additional divergent area III that is there only if the value of

*L*is greater than

*P*.

Interval . | . | Interval . | . |
---|---|---|---|

I | 0 | A | 0 |

II | Φ(O + L) | B | Ψ(O + S + T + L) − 1 |

III | 1 | C | −1 |

IV | Φ(O + L) − Φ(O) | D | Ψ(O + S + T + L) − Ψ(O + S + T) |

V | 1 − Φ(O) | E | −Ψ(O + S + T) |

VI | 0 | F | 0 |

Interval . | . | Interval . | . |
---|---|---|---|

I | 0 | A | 0 |

II | Φ(O + L) | B | Ψ(O + S + T + L) − 1 |

III | 1 | C | −1 |

IV | Φ(O + L) − Φ(O) | D | Ψ(O + S + T + L) − Ψ(O + S + T) |

V | 1 − Φ(O) | E | −Ψ(O + S + T) |

VI | 0 | F | 0 |

There would also be an interval C (not shown) for which the sum of κ^{+} and κ^{−} is always less than zero if conjoined with interval I to VI except III. However, this shows up only if *L* is greater than *P*, but then interval III becomes valid. This can be resolved by using different values for *P _{r}* and

*P*where

_{f}*P*<

_{f}*P*.

_{r}Only in the white area can there be convergence; however, the shape of the actual filter function *u*(*t*) determines whether a certain area converges. Additionally, it is important to include the values for τ^{±} into the considerations as these values in different areas can be 0, which means that there is no overlap between two consecutive signals and the third factor. This will be investigated next.

#### A.3.4. Analytical Calculation of τ^{+} and τ^{−}.

^{±}using the filter function, equation A.16. We likewise assume that there is no overlap of the signal with the signal after its next state. In Table 4, the analytical results for τ

^{±}are stated using equations A.19 to A.22. These equations result from equations 2.9 and 2.10, where we included the filter function, equation A.16. It is important to mention that these functions, ζ(

*t*), Ψ(

*t*), χ(

*t*), and φ(

*t*) are always greater than 0.

Interval . | . | . | γ . |
---|---|---|---|

I | 0 | 0 | 0 |

II ∩ A | ζ(O + L) − ζ(0) | 0 | ∞ |

II ∩ B ∩ a | ψ(O + L) − ψ(0) | φ(O + L) − φ(0) | ⩾0 |

II ∩ B ∩ | ζ(−T) − ζ(0) + ψ(O + L) − ψ(−T) | φ(O + L) − φ(−T) | ⩾0 |

II ∩ C ∩ a ∩ b | 0 | 0 | 0 |

II ∩ C ∩ a ∩ | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ⩾0 |

II ∩ C ∩ | ζ(−T) − ζ(0) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

II ∩ D | ψ(O + L) − ψ(O) | φ(O + L) − φ(O) | ⩾0 |

II ∩ E ∩ b | 0 | 0 | 0 |

II ∩ E ∩ | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ⩾0 |

II ∩ F | 0 | 0 | 0 |

III ∩ A | ζ(P) − ζ(0) _{r} | 0 | ∞ |

III ∩ B ∩ a | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(_{r}O + L) − χ(P) _{r} | ∞ |

III ∩ B ∩ ∩ c | ζ(−T) − ζ(0) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(O + L) − χ(P) _{r} | ∞ |

III ∩ B ∩ | ζ(P) − ζ(0) _{r} | χ(O + L) − χ(P) _{r} | ∞ |

III ∩ C ∩ b | 0 | 0 | 0 |

III ∩ C ∩ a ∩ ∩ d | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ∞ |

III ∩ C ∩ a ∩ | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(−_{r}T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ C ∩ ∩ d | ζ(−T) − ζ(0) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ∞ |

III ∩ C ∩ ∩ | ζ(−T) − ζ(0) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(−T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ C ∩ | ζ(P) − ζ(0) _{r} | χ(−T + P) − χ(−_{f}T) | ∞ |

III ∩ D | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(_{r}O + L) − χ(P) _{r} | ∞ |

III ∩ E ∩ b | 0 | 0 | 0 |

III ∩ E ∩ ∩ d | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ∞ |

III ∩ E ∩ | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(−_{r}T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ F | 0 | 0 | 0 |

IV ∩ A | ζ(O + L) − ζ(O) | 0 | ∞ |

IV ∩ B | ζ(−T) − ζ(O) + ψ(O + L) − ψ(−T) | φ(O + L) − φ(−T) | ⩾0 |

IV ∩ C | ζ(−T) − ζ(O) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

IV ∩ D | ψ(O + L) − ψ(O) | φ(O + L) − φ(O) | ⩾0 |

IV ∩ E | ψ(−T + P) − ψ(_{f}O) | φ(−T + P) − φ(_{f}O) | ⩾0 |

IV ∩ F | 0 | 0 | 0 |

V ∩ A | ζ(P) − ζ(_{r}O) | 0 | ∞ |

V ∩ B ∩ c | ζ(−T) − ζ(O) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(O + L) − χ(P) _{r} | ⩾0 |

V ∩ B ∩ | ζ(P) − ζ(_{r}O) | χ(O + L) − χ(−T) | ⩾0 |

V ∩ C ∩ d | ζ(−T) − ζ(O) + ψ(−T + P) − ψ(−_{r}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

V ∩ C ∩ c ∩ | ζ(−T) − ζ(O) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(−T + P) − χ(_{f}P) _{r} | ⩾0 |

V ∩ C ∩ | 0 | χ(−T + P) − χ(−_{f}T) | 0 |

V ∩ D | ψ(P) − ψ(_{r}O) | φ(P) − φ(_{r}O) + χ(O + L) − χ(P) _{r} | ⩾0 |

V ∩ E ∩ d | ψ(−T + P) − ψ(_{f}O) | φ(−T + P) − φ(_{f}O) | ⩾0 |

V ∩ E ∩ | ψ(P) − ψ(_{r}O) | φ(P) − φ(_{r}O) | ⩾0 |

V ∩ F | 0 | 0 | 0 |

VI ∩ A | 0 | 0 | 0 |

VI ∩ B | 0 | χ(O + L) − χ(−T) | 0 |

VI ∩ C | 0 | χ(−T + P) − χ(−_{f}T) | 0 |

VI ∩ D | 0 | χ(O + L) − χ(O) | 0 |

VI ∩ E | 0 | χ(−T + P) − χ(_{f}O) | 0 |

VI ∩ F | 0 | 0 | 0 |

Interval . | . | . | γ . |
---|---|---|---|

I | 0 | 0 | 0 |

II ∩ A | ζ(O + L) − ζ(0) | 0 | ∞ |

II ∩ B ∩ a | ψ(O + L) − ψ(0) | φ(O + L) − φ(0) | ⩾0 |

II ∩ B ∩ | ζ(−T) − ζ(0) + ψ(O + L) − ψ(−T) | φ(O + L) − φ(−T) | ⩾0 |

II ∩ C ∩ a ∩ b | 0 | 0 | 0 |

II ∩ C ∩ a ∩ | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ⩾0 |

II ∩ C ∩ | ζ(−T) − ζ(0) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

II ∩ D | ψ(O + L) − ψ(O) | φ(O + L) − φ(O) | ⩾0 |

II ∩ E ∩ b | 0 | 0 | 0 |

II ∩ E ∩ | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ⩾0 |

II ∩ F | 0 | 0 | 0 |

III ∩ A | ζ(P) − ζ(0) _{r} | 0 | ∞ |

III ∩ B ∩ a | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(_{r}O + L) − χ(P) _{r} | ∞ |

III ∩ B ∩ ∩ c | ζ(−T) − ζ(0) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(O + L) − χ(P) _{r} | ∞ |

III ∩ B ∩ | ζ(P) − ζ(0) _{r} | χ(O + L) − χ(P) _{r} | ∞ |

III ∩ C ∩ b | 0 | 0 | 0 |

III ∩ C ∩ a ∩ ∩ d | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ∞ |

III ∩ C ∩ a ∩ | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(−_{r}T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ C ∩ ∩ d | ζ(−T) − ζ(0) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ∞ |

III ∩ C ∩ ∩ | ζ(−T) − ζ(0) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(−T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ C ∩ | ζ(P) − ζ(0) _{r} | χ(−T + P) − χ(−_{f}T) | ∞ |

III ∩ D | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(_{r}O + L) − χ(P) _{r} | ∞ |

III ∩ E ∩ b | 0 | 0 | 0 |

III ∩ E ∩ ∩ d | ψ(−T + P) − ψ(0) _{f} | φ(−T + P) − φ(0) _{f} | ∞ |

III ∩ E ∩ | ψ(P) − ψ(0) _{r} | φ(P) − φ(0) + χ(−_{r}T + P) − χ(_{f}P) _{r} | ∞ |

III ∩ F | 0 | 0 | 0 |

IV ∩ A | ζ(O + L) − ζ(O) | 0 | ∞ |

IV ∩ B | ζ(−T) − ζ(O) + ψ(O + L) − ψ(−T) | φ(O + L) − φ(−T) | ⩾0 |

IV ∩ C | ζ(−T) − ζ(O) + ψ(−T + P) − ψ(−_{f}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

IV ∩ D | ψ(O + L) − ψ(O) | φ(O + L) − φ(O) | ⩾0 |

IV ∩ E | ψ(−T + P) − ψ(_{f}O) | φ(−T + P) − φ(_{f}O) | ⩾0 |

IV ∩ F | 0 | 0 | 0 |

V ∩ A | ζ(P) − ζ(_{r}O) | 0 | ∞ |

V ∩ B ∩ c | ζ(−T) − ζ(O) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(O + L) − χ(P) _{r} | ⩾0 |

V ∩ B ∩ | ζ(P) − ζ(_{r}O) | χ(O + L) − χ(−T) | ⩾0 |

V ∩ C ∩ d | ζ(−T) − ζ(O) + ψ(−T + P) − ψ(−_{r}T) | φ(−T + P) − φ(−_{f}T) | ⩾0 |

V ∩ C ∩ c ∩ | ζ(−T) − ζ(O) + ψ(P) − ψ(−_{r}T) | φ(P) − φ(−_{r}T) + χ(−T + P) − χ(_{f}P) _{r} | ⩾0 |

V ∩ C ∩ | 0 | χ(−T + P) − χ(−_{f}T) | 0 |

V ∩ D | ψ(P) − ψ(_{r}O) | φ(P) − φ(_{r}O) + χ(O + L) − χ(P) _{r} | ⩾0 |

V ∩ E ∩ d | ψ(−T + P) − ψ(_{f}O) | φ(−T + P) − φ(_{f}O) | ⩾0 |

V ∩ E ∩ | ψ(P) − ψ(_{r}O) | φ(P) − φ(_{r}O) | ⩾0 |

V ∩ F | 0 | 0 | 0 |

VI ∩ A | 0 | 0 | 0 |

VI ∩ B | 0 | χ(O + L) − χ(−T) | 0 |

VI ∩ C | 0 | χ(−T + P) − χ(−_{f}T) | 0 |

VI ∩ D | 0 | χ(O + L) − χ(O) | 0 |

VI ∩ E | 0 | χ(−T + P) − χ(_{f}O) | 0 |

VI ∩ F | 0 | 0 | 0 |

For these calculations, we also simplify the rising and falling time to have identical values: *P* = *P _{r}* =

*P*. The results can be summarized by plotting the areas for which τ

_{f}^{±}are zero (see Figures 7 and 8). For instance, similar to κ, interval F conjoined with intervals I to VI results in both τ

^{+}and τ

^{−}being equal to zero. In case of τ

^{−}, this observation can also be made for interval A, and in case of τ

^{+}, the interval VI gives zero. There are two additional intervals, I and b, which result in a value of zero for both τ measures. These areas are indicated with different shades of gray in Figure 7 for τ

^{+}and in Figure 8 for τ

^{−}.

As τ^{±} will not affect convergence, all areas would yield convergence. However, only for the white area would γ result in a value unequal to zero.

#### A.3.5. Analytical Calculation of γ.

Finally we can calculate the value of γ using equation 2.14. This is not done explicitly here; however, we indicate in the last column of Table 4 whether γ is zero (no overlap between two consecutive signals and the third factor), greater than zero, or infinite.

Furthermore we can combine the considerations made for κ and τ^{±}, which are illustrated in Figures 6 to 8 into Figure 9. If we compare Figure 9 with Figures 2 and 3, we find that the areas of divergence map exactly, and only in the white areas convergence can be found. Therefore the considerations about convergence and nonzero τ^{±} values can be transfered from the filter function, equation A.16, to all possible functions that possess only one plateau.

## Acknowledgments

F.W. acknowledges the support of the European Commission, IP-Project PACO-PLUS. Helpful discussions with Minija Tamosiunaite are gratefully acknowledged.