## Abstract

This study discusses the negative impact of the derivative of the activation functions in the output layer of artificial neural networks, in particular in continual learning. We propose Hebbian descent as a theoretical framework to overcome this limitation, which is implemented through an alternative loss function for gradient descent we refer to as Hebbian descent loss. This loss is effectively the generalized log-likelihood loss and corresponds to an alternative weight update rule for the output layer wherein the derivative of the activation function is disregarded. We show how this update avoids vanishing error signals during backpropagation in saturated regions of the activation functions, which is particularly helpful in training shallow neural networks and deep neural networks where saturating activation functions are only used in the output layer. In combination with centering, Hebbian descent leads to better continual learning capabilities. It provides a unifying perspective on Hebbian learning, gradient descent, and generalized linear models, for all of which we discuss the advantages and disadvantages. Given activation functions with strictly positive derivative (as often the case in practice), Hebbian descent inherits the convergence properties of regular gradient descent. While established pairings of loss and output layer activation function (e.g., mean squared error with linear or cross-entropy with sigmoid/softmax) are subsumed by Hebbian descent, we provide general insights for designing arbitrary loss activation function combinations that benefit from Hebbian descent. For shallow networks, we show that Hebbian descent outperforms Hebbian learning, has a performance similar to regular gradient descent, and has a much better performance than all other tested update rules in continual learning. In combination with centering, Hebbian descent implements a forgetting mechanism that prevents catastrophic interference notably better than the other tested update rules. When training deep neural networks, our experimental results suggest that Hebbian descent has better or similar performance as gradient descent.

## 1 Introduction

Gradient descent is the commonly used optimization algorithm in machine learning, particularly in artificial neural networks and deep learning. Despite its theoretical foundation and the possibility of proving its convergence analytically even for stochastic gradient descent (Robbins & Siegmund, 1985; Saad, 1998), it can still have slow convergence or get stuck in a suboptimal solution, especially when saturating activation functions like sigmoids are used. One reason for slow convergence in neural networks trained with gradient descent is the vanishing gradient problem originally described by Hochreiter (1991) for recurrent neural networks. However, the underlying principle of vanishing gradients is also relevant for nonrecurrent deep neural networks and even shallow networks, as it is characterized by an unwanted downscaling of the error signal during the backpropagation stage by small values of the derivative of saturating activation functions. In that situation, the resulting parameter updates would be small in magnitude even if the error signal was still large.

While it is apparent that this effect becomes more severe with increasing network depth, a variety of related work and our experiments demonstrate that it matters even for shallow networks. In the context of shallow networks, the potentially negative effect of saturating activation functions has been discussed from various angles. From the perspective of backpropagation, the famous perceptron (Rosenblatt, 1958) algorithm, which uses a step function as a nonlinearity, had its derivative removed during parameter update as the gradient would be zero otherwise. While a good overview of perceptron learning algorithms and its variants using the step activation function is presented by Gallant (1990), the systematic connection between the perceptron learning rule and performing backpropagation without the derivative of the output activation function was not discussed. Biehl and Schwarze (1995) investigated the online learning behavior of single-layer networks with continuous valued outputs and noted formal similarity between the Hebb rule and the backpropagation weight update. While this similarity is not further discussed in their work, we refer to it more formally and in more detail in section 3.1. When using sigmoid output activations, Hinton (1989) proposed to use the cross-entropy as a loss function instead of the squared error, in which case the derivative of the sigmoid output units cancels out in the gradient, which explains why the cross-entropy loss is the preferred loss when training deep networks with sigmoid (and also softmax) outputs today.

In multilayer neural networks, the potentially negative effect of saturating activation functions during backpropagation can accumulate and is then referred to as vanishing gradients. To address the vanishing gradient problem in deep neural networks with sigmoid activation functions, Fahlman (1988) added small values to the gradients if they came from the saturated regime of the activation function and found this to be effective in some applications. Chen (1990) proposed the heuristics of simply ignoring the derivative of the output-layer activation function during backpropagation, and Ng et al. (1999) and Ooyen and Nienhuis (1992) presented variations of the activation function to mitigate the negative effect of small derivatives during backpropagation. Lee et al. (1993) noted that slow training convergence can be caused by prematurely saturating sigmoids and found a parameter initialization scheme that alleviates the problem. Vitela and Reifman (1993) and Ng et al. (2003) proposed a modification of the derivative during the backward pass of deep networks with sigmoid activation functions that magnified small gradients to speed up learning. Motivated by technical benefits, Hertz et al. (1997) proposed an algorithm that uses an approximation to get rid of the derivative of the activation function during the backward pass. They also mentioned that for the hyperbolic tangent as an output activation function, a loss similar to the cross entropy exists that cancels out the derivative of the activation function. Yu et al. (2002) replaced the sigmoid with a hyperbole tangent activation function with learnable slope that produces larger gradients than the former close to the limit output values of 0 and 1. Following a different paradigm, activation functions with a constant derivative of one in the positive domain have been proposed. Examples of such functions include the rectifier (Fukushima & Miyake, 1982; Hahnloser et al., 2000; Hahnloser & Seung, 2001) or exponential linear unit (Clevert et al., 2015). However, due to their distinctively different mapping behavior, using nonsaturating activation functions, particularly in the output layer, might not be possible depending on the application. A good overview of contributions to improve convergence speed and performance of deep neural networks can be found in Vora and Yagnik (2013) and Kumar et al. (2015).

Although we do not discuss recurrent neural networks (RNNs) in this letter, they play an important role in the history of the vanishing (and exploding) gradient problem and should be mentioned here. In RNNs, vanishing gradients were a major problem until the proposal of long short-term memory cells by Hochreiter and Schmidhuber (1997). During backpropagation, these cells allow for a linear error flow back in time, which helps to overcome the problem. As an alternative, Pascanu et al. (2013) proposed gradient clipping to alleviate exploding gradients and a special regularization term on the error signal to alleviate vanishing gradients in recurrent neural networks. Mikolov et al. (2014) approached the problem of vanishing gradients by encouraging a specific structure of the recurrent weight matrix. An overview of milestone contributions to the field or recurrent neural networks, including measures against the vanishing gradients problem, is in Salehinejad et al. (2018). However, those techniques may impose constraints on network parameters, require the use of a restricted set of activation functions, or build on the recurrent nature of the network. Thus, while drawing inspiration from the various approaches listed above, better strategies for nonrecurrent neural networks may be found.

Another intensively explored approach for addressing the vanishing gradient problem is to take previous parameter updates into account, for example, using a momentum term like in RSprop (Riedmiller & Braun, 1993), Adagrad (Duchi et al., 2011), or Adam (Kingma & Ba, 2014). However, while having similar effects on the vanishing gradient problem, those approaches work quite differently from Hebbian descent. They may even be combined with it (see section 5.2), which is why we consider them as separate techniques outside the scope of this work. They also come with their own drawbacks—for example, including the previous parameter updates in continual learning can be counterproductive since it may slow the network’s reaction to new inputs.

The continual learning setting is particularly challenging for artificial neural networks. Here, a steady stream of data is presented sample by sample or minibatch by minibatch to the network without repetition of samples. A good continual learning algorithm has to balance two partially competing goals. While it should learn efficiently from new data, it should at the same time not forget relevant information about older data. Because of limited memory capacity (i.e., trainable parameters of the learner), eliminating forgetting altogether is hard to achieve in practice. It might even be undesirable to do so as forgetting can implicitly serve as a filter to get rid of outdated information in continual learning (LeCun et al., 2012; Eiter & Kern-Isberner, 2019). Thus, a balance between adopting new information and forgetting about old data is required. Contrary to that, catastrophic interference (McCloskey & Cohen, 1989; Ratcliff, 1990) in continual machine learning refers to the problem of abrupt forgetting, that is, exponentially decreasing performance on past data. It represents a forgetting behavior where old information is disregarded by the learner too fast. To overcome catastrophic interference, several approaches have been proposed. French (1991) used sparse hidden representations, which reduce interference. Rehearsal learning (Robins, 1995) is a popular approach but requires storing all previously seen patterns. Pseudo-rehearsal learning (Robins, 1995) calculates the output of random patterns and updates the network on a new pattern and several random input-output pairs to reduce interference but adds a significant computational overhead. Complementary learning systems (McClelland et al., 1995; Ans & Rousset, 1997; French, 1997) use a fast and a slowly learning network to store recent and all patterns, respectively. The fast learning network acts as a buffer that rapidly stores recent patterns, which are then carefully transferred to the slowly learning network, which tries to store not only the latest but all patterns. A disadvantage of such systems is that we need two networks for a task that can potentially be solved by a single network and that the knowledge transfer from the fast to the slowly learning network needs consolidation (rehearsal) again. Elastic weight consolidation (Kirkpatrick et al., 2017) regularizes weights toward previous values and removes the need for rehearsal, but requires storing all previous weights. Memory-augmented neural networks (Santoro et al., 2016) perform continual learning but without a neural implementation of the memory. In conclusion, implementing continual learning using gradient-based methods in a neural network is still a challenging problem.

A learning rule that can perform continual learning of uncorrelated patterns is Hebbian learning, which remains the major learning principle since Donald Hebb postulated his theory in 1949 (Hebb, 1949). It is still widely used in its canonical form generally known as Hebb’s rule, which, however, cannot learn negative or inhibitory weights when assuming positive firing rates. The covariance rule (Sejnowski & Tesauro, 1989) was proposed as an alternative to overcome this limitation. An advantage of both learning rules is that they are capable of continual learning, allowing patterns to be stored instantaneously without need for repetition. A disadvantage is that they do not take advantage of seeing input patterns several times. They also have problems with correlated patterns as Marr et al. (1991) stated and Löwe et al. (1998) and Neher et al. (2015), respectively analyzed for autoassociative (unsupervised) and hetero-associative (supervised) networks. Furthermore, Hebb’s rule and the covariance rule are unstable learning rules, so that the weights are usually renormalized after each update or a weight decay term is added (Bienenstock et al., 1982), where the latter introduces an additional hyperparameter that controls the speed of forgetting. In unsupervised learning, the stability problem was analytically addressed for a linear neuron by Oja’s rule (Oja, 1982) or for several linear neurons by Sanger’s rule (Sanger, 1989), which are convergent learning rules that drive the neurons to learn the principal components of the input patterns. For autoassociative learning in Hopfield networks, contrastive Hebbian learning (Rumelhart, McClelland, et al., 1986) and for Boltzmann machines contrastive divergence (Hinton, 2002) and its variants (Tieleman, 2008; Desjardins et al., 2010; Cho et al., 2010) have been proposed as stable learning rules, respectively.

In what follows, we begin with a recapitulation of gradient descent and the concept of centering in artificial neural networks in section 2 and then investigate learning in artificial neural networks without derivatives of activation functions in the output layer. This leads us to a unified view of various well-known algorithms, which we refer to as Hebbian descent. Notice that in the case of the mean squared error (MSE) loss and particular activation functions, the proposed learning rule has previously been discussed in the context of generalized linear models (Nelder & Baker, 1972; Kakade et al., 2011; Goel et al., 2018), contrastive Hebbian learning (Rumelhart, McClelland, et al., 1986; Movellan, 1991; Xie & Seung, 2003), and gradient descent (Rosenblatt, 1958; Hinton, 1989; Hertz et al., 1997). Hebbian descent, however, generalizes this update rule to arbitrary loss and activation functions and shows that if the derivative of the activation function is strictly positive, it is equivalent to performing gradient descent using an alternative loss function named the Hebbian descent loss (see section 3). Section 3.1 discusses the problem of vanishing activation function derivatives and how Hebbian descent is able to overcome this issue. In section 3.2, we highlight the connections between Hebbian descent and Hebbian learning. In section 3.3, we show that in the case of the MSE loss and an invertible and integrable activation function, Hebbian descent actually optimizes a generalized linear model. As a consequence, the Hebbian descent loss can be seen as the general log-likelihood loss in this case (see appendix B).

After establishing the theoretical connections between these approaches, section 4 describes the experiments performed to compare them. Specifically, we pit Hebbian descent against regular gradient descent, Hebb’s rule, and the covariance rule using one-layer networks. The results are described in section 5, where we show that all learning rules considered in this work benefit from centering (see section 5.1.2). We further show that Hebbian descent outperforms Hebb’s rule and the covariance rule in general (see section 5.1), has a performance similar to gradient descent in batch or mini-batch learning for several epochs (see section 5.1.5), and most importantl, has much better performance than all the other update rules in continual learning (see section 5.1.2). Moreover, we demonstrate empirically that Hebbian descent converges even if the derivative of the activation function is merely positive and can be used with nonlinearities like the step function (see sections 5.1.5 and 5.1.2). Section 5.1.4 illustrates that only Hebbian descent with centering shows a gradual linear forgetting that does not require an additional forgetting mechanism such as weight decay. In section 5.2, we present experiments with various deep network architectures to show that Hebbian descent has similar performance advantages in comparison to gradient descent in that setting, which ties in with well-known state-of-the-art deep learning best practices. To sum up, with our experiments, we demonstrate an overall beneficial effect for performance also in deep learning by counteracting the potentially negative effect of saturating activation functions solely in the output layer.

## 2 Gradient Descent in Centered Neural Networks

Centering in neural networks refers to the subtraction of the mean activity from each neuron, so that all neurons have zero mean activity on average.^{1} It has been shown to be useful for training artificial neural networks (LeCun et al., 2012), in particular for training Boltzmann machines (Montavon & Müller, 2012; Melchior et al., 2016) and autoencoder networks (Melchior et al., 2016), as it makes the network independent of the first-order input statistics, that is, the mean value of each neuron. Centering thus prevents the network from representing mean information through the weights, which allows it to learn only the missing information of a newly represented pattern rather than having to store it entirely in the weights (Melchior, 2021). This is presumably important in continual learning, where we want to store the latest pattern as much as possible while interfering with the stored patterns as little as possible. Since the mean for hidden units is usually not known in advance and changes during training, the offsets can be updated during learning, for example, by an exponentially moving average. When centering was originally proposed by LeCun et al. (2012) and Schraudolph (1998) the authors also recommended normalizing the units’ input to have the same variance, which if updated online as well is closely related to batch normalization proposed by Ioffe and Szegedy (2015).

An important property of both centering as well as full input normalization is that neither of them changes the model class; that is, each centered or normalized artificial neural network can be reparameterized to an uncentered or unnormalized neural network and vice versa, and is therefore just a different parameterization of the same model (Melchior, 2021). Notice also that both centering and normalization are independent of the used learning rule, which is usually gradient descent or backpropagation (Kelley, 1960; Rumelhart, Hinton, et al., 1986) or contrastive learning (Rumelhart, McClelland, et al., 1986; Hinton, 2002) but can also be Hebbian learning (Hebb, 1949).

^{2}

Equations 2.8 and 2.11 clearly show that the derivative of the output layer’s activation function plays a key role in the parameter updates. Here a problem can occur when the prethreshold activities $a$ of the output layer take values that, passed through the derivative of the activation function, lead to zero or near zero results ($\phi '(aj)\u22480$). In this case, the corresponding partial derivatives become zero or get close to zero independent of the actual error signal $E(tj,hj)$ (*e.g.*$hj-tj$), as can be seen from equations 2.8 and 2.11. Once this saturated region of the activation function is reached, updating the network parameters significantly can require a large number of update steps. In case of batch or mini-batch learning with an appropriate parameter initialization and a sufficiently small learning rate, this effect, as shown in the experiments, is usually less of an issue since the network is presented with a balanced variety of input-output combinations in every learning step. This reduces the risk of the network to “burn in” wrong outputs far into the saturated regime of $\phi $ for a majority of the training samples. In continual learning, however, in which we want to store patterns more or less instantaneously or when learning nonstationary input distributions, this problem becomes more severe.

## 3 Hebbian Descent

with learning rate or step-size parameter $\eta $ and error signal $E(t,h)=\u2202L(t,h)\u2202h$ with corresponding loss function $L(t,h)$.

We call the learning rule originating from the loss modification presented in equation 3.1 Hebbian descent, giving credit to Hebbian learning as well as gradient descent, both of which it is strongly connected to. In case of shallow networks using the MSE loss and particular activation functions, this update rule has previously been discussed in the context of generalized linear models (Nelder & Baker, 1972; Kakade et al., 2011; Goel et al., 2018), contrastive Hebbian learning (Rumelhart, McClelland, et al., 1986; Movellan, 1991; Xie & Seung, 2003), and gradient descent or delta rule (Hinton, 1989; Hertz et al., 1997). Moreover, in the special case of binary classification using the MSE loss and a step activation function, Hebbian descent realizes the perceptron update rule proposed by Rosenblatt (1958). If a linear activation function is used instead of the step activation function, Hebbian descent implements the delta rule or Widrow-Hoff rule (Widrow & Hoff, 1960). With this work, we aim to provide a unified view on all of the algorithms we have noted.

When the integral in equation 3.1 exists, it is clear that Hebbian descent naturally inherits the convergence properties of stochastic gradient descent (Robbins & Siegmund, 1985; Saad, 1998), as it effectively is gradient descent with a different loss function. Note that the integral only exists if the output-layer activation function is strictly monotonic, that is, $\phi '(a)\u22600$ everywhere. If it is strictly monotonically decreasing, the integral exists, but $LHD$ leads to updates in the opposite direction of the updates from the original loss $L$, which is clearly undesirable. This is the reason that Hebbian descent formally requires a strictly monotonically increasing output-layer activation function. However, in practice we found that in most cases, it is easy to modify a positive activation function to become strictly positive and that Hebbian descent usually performs well even for merely positive activation functions (e.g., a rectifier), in which case the integral exists only partially.

It is important to point out that the optimal parameters with respect to $LHD$ and $L$ usually differ as the average loss over all data points differs unless not all individual loss terms are zero. However, in practice, neither $L$ nor $LHD$ are the metrics we are interested in, such as when using cross-entropy loss for training a classification network, for example.

Although Hebbian descent is commonly associated with training shallow neural networks, it is equally applicable and also beneficial to deep networks, in which case it only concerns the output-layer activation and loss functions. While it is challenging to adjust the loss function to counteract activation function derivatives in deeper layers, modern deep architectures typically use nonsaturating activation functions like the rectified linear unit (Hahnloser et al., 2000; Glorot et al., 2011) in their hidden-layers, thus mitigating the issue of vanishing gradients caused by near-zero derivatives. Choosing from a wide range of possible hidden layer activation functions is possible as we usually do not require the activation functions in the hidden layers to represent a certain distribution, while for the output layer, we typically do. This leaves the output layer as the one special case where we might have to use a particular saturating activation function if the use case requires it. In these cases, Hebbian descent can prevent the output layer from blocking the error signal and mitigate potential issues throughout the network. Well-established combinations of loss and activation functions for deep networks have been shown to achieve good performance and can be explained by Hebbian descent. For instance, mean squared loss with linear output units or cross-entropy loss with sigmoid output units both cancel out the derivative of the output-layer activation function during parameter update.^{3} Hebbian descent provides a general explanation of why these combinations tend to work better than others and, moreover, enable us to design improved pairings of loss and activation functions from an error term perspective that treats them as a unity. A detailed derivation showing that the cross-entropy loss with sigmoid output units indeed follows the Hebbian descent paradigm is given in appendix A.

### 3.1 The Squared Error Loss

Thus, with Hebbian descent in the case of the MSE and sigmoid activation, the network learns to produce a desired output $t$ given input $x$ by comparing the current output $h$ with the desired output $t$. Equation 3.7 shows that the update rule is the difference between a supervised and an unsupervised Hebb-learning step (see section 3.2). This is strongly connected to contrastive learning rules and the contrastive divergence learning paradigm, and we investigate the connection more closely in the supplementary material. While the supervised learning step measures the correlation between input and desired output, the unsupervised step measures the correlation between input and output that is already represented by the network and removes it from the current update step. This is an important property as it allows learning only the missing information and thus complementing the representation that has already been learned by the network.

To better illustrate the impact of switching from gradient descent with the regular squared error loss to Hebbian descent (i.e., using the cross-entropy loss), Figure 1a shows the different speed of convergence for a network with sigmoid units trained on a 2D toy example using gradient descent and Hebbian descent. It is evident that the norm of the gradient shrinks extremely in saturated regimes of the sigmoid in which the derivative of the activation function takes very small values. Figure 1b illustrates that even if the norm of the update rules is normalized, Hebbian descent converges faster since it points almost directly to the global minimum. Note that due to different optimization dynamics, Hebbian descent and gradient descent arrive at the same optimum only if both achieve exactly zero loss.

### 3.2 Hebbian Descent from the Perspective of Hebbian Learning

Hebbian learning (Hebb, 1949) is one of the oldest and best-known learning rules for neural network training. It plays a central role in research history and is still relevant today. Although section 3.1 briefly touches on the connection between Hebbian descent and Hebbian learning, we aim to provide a more detailed and formal explanation of this connection.

Note, however, that, whether centered or not, the covariance rule is still a divergent learning rule and is unable to consider missing information only in the parameter updates, which limits its effectiveniess of learning correlated or similar pattern pairs.

### 3.3 Hebbian Descent and Generalized Linear Models

Generalized linear models (GLMs) are a popular tool in statistics and subsume a variety of simpler models such as ordinary linear or logistic regression. Here we show that in the case of the squared error loss and an invertible and integrable activation function, Hebbian descent actually optimizes a GLM in canonical form (Nelder & Baker, 1972). We first give a brief introduction to GLMs and then show that the gradient of GLMs is equivalent to the Hebbian descent update, which can also be concluded from the derivations given by Nelder & Baker (1972) and McCullagh & Nelder (1989). (For a detailed introduction to GLMs, see McCullagh & Nelder, 1989.)

Note that Kakade et al. (2011) have proposed an algorithm named L-Isotron that uses the gradient of canonical GLMs (see equations 3.24 and 3.25), but additionally allows learning the nonlinearity instead of choosing it by hand. The authors have also provided convergence bounds for the gradient of GLMs in canonical form. Goel et al. (2018) have transfered the algorithm to single-layer convolutional neural networks.

## 4 Methods

In this section, we describe the benchmark data sets and the experimental setup. For this work, we used the machine learning library PyDeep, which allows reproducing the described results and provides examples.^{4} We chose a mix of community standard plus one random data set, which can all be considered simple by today’s standards. This was done to maintain focus on the comparison aspect of the different methods, reduce experiment run times, and increase the chances of readers being already familiar with the used data sets. To bring each network architecture and learning rule to its full potential, we performed a grid search over the most influential hyperparameters per experiment (see sections 4.2 and 4.3). Again to reduce experiment run times, we chose the particular set of hyperparameters we would perform a grid search over based on initial experiments and domain knowledge.

### 4.1 Benchmark Data Sets

We consider four real-world data sets from various domains as well as binary and normal distributed random patterns in our experiments. For all classification data sets, the class labels are presented as one-hot vectors to the networks.

The MNIST (LeCun et al., 2012) data set consists of 70,000 gray-scale images of handwritten digits divided into training and test sets of 60,000 and 10,000 patterns, respectively. The images have a size of $28\xd728$ pixels, in which all pixel values are normalized to the range [0, 1]. The data set is not binary, but the values tend to be close to zero or one. Each pattern is assigned to one out of 10 classes representing the digits 0 to 9.

The CONNECT (Larochelle et al., 2010) data set consists of 67,587 game-state patterns from the game Connect-4. The data set is divided into training, validation, and test sets with 16,000, 4000, and 47,557 patterns, respectively. The binary patterns are 126-dimensional, and each pattern is assigned to one out of three classes representing the game results: win, lose, or draw.

The ADULT (Larochelle et al., 2010) data set consists of 32,561 binary patterns of census data to predict whether a person’s income exceeds $50,000 per year. The data set is divided into training, validation, and test sets with 5000, 1414, and 26,147 patterns, respectively. The binary patterns are 123-dimensional, and each pattern is assigned to one out of two classes representing whether the income level was exceeded or not.

The CIFAR (Krizhevsky, 2009) data set consists of 60,000 color images of various objects divided into training, validation, and test sets with 40,000, 10,000, and 10,000 patterns, respectively. The images have a size of $32\xd732$ pixels that are converted to gray scale and rescaled to lie in a range of [0, 1], so that the data set has a nonzero mean and can be represented by most of the activation functions. Each pattern is assigned to one out of 10 classes representing trucks, cats, or dogs, for example.

The RAND and RANDN data sets serve as baseline data sets, each consisting of random patterns with a size of 200 pixels. The pixels in the RAND data set take the value one with a probability of 0.5 and zero otherwise. The pixels in the RANDN data set are drawn from a gaussian distribution with zero mean and unit variance. The data set is rescaled to lie in a range of [0, 1], so that it can be represented by most of the activation functions. The resulting data set has a mean of 0.5 and standard deviation of 0.1. These data sets do not have label information.

### 4.2 Network Structure and Learning Setup for Single-Layer Network Experiments

In our main experiments, we consider single-layer networks with various activation functions such as linear, sigmoid, step, softmax, rectifier, and exponential linear. Unless stated otherwise, in the experiments we used MSE error loss without weight decay regularization. Here we focus on single-layer networks to make assessing the comparison between Hebbian descent and some of the other learning rules feasible. The bias values were initialized to zero, and according to Glorot and Bengio (2010), we initialized the weights to $wij\u223cU(-6N+M,$$+6N+M)$, in which $N$ is the number of input units, $M$ is the number of output units, and $U(a,b)$ is the uniform distribution in the interval $[a,b]$. Each experiment was repeated 10 times, in which the initial weight matrices were the same among the methods but different in each trial. The default batch size was one in case of continual learning and 100 in case of mini-batch learning, and when training involved several sweeps through the data, the models were trained for 100 epochs. Depending on the data set and activation function, the optimal learning rate varied a lot, so we performed a grid search over 35 different learning rates ranging from 0.00002 to 100.^{5} When weight decay was used, we additionally performed a grid search over 20 different weight decay values ranging from 0 to 2 leading to a total search space of $20\xd735=700$ hyper-parameter combinations.^{6} When centering was used and if not mentioned otherwise, the input offsets were fixed to the corresponding data mean, and the hidden offsets were initialized to $\lambda j(t=0)=0.5$ and updated with an exponential moving average of $\lambda j(t+1)=0.99\lambda j(t)+0.01hj(t)$. For a fair comparison of the methods, we had to fix input offsets to the data mean because changing offsets during training requires a bias parameter for the reparameterization that is not available when using Hebb’s rule and the covariance rule. However, slowly updated input offsets converge to the data mean, leading to a very similar performance as when initially fixing them to the data mean. This has been shown for mini-batch learning in restricted Boltzmann machines by Melchior et al. (2016) and is shown for continual heteroassociation in the following. Without centering, the offsets were all fixed to zero.

### 4.3 Network Structure and Learning Setup for Multilayer Network Experiments

We provide classification experiments with deep networks on the MNIST and CIFAR data sets as well as regression experiments using a denoising autoencoder on CIFAR to illustrate the efficacy of Hebbian descent in this setting. Network parameter initialization and search for the optimal learning rate were done in the same way as described above for the single-layer network experiments. Each experiment was repeated five times with the same initial weight matrices among the methods but different in each trial. We test the combination of softmax output-layer activation with MSE or CE loss, representing the gradient descent and a Hebbian descent setting, respectively. Additionally, in the classification experiments, we pair the MSE loss with a linear output-layer activation to test how a nonstandard Hebbian descent error term performs in comparison with the best practice of softmax plus CE loss. We use four different network architectures in total, whereas the first three are used in the classification experiments and the last one in the regression experiments:

A two-layer fully connected network with 200 neurons in both layers and a ReLU nonlinearity between them.

A two-layer convolutional network with 32 and 64 channels, a stride of 1, and a final fully connected layer with 128 neurons. All hidden layers have ReLU nonlinearities between them.

A ResNet18 instance, which is an 18-layer deep convolutional network with residual connections and ReLU nonlinearities (He et al., 2016).

A six-layer autoencoder with 12, 24, 48, 24, 12, 3 channels per layer, a stride of 1, and ReLU nonlinearities between the hidden layers.

### 4.4 Performance Measurement

As the different methods and activation functions effectively optimize different loss functions, we decided for a coherent performance measure that represents the performed task best and favors neither one method nor the other. For the classification experiments, we therefore evaluated the average misclassification rate, while for the other experiments, we used the mean absolute error (MAE) as an intuitive performance measure. For all supervised experiments (see section 5.2), the reported performance was measured on a held-out test set. For the heteroassociation experiments (see section 5.1), the concept of a test set is not applicable as the task is to explicitly associate a given pattern with one specific other pattern. For the experiments with binary output patterns consisting of zeros and ones, it is clear that a mean absolute error of 0.1, for example, means that on average, each neuron deviates 10% and that an error above 0.5 is worse than for random output (chance level). Alternatively we could have used the mean squared error that would have been in favor for gradient descent, while the log-likelihood would have been in favor for Hebbian descent, thus both not allowing for an unbiased performance measure. Furthermore, both overestimate outliers and underestimate small deviations, thus leading to a less interpretable performance measure. Another choice that is often used in neuroscience is the Pearson correlation between target and output pattern. However, it is scale invariant and can lead to a wrong impression of the network’s performance. For some experiments, we also evaluated these other performance measurements and found qualitatively the same results: if a method performed significantly better in terms of the MAE, it was also the best with regard to the other measures.

## 5 Results

We compare the performance of Hebbian descent, gradient descent, Hebb’s rule, and the covariance rule for centered and uncentered single-layer networks. We first focus on continual heteroassociation, associating $N$ input patterns with $N$ output patterns one after the other, but we also present multi-epoch experiments to explore how much the individual learning rules profit from seeing training examples multiple times. In a separate set of experiments, we compare the classification and regression performance of Hebbian descent and gradient descent using several deep network architectures.

### 5.1 Heteroassociative Learning

In this section we investigate the performance of Hebb’s rule, the covariance rule, gradient descent, and Hebbian descent in heteroassociative learning with a focus on continual learning. To maintain an emphasis on the continual learning setting, every data point is generally presented only once to each algorithm.

#### 5.1.1 Heteroassociative Continual Learning without Centering

In a first set of experiments, we analyze how well 100 patterns of one data set can be associated with 100 patterns of another data set one pair at a time. Figure 2a shows the performance of uncentered networks with sigmoid units when the four different methods have been used to associate 100 binary random patterns (RAND) with 100 binary random patterns (RAND). Unless stated otherwise, the optimal learning rate was generally determined for each method individually such that the average performance over the last 20 patterns is best. For select experiments, we added a setting where the learning rate was tuned to achieve the best possible performance for only the last pattern. Note that during performing the learning rate grid search, we found that generally Hebbian descent was more robust with regard to variations of the learning rate, which actually puts the remaining methods at a slight advantage since they profited from the learning rate fine tuning to a higher degree than Hebbian descent.

For Hebb’s rule, the error is close to 0.5, which means that the network has not learned to associate the patterns at all. It is the same performance as baseline, which is the error between the output patterns and their mean value. This corresponds to the performance of a network that independent of the input always returns the mean output pattern and thus represents the most trivial solution. The bad performance of Hebb’s rule is a direct consequence of not being able to learn negative or zero correlations and can be explained as follows. Equation 3.10 indicates that in the uncentered case and for binary patterns, each weight $wij$ is updated by a value of $\eta $ when the corresponding input and target value is one, or by zero otherwise. Since the patterns are drawn uniformly at random, the weights will thus increase with a probability of 0.25 or stay the same otherwise. Consequently, all neurons will sooner or later and independent of the input produce a constant output of one resulting in an error of 0.5.

The covariance rule solves this problem by allowing negative weight changes leading to a much better performance, which is even better than that of Hebbian descent and gradient descent. Interestingly, in the case of Hebb’s rule and the covariance rule, early and late patterns are learned equally well, whereas for gradient descent and Hebbian descent, more recent patterns are represented better than older ones. While this enables the latter two approaches to learn continually from new data, both show catastrophic forgetting for older patterns (see Figure 2). Note that only Hebbian descent allows storing the latest pattern almost perfectly, independent of the number of the stored patterns. For the covariance rule, however, the performance of all patterns decreases with an increasing number of learned patterns.

To show that the observed performance differences do not depend on the number of patterns on which the learning rate is optimized, we performed two additional experiments. In one, we selected for each learning rule the optimal learning rate such that the performance of only the last pattern (index 100) is best; the results are shown in Figure 2b. In a second one, the optimal learning rate per learning rule was chosen such that the performance over all 100 patterns is best (results not shown here). For all scenarios, the performance curves of the learning rules are very similar. Since the optimal learning rates in the experiments are nearly the same, a larger learning rate does not allow the methods to represent only the latest pattern better.

#### 5.1.2 Heteroassociative Continual Learning with Centering

To show the importance of centering and its ability to prevent catastrophic interference in continual learning when combined with Hebbian descent, we performed the same experiments as before but with centered networks. The results are shown in Figure 3, illustrating that all methods profit significantly from centering and that the covariance rule and Hebb’s rule become equivalent in case of centering as shown analytically in equation S6. We refer to Melchior (2021) for a similar performance comparison where the data are not centered to the mean but to offsets learned during training. In the mean-centered setting presented here, all methods except for Hebbian descent have a homogeneous error distribution over the single patterns, which also does not change when the learning rate is selected such that the last pattern is represented best, as shown in Figure 3b. Interestingly, only Hebbian descent allows for a linear slope of forgetting. It is still able to learn from new patterns but at the same time does not suffer from catastrophic interference anymore. While compared to noncentered networks, the performance improved for Hebbian learning, and the covariance rule, the performance of all patterns, will decrease with an increasing number of learned patterns.

We see the reason that Hebbian descent shows this continual learning behavior in two factors. First, centering the data helps the network to disentangle the weights and biases, which effectively avoids large, disruptive weight updates due to the inherent bias of the presented samples (Melchior, 2021). This enables Hebbian descent as well as gradient descent to make more targeted updates with respect to the currently presented data point without compromising what has been learned before. If centering is not applied, we can see catastrophic forgetting as in Figure 2. Second, storing a pattern within one update requires a rather large step size, which easily pushes the units into the saturated regions of their activation functions. While Hebbian descent can potentially undo these steps linearly with respect to the error term (see equations 3.2 and 3.4), gradient descent would need several steps as the update is significantly scaled down by the derivative of the activation function (see equations 2.8 and 2.11), hindering the network from storing the information within one step. Thus, centering is required to enforce disentangling representation of weights and biases, while only Hebbian descent is able to unlearn or forget information that is stored in rather saturated regions of the activation function.

Since natural data are usually not uncorrelated, we performed the same experiments as before but with correlated real-world data sets. In particular, we associated 100 binary random patterns (RAND) with 100 patterns of the ADULT data set. The results for centered networks are shown in Figure 4a, where Hebb’s rule and the covariance rule perform significantly worse than gradient descent, Hebbian descent, and also as baseline. Again, Hebbian descent has a linear slope of forgetting while all the other methods have a homogeneous error distribution. The plot for associating a set of patterns of the ADULT data set with another set of patterns of the ADULT data set is qualitatively similar and is thus not shown. Figure 4b shows the reverse experiment in which 100 patterns of the ADULT data set in the input are associated with 100 binary random patterns (RAND) in the output. The error for all methods is rather large, which arises from a rather high pairwise correlation of the patterns in the ADULT data set. Associating two very similar patterns with two completely random output patterns is a rather difficult task in continual learning as the chance of overwriting associations is extremely high. Again, all methods have roughly the same error across all individual patterns except for Hebbian descent, which allows storing at least the most recent patterns with high accuracy. This comes at the cost of the older pattern’s accuracy, leading to gradual forgetting. The optimal learning rate in both experiments was chosen such that the performance is best at the last 20 patterns, but consistent with the previous experiments, the results are almost the same when choosing it to be best for only the last or all patterns. Furthermore, without centering, all methods perform significantly worse (data not shown).

To confirm empirically that Hebbian descent is generally better on recent patterns, we performed several experiments with various data sets and activation functions. For detailed results, we refer to the supplementary material (see section F, Table S4, as well as Tables S9 and S10). Across all combinations of data sets and activation functions, Hebbian descent and gradient descent perform significantly better than Hebb’s rule, and the covariance rule on the last 20 patterns, and the latter two are both not even better than baseline performance. Furthermore, Hebbian descent performs significantly better than gradient descent in most cases, which becomes more significant as the activation function becomes more nonlinear. As an extreme case, we used the step function, which guarantees a binary output of the network but is incompatible with gradient descent as the gradient is constant zero for such networks. Hebbian descent, however, can deal with this type of activation functions and reaches good performance. The linear networks, for which gradient descent and Hebbian descent are equivalent, always perform worse than or similar to corresponding nonlinear networks. A nonlinearity should thus be applied, and the sigmoid function has the best performance among the different activation functions even when the output data come from a continuous domain such as ADULT $\u2192$ RANDN or CIFAR $\u2192$ RANDN, for example (see supplementary material section F, Table S10). To emphasize the advantage Hebbian descent holds over gradient descent in our experiments more clearly, we generated a scatter plot comparing their performance over the various data sets and output-layer activation functions shown in Figure 5. (For more detailed results, see supplementary material, Tables S4, S9, and S10.) All points lie either roughly on the diagonal or clearly below it, showing that if there is a significant difference between the two methods, Hebbian descent performs significantly better than gradient descent.

For comparison, we performed the same experiments as before but without centering, shown in Figure 5b, which clearly supports our statement that centering is valuable for all methods. Again, more results can be found in the supplementary material: section F, in Tables S5, S11, and S12. Without centering, Hebbian descent loses its ability to store recent patterns significantly better than older ones, but in most cases, it still performs significantly better than the other methods. We also emphasize the superiority of centered over uncentered networks by plotting the results for the centered versus uncentered Hebbian descent experiments in a scatter plot shown in Figure 5b. Detailed experimental results can be found in the supplementary material in Tables S4, S5, S9, S10, S11, and S12. All points lie clearly below the diagonal, showing that centering is always beneficial in our experimental setting.

#### 5.1.3 On the Advantage of Saturating Activation Functions in Continual Learning

The advantage of activation functions with restricted output values is that their values cannot overshoot, meaning that even extreme weight changes will lead to reasonable output values, which seems to be crucial for continual learning. This can be seen from the sensitivity of the network with respect to the learning rate. Figure 6 shows the performance of the last 20 patterns for (a) linear networks and (b) networks with sigmoid activation functions that are trained to associate 100 binary random patterns (RAND) with another 100 binary random patterns (RAND) using different learning rates. In the linear case, it is crucial for all methods to choose the right learning rate, as can be seen from the very sharp optimum. If the learning rate is chosen too big, the error increases exponentially. Also notice that Hebb’s rule or covariance rule does not even get close to baseline performance. When using a sigmoid activation function, however, all methods perform significantly better than baseline. While gradient descent still has a rather sharp optimum, the performance of the other methods does not change significantly for a learning rate above a certain threshold. This is a useful property as one can simply select a large learning rate, instead of performing a grid search, to achieve performance close to the optimum. Qualitatively, the same picture can be seen for the other data sets used in the experiments.

#### 5.1.4 Heteroassociative Online Learning with Weight Decay

The ability of the network to forget patterns becomes more important as more and more patterns are stored in the network. To illustrate this effect, we trained centered networks to associate 1000 instead of 100 patterns, which is way beyond the capacity of the network. The performance for the last 100 patterns when associating 1000 binary random patterns with another 1000 binary random patterns is shown in Figure 7a and when associating 1000 binary random patterns with 1000 patterns of the ADULT data set is shown in Figure 7b.

The errors for Hebb’s rule, the covariance rule, and gradient descent simply increase for all patterns compared to when only 100 patterns have been stored. This can be seen by comparing the results in Figure 7a with Figures 3a and 7b with Figure 4a, respectively. Hebbian descent in contrast still shows a power-law forgetting curve in which more recent patterns are represented better than older ones. Furthermore, only Hebbian descent allows having roughly the same performance on the latest patterns independent of the number of patterns that have been stored previously.

While Hebbian descent controls the amount of forgetting automatically, the other methods need an explicit forgetting mechanism, usually implemented through weight decay. It removes a certain proportion of the current weight matrix in each update step and therefore introduces an additional hyperparameter that controls the speed of forgetting. To investigate the effect of weight decay on the four different methods, we again trained networks to associate 1000 binary random patterns (RAND) with another 1000 binary random patterns (RAND). A simultaneous grid search over learning rate and weight decay was performed such that, just as in the previous experiments, the performance on the final 20 patterns is best.^{7} The results are shown in Figure 8a illustrating that except for Hebb’s rule, all methods improve on the final 20 patterns compared to when no weight decay is used see as shown in Figure 7a. Notice that the equivalence of centered Hebb’s rule and the centered covariance rule (see equation S6) no longer holds when a weight decay term is used, as the weights of the final weight matrix depend on the order of the presented data points now. Independent of whether a weight decay term is used or not, Hebb’s rule is limited when an input pattern has to be associated with a zero target since the update is zero, and thus no learning happens. This is different for the covariance rule in which the mean values are also subtracted from the target values, allowing associations with zero target values to be learned. In this experiment the covariance rule achieves an even better performance than Hebbian descent on the final 20 patterns, which, however, comes at the cost of reduced performance for older patterns. Hebbian descent achieves optimal performance on the latest patterns with a rather small weight decay, and the performance is similar to that without weight decay, indicating that the implicit mechanism of forgetting does not profit much from weight decay. While gradient descent also improves on recent patterns, it has a worse performance than Hebbian descent and the covariance rule over all of the last 100 patterns.

To investigate the effect of correlated data, we also performed the same experiments on associating 1000 binary random patterns (RAND) with 1000 patterns of the ADULT data set. The results are shown in Figure 8b and illustrate that in the presence of correlation, even with weight decay, none of the three alternative learning rules performs better than Hebbian descent on the final 20 or any of the final 100 patterns. We performed many more experiments with other data sets and activation functions and refer to the supplementary material, section F, Table S7, for the detailed results.

In most cases, Hebbian descent performs significantly better than the other methods, and in the remaining cases, its performance is close to the best. Gradient descent either reaches a similar performance as Hebbian descent or performs significantly worse. We did not expect weight decay to allow gradient descent to become better than Hebbian descent as it does not solve the problem of very small or zero values of the derivative of the activation function. While Hebb’s rule is always significantly worse than all other methods, the covariance rule reaches a very good performance in case of binary random input and output patterns, as shown in Figure 8a. When the output data are binary random, the covariance rule reaches a similar performance as Hebbian descent, but in all other cases, the performance is significantly worse than Hebbian descent. As the problem with correlation or redundancy is still present for Hebb’s rule and covariance rule, we did not expect that weight decay leads to a better performance than Hebbian descent. In Figure 9, we compare the performance of all methods with and without weight decay. All methods profit from using weight decay as all points lie below the diagonal. It it obvious to see that Hebbian descent has the best performance on average, closely followed by gradient descent, as the points are located much closer to the origin compared to the other methods. Gradient descent profits in most cases from weight decay, as shown in Figure 9a, although the improvement for Hebb’s rule and the covariance rule is much higher, as shown in Figures 9c and 9d. Hebbian descent has a similar performance with and without weight decay in most cases, as shown in Figure 9b. In fact, in most of these cases, the optimal weight decay is even zero or at least very small (not shown). In case of unbounded activation functions, such as linear, rectifier, and exponential linear units, a weight decay term often has a slightly positive effect by keeping the weights from growing too large (i.e., overshooting). The use of a small weight decay term when using an unbounded activation function therefore seems to be a good idea in general.

#### 5.1.5 Heteroassociative Multi-Epoch Learning

In contrast to Hebb’s rule and the covariance rule, gradient descent and Hebbian descent take advantage of seeing training patterns several times. Repetitions help to stabilize memories (think of humans learning vocabularies, for example). This is a clear advantage as the methods can further improve the network’s performance over time. We used the same training setup as in the continual learning experiments to train networks using gradient descent and Hebbian descent, but this time for 100 epochs instead of one. In other words, we continued training for another 99 epochs to investigate how much the networks can improve their performance when each pattern is presented several times. The results for centered and uncentered networks can be found in the supplementary material in section F, Tables S8, S13, and S14. We refer to Melchior (2021) for further experiments in the multi-epoch setting.

In all cases, whether centered or not, both methods improve significantly compared to when training only a single epoch. Furthermore, both methods reach better results when centering is used, and networks trained with centered Hebbian descent reach more often even a zero error value. In most cases, Hebbian descent reaches better values than gradient descent, especially when the association can be learned almost perfectly as in the case of RAND $\u2192$ ADULT, RANDN $\u2192$ CONNECT, or MNIST $\u2192$ CONNECT, for example. Both algorithms have a similar performance for exponential linear units on all data sets, but the performance of rectifier, sigmoid, or step function is more often better when using Hebbian descent. In case of the rectifier as an activation function, gradient descent often gets trapped in rather bad local optima, a direct consequence of the rectifier having a zero derivative for negative input values. As the derivative of the step function is constantly zero, only Hebbian descent can deal with the step function as a nonlinearity. The general tendency of Hebbian descent to perform better than gradient descent is even more prominent without centering.

To summarize, both methods reach similar results, but only Hebbian descent reaches good results also in continual learning (i.e., seeing every pattern only once). Similar results are also achieved when mini-batch learning (e.g., 10) is used instead of a batch size of 1.

### 5.2 Deep Network Experiments

In this section, we compare the classification and regression performance of Hebbian descent against gradient descent in deep neural networks. For classification results of single-layer networks, we refer to the supplementary material. In the classification setting, we present performance numbers on the MNIST and CIFAR data sets after five epochs of training. For the regression setting, we trained a denoising autoencoder on CIFAR-10 and report the final MAE after 80 epochs. Input images were treated with random perturbations in hue, saturation, value, and brightness of maximally 20% and had to be reconstructed to the original image. For a fair comparison, we chose per experiment the same data set, batch size, network architecture, and optimizer for both Hebbian descent and gradient descent. To accommodate for the difference in effective learning rates, we then performed a separate grid search for the optimal learning rate per learning rule. Figure 10a shows the final misclassification rate of Hebbian descent versus gradient descent in the classification setting, and Figure 10b shows the final mean absolute error of Hebbian descent versus gradient descent in the denoising autoencoder setting. The full experimental results are listed in the supplementary material in section F, Tables S2 and S3.

For the classification experiments (see Figure 10a), we see that Hebbian descent is either on par with or outperforms gradient descent for all experiments. This is especially the case when a convolutional architecture is applied to the more difficult CIFAR data set. While the lead is less pronounced for the higher-complexity ResNet18 architecture, as can also be seen in Table S2, Hebbian descent provides a measurable benefit over gradient descent as well, and this still holds if the state-of-the-art Adam optimizer is used. If the uncommon combination of MSE loss and linear output layer is used to represent Hebbian descent (not plotted here), the picture is less clear. We still see a notable advantage for Hebbian descent in case of the simple convolutional architecture on CIFAR but slightly weaker performance in case of the simple feedforward architecture. The ResNet18 performs within a margin of error identical to the gradient descent setting.

In the autoencoder experiments (see Figure 10b), Hebbian descent performs better on average if the SGD optimizer is used. If Adam is used, gradient descent is slightly in the lead, whereas the performance difference between both is notably less pronounced than the one in the SGD experiments.

Our empirical results suggest that if nonsaturating activation functions are used on the hidden layers, Hebbian descent provides a useful perspective for deep network training in a way similar to that for shallow network training. Although it does not provide improved results across all error terms, architectures, data sets, batch sizes, and tasks, we achieved notably better or at least similar performance in the majority of experiments. Interestingly, even if we deviate from the best practice of using CE loss in conjunction with softmax in the classification experiments and pair MSE with linear activation instead, convolutional networks seem to either benefit from our alternative Hebbian descent error term or remain untouched performance-wise. While the choice of an unbounded activation function for learning one-hot vectors is suboptimal to begin with (see section 5.1.3), we nonetheless emphasize that Hebbian descent can provide tangible benefits over gradient descent here even if the activation function does not perfectly match the criteria defined by the problem at hand.

## 6 Conclusion

In this study, we introduce Hebbian descent as a theoretical framework for avoiding negative effects of saturating output-layer activation functions in artificial neural networks, particularly in continual learning scenarios. By reviewing the relevant literature, we situated Hebbian descent within a lineage of similar concepts in section 1 and demonstrate that it unifies Hebbian learning, the covariance rule, and generalized linear models under a common principle (see section 3). The approach is operationalized by means of an alternative loss function for gradient descent, denoted as the Hebbian descent loss. This loss function is analogous to a weight update rule for the last layer of neural networks but with the derivative of the activation function removed. We demonstrate empirically that Hebbian descent can be used as a replacement for gradient descent as well as Hebbian learning as it inherits their advantages while not suffering from their disadvantages.

While in deep networks the hidden-layer activation function can usually be chosen as desired; for example, rectified linear units, employing saturating activation functions in the output layer, may be a necessity dictated by specific network applications. Here, Hebbian descent provides a complementary solution to nonsaturating hidden-layer activation functions, which mitigates the potentially negative effects of saturating activation functions in the output layer, where it would inevitably affect parameter updates in all previous layers.

In section 3.2, we argued that the major drawbacks of Hebbian learning are the problem of dealing with correlated or redundant input data and that it does not profit from seeing training patterns several times. For gradient descent, we identify the derivative of the final-layer activation function as problematic as it can lead to a vanishing error term that prevents efficient continual learning (see section 2). Hebbian descent addresses both of these problems. It exhibits the same convergence properties as gradient descent (provided that the derivative of the output-layer activation function is strictly positive), does not suffer from the vanishing error term problem, can deal with correlated data, profits from seeing patterns several times, and enables successful continual learning in single-layer networks when centering is used.

We show analytically that:

The Hebbian descent update can generally be understood as gradient descent update in which the derivative of the activation function of the output layer is removed (see section 2).

In case of a strictly positive derivative of the activation function in the output layer, Hebbian descent leads to the same update rule as gradient descent with a different loss function we name Hebbian descent loss (see equation 3.1). It thus inherits the convergence properties but even converges empirically when the derivative of the activation function is merely positive.

In the case of the mean squared error loss, Hebbian descent can be understood as the difference of a supervised and an unsupervised Hebbian learning step (see section 3.2), and with an invertible and integrable activation function Hebbian descent actually optimizes a generalized linear model (see section 3.3).

Our empirical results suggest that:

**4.**All update rules considered in this work profit from centering (see section 5.1.2).**5.**Hebbian descent performs significantly better in continual learning than all other considered update rules (see section 5.1.2). Only Hebbian descent with centering shows an inherent and plausible curve of forgetting so that no additional forgetting mechanism like a weight decay term is required (see section 5.1.4).**6.**In mini-batch learning for shallow and deep neural networks, the Hebbian descent loss performs better than or similar to the original loss with gradient descent (see section 5.2).

A comparison of the properties of Hebbian descent, gradient descent, and Hebbian learning rules is given in Table 1. The empirical evaluation of less conventional Hebbian descent losses (see appendix A) in deep neural networks is therefore an interesting future research direction. Future work might also focus on Hebbian descent learning in spiking neural networks since an implementation of Hebbian descent through spike-timing-dependent plasticity is obvious.

. | GD . | HL/CR . | HD . |
---|---|---|---|

Continual learning | $+$ | $+$ | |

Multi-epoch/batch learning | $+$ | $+$ | |

Convergent and stable | $+$ | $+$ | |

Profits from centering | $+$ | $+$ | $+$ |

Can deal with correlated patterns | $+$ | $+$ | |

No vanishing error term | $+$ | $+$ | |

No catastrophic interference | $+$ | $+$ | |

Inherent plausible forgetting mechanism | $+$ |

. | GD . | HL/CR . | HD . |
---|---|---|---|

Continual learning | $+$ | $+$ | |

Multi-epoch/batch learning | $+$ | $+$ | |

Convergent and stable | $+$ | $+$ | |

Profits from centering | $+$ | $+$ | $+$ |

Can deal with correlated patterns | $+$ | $+$ | |

No vanishing error term | $+$ | $+$ | |

No catastrophic interference | $+$ | $+$ | |

Inherent plausible forgetting mechanism | $+$ |

Notes: A plus sign indicates that the learning rule has the specified property. All properties are formulated so that a plus is advantageous.

## Appendix A: List of Hebbian Descent Loss Functions

For $LGD(tj,hj)=12hj-tj2$ . | ||
---|---|---|

$hj=\phi (aj)$ . | $E(tj,hj)\phi '(aj)=hj-tj\phi '(aj)$ . | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ . |

Linear | Squared error loss | |

$aj$ | $hj-tj1$ | $12hj-tj2$ |

Sigmoid | Cross-entropy loss | |

$11+exp(-aj)$ | $hj-tjhj(1-hj)$ | $-tjln(hj)-(1-tj)ln(1-hj)$$(((($ |

Softmax | Cross-entropy loss | |

(Bridle, 1990) $exp(aj)\u2211kexp(ak)$ | $hj-tjhj(\delta ij-hi)$ | $-\u2211jtjln(hj)$$(((($ |

Scaled Hyperbolic | ||

Tangent | ||

$\alpha tanh(aj)$ | $hj-tj\alpha (1-hj2)$ | $-12\alpha (1+tj)ln(1+hj)-12\alpha (1-tj)ln(1-hj)$$((((($ |

Approx. Step ($\alpha \u21920$) | ||

$\alpha aj\alpha aj+\beta $ | $hj-tj\alpha $ | $12\alpha hj-tj2foraj<0foraj\u22650$$((((((($ |

Leaky Rectifier. | ||

(Maas et al., 2013) | ||

$\alpha ajaj$ | $hj-tj\alpha hj-tj1$ | $12\alpha hj-tj2foraj<012hj-tj2foraj\u22650$$((((((((($ |

Scaled Exp. Linear | ||

(Klambauer et al., 2017) | ||

$\lambda \alpha (exp(aj)-1)\lambda aj$ | $hj-tjhj+\lambda \alpha hj-tj\lambda $ | $hj-tj+\lambda \alpha log(hj+\lambda \alpha )foraj<012\lambda hj-tj2foraj\u22650$$((((((((($ |

Inv. Sqrt. | ||

(Carlile et al., 2017) | ||

$aj1+\alpha aj2$ | $hj-tj\alpha aj2+132$ | $12\alpha aj2+1-32hj-tj2$$((((((((($ |

Inv. Sqrt. Linear | ||

(Carlile et al., 2017) | ||

$aj1+\alpha aj2aj$ | $hj-tj\alpha aj2+1-32hj-tj1$ | $12\alpha aj2+132hj-tj2foraj<012hj-tj2foraj\u22650$$(((((((((((($ |

SoftSign | ||

(Bergstra et al., 2009) | ||

$aj1+|aj|$ | $hj-tj(1+|aj|)-2$ | $12(1+|aj|)2hj-tj2$$((((($ |

SoftPlus | ||

(Glorot et al., 2011) | ||

$log(1+exp(aj))$ | $hj-tj(1+exp(-aj))-1$ | $12(1+exp(-aj)hj-tj2$$((((($ |

For $LGD(tj,hj)=1-tjhjfortjhj<1\alpha hjfortjhj\u22651$$((((((($$((((((($ | ||

$hj=\phi (aj)$ | $(\alpha >0,\alpha \u21920)$ | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ |

$E(tj,hj)\phi '(aj)=-tj\phi '(aj)\alpha \phi '(aj)$$(((((((($ | ||

Linear | $(\alpha >0,\alpha \u21920)$ | Leaky version of the hinge loss |

$aj$ | $-tj1\alpha 1$ | $1-tjhjfortjhj<1\alpha hjfortjhj\u22651$ |

Sigmoid | $(\alpha >0,\alpha \u21920)$ | |

$11+exp(-aj)$ | $-tjhj(1-hj)\alpha hj(1-hj)$ | $-tjln(hj)+tjln(1-hj)fortjhj<1+\alpha ln(hj)-\alpha ln(1-hj)fortjhj\u22651$$(((((((((($ |

For $LGD(tj,hj)=\alpha \beta ln(cosh(\beta (hj-tj)))$$(((($ | ||

$hj=\phi (aj)$ | $E(tj,hj)\phi '(aj)=$ | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ |

$\alpha tanh(\beta (hj-tj))\phi '(aj)$$(((($ | ||

Linear | A smooth version of the Huber loss | |

$aj$ | $\alpha tanh(\beta (hj-tj))1$ | $\alpha \beta ln(cosh(\beta (hj-tj)))$ |

For $LGD(tj,hj)=12hj-tj2$ . | ||
---|---|---|

$hj=\phi (aj)$ . | $E(tj,hj)\phi '(aj)=hj-tj\phi '(aj)$ . | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ . |

Linear | Squared error loss | |

$aj$ | $hj-tj1$ | $12hj-tj2$ |

Sigmoid | Cross-entropy loss | |

$11+exp(-aj)$ | $hj-tjhj(1-hj)$ | $-tjln(hj)-(1-tj)ln(1-hj)$$(((($ |

Softmax | Cross-entropy loss | |

(Bridle, 1990) $exp(aj)\u2211kexp(ak)$ | $hj-tjhj(\delta ij-hi)$ | $-\u2211jtjln(hj)$$(((($ |

Scaled Hyperbolic | ||

Tangent | ||

$\alpha tanh(aj)$ | $hj-tj\alpha (1-hj2)$ | $-12\alpha (1+tj)ln(1+hj)-12\alpha (1-tj)ln(1-hj)$$((((($ |

Approx. Step ($\alpha \u21920$) | ||

$\alpha aj\alpha aj+\beta $ | $hj-tj\alpha $ | $12\alpha hj-tj2foraj<0foraj\u22650$$((((((($ |

Leaky Rectifier. | ||

(Maas et al., 2013) | ||

$\alpha ajaj$ | $hj-tj\alpha hj-tj1$ | $12\alpha hj-tj2foraj<012hj-tj2foraj\u22650$$((((((((($ |

Scaled Exp. Linear | ||

(Klambauer et al., 2017) | ||

$\lambda \alpha (exp(aj)-1)\lambda aj$ | $hj-tjhj+\lambda \alpha hj-tj\lambda $ | $hj-tj+\lambda \alpha log(hj+\lambda \alpha )foraj<012\lambda hj-tj2foraj\u22650$$((((((((($ |

Inv. Sqrt. | ||

(Carlile et al., 2017) | ||

$aj1+\alpha aj2$ | $hj-tj\alpha aj2+132$ | $12\alpha aj2+1-32hj-tj2$$((((((((($ |

Inv. Sqrt. Linear | ||

(Carlile et al., 2017) | ||

$aj1+\alpha aj2aj$ | $hj-tj\alpha aj2+1-32hj-tj1$ | $12\alpha aj2+132hj-tj2foraj<012hj-tj2foraj\u22650$$(((((((((((($ |

SoftSign | ||

(Bergstra et al., 2009) | ||

$aj1+|aj|$ | $hj-tj(1+|aj|)-2$ | $12(1+|aj|)2hj-tj2$$((((($ |

SoftPlus | ||

(Glorot et al., 2011) | ||

$log(1+exp(aj))$ | $hj-tj(1+exp(-aj))-1$ | $12(1+exp(-aj)hj-tj2$$((((($ |

For $LGD(tj,hj)=1-tjhjfortjhj<1\alpha hjfortjhj\u22651$$((((((($$((((((($ | ||

$hj=\phi (aj)$ | $(\alpha >0,\alpha \u21920)$ | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ |

$E(tj,hj)\phi '(aj)=-tj\phi '(aj)\alpha \phi '(aj)$$(((((((($ | ||

Linear | $(\alpha >0,\alpha \u21920)$ | Leaky version of the hinge loss |

$aj$ | $-tj1\alpha 1$ | $1-tjhjfortjhj<1\alpha hjfortjhj\u22651$ |

Sigmoid | $(\alpha >0,\alpha \u21920)$ | |

$11+exp(-aj)$ | $-tjhj(1-hj)\alpha hj(1-hj)$ | $-tjln(hj)+tjln(1-hj)fortjhj<1+\alpha ln(hj)-\alpha ln(1-hj)fortjhj\u22651$$(((((((((($ |

For $LGD(tj,hj)=\alpha \beta ln(cosh(\beta (hj-tj)))$$(((($ | ||

$hj=\phi (aj)$ | $E(tj,hj)\phi '(aj)=$ | $LHD(tj,hj)=\u222bE(tj,hj)\phi '(aj)dhj$ |

$\alpha tanh(\beta (hj-tj))\phi '(aj)$$(((($ | ||

Linear | A smooth version of the Huber loss | |

$aj$ | $\alpha tanh(\beta (hj-tj))1$ | $\alpha \beta ln(cosh(\beta (hj-tj)))$ |

Note: For notational simplicity, given for a single output, but for multiple outputs, simply use $LHD(t,h)=\u2211jLHD(tj,hj)$.

This illustes the simplicity of Hebbian descent where we do not need to calculate the derivatives of the activation function at all, instead using the mean squared error loss and drop the derivatives of the activation function in the calculation of the gradient.

## Appendix B: Hebbian Descent Loss Is the General Log-Likelihood Loss

## Acknowledgments

We thank Amir Hossein Azizi and Mehdi Bayati for helpful discussions on Hebbian learning.

## Notes

^{1}

Note that the average here can refer to the average activity over the batch, the training data seen so far, or the whole training data.

^{2}

The elementwise derivation of the error signal is $\Delta GDwij=(2.6)-\eta \u2202LHD(tj,hj)\u2202hj\u2202hj\u2202aj\u2202aj\u2202wij$$=(2.1)-\eta \u2202LHD(tj,hj)\u2202hj\u2202\phi (aj)\u2202aj\u2202(\u2211iNwijxi-\mu i+bj)\u2202wij$$=-\eta xi-\mu iE(tj,hj)\phi '(aj),$$\Delta GDbij=(2.6,2.1)-\eta \u2202LHD(tj,hj)\u2202hj\u2202\phi (aj)\u2202aj\u2202(\u2211iNwijxi-\mu i+bj)\u2202bj$$=-\eta xi-\mu iE(tj,hj)\phi '(aj).$

^{3}

Although this is trivially the case with a linear activation function.

^{4}

Machine learning library PyDeep: https://pydeep.readthedocs.io/, Example for Hebbian descent: https://github.com/MelJan/PyDeep/blob/master/examples/Example_Hebbian_descent.py.

^{5}

We used the following learning rates in this work: 0.00002, 0.00004, 0.00006, 0.00008, 0.0001, 0.0002, 0.0004, 0.0006, 0.0008, 0.001, 0.002, 0.004, 0.006, 0.008, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.4, 0.6, 0.8, 1, 2, 4, 6, 8, 10, 20, 40, 60, 80, 100.

^{6}

We used the following weight decay values in this work: 0.0, 0.0001, 0.0005, 0.001, 0.002, 0.004, 0.006, 0.008, 0.01, 0.02, 0.04, 0.06, 0.08, 0.1, 0.2, 0.4, 0.6, 0.8, 1.0, 2.0.

^{7}

Note that a different number of patterns (e.g., 50 or 100 patterns) can be chosen for selecting the optimal hyperparameters, which, however, does not alter the conclusion of our analysis.

## References

*Comptes Rendus de l’Académie des Sciences, Series III: Sciences de la Vie*

*Quadratic polynomials learn better image features.*

*Journal of Physics A: Mathematical and General*

*Journal of Neuroscience*

*Neurocomputing*

*Improving deep learning by inverse square root linear units (ISRLUs).*

*Proceedings of the International Joint Conference on Neural Networks*

*Proceedings of the International Joint Conference on Neural Networks*

*Fast and accurate deep network learning by exponential linear units (ELUS).*

*Proceedings of the International Conference on Artificial Intelligence and Statistics*

*Journal of Machine Learning Research*

*Künstliche Intelligenz*

*Proceedings of the 1988 Connectionist Models Summer School*

*Encyclopedia of Cognitive Science*

*Connection Science*

*Competition and cooperation in neural nets*

*IEEE Transactions on neural networks*

*Proceedings of the International Conference on Artificial Intelligence and Statistics*

*Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics*

*Proceedings of the 35th International Conference on Machine Learning*

*Nature*

*Advances in neural information processing systems*

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*The organization of behavior: A neuropsychological approach*

*IEEE Transactions on Neural Networks*

*Artificial Intelligence*

*Neural Computation*

*Untersuchungen zu dynamischen neuronalen netzen.*

*Neural Computation*

*Batch normalization: Accelerating deep network training by reducing internal covariate shift.*

*Advances in neural information processing systems*

*ARS Journal*

*Adam: A method for stochastic optimization.*

*Proceedings of the National Academy of Sciences*

*Advances in neural information processing systems*

*International Journal of Electrical and Electronics Engineering*

*Neural Computation*

*Neural networks: Tricks of the trade*

*Neural Networks*

*Annals of Applied Probability*

*Proceedings of the International Conference on Machine Learning*

*From the retina to the neocortex*

*Psychological Review*

*Psychology of learning and motivation*

*Generalized linear models*

*On the importance of centering in artificial neural networks*

*Journal of Machine Learning Research*

*Learning longer memory in recurrent neural networks*

*Lecture Notes in Computer Science*

*Connectionist models*

*PLOS Computational Biology*

*Proceedings of the International Joint Conference on Neural Networks, 2003*

*Neural Processing Letters*

*Journal of Mathematical Biology*

*Neural Networks*

*Proceedings of the 30th International Conference on Machine Learning*

*Psychological Review*

*Proceedings of the IEEE International Conference on Neural Networks*

*Herbert Robbins selected papers*

*Connection Science*

*Psychological Review*

*Nature*

*Parallel distributed processing: Explorations in the microstructure of cognition*

*Recent advances in recurrent neural networks*

*Neural Networks*

*One-shot learning with memory-augmented neural networks*

*Neural networks: Tricks of the trade*

*Neural models of plasticity: Experimental and theoretical approaches*

*Proceedings of the International Conference on Machine Learning*

*Transactions of the American Nuclear Society*

*International Journal of Engineering Development and Research*

*1960 IRE WESCON Convention Record, Part 4*

*Neural Computation*

*Proceedings of the 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering*