Abstract
Training spiking neural networks to approximate universal functions is essential for studying information processing in the brain and for neuromorphic computing. Yet the binary nature of spikes poses a challenge for direct gradient-based training. Surrogate gradients have been empirically successful in circumventing this problem, but their theoretical foundation remains elusive. Here, we investigate the relation of surrogate gradients to two theoretically well-founded approaches. On the one hand, we consider smoothed probabilistic models, which, due to the lack of support for automatic differentiation, are impractical for training multilayer spiking neural networks but provide derivatives equivalent to surrogate gradients for single neurons. On the other hand, we investigate stochastic automatic differentiation, which is compatible with discrete randomness but has not yet been used to train spiking neural networks. We find that the latter gives surrogate gradients a theoretical basis in stochastic spiking neural networks, where the surrogate derivative matches the derivative of the neuronal escape noise function. This finding supports the effectiveness of surrogate gradients in practice and suggests their suitability for stochastic spiking neural networks. However, surrogate gradients are generally not gradients of a surrogate loss despite their relation to stochastic automatic differentiation. Nevertheless, we empirically confirm the effectiveness of surrogate gradients in stochastic multilayer spiking neural networks and discuss their relation to deterministic networks as a special case. Our work gives theoretical support to surrogate gradients and the choice of a suitable surrogate derivative in stochastic spiking neural networks.
1 Introduction
Our brains efficiently process information in spiking neural networks (SNNs) that communicate through short stereotypical electrical pulses called spikes. SNNs are an indispensable tool for understanding information processing in the brain and instantiating similar capabilities in silico. Like conventional artificial neural networks (ANNs), SNNs require training to implement specific functions. However, SNN models, which are commonly simulated in discrete time (Eshraghian et al., 2023), are not differentiable due to the binary nature of the spike, which precludes the use of standard gradient-based training techniques based on backpropagation (BP; Rumelhart et al., 1986). There are several ways to overcome this problem. One can dispense with hidden layers altogether (Gütig & Sompolinsky, 2006; Memmesheimer et al., 2014), but this limits the network’s expressivity and precludes solving more complex tasks. Alternatively, one makes the neuron model differentiable (Huh & Sejnowski, 2018) or considers the timing of existing spikes for which gradients exist (Bohte et al., 2002; Mostafa, 2017; Wunderlich & Pehle, 2021; Klos & Memmesheimer, 2023). This approach, however, requires additional mechanisms to create or remove spikes, as well as dealing with silent neurons, where the spike time is not defined. Finally, one can replace the gradient with a suitable surrogate (Bohte, 2011; Courbariaux et al., 2016; Esser et al., 2016; Zenke & Ganguli, 2018; Bellec et al., 2018; Neftci et al., 2019; Eshraghian et al., 2023). In this article, we focus on the latter. Surrogate gradient (SG) approaches are empirically successful and not limited to changing the timing of existing spikes while working with nondifferentiable neuron models.
However, at present surrogate gradients are a heuristic that lacks a theoretical foundation. Consequently, we still do not understand why SG descent works well in practice (Zenke & Vogels, 2021). We do not know if it is optimal in any sense, nor do we know if there are better strategies for updating the weights. To address this gap in our understanding we analyze the commonalities and differences of SGs with theoretically well-founded approaches based on stochastic networks. Specifically, we focus on smoothed probabilistic models (SPMs; Neftci et al., 2019; Jang et al., 2019) and the more recently proposed stochastic automatic differentiation (stochAD) framework (Arya et al., 2022). SPMs typically rely on stochasticity to smooth out the optimization landscape in expectation to enable gradient computation on this smoothed loss. In this article, we particularly focus on neuron models with escape noise (Gerstner et al., 2014), which are commonly used to smooth the nondifferentiable spikes in expectation and for which exact gradients are computable (Pfister et al., 2006; Brea et al., 2013). However, extending SPMs to multilayer neural networks has been difficult because they preclude using BP and, hence, efficient gradient estimation requires additional approximations (Gardner et al., 2015). In contrast, stochAD is a recently developed framework for automatic differentiation (AD) in programs with discrete randomness, that is, with discrete random variables. While this opens the door for BP or other recursive gradient computation schemes, the framework has not yet been applied to SNNs.
Here, we jointly analyze the above methods, elucidate their relations, provide a rigorous theoretical foundation for SG-descent in stochastic SNNs, and discuss their implications for deterministic SNNs. We first provide some background information on each of the above methods before starting our analysis with the case of a single perceptron. From there, we move to the more complex case of multilayer perceptrons (MLPs), where we elaborate the theoretical connection between SGs and stochAD and end with multilayer networks of leaky integrate-and-fire (LIF) neurons. Finally, we substantiate the theoretical results with empirical simulations.
2 Background on Surrogate Gradients, Smoothed Probabilistic Models, and Stochastic Automatic Differentiation
Before analyzing the relation between SGs, gradients of the log-likelihood in SPMs, and stochAD in section 3, we briefly review these methods here. While their common goal is to compute reasonable weight updates, the first two approaches are specifically designed for training SNNs. The third approach aims to compute unbiased gradients of arbitrary functions with discrete randomness based on AD.
2.1 Smoothed Probabilistic Models
2.2 Surrogate Gradients
2.3 Stochastic Surrogate Gradients
2.4 Stochastic Automatic Differentiation
SDs are equivalent to derivatives of expected outputs in SPMs and smoothed stochastic derivatives in binary perceptrons. (A) Membrane potential dynamics of an LIF neuron (maroon) in comparison to the perceptron. When input spikes (from input neurons and ) are received, they excite the LIF neuron, which causes the membrane potential to increase. Once it reaches the threshold, output spikes are emitted. In the limit of a large simulation time step () and appropriate scaling of the input currents, the LIF neuron approximates a perceptron receiving time-locked input (gray line). (B) Left: The simplified computational graph of a deterministic (blue) and a stochastic (yellow) perceptron. Right: Forward pass in a deterministic (top) or stochastic (bottom) perceptron. The colored arrows indicate that both use the derivative of on the backward pass, which is the derivative of the expected output of the stochastic neuron in case . In the case of the deterministic neuron, this constitutes the SG used instead of the nonexisting derivative of the step function. (C) Left: Network output over multiple trials in the deterministic perceptron. Right: The SD (blue) is the derivative of a sigmoid (gray), which is used to approximate the nonexisting derivative of the step function (black). (D) Same as panel C but for the stochastic perceptron. Left: Escape noise leads to variability in the spike trains over trials. Right: The expected output follows a sigmoid, and we can compute the derivative (yellow) of the expected output.
SDs are equivalent to derivatives of expected outputs in SPMs and smoothed stochastic derivatives in binary perceptrons. (A) Membrane potential dynamics of an LIF neuron (maroon) in comparison to the perceptron. When input spikes (from input neurons and ) are received, they excite the LIF neuron, which causes the membrane potential to increase. Once it reaches the threshold, output spikes are emitted. In the limit of a large simulation time step () and appropriate scaling of the input currents, the LIF neuron approximates a perceptron receiving time-locked input (gray line). (B) Left: The simplified computational graph of a deterministic (blue) and a stochastic (yellow) perceptron. Right: Forward pass in a deterministic (top) or stochastic (bottom) perceptron. The colored arrows indicate that both use the derivative of on the backward pass, which is the derivative of the expected output of the stochastic neuron in case . In the case of the deterministic neuron, this constitutes the SG used instead of the nonexisting derivative of the step function. (C) Left: Network output over multiple trials in the deterministic perceptron. Right: The SD (blue) is the derivative of a sigmoid (gray), which is used to approximate the nonexisting derivative of the step function (black). (D) Same as panel C but for the stochastic perceptron. Left: Escape noise leads to variability in the spike trains over trials. Right: The expected output follows a sigmoid, and we can compute the derivative (yellow) of the expected output.
Derivative computation in MLPs. Schematic of an example network for which (surrogate) derivatives are computed according to different methods. The colored arrows indicate where partial derivatives are calculated. (A) SG descent relies on the chain rule for efficient gradient computation in a deterministic MLP. Thus, the derivative of the output with respect to a given weight is factorized into its primitives, which are indicated by the colored arrows. (B) SPMs approach the problem of nondifferentiable spike trains by adding noise and then smoothing the output based on its expected value. Since this method does not allow the use of the chain rule, the derivative for each weight must be computed directly. (C) The derivative and the expected value are not interchangeable, which makes this option mathematically invalid. Furthermore, it is not possible to achieve the necessary smoothing using the expected value after such an interchange. (D) Smoothed stochastic derivatives in stochAD use the expected value of each node to compute the derivative. However, the method relies on expectation values conditioned on the activity of a specific forward pass.
Derivative computation in MLPs. Schematic of an example network for which (surrogate) derivatives are computed according to different methods. The colored arrows indicate where partial derivatives are calculated. (A) SG descent relies on the chain rule for efficient gradient computation in a deterministic MLP. Thus, the derivative of the output with respect to a given weight is factorized into its primitives, which are indicated by the colored arrows. (B) SPMs approach the problem of nondifferentiable spike trains by adding noise and then smoothing the output based on its expected value. Since this method does not allow the use of the chain rule, the derivative for each weight must be computed directly. (C) The derivative and the expected value are not interchangeable, which makes this option mathematically invalid. Furthermore, it is not possible to achieve the necessary smoothing using the expected value after such an interchange. (D) Smoothed stochastic derivatives in stochAD use the expected value of each node to compute the derivative. However, the method relies on expectation values conditioned on the activity of a specific forward pass.
SGs correspond to smoothed stochastic derivatives in stochastic SNNs. The tree illustrates the discrete decisions associated with the binary spike generation process at different points over different layers of a stochastic MLP. A single forward pass in the network corresponds to a specific path through the tree, which yields a specific set of spike trains. Another forward pass will result in a different path and spike patterns. Computing gradients using finite differences requires randomly sampling paths from the network and evaluating their averaged loss before and after a given weight perturbation. Although this approach is unbiased for small perturbations, the random path selection results in high variance. Furthermore, that approach is not scalable to large networks. Stochastic SG descent is equivalent to smoothed stochastic derivatives in the stochAD framework. To compute the gradient, we roll out the network once and sample a random path in the tree, which we now keep fixed (yellow). At each node, we compute the expected output given the fixed activation of the previous layer , which yields a low-variance estimate (see the inset: spike raster, selected trial shown in yellow, spike trains of other trials in gray, expectation shown as shaded overlay). By choosing a surrogate function that matches the escape noise process, both methods give the same derivative for a spike with respect to the membrane potential. Deterministic SG descent can be seen as a special case in which the random sampling of the path is replaced by a point estimate given by the deterministic rollout (blue).
SGs correspond to smoothed stochastic derivatives in stochastic SNNs. The tree illustrates the discrete decisions associated with the binary spike generation process at different points over different layers of a stochastic MLP. A single forward pass in the network corresponds to a specific path through the tree, which yields a specific set of spike trains. Another forward pass will result in a different path and spike patterns. Computing gradients using finite differences requires randomly sampling paths from the network and evaluating their averaged loss before and after a given weight perturbation. Although this approach is unbiased for small perturbations, the random path selection results in high variance. Furthermore, that approach is not scalable to large networks. Stochastic SG descent is equivalent to smoothed stochastic derivatives in the stochAD framework. To compute the gradient, we roll out the network once and sample a random path in the tree, which we now keep fixed (yellow). At each node, we compute the expected output given the fixed activation of the previous layer , which yields a low-variance estimate (see the inset: spike raster, selected trial shown in yellow, spike trains of other trials in gray, expectation shown as shaded overlay). By choosing a surrogate function that matches the escape noise process, both methods give the same derivative for a spike with respect to the membrane potential. Deterministic SG descent can be seen as a special case in which the random sampling of the path is replaced by a point estimate given by the deterministic rollout (blue).
3 Analysis of the Relation between SGs, SPMs, and stochAD for Direct Training of SNNs
To fathom the theoretical foundation of SG learning, we focused on the theoretically well-grounded SPMs and the recently proposed stochAD framework. Because both approaches assume stochasticity whereas SGs are typically applied in deterministic settings, we focus our analysis on stochastic networks (see section 2). We later discuss deterministic networks as a special case.
To further ease the analysis, we start with binary perceptrons, thereby neglecting all temporal dynamics and the reset of conventional spiking neuron models, while retaining the essential binary spike generation process (see Figure 1A and section 8.2). We begin our comparison by examining a single neuron before moving on to MLPs. We defer the discussion of LIF neurons to section 5.
3.1 Analysis of the Binary Perceptron
Before we compare the different methods of computing gradients or SGs, we define our deterministic and stochastic perceptron here.
3.1.1 Deterministic Perceptron
3.1.2 Stochastic Perceptron
We now compare SG descent in a deterministic and a stochastic unit. To do so, we first incorporate the spike generation of the stochastic binary perceptron into the threshold and reformulate the stochastic firing using . The stochastic threshold is given by , where is randomly drawn noise from a zero-mean distribution. This formulation is equivalent to stochastic firing as in equation 3.4 for a suitable choice of the distribution underlying (see equation 8.5). The stochastic SD is then as in equation 2.2. Since the input and the weights are the same in both cases, both units have the same membrane potential. Because the SD depends only on the membrane potential, the resulting weight updates are equivalent provided that both use the same (see equations 2.1 and 2.2). Thus, the resulting SDs in the stochastic and deterministic perceptron are equivalent, although the outputs of the two units are generally different (see Figures 1C and 1D), and hence the overall weight update is different.
We note that the right-hand sides of expressions 3.5 and 3.3 are the same when setting . Thus, the derivative of the expected output of the stochastic Perceptron is equivalent to the SD of the output of the perceptron, if the SD is chosen such that it matches the derivative of the neuronal escape noise (see Figure 1B).
In summary, when analyzing different ways of computing or approximating derivatives, the resulting expressions are identical in single units if the SD matches the derivative of the escape noise function of the stochastic neuron model. In the single-neuron case, this match is equivalent to probabilistic smoothing of the loss, a well-known fact in the literature (Pfister et al., 2006), which allows computing exact gradients of the expected value. Nevertheless, the resulting weight updates are generally different in the deterministic and stochastic cases because these expressions also depend on the neuronal output, which is different in the deterministic and stochastic neurons. We will see in the next section which equivalences are broken in multilayer networks due to these and other notable differences.
3.2 Analysis of the Multilayer Perceptron
To analyze the relation of the different methods in the multilayer setting, we begin by examining SPMs, which lack support for AD and thus an efficient algorithm to compute gradients. We then further discuss how stochAD provides smooth stochastic derivatives equivalent to SGs in multilayer networks.
3.2.1 Output Smoothing in Multilayer SPMs Precludes Efficient Gradient Computation
SPMs lack support for AD because they smooth the expected loss landscape through stochasticity and therefore require the calculation of expected values at the network output. While output smoothing allows the implementation of the finite difference algorithm, this algorithm does not scale to large models and is therefore of little practical use for training ANNs (Werfel et al., 2004; Lillicrap et al., 2020). The application of AD, however, requires differentiable models, like standard ANNs, so that the chain rule can be used to decompose the gradient computation into simple primitives. This composability is the basis for efficient recursive algorithms like BP and real-time recurrent learning (RTRL; Marschall et al., 2019).
To see why SPMs do not support AD, we consider a simple example network: Let be the output of a binary neural network with input and two hidden layers with activities , (see Figure 2). The output has the firing probability as in equation 3.4 and the hidden layers have the firing probabilities and , respectively. We are looking for a closed-form expression of the derivative of the expected output for each parameter, for example, for weight (see Figure 2B). In a deterministic and differentiable network, one can use the chain rule to split the expression into a product of partial derivatives as (see Figure 2A). However, this is not possible for SPMs because , where the right-hand side would involve the derivative of the nondifferentiable binary output.
3.2.2 Stochastic Automatic Differentiation Bestows Theoretical Support to Surrogate Gradients
Here we show how smoothed stochastic derivatives for stochastic binary MLPs relate to SGs. Arya et al. (2022) defined the smoothed stochastic derivative (repeated in equation 2.3 for convenience) for one realization of a random variable , where the discrete random variables are usually Bernoulli random variables in stochastic binary MLPs.
To gain an intuitive understanding of the essence of smoothed stochastic derivatives, we consider a single realization of a stochastic program, which, in our case, is a stochastic MLP. In other words, we run a single stochastic forward pass in the MLP and condition on its activity. At each point in time, each neuron either emits a one or a zero, that is, a spike or none. Thus, we can think of all these binary decisions as edges in a tree, with each node representing the spike pattern of all units in a given layer. Any rollout of a forward pass then corresponds to a particular randomly chosen path through this tree (see Figure 3). These paths originate from the same root for a given input , yet the stochasticity in all of them is independent. In this tree of possible spike patterns, it is easy to understand how different methods for gradient estimation work.
Given the above method for computing smoothed stochastic derivatives, we are now in the position to understand their relationship to SGs. Since we condition on a specific path in the tree of all possible spike patterns, we only compute derivatives of the expected output conditional on the output of the previous layer according to the chosen path at each node. Such an approach exactly corresponds to treating each node as a single unit as in section 3.1. As we saw above, the derivative of the expected output of a single unit with escape noise is equivalent to the corresponding smoothed stochastic derivative and the SD in that unit both with or without escape noise. Furthermore, when using SGs, there is no difference in how the method is applied in single-versus multilayer cases and there is always a well-defined path to condition on. So SGs can also be understood as treating all units at each layer as single units. The same is true for smoothed stochastic derivatives in the stochAD framework. Thus, the stochAD framework provides theoretical support to SGs in stochastic networks and also stipulates that the SD should be matched to mirror the derivative of the neuronal escape noise, thereby extending this matching principle from the single neuron case (Pfister et al., 2006; Brea et al., 2013) to the multilayer network setting and removing some of the ambiguity associated with choosing this function in practice (Zenke & Vogels, 2021).
3.2.3 Surrogate Gradients in Deterministic Networks
SGs are commonly used to train deterministic SNNs. Deterministic networks show two essential differences to their stochastic counterparts, which result in different weight updates from what we have learned above. First, while a stochastic network typically selects a different path in the tree of spike patterns in each trial, there is only one path per input and parameter set in the deterministic case. This means that the expectation value over sampling random paths is replaced by a point estimate provided by the deterministic rollout. This difference introduces a selection bias in the estimation of the stochastic implementation. However, in practice, this bias is often small (see supplementary Figure S1). Second, when training deterministic networks with SGs, the slope of the SD at threshold () is typically chosen ad hoc. Typical values are on the order one compared to the neuronal rheobase. This is stark in contrast to the “correct” slope of the corresponding asymptotic escape noise function from taking the limit . Approaching this limit empirically with leads to unstable training and poor performance (Zenke & Vogels, 2021). The choice inevitably results in different weight updates. While this is a desired effect that allows training in the first place by avoiding the problem of zero and infinite weight updates, it is less clear whether this approximation has any undesirable off-target consequences. To address this question, we examine the properties of SGs in deterministic networks in the next section.
4 Analysis of Surrogate Gradient Properties in Deterministic Networks
By design, SGs deviate from the actual gradient, as they provide a nonzero surrogate in cases where the actual gradient is zero. It is not clear a priori what consequences such a deviation has and whether SGs are gradients at all, which can be obtained by differentiating a surrogate loss. However, it is difficult to get a quantitative understanding when comparing either to the zero vector or to the ill-defined derivative at the point of the spike. To take a closer look at how SGs deviate from actual gradients and the resulting consequences, we now move to differentiable network models that have a well-defined gradient. While SG training is not required in such networks, it allows us to develop a quantitative understanding of the commonalities and differences we should expect. This comparison also allows us to check whether SGs satisfy the formal criteria of gradients.
4.1 Deviations of SGs from Actual Gradients in Differentiable MLPs and the Consequences
SGs deviate from actual gradients in differentiable MLPs. (A) Schematic of a differentiable network (left) with sigmoid activations (right) for which we compute an SD using the derivative of a flatter sigmoid (yellow) in contrast to the actual activation (black). (B) Top row: Network output (solid gray), smoothed network output (dashed), and integrated SD (yellow) as a function of . The triangles on the -axis indicate the minimum of the corresponding curves. Bottom row: Derivatives of the top row. Left and right correspond to a flatter () and a steeper () SD. See Table 1 for network parameters. Note that the actual derivative and the surrogate can have opposite signs. (C) Heat map of the optimization landscape along and for different values (top to bottom). While the actual gradient can be asymptotically zero (see yellow dot, bottom), the SD provides a descent direction (yellow arrow), thereby enabling learning (top and middle).
SGs deviate from actual gradients in differentiable MLPs. (A) Schematic of a differentiable network (left) with sigmoid activations (right) for which we compute an SD using the derivative of a flatter sigmoid (yellow) in contrast to the actual activation (black). (B) Top row: Network output (solid gray), smoothed network output (dashed), and integrated SD (yellow) as a function of . The triangles on the -axis indicate the minimum of the corresponding curves. Bottom row: Derivatives of the top row. Left and right correspond to a flatter () and a steeper () SD. See Table 1 for network parameters. Note that the actual derivative and the surrogate can have opposite signs. (C) Heat map of the optimization landscape along and for different values (top to bottom). While the actual gradient can be asymptotically zero (see yellow dot, bottom), the SD provides a descent direction (yellow arrow), thereby enabling learning (top and middle).
Parameter Values for Sign Flip Example.
Parameter . | . | . | . | . | . | . | . | Input . |
---|---|---|---|---|---|---|---|---|
Value | 0 | 0.05 | 0.1 | 1 | −1 | 100 | 25 | 1 |
Parameter . | . | . | . | . | . | . | . | Input . |
---|---|---|---|---|---|---|---|---|
Value | 0 | 0.05 | 0.1 | 1 | −1 | 100 | 25 | 1 |
Notes: Parameter values for the network in Figure 4A, with input , which serve as an example, that SGs can have the opposite sign of the actual gradient and thus point toward the opposite direction. Therefore, we cannot guarantee the SG to align with the actual gradient.
4.2 Surrogate Gradients Are Not Gradients of a Surrogate Loss
SGs are not gradients. (A) Heat map of the network output while moving along two random directions in the parameter space of the example network with step activation (top) and sigmoid activation (bottom; see Figure 4A). The circles indicate a closed integration path through parameter space, starting at the arrowhead. (B) Integral values of SDs as a function of the angle along the closed circular path shown in panel A. Different shades of yellow correspond to different values of . The blue lines correspond to the actual output of the network with step activation function (solid) or sigmoid activation (dashed). The integrated actual derivative of the network with sigmoid activation matches the output (dashed line) and is thus not visible in the plot. (C) Absolute difference between actual loss value and integrated SD as a function of the number of integration steps. The numerical integrals converge to finite values. Thus the observed difference is not an artifact of the numerical integration.
SGs are not gradients. (A) Heat map of the network output while moving along two random directions in the parameter space of the example network with step activation (top) and sigmoid activation (bottom; see Figure 4A). The circles indicate a closed integration path through parameter space, starting at the arrowhead. (B) Integral values of SDs as a function of the angle along the closed circular path shown in panel A. Different shades of yellow correspond to different values of . The blue lines correspond to the actual output of the network with step activation function (solid) or sigmoid activation (dashed). The integrated actual derivative of the network with sigmoid activation matches the output (dashed line) and is thus not visible in the plot. (C) Absolute difference between actual loss value and integrated SD as a function of the number of integration steps. The numerical integrals converge to finite values. Thus the observed difference is not an artifact of the numerical integration.
5 From Perceptrons to LIF Neurons
where describes the synaptic weights between the neuron and the input neuron , and , where is the time step, is the synaptic time constant, and is the membrane time constant. While the first two equations characterize the linear temporal dynamics of the synaptic current and membrane potential , the last equation captures the nonlinearity of the neuron, its spiking output . Thus, in this formulation, we can think of an LIF neuron as a binary perceptron whose inputs are first processed through a linear filter cascade (i.e., the neuronal dynamics) and additionally have a reset mechanism given by the last factor in equation 5.2.
6 Surrogate Gradients Are Ideally Suited for Training Stochastic SNNs
We have seen that SG descent is theoretically justified by stochAD, albeit better supported for stochastic spiking neuron models. This finding also suggests that SG descent is suitable for training stochastic SNNs, with deterministic SNNs being a special case (see Figure 3). Next, we wanted to confirm this insight numerically. To that end, we trained deterministic and stochastic SNNs with SG descent on a deterministic spike train matching and a classification task.
For the spike train matching task, we assumed 200 input neurons with frozen Poisson spike trains. We then set up a feedforward SNN with one hidden layer, initialized in the fluctuation-driven regime (Rossbroich et al., 2022). We used supervised SG training in a deterministic and a stochastic version of the same network to match 200 target spike trains that were given by a dithered picture of the Radcliffe Camera (see Figure 6A and section 8.4.1). Both networks learned the task, albeit with visibly different hidden-layer activity (see Figure 6B). The deterministic version outperformed the stochastic SNN in terms of -distance at 1 ms, the temporal resolution of the simulation. However, when comparing their outputs according to the van Rossum distance (van Rossum, 2001) with an alpha-shaped filter kernel equivalent to the -kernel of the LIF neuron (), we found no difference in loss between the stochastic and deterministic networks (see Figure 6C). Finally, we observed that the Fano factor, a measure of stochasticity, of the stochastic SNN dropped during training to better accommodate the deterministic target (see Figure 6D). In summary, we find that the stochastic network exhibits comparable performance to the deterministic network. Thus, SGs are suitable for training stochastic SNNs.
SGs successfully train stochastic SNNs on a spike train matching task. (A) Spike raster plots of the input and target spike trains of the spike train matching task. Time is shown on the -axis, the neuron indices are on the -axis, and each dot is a spike. The task is to convert the given frozen Poisson input spike pattern (left) into a structured output spike pattern depicting the Radcliffe Camera in Oxford (right). (B) Top: Output spike raster plots after training of the deterministic (left) and the stochastic SNNs (right). Although both methods faithfully reproduce the overall target structure, the deterministic network is slightly better at matching the exact timing. Bottom: Hidden-layer spike raster plots. Despite the similar output, the two networks show visibly different hidden-layer activity patterns. (C) loss (top) and van Rossum distance (bottom) throughout training for the two network models. While the deterministic network outperforms the stochastic one in terms of distance, the difference is negligible for the van Rossum distance. (D) Average Fano factor throughout training in the hidden and output layer of the stochastic network. Although the stochastic network reduces its variability during training to match the deterministic target, its hidden layer still displays substantial variability at the end of training.
SGs successfully train stochastic SNNs on a spike train matching task. (A) Spike raster plots of the input and target spike trains of the spike train matching task. Time is shown on the -axis, the neuron indices are on the -axis, and each dot is a spike. The task is to convert the given frozen Poisson input spike pattern (left) into a structured output spike pattern depicting the Radcliffe Camera in Oxford (right). (B) Top: Output spike raster plots after training of the deterministic (left) and the stochastic SNNs (right). Although both methods faithfully reproduce the overall target structure, the deterministic network is slightly better at matching the exact timing. Bottom: Hidden-layer spike raster plots. Despite the similar output, the two networks show visibly different hidden-layer activity patterns. (C) loss (top) and van Rossum distance (bottom) throughout training for the two network models. While the deterministic network outperforms the stochastic one in terms of distance, the difference is negligible for the van Rossum distance. (D) Average Fano factor throughout training in the hidden and output layer of the stochastic network. Although the stochastic network reduces its variability during training to match the deterministic target, its hidden layer still displays substantial variability at the end of training.
To verify that this finding generalizes to a more complex task, we trained a convolutional spiking neural networks (ConvSNN) with three hidden layers on the Spiking Heidelberg Digits (SHD) data set (Cramer et al., 2022) with maximum-over-time readout (see section 8.4.2 and Figure 7A). The stochastic network learned to solve the task with comparable training and validation accuracy to the deterministic network (see Figure 7B). Specifically, we found that the stochastic ConvSNN achieved a validation accuracy of , (test ), compared to (test ) for the deterministic ConvSNN. Furthermore, in the case of the stochastic SNN, we left the escape noise active during validation and testing. However, one can see that the stochastic network performs equally well when tested in a deterministic environment (see Figure 7D). Conversely, the deterministically trained network does not perform well when evaluated under stochastic spike generation. In contrast to the previous task, we found that the Fano factor remained high during training (see Figure 7C), that is, stochasticity is preserved. This is also reflected in a high trial-to-trial variability as shown in Figure 7E. One can see that the spiking activity for three example neurons shows a high variability across trials for the same input. Thus, even for a more complex task, SG descent is well suited for training stochastic SNNs, and stochasticity is preserved after training.
SGs can successfully train stochastic ConvSNNs. (A) Snapshot of network activity for two correctly classified sample inputs from the SHD data set. Top: Readout unit activity over time, the colored line indicates the activity of the unit that corresponds to the correct class. Below: Spike raster plots of the three convolutional hidden layers. The spike raster plots of the inputs are shown at the bottom in red and gray. Time is on the -axis, the neuron index is on the -axis, and each dot represents a spike. (B) Learning curves for training (dashed) and validation (solid) accuracy for the stochastic and the deterministic case (average over trials std). (C) The mean Fano factor of the different layers in the stochastic network throughout training ( std, ). (D) The first three pairs of boxes show train, validation, and test loss of the ConvSNN as in panel A for the stochastic and the deterministic case for random initializations. The right-most boxes show test loss for the opposite activation function. This means the network trained deterministically is tested with stochastic activation and vice versa. (E) Raster plots over trials of the spiking activity of three randomly chosen units from the second hidden layer. The units show clear trial-to-trial variability reminiscent of cortical activity.
SGs can successfully train stochastic ConvSNNs. (A) Snapshot of network activity for two correctly classified sample inputs from the SHD data set. Top: Readout unit activity over time, the colored line indicates the activity of the unit that corresponds to the correct class. Below: Spike raster plots of the three convolutional hidden layers. The spike raster plots of the inputs are shown at the bottom in red and gray. Time is on the -axis, the neuron index is on the -axis, and each dot represents a spike. (B) Learning curves for training (dashed) and validation (solid) accuracy for the stochastic and the deterministic case (average over trials std). (C) The mean Fano factor of the different layers in the stochastic network throughout training ( std, ). (D) The first three pairs of boxes show train, validation, and test loss of the ConvSNN as in panel A for the stochastic and the deterministic case for random initializations. The right-most boxes show test loss for the opposite activation function. This means the network trained deterministically is tested with stochastic activation and vice versa. (E) Raster plots over trials of the spiking activity of three randomly chosen units from the second hidden layer. The units show clear trial-to-trial variability reminiscent of cortical activity.
7 Discussion
We have analyzed the theoretical foundations of the widely used SG descent method for multilayer SNNs and shown that SGs can be derived for stochastic networks from the stochAD framework (Arya et al., 2022). The stochAD framework provides a principled way to use AD in discrete stochastic systems, directly applicable to SG descent in stochastic SNNs. Our analysis is based on smoothed stochastic derivatives introduced within the stochAD framework and stipulates that the SD should match the derivative of the escape noise function of the stochastic neuron model. We confirmed in simulations that SG descent allows training well-performing stochastic SNNs with substantial variability comparable to neurobiology. Our work shows that SGs enjoy better theoretical support in stochastic SNNs, which removes some ambiguity of choice of the SD and also sheds light on deterministic SGs as a special case.
One of our key findings is the relation between the functional shape of the escape noise in stochastic neuron models and the SD. This relation is evident in single neurons, where the solution found by SGs and stochAD is equal to the exact derivative of the expected output. This relation has also been described in previous work on shallow networks (Pfister et al., 2006; Brea et al., 2013), where the derivative of the escape noise directly shows up when computing exact derivatives in expectation. In this article, we found that the relation extends to multilayer SNNs when applying stochAD as a theoretical framework. The link between the SD shape and escape noise provides a rigorous mathematical justification for a particular SD choice, which in previous work was typically chosen heuristically.
7.1 Related Work
The issue of nonexisting derivatives is not unique to SNNs but is a well-known challenge when dealing with discrete random variables and their derivatives. It hence arises in various contexts, such as in general binary neural networks (Courbariaux et al., 2016), sigmoid belief nets (Neal, 1992), or when dealing with categorical distributions (Maddison et al., 2017; Jang et al., 2016). To address this challenge in arbitrary functions, including discrete random variables, the score function estimator, also called REINFORCE (Williams, 1992), is often applied because it provides unbiased estimates. This estimator can be computed through stochastic computation graphs (Schulman et al., 2015). While Arya et al. (2022) use coupled randomness to reduce variance, there are also methods building on reinforcement learning for low-variance gradient estimation (Weber et al., 2019).
Conversely, to address the challenge of BP through nondifferentiable functions in neural networks instead, a commonly used solution is the STE (Hinton, 2012; Bengio et al., 2013), which replaces the derivative of a stochastic threshold function with one, also called the identity STE (Yin et al., 2018). This solution relates closely to SGs used in SNNs. However, SGs are commonly used in deterministic networks and comprise a nonlinear SD like the derivative of a scaled fast sigmoid (Neftci et al., 2019). While both are successful in practice, we still lack a complete theoretical understanding. Several studies have analyzed the theoretical aspects of the STE, looking at the problem from various angles. However, they have each analyzed different versions of the STE, and a complete picture of the underlying theory is still missing. Although none of them analyzed STEs in SNNs (i.e., SGs), examining their results helps to put SGs into perspective. Bengio et al. (2013) originally introduced the identity STE as a biased estimator, saying that it guarantees the correct sign only in shallow networks. Yin et al. (2018) studied coarse gradients, including the STE. They studied this either in a network with only one nonlinearity or in an activation-quantized network with a quantized rectified linear unit (ReLU) as opposed to a Heaviside activation function, as we have in the case of SNNs. They analyzed training stability and investigated which choice of STEs leads to useful weight updates. They concluded that the identity STE does not, while the ReLU and clipped ReLU versions do. They further found that the identity STE might be repelled from certain minima. Liu et al. (2023) showed that the identity STE provides a first-order approximation to the gradient, which was previously shown by Tokui and Sato (2017) specifically for Bernoulli random variables. However, other STEs were not considered, for example, the sigmoidal STE, which would correspond to our SG. Another approach to dealing with discrete distributions was pursued by Jang et al. (2016) and Maddison et al. (2017). They proposed a solution similar to the reparameterization trick but combined with a smooth relaxation of the categorical distribution. This relaxation amounts to training a network with continuous representations by sampling from a Gumbel-Softmax. At the same time, after training, the temperature of the Gumbel-Softmax distribution is set to zero to obtain a discrete output. Hence, while providing discrete output after training, such a network is trained with continuous neuronal outputs rather than spikes.
Interestingly, most of these solutions deal with discrete functions in the stochastic case, such as stochastic binary perceptrons or Bernoulli random variables, where noise is used for smoothing. But as we saw above, probabilistic approaches in the context of SPMs cannot be applied to multilayer SNNs and BP without approximations. In the past, stochastic SNN models have been implemented mainly for theoretical and biological plausibility reasons (Pfister et al., 2006; Brea et al., 2013). For example Brea et al. (2013) showed a link between the escape noise and the voltage-triplet rule (Clopath & Gerstner, 2010) in shallow networks. Furthermore, there were also approaches that enabled the training of multilayer stochastic SNNs with AD, such as the MultilayerSpiker proposed by Gardner et al. (2015), which uses the spike train itself as the SD of a spike train. Today, SNNs are mostly implemented as networks of deterministic LIF neurons with a Heaviside function as spiking nonlinearity, as they tend to have higher performance. Recently, however, Ma et al. (2023) found empirically that stochastic SNNs with different types of escape noise can be trained with SG descent to high performance. While they showed a connection to deterministic SG descent, they did not discuss the implications for multilayer networks with stateful LIF neurons. Again, the effectiveness of SG descent in stochastic SNNs was confirmed, as suggested by the connection to the stochAD framework found here. Thus, our work not only provides empirical confirmation of the effectiveness of SGs in training stochastic SNNs but also a comprehensive theoretical explanation for their success in multilayer networks.
7.2 Limitations
To gain analytical insight into the theory behind SGs, we had to make some simplifying assumptions. For example, we performed most of our theoretical analysis on perceptrons, which, unlike LIF neurons, have no memory. Nevertheless, our results readily transfer to LIF neurons in discrete time, as such networks can be understood as a special type of binary recurrent neural networks (Neftci et al., 2019) where the output is given by a Heaviside step function. However, one difference is that unlike perceptrons, LIF models have a reset after the spike, which should also be considered when calculating gradients. Smoothed stochastic derivatives allow and suggest backpropagating through the reset mechanism, but previous work has shown that it is beneficial to exclude the reset mechanism from the backward pass, as this leads to more robust performance, especially when the SG is not scaled to one (Zenke & Ganguli, 2018; Zenke & Vogels, 2021; Eshraghian et al., 2023). Both options are possible when using SG-descent, and we have not noticed much difference in performance when scaling the SDs by to one. However, omitting this scaling negatively affects performance when backpropagating through the reset (see supplementary Figure S2C and compare the asymptotic SuperSpike in Zenke & Ganguli, 2018). While omitting this scaling can potentially be counteracted by an optimizer using per parameter learning rates, the prefactor due to in the unscaled case can become quite large as it accumulates across layers and thus still affects performance.
Another limitation is that we derived SGs for stochastic SNN models, while they are most commonly used in deterministic networks. This discrepancy may in fact point at a possible advantage. Our simulations suggested that stochastic networks generalize equally well or better than their deterministic counterparts. This difference may be due to stochastic path selection, which leads to lower selection bias compared to the deterministic case. Moreover, when employing small weight updates, each update follows a different path in the tree of spike patterns, effectively averaging the updates over multiple trials, albeit with slightly different parameter sets. Such stochastic averaging helps to reduce the bias present in the smoothed stochastic derivatives. Alternatively, the difference may be due to escape noise acting as an implicit regularizer reminiscent of dropout that promotes better generalization.
Next, while our work motivates a clear link between the SD and a stochastic neuron model with escape noise, it does not say which escape noise to use. This lack may seem like a weakness that merely moves the ambiguity of choice to a different component of the neuron model. Instead, doing so reduces two heuristic choices, namely the choice of escape noise and the shape of the SD, to one. Furthermore, the escape noise function can be constrained using biological data (Jolivet et al., 2006; Pozzorini et al., 2015).
Moreover, the tasks we used to study stochastic SNNs were limited. Historically, stochastic SNNs have been evaluated on tasks that require exact deterministic spike trains at readout such as in the spike train matching task (Pfister et al., 2006; Brea et al., 2013; Gardner et al., 2015). However, despite being performed with stochastic SNNs, such tasks actually punish stochasticity, at least in the readout. Therefore, stochastic SNNs respond by becoming more deterministic during training as we have observed when monitoring the Fano factor (see Figure 6). In our setting, neurons have the possibility to change the effective steepness of the escape noise function by increasing their weights, as it is dependent on , and thus get more deterministic. Therefore, we believe that tasks that do not punish stochasticity, such as classification tasks, are much more suitable for evaluating the performance of stochastic SNNs as they allow the training of well-performing SNNs that still show substantial variability in their responses.
Finally, our analysis of the deterministic SGs was also subject to some limitations. In particular, a comparison with the actual gradient, which is either zero or undefined, would not have worked. Instead, we analyzed the deviation of the SG from the actual gradient in differentiable networks that are nonspiking. Doing so, we have shown that SGs in deterministic networks with well-defined gradients deviate from the actual gradients. In the limiting case , these networks are equivalent to perceptron networks with step activation functions. Thus, although we could not perform our analysis directly on deterministic SNNs, we consider the above approach to be a good approximation since it preserves the main properties of SGs.
7.3 Future Research
The work in this article motivates several exciting directions for future research. First, we found that although necessary for training, SGs are generally not gradients of a surrogate loss and thus not guaranteed to find a local minimum. Specifically, when replacing the nonexistent derivative of a Heaviside with an SD, the resulting SG is generally not a gradient. It seems unlikely that choosing a different escape noise or SD will alter this result. Thus, in situations requiring stronger theoretical guarantees as those bestowed by gradients, an alternative approach would be to start from a surrogate loss function. While this approach would, by design, yield a proper gradient, it would be more akin to optimizing a smoothed version of the problem, as is the case when using the Gumbel-Softmax or the concrete distribution (Jang et al., 2016; Maddison et al., 2017) rather than defining a surrogate for an ill-behaved gradient. However, this approach would also come with the known difficulties for AD. It is not yet clear whether or how they can be overcome by future work.
We also saw that SGs generally decrease the loss, but they do not necessarily find a local minimum of the loss. This discrepancy is due to the deviation from the actual gradient. Yet learning in deterministic SNNs is only possible due to this deviation because the actual gradient is almost always zero. In fact, deviations from the actual gradient are often also desirable even in conventional ANN training. Bias and variance of the gradient estimator can help to generalize (Ghosh et al., 2023). For instance, in ANNs with a rough loss landscape, mollifying, which induces a deviation from the gradient, is successful in practice (Gulcehre et al., 2017). Regularizers or optimizers, which are often used by default, show another advantage of deviations. For example, many optimizers rely on momentum to speed up training or avoid getting caught in bad local minima. Thus, an interesting line of future work is to study what constitutes a good local minimum in the case of SNNs, which types of biases or deviations from the gradient help to find them, and how this relates to SGs.
Another promising starting point for future work is tied to the stochAD framework and questions pertaining to optimization effectiveness. In this study, we linked SG-descent to smoothed stochastic derivatives, which are biased with respect to the nonsmoothed version based on the full stochastic triple. Thus far, the latter can only be computed in forward-mode AD, which is impractical for neural network training, as it comes with an increased computational cost. An intriguing direction for future research is therefore to investigate forward mode AD with the full stochastic triple to determine whether training improvements due to the reduction in bias would justify the added computational cost. Furthermore, it would be worth exploring whether this computational cost could be reduced, for instance, by using mixed-mode AD, such as in DECOLLE (Kaiser et al., 2020) or online spatiotemporal learning (Bohnstingl et al., 2023; see Zenke & Neftci, 2021, for a review).
Finally, an exciting direction for future research is to use our theoretically motivated methodology for training stochastic SNNs to address biological questions. In neurobiology, many neurons exhibit trial-to-trial variability. However, the role of this variability is not well understood. Here, functional stochastic SNN models will help to further our understanding of its role in the brain. Stochasticity may play possible roles in representational drift (Micou & O’Leary, 2023), efficient coding (Denève & Machens, 2016), and for biological implementations of latent generative models (Hinton et al., 1995; Rombach et al., 2022). Thus, it would be interesting to study how learning influences representational changes in plastic functional stochastic SNNs and whether we can identify specific dynamic signatures that allow drawing conclusions about the underlying learning mechanisms in the brain.
In conclusion, SGs remain a valuable tool for training both deterministic and stochastic SNNs. In this article, we saw that stochAD provides a theoretical backbone to SG learning that naturally extends to stochastic SNNs. Further, it suggests that the choice of the SD should be matched to the derivative of the escape noise. Training such stochastic SNNs is becoming increasingly relevant for applications with noisy data or noisy hardware substrates and is essential for theoretical neuroscience as it opens the door for studying functional SNNs with biologically realistic levels of trial-to-trial variability.
8 Methods
Our perceptron and SNN models were custom-written in Python 3.10.4 using PyTorch (Paszke et al., 2019), Jax (Bradbury et al., 2018) and the Stok Library for training SNNs (Rossbroich et al., 2022, https://github.com/fmi-basel/stork). The code repository can be found at https://github.com/fmi-basel/surrogate-gradient-theory. All models were implemented in discrete time. The following sections provide more details on the different neuron models before providing all the necessary information on the two learning tasks, including the architecture and parameters used for a given task.
8.1 Neuron Models
8.1.1 Perceptron
8.1.2 Leaky Integrate-and-Fire Neuron
8.2 Spike Generation
The generation of a spike follows the same mechanism in both the LIF and the perceptron neuron model. Depending on the membrane potential value, the activation function generates a spike or not. The following paragraphs highlight the differences between the deterministic and the stochastic cases.
8.2.1 Deterministic Spike Generation
8.2.2 Stochastic Spike Generation
8.3 Differentiable Example Network
The minimal example network (see Figure 4A) was simulated using Jax (Bradbury et al., 2018). To implement the main essence of SG descent in a differentiable network, we constructed a network of perceptrons with sigmoid activation functions (see equation 4.1). The SG was implemented by a less steep sigmoid, which was used instead of the actual activation function on the backward pass.
8.3.1 Integrated (Surrogate) Gradients
8.4 Learning Tasks
We trained SNNs to evaluate the differences in learning between the stochastic and deterministic SNN models trained with SG-descent. For this purpose, we chose a spike train matching and a classification task to cover different output modalities.
8.4.1 Spike Train Matching
If an SNN can learn a precise timing of spikes, it must be able to match its output spike times with some target output spike times. Therefore, in the spike train matching task, we ask whether both the stochastic and the deterministic network can learn deterministic output spike trains. For the training, no optimizer is used, and since we use a minibatch size of one, this means we apply batch gradient descent. The details about the used architecture, neuronal, and training parameters are given in the following paragraphs as well as in Tables 2, 3, and 4 for both, the stochastic and the deterministic versions of the network. The code was written in Jax.
Noise and Surrogate Function Parameters.
. | Spike-train matching task . | Classification task . | ||
---|---|---|---|---|
. | stochastic . | deterministic . | stochastic . | deterministic . |
Number of trials | 10 | 1 | 1 | 1 |
Learning rate | 0.01 | 0.01 | ||
Escape noise | ||||
Function | step | step | ||
Parameter | – | – | ||
Surrogate gradient | ||||
Function | SuperSpike | SuperSpike | SuperSpike | |
Parameter |
. | Spike-train matching task . | Classification task . | ||
---|---|---|---|---|
. | stochastic . | deterministic . | stochastic . | deterministic . |
Number of trials | 10 | 1 | 1 | 1 |
Learning rate | 0.01 | 0.01 | ||
Escape noise | ||||
Function | step | step | ||
Parameter | – | – | ||
Surrogate gradient | ||||
Function | SuperSpike | SuperSpike | SuperSpike | |
Parameter |
Notes: Parameters used in our numerical simulations for feedforward networks on the spike-train matching task and multi-layer recurrent ConvSNNs on the SHD classification task. We selected the learning rate based on the best validation accuracy (see supplementary Figure S2B). The SuperSpike nonlinearity is the derivative of a fast sigmoid scaled by : .
Architecture Parameters.
. | Spike-train matching task . | Classification task . |
---|---|---|
Data Set | Radcliffe Camera | SHD |
No. input neurons | 200 | 700 |
No. hidden neurons | 200 | 16-32-64 |
No. output neurons | 200 | 20 |
No. training epochs | 5000 | 200 |
Time step | 1 ms | 2 ms |
Duration | 198 ms | 700 ms |
Mini-batch size | 1 | 400 |
Kernel size (ff) | – | 21-7-7 |
Stride (ff) | – | 10-3-3 |
Padding (ff) | – | 0-0-0 |
Kernel size (rec) | – | 5 |
Stride (rec) | – | 1 |
Padding (rec) | – | 2 |
. | Spike-train matching task . | Classification task . |
---|---|---|
Data Set | Radcliffe Camera | SHD |
No. input neurons | 200 | 700 |
No. hidden neurons | 200 | 16-32-64 |
No. output neurons | 200 | 20 |
No. training epochs | 5000 | 200 |
Time step | 1 ms | 2 ms |
Duration | 198 ms | 700 ms |
Mini-batch size | 1 | 400 |
Kernel size (ff) | – | 21-7-7 |
Stride (ff) | – | 10-3-3 |
Padding (ff) | – | 0-0-0 |
Kernel size (rec) | – | 5 |
Stride (rec) | – | 1 |
Padding (rec) | – | 2 |
Note: Parameters used in numerical simulations for feedforward (ff) SNNs on the spike-train matching task and multilayer recurrent (rec) ConvSNNs on the SHD classification task.
Neuronal and Training Parameters.
. | Spike-train matching task . | Classification task . |
---|---|---|
Loss | or van Rossum distance | Maximum over time |
Optimizer | None (gradient descent) | SMORMS3 |
Neuronal parameters | ||
Spike threshold | 1 | 1 |
Resting potential | 0 | 0 |
– | ||
Reset | at same time step | at next time step |
Activity regularizer | ||
– | 7 | |
– | 0.01 |
. | Spike-train matching task . | Classification task . |
---|---|---|
Loss | or van Rossum distance | Maximum over time |
Optimizer | None (gradient descent) | SMORMS3 |
Neuronal parameters | ||
Spike threshold | 1 | 1 |
Resting potential | 0 | 0 |
– | ||
Reset | at same time step | at next time step |
Activity regularizer | ||
– | 7 | |
– | 0.01 |
Note: Parameters used in numerical simulations for feedforward SNNs on the spike-train matching task and deep recurrent ConvSNNs on the SHD classification task.
Task. The target spike trains to match were generated from a picture of the Radcliffe Camera in Oxford using dithering to create a binary image. We set this image as our target spike raster, where the -axis corresponds to time steps and the -axis is the neuron index. Hence, the image provides a binary spike train for each readout neuron. As an input, we used frozen Poisson-distributed spike trains with a per neuron firing rate of 50 Hz. This created a one-image data set that requires 200 input neurons and 200 output neurons and has a duration of 198 time steps, that is, 198 ms when using .
Network architecture. The spike train matching task is an easy task that could also be solved by an SNN without a hidden layer. However, since we knew that in the shallow case without a hidden layer, the only difference between our stochastic and our deterministic versions would lie in the loss, we decided to use a network with one hidden layer. Hence, the network architecture we used was a feedforward network with 200 input units, 200 hidden units, and 200 readout units, run with for a total of 198 time steps (see also Table 3). Weights were initialized in the fluctuation-driven regime (Rossbroich et al., 2022) with a mean of zero and target membrane fluctuations . For the stochastic networks, we averaged the update over 10 trials: we sampled 10 paths in the tree of spike patterns, used each of them to compute an SG, and took their average to perform the weight update before sampling another path for the next weight update (see Figure 3).
Fano factor. The Fano factor was calculated on the hidden layer and output spike trains as .
8.4.2 Classification on SHD
To evaluate performance differences, we also evaluated both the stochastic and the deterministic version on a classification task using a multilayer recurrent ConvSNN (as in Rossbroich et al., 2022), hence also increasing the difficulty for the stochastic model, which now had to cope with multiple noisy layers. The details about the used architecture, neuronal, and training parameters are given in the following paragraphs as well as in Tables 2, 3, and 4 for both, the stochastic and the deterministic versions of the network.
Task. For the classification task, we used the SHD data set (Cramer et al., 2022), a real-world auditory data set containing spoken digits in German and English from different speakers; hence, it has 20 classes. The data set can be downloaded from https://ieee-dataport.org/open-access/heidelberg-spiking-datasets. Cramer et al. (2022) preprocessed the recordings using a biologically inspired cochlea model to create input spike trains for neurons. The duration of the samples varies; hence, we decided to consider only the first of each sample, which covers more than of the original spikes. For our numerical simulations, we binned the spike trains using a and hence had time steps in this task. Ten percent of the training data was used as a validation set, and for testing, we used the officially provided test set, which contains only speakers that did not appear in the training set.
Network architecture. We used a multilayer recurrent ConvSNN for the SHD classification task (see also Table 3). There were 700 input neurons that directly got fed the input spikes from the SHD data set. The network had three recurrently connected hidden layers and used one-dimensional convolution kernels. Weights were initialized in the fluctuation-driven regime with a mean of zero, target membrane fluctuations , and the proportion of fluctuations due to the feedforward input was . The network size as well as the parameters for the feedforward and recurrent convolutional operations are summarized in Table 3. We performed only one trial per update for the stochastic case.
Readout units. As opposed to the previous task, a classification task requires special readout units. The readout units did have the same membrane and current dynamics as a normal LIF neuron (see equations 8.1 and 8.2), but they did not spike. Furthermore, we used a different membrane time constant for the readout units. This allowed us to read out their membrane potential for classification (see the paragraph below on loss function).
Given this, SMORMS3 computes the current effective learning rate as , where is the initially chosen learning rate and is a small constant to avoid division by zero. Therefore, the parameter update is .
Acknowledgments
We thank Frank Schäfer, Guillaume Bellec, and Wulfram Gerstner for stimulating discussions. This work was supported by the Swiss National Science Foundation (grant PCEFP3_202981) and the Novartis Research Foundation.