## Abstract

Individual neurons in the brain have complex intrinsic dynamics that are highly diverse. We hypothesize that the complex dynamics produced by networks of complex and heterogeneous neurons may contribute to the brain's ability to process and respond to temporally complex data. To study the role of complex and heterogeneous neuronal dynamics in network computation, we develop a rate-based neuronal model, the generalized-leaky-integrate-and-fire-rate (GLIFR) model, which is a rate equivalent of the generalized-leaky-integrate-and-fire model. The GLIFR model has multiple dynamical mechanisms, which add to the complexity of its activity while maintaining differentiability. We focus on the role of after-spike currents, currents induced or modulated by neuronal spikes, in producing rich temporal dynamics. We use machine learning techniques to learn both synaptic weights and parameters underlying intrinsic dynamics to solve temporal tasks. The GLIFR model allows the use of standard gradient descent techniques rather than surrogate gradient descent, which has been used in spiking neural networks. After establishing the ability to optimize parameters using gradient descent in single neurons, we ask how networks of GLIFR neurons learn and perform on temporally challenging tasks, such as sequential MNIST. We find that these networks learn diverse parameters, which gives rise to diversity in neuronal dynamics, as demonstrated by clustering of neuronal parameters. GLIFR networks have mixed performance when compared to vanilla recurrent neural networks, with higher performance in pixel-by-pixel MNIST but lower in line-by-line MNIST. However, they appear to be more robust to random silencing. We find that the ability to learn heterogeneity and the presence of after-spike currents contribute to these gains in performance. Our work demonstrates both the computational robustness of neuronal complexity and diversity in networks and a feasible method of training such models using exact gradients.

## 1 Introduction

### 1.1 Background

Primarily, biological neurons are dynamic, constantly modulating internal states in a nonlinear way while responding to inputs. A biological neuron maintains a membrane potential that varies not only with input currents but also through history-dependent transformations. When its voltage exceeds some threshold, a neuron produces a spike, a rapid fluctuation in voltage that is considered the basis of neuronal communication. While these are the defining features of a neuron, neurons exhibit additional, more complex types of dynamics, including threshold variability and bursting. Threshold adaptation gives rise to threshold variability and thus modulates the sensitivity of a neuron to inputs. A proposed mechanism is that the threshold fluctuates in a voltage-dependent manner, possibly due to the gating mechanisms of various ion channels (Fontaine et al., 2014). Another type of dynamic is responsible for bursting and related spiking behaviors. Bursting results from depolarizing currents, which can be induced by prior spikes. A broader array of spiking patterns can be explained by considering both hyperpolarizing and depolarizing after-spike currents, currents that are modulated by a neuron's spiking activity (Mihalaş & Niebur, 2009).

In sum, biological neuronal dynamics can be thought of as continuous, nonlinear transformations of internal neural states, such as ionic currents and voltage. This is in contrast to a typical artificial neuron that maintains a single state resulting from a static, linear transformation of an input vector. Moreover, diversity in intrinsic neuronal mechanisms gives rise to heterogeneity in the dynamics that biological neurons express. For example, not all neurons in the brain display bursting behavior or threshold adaptation. The brain also exhibits diversity in neuronal properties, such as ionic conductances, threshold, and membrane properties. This diversity is evident from experiments fitting neuronal models to spiking behavior observed in the mouse cortex (Teeter et al., 2018). Heterogeneity of such neural characteristics further amplifies the diversity in neuronal dynamics. In contrast, the only diversity in typical recurrent neural networks (RNNs), ANNs that incorporate information over time, lies in synaptic weights. Typical RNN neurons do not exhibit intrinsic dynamics, and each neuron responds identically to input with a common, fixed activation function.

Vanilla RNNs, which employ linear weighting of inputs and nonlinear transformation of hidden states over time (see Figure 1A), have managed to perform quite well on complex tasks without incorporating the complexity and heterogeneity prominent in biological networks. RNNs have been successful at classifying text (Liu et al., 2016) and images from a pixel-by-pixel presentation (Goodfellow et al., 2016; Li et al., 2018; LeCun et al., 2015). This raises the question of what, if any, the advantage of neuronal dynamics and the diversity thereof is. Do neuronal dynamics enable the modeling of more temporally complex patterns? Does neuronal heterogeneity improve the learning capacity of neural networks?

We propose that neuronal dynamics, such as after-spike currents, as well as heterogeneity in the properties of these dynamics across a network, enhance a network's capacity to model temporally complex patterns. Moreover, we hypothesize that heterogeneity in dynamics, resulting from the diversity of single-neuron parameters, will be optimal when parameters are optimized in conjunction with synaptic weights during training. In order to evaluate these hypotheses, we develop and test a neuronal model whose parameters can be optimized through gradient descent.

A set of approaches has been previously used to develop neuronal models that encapsulate the above-mentioned biological complexities. These range from models that include a dynamical description of at the level of membrane channels (Hodgkin & Huxley, 1952; Morris & Lecar, 1981) to more compact models that synthesize the complexities into more abstract dynamical systems (Izhikevich, 2003) or hybrid systems (Mihalaş & Niebur, 2009). While networks of such neurons have been constructed (Markram et al., 2015; Billeh et al., 2020), they are often difficult to optimize, either to solve a task or to fit biological data.

Small increases in dynamical complexity can have significant benefits. Adaptation dynamics can help neural networks perform better on predictive learning tasks (Burnham et al., 2021) and achieve better fits of neuronal and behavioral data in mice (Hu et al., 2021). Inroads have been made to allow the optimization of spiking models using backpropagation via surrogate gradients (Huh & Sejnowski, 2018; Neftci et al., 2019). Such methods have revealed the importance of neuronal integration and synaptic timescales (Perez-Nieves et al., 2021) and adaptation (Salaj et al., 2021; Bellec et al., 2020) in network computation. Previous models (Perez-Nieves et al., 2021; Burnham et al., 2021) have also shown the importance of timescale diversity in temporal tasks. However, these models are still significantly simpler than those found to fit single-neuron data well (Teeter et al., 2018). To allow larger-scale dynamics to fit well, we want to be able to optimize neural networks with the type of dynamics proven to fit such complexities. Here we focus on the addition of after-spike currents, which are currents modulated by a neuron's spiking behavior leading to a wide variety of observed dynamics, including bursting (Gerstner et al., 2014; Mihalaş & Niebur, 2009).

While spiking neurons are more biologically realistic, they are generally more difficult to optimize. Thus, we develop a model that is typically differentiable but becomes a spiking model in the limit of taking a parameter to 0. We refer to this novel neuronal model as the generalized leaky-integrate-and-fire-rate (GLIFR) model. The GLIFR model is built on the spike-based generalized-leaky-integrate-and-fire (GLIF) models. GLIF models introduce several levels of complexity to leaky-integrate-and-fire models, which only model a neuron's leaky integration of input currents. We focus on the GLIF$3$ model (Teeter et al., 2018), which accounts for after-spike currents through additive and multiplicative effects of spiking. Our proposed GLIFR model translates the GLIF$3$ model into a rate-based scheme. Unlike the GLIF model, the differentiability of the GLIFR model enables the application of standard deep learning techniques to optimize parameters underlying intrinsic neuronal dynamics. Thus, we explore two aspects of the GLIFR model. First, we explore the addition of complex neural dynamics. After-spike currents act like a feedback mechanism in a neuron, being directly affected by a neuron's firing pattern and simultaneously being integrated into the voltage along with synaptic currents. Second, we study the effects of heterogeneity in dynamics resulting from heterogeneity in underlying neuronal parameters learned by gradient descent. For example, different after-spike currents may lead to different neuronal timescales. Beyond after-spike currents, differences in the membrane time constant and firing thresholds across neurons may add to the rich repertoire of firing patterns produced by a network.

To explore these two aspects introduced by the GLIFR neurons, we use gradient descent to optimize networks of GLIFR neurons on several temporal tasks and assess the performance and robustness of these networks in comparison to that of vanilla RNNs as well as long short-term-memory (LSTM) networks, which use a gating mechanism to promote memory over timescales (Hochreiter & Schmidhuber, 1997). While the GLIFR networks are outperformed by LSTM networks, our networks have mixed performance when compared to vanilla RNNs on a pattern generation task and a temporally complex sequential MNIST task. We find that it is possible to optimize both intrinsic neuronal parameters and synaptic weights using gradient descent. Optimization of neuronal parameters generally leads to diversity in parameters and dynamics across networks. Moreover, when we compare among several variations of the GLIFR model, we find that both the presence of after-spike currents and the heterogeneity in neuronal properties improve performance, suggesting an important computational role for neuronal complexity and heterogeneity in the brain. We provide code for creating and optimizing GLIFR models in Python.

### 1.2 Related Work

$S(t)=11+e-(V(t)-Vth)/\sigma V$ |

$dV(t)dt=-kmV(t)+RmkmItot(t)-(V(t)-Vreset)Sr(t)$ |

$dIjdt=-kjIj(t)+aj+rjIj(t)Sr(t)$ |

$Isyn(t)=WinSpre(t)+WlatS(t)$ |

$S(t)=11+e-(V(t)-Vth)/\sigma V$ |

$dV(t)dt=-kmV(t)+RmkmItot(t)-(V(t)-Vreset)Sr(t)$ |

$dIjdt=-kjIj(t)+aj+rjIj(t)Sr(t)$ |

$Isyn(t)=WinSpre(t)+WlatS(t)$ |

In addition, in these works, parameter optimization by gradient descent was not explored. Other work has explored the effect of spike frequency adaptation on network performance using a different mode of spike frequency adaptation, mediated by threshold adaptation rather than after-spike currents (Salaj et al., 2021). Spike-frequency adaptation improved the performance of LIF SNNs on temporally challenging tasks, such as a sequential MNIST task, where the network must classify images of handwritten digits based on a pixel-by-pixel scan, an audio classification task where the network must identify silence or spoken words from the Google Speech Commands data set, and an XOR task where the network must provide the answer after a delay following the inputs. These networks were found to approach RNN performance on the second audio classification task.

While the work described thus far used spiking models of biological dynamics, it did not explore the advantage that heterogeneity could confer. Several approaches have been taken to this end. One approach is to initialize an SNN with heterogeneous dynamics but optimize only synaptic weights. This method achieved comparable or higher performance than ANNs that employed convolutional or recurrent transformations on an object detection task (She et al., 2021). A second approach is to optimize intrinsic neuronal parameters in addition to synaptic weights. Under the hypothesis that neuronal heterogeneity is computationally advantageous, the learned parameters will be diverse across the trained network. To this end, one study extended the surrogate gradient technique for LIF SNNs to also optimize membrane and synaptic time constants across networks (Perez-Nieves et al., 2021). It was found that these networks learned heterogeneous parameters across the network when trained on temporal MNIST tasks, a gesture classification task, and two auditory classification tasks. On some tasks, particularly the temporally complex tasks that relied on precise timing of spikes, learning parameters improved performance over learning synaptic weights alone. However, in this work, learning parameters relied on surrogate gradients, and a simpler neuron model (LIF) was used.

Recent work has proposed a novel approach, event-based backpropagation, to compute exact gradients despite the discrete spiking (Wunderlich & Pehle, 2021). While promising, this approach requires complex computational techniques. In contrast, our work establishes a method of training neuronal parameters with standard gradient descent by using a rate-based neuronal model; moreover, this model additionally expresses after-spike currents as schematized in Figures 1A and 1B.

To summarize, previous work has suggested that heterogeneity in integration and synaptic timescales improves network performance. This opens the door to the questions we address here: whether after-spike currents that produce complex dynamics within individual neurons can be trained, whether these after-spike currents improve network performance, and whether training naturally leads to their heterogeneity. It also invites the question of whether neuronal models with these complex dynamics can be designed that learn using standard gradient descent. In what follows, we show that this is possible, and illustrate how this results in heterogeneous neuronal dynamics as well as mixed performance relative to vanilla RNNs.

## 2 Generalized Leaky-Integrate-and-Fire-Rate (GLIFR) Model

### 2.1 Model Definition

We model voltage similar to its definition in the spiking GLIF model, but instead of the discrete reset rule in the GLIF model, we let voltage continuously tend toward the reset voltage $Vreset$ ($mV$) at a rate proportional to the firing rate $Sr$. Thus, voltage is modeled as $dV(t)dt=-kmV(t)+RmkmItot(t)-(V(t)-Vreset)Sr(t)$, where $Itot(t)$ is the total current produced by the sum of a constant baseline current $I0$, the after-spike currents, and the synaptic current.

After-spike currents are modeled as a set of separate ionic currents indexed by $j$ as follows: $dIjdt=-kjIj(t)+(aj+rjIj(t))Sr(t)$. The decay factor $kj$$(ms-1)$ captures the continuous decay of the current. The “spike-dependent” component of the current is determined by the combination of the multiplicative parameter $rj$ and the additive parameter $aj$ ($pA$), scaled by the raw firing rate $1\tau s$.

Synaptic currents, those resulting from input from the previous layer as well as lateral connections, are modeled according to $Isyn(t)=WinSpre(t)+WlatS(t)$, where $Spre$ represents either the normalized firing rate of the presynaptic neuron layer or the input to the network. The weights $Win$ describe how the input to the neuron layer should be weighted, and the lateral weights $Win$ describe the connections between neurons in the same layer. We do not include a temporal dependence for simplicity as we focus on cellular properties. These equations are summarized in Table 1.

### 2.2 Effect of Parameters on Model Behavior

The effects of the parameters on model behavior have been previously described in the discrete spiking scheme (Mihalaş & Niebur, 2009). In this work, we translate to a continuous spiking scheme. Thus, we visualize the effect of $\sigma V$ on model behavior (see Figures 2A and 2B). As shown, lower values of $\sigma V$ approximate discrete spikes, while higher values of $\sigma V$ result in smoother changes in the firing rate.

Varying the values of $aj$ and $rj$ can give rise to a variety of complex patterns including bursting. As shown in Figure 2A, hyperpolarizing (negative) values enable the neuronal firing rate to oscillate slightly, and a combination of hyperpolarizing and depolarizing (positive) after-spike currents enables regular oscillations in firing rate. Because we model firing rates rather than individual spikes, we take this as a form of bursting. We furthermore find that for a given set of neuronal parameters, larger inputs yield higher-frequency oscillations in firing rate (see Figure 2B). We note that in these simulations and later in trained networks, the GLIFR model is theoretically capable of producing both biologically plausible patterns and less biologically realistic activity.

### 2.3 Optimization of Neuronal Parameters for Learning Realizable Signals

We first confirm the ability of neuronal parameters to be optimized through gradient descent in a theoretically simple task: learning a target that is known to be realizable by a GLIFR.

We tested the ability of the learning neuron to reproduce the pattern of the target neuron. Figures 3B and 3C show that the learning neuron successfully learned to approximate the dynamics of the target neuron. However, the final parameters learned by the learning neuron differed from the parameters of the target neuron (see Figure 3D), illustrating that different internal mechanisms can lead to similar dynamics (Prinz et al., 2004). This supports the idea that for the proposed model, when optimizing independently the time constants and the additive and multiplicative parameters for after-spike currents, varying distributions of trained parameters across a network may nevertheless lead to similar network function.

## 3 Results

### 3.1 Strategy for Analyzing Task Learning and Performance

Each network, including RNNs, GLIFR networks, and LSTM networks, consists of a single recurrent hidden layer of the appropriate neuron type, whose outputs at each time point are passed through a fully connected linear transformation to yield the final output. The recurrence is incorporated into the GLIFR networks through the term $WlatS(t)$ in the synaptic currents, equation 1.1, and the $Wlatpqsp,l(t-\Delta t)$ term in equation 6.5.

Code for the GLIFR model is publicly available at https://github.com/AllenInstitute/GLIFS_ASC. The model is implemented in PyTorch, and we rely on the underlying autodifferentiation engine to backpropagate through the model.

### 3.2 Performance on Tasks

. | Sine Generation Task . | N-MNIST . | L-MNIST . | P-MNIST . |
---|---|---|---|---|

RNN | 0.0561 | 84.1740% | 97.7920% | 25.6400% |

LSTM | 0.0094 | 85.9230% | 98.6620% | N/A |

Hom | 0.0348 | 88.0000% | 93.6540% | N/A |

HomA | 0.0309 | 87.9560% | 93.6420% | N/A |

LHet | 0.0233 | 87.7470% | 93.7750% | N/A |

LHetA | 0.0227 | 87.8330% | 94.1700% | 61.6780% |

FHet | 0.0164 | 87.2160% | 93.7220% | N/A |

FHetA | 0.0166 | 87.2240% | 94.1160% | N/A |

RHet | 0.0115 | 87.4020% | 93.7520% | N/A |

RHetA | 0.0121 | 87.3070% | 94.8140% | 67.2090% |

. | Sine Generation Task . | N-MNIST . | L-MNIST . | P-MNIST . |
---|---|---|---|---|

RNN | 0.0561 | 84.1740% | 97.7920% | 25.6400% |

LSTM | 0.0094 | 85.9230% | 98.6620% | N/A |

Hom | 0.0348 | 88.0000% | 93.6540% | N/A |

HomA | 0.0309 | 87.9560% | 93.6420% | N/A |

LHet | 0.0233 | 87.7470% | 93.7750% | N/A |

LHetA | 0.0227 | 87.8330% | 94.1700% | 61.6780% |

FHet | 0.0164 | 87.2160% | 93.7220% | N/A |

FHetA | 0.0166 | 87.2240% | 94.1160% | N/A |

RHet | 0.0115 | 87.4020% | 93.7520% | N/A |

RHetA | 0.0121 | 87.3070% | 94.8140% | 67.2090% |

Note: This table lists the testing performance (mean-squared error for the sine wave generation task and accuracy for the remaining tasks) of each task explored.

### 3.3 Robustness to Silencing

GLIFR dynamics enable more complex temporal interactions within and among neurons. What is the resulting impact on the robustness of GLIFR networks' random “deletions” (or silencing) of individual neurons? We test this by evaluating the robustness of networks to random silencing of neurons. Specifically, for various proportions $p$, we randomly select a subset of the neurons in the network with proportion $p$ to silence in a trained network throughout testing. For each subset, we clamp the neurons' firing rates to zero, preventing their contribution in both forward and lateral connections, and compute the performance of the network on the task. We run this experiment on the sine wave generation, N-MNIST, and L-MNIST tasks (see Figure 7). In all tasks, silencing impaired performance, but we analyze the extent to which each network was impaired. Lower impairment implies greater robustness to silencing. In the sine wave generation task, several variations of the GLIFR network appear more robust than the RNN when small percentages of neurons are silenced. The GLIFR networks do not appear to maintain this advantage for the N-MNIST task. We see the greatest differences in the L-MNIST task, where we find that when silencing proportions of neurons with $p\u22650.2$, all forms of GLIFR networks show an advantage over the RNN. In general, RHetA networks perform the best. This observation suggests that neuronal complexity and heterogeneity improve network robustness despite the reduced baseline performance induced by silencing when compared with vanilla RNNs.

Arguably, the poorer robustness of the RNNs may be anticipated due to the lack of dropout during training (Hinton et al., 2012). Thus, we performed additional experiments on the L-MNIST task. For each probability $p$, we trained each network with dropout with probability $p$ and tested its performance when random sets of neurons (proportion $p$) were silenced through the entirety of each testing simulation. Indeed, we see that the performance of these networks trained with dropout is better than that observed without dropout, but the trends discussed above hold (see Figure 12).

### 3.4 Optimization in Discrete Spiking Regime

In all the experiments so far, we have kept $\sigma V$ a constant during training. As previously noted, the parameter $\sigma V$ can be modified such that as $\sigma V\u226a1$, the model approaches discrete spiking dynamics.

We take a simulated annealing approach to assess whether this setup can be utilized to learn a nearly discrete model that would more closely reproduce biologically realistic, rapid spiking. Specifically, we train an LHetA GLIFR network on the L-MNIST task while gradually decreasing the $\sigma V$ parameter over training (from 1 to $10-3$). We find that the LHetA networks still learn the task well, achieving an average accuracy of $74.49%$ ($n=10$; standard deviation of $1.90%$). This is lower than the previously achieved accuracy of $94.17%$ using relatively high values of $\sigma V$ but better than when using a constant but low value of $\sigma V$ throughout training. We also note that this approach does not converge once $\sigma V$ is reduced past a certain amount (see Figure 8). Thus, further work is needed to identify the optimal annealing procedure for learning tasks with low $\sigma V$, but this annealing experiment demonstrates the ability of GLIFR networks to produce spiking behavior and still perform well despite being a fundamentally rate-based model.

### 3.5 Heterogeneity Learned over Training

Based on the performance values above, neuronal heterogeneity seemed to contribute to the GLIFR networks' performance. To further elucidate whether and how the learned heterogeneity in parameters may have reshaped neuronal dynamics, we determined the extent to which the GLIFR networks had learned truly diverse parameters in each task context. For the homogeneously initialized networks, both $aj$s and $rj$s had been initialized with limited diversity to promote neuronal stability, and the remaining neuronal parameters had been set homogeneously. We hypothesized that the diversity in all parameters would have developed over training. The heterogeneously initialized networks were set with distributions of parameters learned by the LHet and LHetA networks, but we expected the distribution to shift over the course of training as the model fine-tuned its shuffled parameters and weights to the particular task.

We found that training did result in heterogeneous parameter values for all tasks (data shown for L-MNIST; see Figures 13A and 13B) such that the variance of the trained $aj$ and $rj$ parameters is much larger than at initialization (standard deviation of initial distribution was 0.01). Similarly, we observe shifts in the distribution of some parameters when heterogeneity is “refined.” We wanted to determine how well this diversity in neuronal parameters mapped to diversity in neuronal dynamics. To do this, we constructed f-I curves representing the average firing rate of each neuron over a time period (5 ms) when the neuron was injected with varying levels of current (see Figure 13B). We found a diversity in shapes of these curves, illustrating the diversity in the neuronal dynamics produced by neurons in the trained networks.

We also found that the classes based on parameters separated the f-I curves (see Figure 9B). For example, the neurons with hyperpolarizing after-spike currents tended to exhibit low maximal firing rates. And the neurons with depolarizing after-spike currents tended to exhibit firing rates that rapidly saturated to relatively high values. While we cannot extend these results to identify each category as discrete cell types, this further suggests that the learned diversity in parameters enabled a diversity in dynamics as well.

## 4 Discussion

This work explored the role of neuronal complexity and heterogeneity in network learning and computation using a novel paradigm. Specifically, we developed the GLIFR model, a differentiable rate-based neuronal model that expresses after-spike currents in addition to the simple types of dynamics. While past work (Perez-Nieves et al., 2021; Salaj et al., 2021) has studied neuronal complexity and heterogeneity, here we demonstrated the ability to learn both synaptic weights and individual neuronal parameters underlying intrinsic dynamics with traditional gradient descent for models with more complex internal neuron dynamics generated by after-spike currents. While it is generally rate-based, the GLIFR model retains the ability to produce spiking outputs and thus is a powerful model for studying neuronal dynamics.

We demonstrated the ability for networks of GLIFR neurons to learn tasks ranging in complexity from a sine wave generation task to the pixel-by-pixel MNIST task. In each task, heterogeneous parameters and dynamics were learned. We tested the effects of the ability to learn parameter heterogeneity and the ability to express after-spike currents on model performance. Learning heterogeneous parameter distributions generally improved performance, and modeling after-spike currents improved performance on the L-MNIST task. Regardless of whether heterogeneous parameters or after-spike currents were learned, the GLIFR models outperformed vanilla RNNs in the N-MNIST and P-MNIST tasks. We hypothesize that the differences in results among tasks reflect differences in complexity and engagement of memory between tasks. For instance, we only observe improvements in GLIFR networks over RNN networks in N-MNIST and P-MNIST tasks, but not in the sine wave generation task and the L-MNIST task. Since the N-MNIST and P-MNIST tasks theoretically engage more memory than the other two tasks, we suggest that the RNN may simply do well enough for the other two tasks that improvement over it would be unexpected. It is difficult to explain why neuronal heterogeneity appeared to be disadvantageous in the N-MNIST task, but perhaps neuronal heterogeneity complicated the error landscape and hindered the network's learning ability in this setting. We also found that GLIFR networks trained on the L-MNIST task were more robust to neuronal loss. Specifically, when we silenced fixed fractions of neurons, the GLIFR networks generally performed better than vanilla recurrent neural networks. Finally, we found that learning parameters across a network in response to a temporally challenging task enabled the network to develop neurons with differing intrinsic parameters and dynamics.

These findings support the hypothesis that neuronal heterogeneity and complexity have a computational role in learning complex tasks. The implications of this are two-fold. On one hand, neuronal heterogeneity may allow more powerful computing in artificial neural networks. Vanilla recurrent networks, for example, rely on a single type of dynamic—typically the ReLU, sigmoid, or tanh activation function. However, the use of activation functions that can be “learned” over time, such that the trained network exhibits diverse dynamics across its neurons, may confer a computational advantage (Geadah et al., 2020). Here we demonstrated several computational advantages of more biologically realistic neurons for some specific temporally challenging tasks while using traditional gradient descent training techniques. Other learning techniques may be required for other tasks, where the complex error landscape hinders the ability of gradient descent to optimize both the neuronal parameters and the synaptic weights.

On the other hand, our results provide further insight into the purpose of complexity and diversity of neural dynamics seen in the brain. Our brains are responsible for integrating various sensory stimuli, each varying in their temporal structures. Intuitively, heterogeneity in the types of dynamics used by neurons across the brain may enable robust encoding of a broad range of stimuli. The ability of diverse networks to encode information more efficiently is supported by several studies (Hunsberger et al., 2014; Shamir & Sompolinsky, 2006; Tripathy et al., 2013). Certain types of complex dynamics may also affect synaptic strengths and improve the network's robustness to damage.

We believe that with the GLIFR model in the research domain, we have opened the door to more intriguing studies that will identify further roles for complex neural dynamics in learning tasks. In future work, testing the GLIFR models on additional tasks may provide additional insight into the computational advantages of neuronal heterogeneity and complexity. Additionally, it would be interesting to pursue a theoretical explanation of the computational role of heterogeneity and complexity. Past experimental work (Tripathy et al., 2013) found that neuronal heterogeneity increased the amount of information carried by each neuron and reduced redundancy in the network. Exploring this in computational models would be valuable, as it may suggest additional ways in which the computational advantages of biological dynamics can be harnessed to improve artificial neural networks and yield insights into the mechanisms of computation in biological networks.

## 5 Code Availability

Modeling, training, and analysis code is available at https://github.com/AllenInstitute/GLIFS_ASC.

## 6 Methods

### 6.1 GLIFR Model

#### 6.1.1 Development of the Single-Neuron Model

As previously described, the GLIFR model encapsulates the concepts behind the GLIF$3$ model (Teeter et al., 2018), while producing continuous outputs. We describe how state variables are computed at each timestep in the GLIFR model, comparing the computation to that in the GLIF model. We use a subscript $s$ for the GLIF state variables and no subscript for the equivalent GLIFR variables. We use superscript $p$ to represent the index of the neuron in consideration and $l$ for the associated layer, where postsynaptic layers are indexed with larger numbers than are presynaptic layers. For example, $Sp,l(t)$ defines the normalized firing rate of the $p$th neuron in the $l$th layer.

Here, $Vp,l(t)$ decays according to $km$ ($ms-1$) and integrates current based on resistance $Rm$ ($G\Omega $). Voltage also tends toward the reset voltage $Vreset$ (mV) at a rate proportional to the firing rate $sp,l$. This is a continuous equivalent of the discrete voltage reset.

A summary of these equations and the states and parameters describing the GLIFR equations is in Tables 1, 3, and 4.

$s(t)$ | normalized firing rate | 1 |

$V(t)$ | voltage | mV |

$Ij(t)$ | after-spike current | pA |

$Isyn(t)$ | synaptic current | pA |

$s(t)$ | normalized firing rate | 1 |

$V(t)$ | voltage | mV |

$Ij(t)$ | after-spike current | pA |

$Isyn(t)$ | synaptic current | pA |

Spiking related | ||

$Vthp,l$ | threshold | mV |

$\sigma V$ | smoothness | mV |

Voltage related | ||

$kmp,l$ | voltage decay factor | $ms-1$ |

$Rm$ | membrane resistance | $G\Omega $ |

$Vreset$ | reset voltage | mV |

$I0$ | baseline current | pA |

After-spike current related | ||

$ajp,l$ | additive constant in after-spike current | pA |

$rjp,l$ | multiplicative constant in after-spike current | unitless |

$kjp,l$ | decay factor for after-spike current | $ms-1$ |

Synaptic current related | ||

$Winpq$ | input weight | pA |

$Wlatpq$ | lateral weight | pA |

Spiking related | ||

$Vthp,l$ | threshold | mV |

$\sigma V$ | smoothness | mV |

Voltage related | ||

$kmp,l$ | voltage decay factor | $ms-1$ |

$Rm$ | membrane resistance | $G\Omega $ |

$Vreset$ | reset voltage | mV |

$I0$ | baseline current | pA |

After-spike current related | ||

$ajp,l$ | additive constant in after-spike current | pA |

$rjp,l$ | multiplicative constant in after-spike current | unitless |

$kjp,l$ | decay factor for after-spike current | $ms-1$ |

Synaptic current related | ||

$Winpq$ | input weight | pA |

$Wlatpq$ | lateral weight | pA |

Note: Here, indices $p,l$ refer to neuron $p$ in layer $l$.

#### 6.1.2 Effect of $\sigma V$ on Model Behavior

#### 6.1.3 Discretization of GLIFR Equations

### 6.2 GLIFR Optimization

We use standard gradient descent to optimize networks of GLIFR neurons, training both synaptic weights and parameters in each neuron separately, similar to Perez-Nieves et al. (2021). In this way, the network can learn heterogeneous parameters across neurons.

We also add biological constraints to several parameters to maintain stability through training. In biological systems, $km$ and $kj$ are restricted to be positive since they represent rates of decay. Moreover, to maintain stability, decay factors should be kept between 0 and $1dt$ throughout training. To achieve this constraint without introducing discontinuities in the loss function, we optimize $t*=ln(kdt1-kdt)$ for each decay factor $k$. Thus, $t*$ can be any real number, whereas $k=sigmoid(t*)\xb71dt$ is the decay factor we use when simulating neural activity. In this way, the value of $k$ that is used is always between 0 and $1dt$.

The term $rj$ can also introduce instability, so we use a similar method to constrain $rj$ to $[-1,1]$. We optimize $t*=ln(1-rj1+rj)$ and retrieve $rj$ using $rj=1-2\xb7sigmoid(t*)$, which is bounded between $-1$ and 1. Other instabilities may be encountered, for example, by large synaptic weights, but we find these approaches sufficient for training networks of GLIFR neurons.

The parameters we optimize-after making the above changes are listed in Table 5. When training on complex tasks, we create networks of neurons, each consisting of a single hidden layer of GLIFR neurons whose outputs at each time point are passed through a fully connected linear transformation to yield the final output. The weights and biases used for this linear transformation are additionally optimized.

$Vthp,l$ | threshold | mV |

$\omega inpq$, $\omega latpq$ | combined synaptic weights | mV |

$ajp,l$ | additive constant in after-spike current | pA |

$rjp,l,*$ | transformed multiplicative constant in after-spike current | unitless |

$kjp,l,*$ | transformed decay factor for after-spike current | ms$-1$ |

$kmp,l,*$ | transformed voltage decay factor | ms$-1$ |

$Vthp,l$ | threshold | mV |

$\omega inpq$, $\omega latpq$ | combined synaptic weights | mV |

$ajp,l$ | additive constant in after-spike current | pA |

$rjp,l,*$ | transformed multiplicative constant in after-spike current | unitless |

$kjp,l,*$ | transformed decay factor for after-spike current | ms$-1$ |

$kmp,l,*$ | transformed voltage decay factor | ms$-1$ |

Notes: List of trainable parameters in GLIFR networks, along with their units. Here, indices $p,l$ refer to neuron $p$ in layer $l$, so that individual neuron parameters are trained separately.

### 6.3 Evaluation Setup

We additionally study multiple variations of the GLIFR networks, varying in whether parameters are homogeneously initialized and whether parameters are learned over training. When initializing homogeneously, we employ a deterministic initialization scheme to decrease the likelihood that the network will learn parameters underlying exploding or saturating firing rates; specifically, we initialize the network with small decay constants and thresholds close to the reset voltage of 0 mV. The homogeneous initialization scheme is described in Table 6. Note that the neuron described is in a layer with $N$ neurons. When initializing heterogeneously, we initialize both parameters and input and lateral weights by randomly sampling from the distribution of parameters and weights in a trained LHet or LHetA model. This also decreases the likelihood of learning parameters that lead to exploding or saturating firing rates while erasing any learned patterns in the parameters or connectivity patterns in the network.

Parameter . | Homogeneous Initialization . |
---|---|

$Vth$ | 1 mV |

$\omega pq$ | $U(-1N,1N)$ mV |

$ajp$ | $U(-0.01,0.01)$ pA |

$rjp$ | $U(-0.01,0.01)$ |

$kjp$ | $0.1dt$ |

$kmp$ | dt |

Parameter . | Homogeneous Initialization . |
---|---|

$Vth$ | 1 mV |

$\omega pq$ | $U(-1N,1N)$ mV |

$ajp$ | $U(-0.01,0.01)$ pA |

$rjp$ | $U(-0.01,0.01)$ |

$kjp$ | $0.1dt$ |

$kmp$ | dt |

Notes: This table describes how each trainable parameter was initialized in GLIFR neurons either homogeneously or heterogeneously across neurons. Note that in both schemes, $\omega $, $rj$, and $kj$ were initialized with at least some amount of heterogeneity.

### 6.4 Tasks

In this work, we explore multiple tasks, analyzing and comparing the performance of RNNs, LSTMs, and variations of GLIFR networks. The chosen tasks range in complexity. When training networks on these tasks, we parameter-match them, thus using the hidden sizes listed in Table 7. The hyperparameters used for each task are listed in Table 8. Below, we discuss each task in detail.

. | Sine Generation . | L-MNIST . | P-MNIST . | N-MNIST . |
---|---|---|---|---|

RNN | 128(16,769) | 256(75,530) | 256(75,530) | 256(75,530) |

LSTM | 63(16,696) | 122(75,406) | 122(75,406) | 122(75,406) |

Hom | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

HomA | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

LHet | 127(16,638) | 255(75,235) | 255(75,235) | 255(75,235) |

LHetA | 124(16,617) | 252(75,106) | 252(75,106) | 252(75,106) |

FHet | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

FHetA | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

RHet | 127(16,638) | 255(75,235) | 255(75,235) | 255(75,235) |

RHetA | 124(16,617) | 252(75,106) | 252(75,106) | 252(75,106) |

. | Sine Generation . | L-MNIST . | P-MNIST . | N-MNIST . |
---|---|---|---|---|

RNN | 128(16,769) | 256(75,530) | 256(75,530) | 256(75,530) |

LSTM | 63(16,696) | 122(75,406) | 122(75,406) | 122(75,406) |

Hom | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

HomA | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

LHet | 127(16,638) | 255(75,235) | 255(75,235) | 255(75,235) |

LHetA | 124(16,617) | 252(75,106) | 252(75,106) | 252(75,106) |

FHet | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

FHetA | 128(16,641) | 256(75,274) | 256(75,274) | 256(75,274) |

RHet | 127(16,638) | 255(75,235) | 255(75,235) | 255(75,235) |

RHetA | 124(16,617) | 252(75,106) | 252(75,106) | 252(75,106) |

Notes: This table lists the size of networks trained on each task. These are listed as number of neurons (number of parameters).

. | Sine Generation . | L-MNIST . | P-MNIST . | N-MNIST . |
---|---|---|---|---|

Model hyperparameters | ||||

$\Delta t$ (GLIFR/RNN) | 1 ms/0 ms | 1 ms/0 ms | 15 ms/15 ms | 0 ms/0 ms |

$dt$ | 0.05 ms | 0.05 ms | 0.05 ms | 1 ms |

Optimization technique | ||||

Loss function | Mean-squared error | Cross-entropy loss | ||

Technique | Adam optimization (Kingma & Ba, 2014) | |||

Optimization hyperparameters | ||||

Learning rate | 0.0001 | 0.001 | 0.0001 | 0.001 |

Betas | (0.9, 0.999) | |||

Batch size | 6 | 256 | 512 | 512 |

Number of epochs | 5000 | 30 | 50 | 30 |

. | Sine Generation . | L-MNIST . | P-MNIST . | N-MNIST . |
---|---|---|---|---|

Model hyperparameters | ||||

$\Delta t$ (GLIFR/RNN) | 1 ms/0 ms | 1 ms/0 ms | 15 ms/15 ms | 0 ms/0 ms |

$dt$ | 0.05 ms | 0.05 ms | 0.05 ms | 1 ms |

Optimization technique | ||||

Loss function | Mean-squared error | Cross-entropy loss | ||

Technique | Adam optimization (Kingma & Ba, 2014) | |||

Optimization hyperparameters | ||||

Learning rate | 0.0001 | 0.001 | 0.0001 | 0.001 |

Betas | (0.9, 0.999) | |||

Batch size | 6 | 256 | 512 | 512 |

Number of epochs | 5000 | 30 | 50 | 30 |

Notes: This table lists the hyperparameters used for each task. The only hyperparameter that varies between networks is the synaptic delay ($\Delta t$), and this is denoted as ($\Delta t$ for GLIFR and LSTM/$\Delta t$ for RNN).

#### 6.4.1 Sine Generation Task

In the sine generation task, each network was trained to generate 5 ms sinusoidal patterns whose frequencies are determined by the amplitudes of the corresponding constant inputs. We trained each network to produce six sinusoids with frequencies ranging from 80 Hz to 600 Hz. The $i$th sinusoid was associated with a constant input of amplitude $(i/6)+0.25$. Previous work (Sussillo & Barak, 2013) tackles a more complex version of this task using RNNs but uses Hessian-free optimization, while we limit our training technique on both the RNN and the GLIFR networks to simple gradient descent.

#### 6.4.2 Line-by-Line MNIST (L-MNIST) Task

In the L-MNIST task, a network takes in a 28-dimensional input at each timestep corresponding to a row of a MNIST image, which is sized $28\xd728$. At each of the 28 timesteps, the network produces a 10-dimensional output (1 dimension for each digit 0–9), whose softmax would represent a one-hot encoding of the digit. The network is trained so that the 10-dimensional output at the last time point correctly indicates the input digit.

#### 6.4.3 Pixel-by-Pixel MNIST (P-MNIST) Task

We analyze a variation of the L-MNIST task in which at each timestep, a network takes in a 1-dimensional input corresponding to a single pixel of an MNIST image. Thus, the number of timesteps processed for a single sample is $28\xd728=784$. Similar to the L-MNIST setup, the network produces a 10-dimensional output for each timestep, and the network is trained so that the output at the last time point correctly indicates the input digit.

#### 6.4.4 Neuromorphic MNIST (N-MNIST) Task

In the N-MNIST task, a network is trained to identify the digit represented by an MNIST image captured by a dynamic vision sensor (DVS) (Orchard et al., 2015). The image is smoothly moved in front of a DVS camera, while for each frame, the DVS camera senses changes in pixel intensity in two directions and translates these into spiking events. As a result, a single image is represented by a $34\xd734\xd72\xd7T$ matrix, where T is the number of frames and each image is $34\xd734$. Due to the large $T$ accompanying samples from this data set, we employ the strategy taken in He et al. (2020). Specifically, for a timestep duration $dt$, we summarize each consecutive nonoverlapping window of length $dt$ by computing $sign(\u2211tXijk(t))$ for each pixel coordinates $i,j$ and channel $k$. We define $sign(x)$ to be 0 for $x=0$, positive for $x>0$, and negative for $x<0$. After summarizing each consecutive window, we use the first 15 timesteps for training.

### 6.5 Hardware

To accelerate training networks on the L-MNIST and P-MNIST tasks, we use the NVIDIA RTX 2080Ti GPU (two GPUs and eight CPUs). With GPU acceleration, one epoch on the L-MNIST task takes under 1 minute, and one epoch on the P-MNIST task takes approximately 2 minutes. We trained the sine wave generation task, which was and the N-MNIST task, which took around 20 minutes per epoch, on CPU only.

## 7 Supplementary Data

## Acknowledgments

We acknowledge the following sources of funding and support: the University of Washington Mary Gates Research Endowment, NIH training grant T90 DA 32436-10, and NIH BRAIN initiative grant 1RF1DA055669. We thank the Allen Institute founder, Paul G. Allen, for his vision, encouragement, and support.

## References

*PLOS Computational Biology*

*Nature Communications*

*Neuron*

*Learning to predict in networks with heterogeneous and dynamic synapses.*

*PLOS Computational Biology*

*Advantages of biologically-inspired adaptive neural activation in RNNs during learning*

*Neuronal dynamics: From single neurons to networks and models of cognition*

*Neural Networks*

*Improving neural networks by preventing co-adaptation of feature detectors*

*Neural Computation*

*Journal of Physiology*

*PLOS Computational Biology*

*Advances in neural information processing systems*

*Neural Computation*

*IEEE Transactions on Neural Networks*

*Adam: A method for stochastic optimization.*

*Independently recurrent neural network (INDRNN): Building a longer and deeper RNN.*

*Recurrent neural network for text classification with multi-task learning.*

*Cell*

*Neural Computation*

*Biophysical Journal*

*PLOS Computational Biology*

*IEEE Signal Processing Magazine*

*Frontiers in Neuroscience*

*Nature Communications*

*Nature Neuroscience*

*Neuron*

*eLife*

*Neural Computation*

*Frontiers in Neuroscience*

*Neural Computation*

*Nature Communications*

*Proceedings of the National Academy of Sciences*

*Scientific Reports*

*Proceedings of the National Academy of Sciences*

*Neural Computation*