## Abstract

Learning new concepts rapidly from a few examples is an open issue in spike-based machine learning. This few-shot learning imposes substantial challenges to the current learning methodologies of spiking neuron networks (SNNs) due to the lack of task-related priori knowledge. The recent learning-to-learn (L2L) approach allows SNNs to acquire priori knowledge through example-level learning and task-level optimization. However, existing L2L-based frameworks do not target the neural dynamics (i.e., neuronal and synaptic parameter changes) on different timescales. This diversity of temporal dynamics is an important attribute in spike-based learning, which facilitates the networks to rapidly acquire knowledge from very few examples and gradually integrate this knowledge. In this work, we consider the neural dynamics on various timescales and provide a multi-timescale optimization (MTSO) framework for SNNs. This framework introduces an adaptive-gated LSTM to accommodate two different timescales of neural dynamics: short-term learning and long-term evolution. Short-term learning is a fast knowledge acquisition process achieved by a novel surrogate gradient online learning (SGOL) algorithm, where the LSTM guides gradient updating of SNN on a short timescale through an adaptive learning rate and weight decay gating. The long-term evolution aims to slowly integrate acquired knowledge and form a priori, which can be achieved by optimizing the LSTM guidance process to tune SNN parameters on a long timescale. Experimental results demonstrate that the collaborative optimization of multi-timescale neural dynamics can make SNNs achieve promising performance for the few-shot learning tasks.

## 1  Introduction

Spiking neural networks (SNNs) are capable of encoding spatiotemporal information of external inputs through discrete spike signals, which exhibits great computational potential in a biologically feasible way (Maass, 1997; Ghosh-Dastidar & Adeli, 2009). Currently, there are two main directions to effectively train SNNs: unsupervised learning and supervised learning. Unsupervised learning adopts the plasticity mechanism of local neuronal activity to carry out adaptive learning, for example, spike-timing-dependent-plasticity (STDP) (Diehl & Cook, 2015; Kheradpisheh et al., 2016). Achieving high performance is quite difficult due to the lack of global supervised signals. Instead, supervised learning can use the global error signal to optimize the neural dynamics of SNNs, which has shown superior performance in important domains such as vision and speech (Samadi, Lillicrap, & Tweed, 2017; Zenke & Ganguli, 2018; Wu, Yılmaz, Zhang, Li, & Tan, 2020). Most of them consider spatiotemporal credit assignment of errors and handle nondifferentiable spike activities with a surrogate gradient (Wu, Deng, Li, Zhu, & Shi, 2018; Shrestha & Orchard, 2018; Chen & Li, 2020). Some novel work focuses on the computational advantage of multispike learning, that is, a single neuron can be trained to learn multicategory tasks (Xiao, Yu, Yan, & Tang, 2019; Yu et al., 2020). This multispike learning can use the temporal dynamics of spiking neurons and effectively solve the recognition of perceptual cues with rich temporal information, such as temporal credit assignment (TCA) problem (Xiao et al., 2019; Zhang et al., 2020). However, the existing methods were unable to solve the few-shot learning problems, i.e., they often suffer from the overfitting in learning new concepts from a few labeled data, failing to identify variations on these concepts in future percepts (Fefei, Fergus, & Perona, 2003; Lake, Salakhutdinov, Gross, & Tenenbaum, 2011; Lake, Salakhutdinov, & Tenenbaum, 2015).

Establishing this kind of fast and flexible few-shot learning is underexplored in spike-based machine learning, which imposes substantial challenges: limited supervised information requires that the learning model can integrate a priori knowledge from related tasks, and this a priori knowledge must enable transferring into new, potentially invisible tasks. One typical learning approach to investigate this multitask knowledge integration and transfer is learning-to-learn (L2L) (Schmidhuber, Zhao, & Wiering, 1997; Thrun, 1998), which aims to learn a learning method conducive to multitask transfer. In particular, L2L introduces a nested optimization with two loops (see Figure 1a): an inner loop performs quick learning to acquire the knowledge for a specific task, and an outer loop slowly optimizes the inner learning process to maximize task-level performance. Through appropriate inner learning and outer optimization, the L2L can extract the a priori knowledge shared among all tasks and enable the network to quickly adapt to the constraints of few-shot learning. A commonly used few-shot learning setting for L2L is N-way-K-shot (see Figure 1b). In each task, also termed an episode, the model can learn from some labeled data (K samples of each unseen N classes in $Dtrain$) and optimize according to the performance on new samples ($Dtest$). In addition, a meta-train set $Dmeta-train$ and a meta-test set $Dmeta-test$ are used to simulate task-level learning and evaluate transfer performance, respectively.
Figure 1:

The few-shot learning scheme. (a) Schematic for learning to learn (L2L). (b) N-way-K-shot task setting.

Figure 1:

The few-shot learning scheme. (a) Schematic for learning to learn (L2L). (b) N-way-K-shot task setting.

The L2L approach inspires an interesting body of work to solve few-shot learning problems with artificial learning systems, generally including metric-based, memory-based, and optimization-based approaches (Vanschoren, 2018). Metric-based methods learn data embeddings in a certain metric space so that they can be discrete with a simple nearest neighbor or linear classifiers. The form of embeddings is decided by a series of unique structures: Siamese networks (Koch, Zemel, & Salakhutdinov, 2015), match networks (Vinyals, Blundell, Lillicrap, & Wierstra, 2016), prototypical networks (Snell, Swersky, & Zemel, 2017), and relation networks (Sung et al., 2018). Memory-based methods draw support from the reliably external memory module, such as meta-networks (Munkhdalai & Yu, 2017) and memory networks (Santoro, Bartunov, Botvinick, Wierstra, & Lillicrap, 2016). These architectures store knowledge required to address the few-shot learning problem from its external memory and implement classification by comparing with historic information stored in the memory. Optimization-based methods aim to optimize network parameters so that they can be fine-tuned with several gradient descent steps on a few examples. Model agnostic meta-learning (MAML) (Finn, Abbeel, & Levine, 2017) uses second-order gradient to search the initial weight of the neural network that is efficient for fine-tuning on the few-shot condition, while Reptile (Nichol & Schulman, 2018) gets rid of the complicated second-order gradient calculation by incorporating an $ℓ2$ loss. Further, the long short-term memory (LSTM)–based optimizers (Hochreiter, Younger, & Conwell, 2001; Andrychowicz et al., 2016; Ravi & Larochelle, 2017) show superior knowledge transfer capabilities where the priori knowledge is embodied in both a task-common initial condition and appropriate updates of the network parameters.

To explore how to perform L2L in a brain-like way, recent advances have implemented L2L with spiking neurons, primarily focusing on exploring the transfer learning capabilities of specific spiking models. A special recurrent SNN with adapting neurons (LSNN) is suitable for L2L, which can approach the knowledge transfer performance of LSTM networks, but the universality of this algorithm to other SNN structures is not addressed (Bellec, Salaj, Subramoney, Legenstein, & Maass, 2018). With the aid of powerful gradient-free optimization tools, L2L also enhances the reward-based learning capability of SNNs in neuromorphic hardware (Bohnstingl, Scherr, Pehle, Meier, & Maass, 2019). Even so, these methods are limited to relatively simple learning tasks, such as learning the nonlinear functions from a teacher network or the reinforcement learning of multiarmed bandits. There are still quite a few reports about their performance on a more complex few-shot pattern recognition problem. The few-shot gesture recognition performance on a neuromorphic processor is first reported in Stewart, Orchard, Shrestha, and Neftci (2020). Unlike L2L, it adopts the transfer learning strategy, where the feature extraction portion of SNN is pretrained offline and the remaining portion is trained online with the Loihi plasticity rule (LPR). However, it has been observed that this training strategy is still troubled by overfitting.

## 2  Few-Shot Learning by Multi-Timescale Optimization

As shown in Figure 2, our proposed multi-timescale optimization (MTSO) coordinates short-term learning and long-term evolution through a modified LSTM. Before the SNN receives spiking inputs, the LSTM provides its own hidden state $θ0$ to the SNN as its initial parameter, and $φ0$ is defined as the LSTM parameter for modulating the SNN learning. In short-term learning, the short-term gradient $∇ti$ and loss information $Lti$ of SNN are calculated by a surrogate gradient online learning (SGOL) rule at each time step $t$. Each LSTM cell receives the short-term information at each time step and outputs temporarily updated parameters $θt+1i$ back to the SNN for calculation at the next time step. Meanwhile, the updated parameters are also transmitted to the next time step of LSTM as a hidden state so that LSTM can simulate the SNN learning process (gradient update) in $Dtrain$. In long-term evolution, we optimize the LSTM's parameters ($φ0$) and initial hidden state ($θ0$) by defining the loss function of the SNN on the $Dtest$ and executing backpropagation through time (BPTT) from SNN to LSTM. Hence, MTSO can force SNNs to encode all learning results through multiple episodes and support fast few-shot learning of a random episode.
Figure 2:

Overview of the multi-timescale optimization method. It illustrates the short-term learning and the long-term evolution in an episode i. The short-term learning performs two transmissions between SNN and LSTM Cell at each time step t (black dotted line). The short-term gradient $∇ti$ and loss $Lti$ information of SNN is calculated by the surrogate gradient online learning (SGOL) rule. Then each LSTM cell receives this information and sends temporarily updated parameters $θt+1i$ back to the SNN for calculation at next time step. The long-term evolution optimizes the LSTM's parameters $φ0$ and initial hidden state $θ0$ (labeled in orange and green, respectively) by defining the loss function of the SNN on the $Dtest$ and executing backpropagation through time (BPTT) from SNN to LSTM.

Figure 2:

Overview of the multi-timescale optimization method. It illustrates the short-term learning and the long-term evolution in an episode i. The short-term learning performs two transmissions between SNN and LSTM Cell at each time step t (black dotted line). The short-term gradient $∇ti$ and loss $Lti$ information of SNN is calculated by the surrogate gradient online learning (SGOL) rule. Then each LSTM cell receives this information and sends temporarily updated parameters $θt+1i$ back to the SNN for calculation at next time step. The long-term evolution optimizes the LSTM's parameters $φ0$ and initial hidden state $θ0$ (labeled in orange and green, respectively) by defining the loss function of the SNN on the $Dtest$ and executing backpropagation through time (BPTT) from SNN to LSTM.

### 2.1  The Neuron Model

The discrete-time version of the leaky integrate-and-fire (LIF) neuron model is employed as the basic computational units for SNNs, which emulates the subthreshold dynamics of neuron activities, including the updating-resetting mechanism of membrane potential and the spike firing:
$Ul[t]=λl(1-Sl[t-1])Ul[t-1]+WlSl-1[t],$
(2.1)
$Vl[t]=Ul[t]Bl-1andSl[t]=h(Vl[t])=1Vl[t]≥00Vl[t]<0.$
(2.2)
In the above formulas, the superscript $l$ and $t$ present the state of the neuron vector at the $l$th layer and the $t$th time step. $Ul[t]$ is the membrane potential, and $Sl[t]$ denotes the normalized membrane potential regulated by the firing threshold $Bl$. $Sl[t]∈{0,1}$ is the binary spike activity governed by the step function $h(x)$, where the neuron emits a spike if the normalized membrane potential $Vl[t]$ is above zero.

Different from the conventional LIF model described with linear differential equations in a continuous spatial-temporal domain, the discrete-time version defines the explicit spatial-temporal dynamics (i.e., temporal dependency and the spatial relationship). Specifically, the update of membrane potential $Ul[t]$ is composed of two domains: the temporal domain and the spatial domain. In the temporal domain (the first item in equation 2.1), the leak coefficient $λl$ and the term $(1-Sl[t-1])$ denote the leakage and reset of the previous membrane potential states $Ul[t-1]$. In the spatial domain (the second item in equation 2.1), the membrane potential is accumulated by the weighted summation of the presynaptic spike inputs $Sl-1[t]$ generated from the preceding layer, where $Wl$ is the synaptic weight matrix connecting presynaptic neuron and postsynaptic neuron.

It is obvious that the spatial-temporal dynamics of LIF neurons is regulated by the neuron parameters (membrane leak $λl$ and threshold $Bl$) and the synapse parameters (weights $Wl$). The threshold governs the average integration time of input, and the leak regulates how much of the potential is retained from the previous time step. However, most of the existing work mainly focuses on the optimization of synaptic parameters but ignoring the neuron parameters (Wu, Deng, Li, Zhu, & Shi, 2018; Shrestha & Orchard, 2018; Gu, Xiao, Pan, & Tang, 2019). Therefore, we consider the joint optimization of these parameters to achieve the LIF neurons with adaptive dynamics (introduced in the next section).

### 2.2  Short-Term Learning: Surrogate Gradient Online Learning

Short-term learning focuses on learning from the dynamic inputs of the current task (episode). To fully extract the essential knowledge from the limited training samples, the temporary change of network parameters will occur with the input signal strength and neuron activity level on a short-term timescale. As one of the most successful supervised learning algorithms, surrogate-gradient backpropagation can optimize the complex spatial-temporal dynamics in deep SNNs (Lee, Delbruck, & Pfeiffer, 2016; Shrestha & Orchard, 2018; Neftci, Mostafa, & Zenke, 2019), where the gradient of the nondifferentiable spike firing function (Dirac delta) is approximated by a continuous function. Therefore, the spatial and temporal credit assignment can be performed by unrolling the internal state of the LIF neurons in the temporal domain and employing BPTT. This temporal backpropagation can effectively capture the time dependence of spike inputs, but it also requires massive recursive computation and intermediate state storage, which increases the power consumption. In addition, a single update on the entire sample time window is not conducive to online learning sensory cues that change rapidly over time. To address these problems, we propose a short-term surrogate gradient online learning (SGOL) algorithm for short-term learning, which can realize the instant gradient calculation by truncating the error backpropagation in the temporal domain (see Figure 3a). According to the chain rule, we derive the expressions required to backpropagate the error from the output layer to the input layer at any time step $t$ of the whole time windows $T$.
Figure 3:

(a) Surrogate gradient online learning algorithm. The internal states of LIF neurons (green box) are unrolled in both spatial domain (SD) and temporal domain (TD), which are labeled in blue and red dotted box, respectively. The instant gradient of the neuron parameters (membrane leak $λ$ and threshold $B$) and the synapse parameters (weights $Wl+1$) can be calculated by truncating the error backpropagation in the temporal domain. (b) The internal calculation of the LSTM cell.

Figure 3:

(a) Surrogate gradient online learning algorithm. The internal states of LIF neurons (green box) are unrolled in both spatial domain (SD) and temporal domain (TD), which are labeled in blue and red dotted box, respectively. The instant gradient of the neuron parameters (membrane leak $λ$ and threshold $B$) and the synapse parameters (weights $Wl+1$) can be calculated by truncating the error backpropagation in the temporal domain. (b) The internal calculation of the LSTM cell.

#### 2.2.1  Output Layer

The output layer employs the integrated neuron model that only accumulates the weighted prespikes in the membrane potential without leak and firing. The membrane potential represents the neuron response intensity, which eliminates the difficulty of defining the loss function with spike count (Nitin, Gopalakrishnan, Priyadarshini, & Kaushik, 2020),
$UL[t]=UL[t-1]+WLSL-1[t],$
(2.3)
where $UL[t]$ is a vector containing the accumulated membrane potential of $N$ output neurons in the output layer $L$, $WL$ is the weight matrix connecting the output layer and the previous layer, and $SL-1[t]$ is the spike vector from layer $(L-1)$.
The softmax function is adopted to normalize the output distribution (see equation 2.4). Then the cross-entropy loss between the target label and the output distribution is defined by equation 2.5:
$PUL[t]:u1L[t]⋯uNL[t]→p1[t]⋯pN[t]pi[t]=euiL[t]∑k=1NeukL[t],$
(2.4)
$E[t]=-∑i=1Nyilogpi[t]=-YlogP[t],$
(2.5)
where $E[t]$ is the loss function at the current time step, $Y$ is the one-hot vector of the target label, and $P[t]$ is the normalized output vector through the softmax function $P(.)$.
The accumulated membrane potential of the output layer has an iterative relationship in time. Here, we ignore the gradient propagation of this variable from the current state to the previous state (i.e., $∂UL[t'+1]∂UL[t']=0$), so the derivatives with respect to the weight matrix $WL$ are computed as
$∂E[t]∂WL=∂E[t]∂UL[t]∂UL[t]∂WL+∑t'=1t-1∂E[t]∂UL[t'+1]∂UL[t'+1]∂UL[t']∂UL[t']∂WL=(P[t]-Y)SL-1[t].$
(2.6)

#### 2.2.2  Hidden Layers

There is a self-recursive relation for the membrane potential state $Ul[t]$ in the hidden layer. We also truncate the gradient propagation of this variable (i.e., $∂Ul[t'+1]∂Ul[t']=∂Ul[t'+1]∂Sl[t']=0$) and derive the gradient to the weight matrix $Wl$:
$∂E[t]∂Sl[t]=∂E[t]∂Ul+1[t]∂Ul+1[t]∂Sl[t]=∂E[t]∂Ul+1[t]Wl+1,$
(2.7)
$∂E[t]∂Ul[t]=∂E[t]∂Sl[t]∂Sl[t]∂Vl[t]∂Vl[t]∂Ul[t]=∂E[t]∂Sl[t]∂Sl[t]∂Vl[t]1Bl+1,$
(2.8)
$∂E[t]∂Wl=∂E[t]∂Ul[t]∂Ul[t]∂Wl+∑t'=1t-1∂E[t]∂Ul[t'+1](∂Ul[t'+1]∂Ul[t']+∂Ul[t'+1]∂Sl[t']∂Sl[t']∂Vl[t']∂Vl[t']∂Ul[t'])∂Ul[t']∂Wl=∂E[t]∂Ul[t]Sl-1[t],$
(2.9)
where $∂Sl[t]∂Vl[t]$ is the nondifferentiable, and we introduce a piecewise-linear function $g(u)$ as the surrogate gradient,
$∂Sl[t]∂Vl[t]≈g(Vl[t])=γmax0,1-Vl[t],$
(2.10)
where $γ$ is a constant to determine the amplitude of the gradient.
Except for the synapse parameters, SGOL also supports gradient learning of the neuron parameters. The membrane threshold update is calculated as
$∂E[t]∂Bl=∂E[t]∂Sl[t]∂Sl[t]∂Vl[t]∂Vl[t]∂Bl=-∂E[t]∂Sl[t]∂Sl[t]∂Vl[t]Vl[t](Bl)2,$
(2.11)
and the membrane leak update is calculated as
$∂E[t]∂λl=∂E[t]∂Ul[t]∂Ul[t]∂λl=∂E[t]∂Ul[t]Ul[t-1].$
(2.12)

From the above derivation, it can be seen that our proposed SGOL method performs joint online learning of neuron and synapse parameters relying only on the state information of the current time step, reflecting the short-term change trend of the network's spatial-temporal dynamics. We use the symbols $∇t$ and $Lt$ to represent the short-term gradient $∂E[t]∂Wl,∂E[t]∂Bl,∂E[t]∂λl$ and loss information ($E[t]$) calculated by SGOL at the time step $t$.

### 2.3  Long-Term Evolution: LSTM Optimizer

Long-term evolution focuses on the optimization of the inner learning process across all tasks. The learning rate and weight decay are the key hyperparameters that affect the learning process, which should be adjusted adaptively to achieve optimal task-level performance (the accuracy of SNN on the $Dtest$). Additionally, the inner learning fails to capture the potential long-term dependence of the gradient update process, that is, the contribution degree of the gradient values at different time stamps to the final network parameters. Thus, we adopt an LSTM optimizer to simulate and optimize the SNN parameters' dynamic update (gradient descent) process, thereby exploring this temporal dependence. As shown in Figure 3b, each LSTM cell receives the short-term gradient $∇ti$ and loss $Lti$ information of all SNN learnable parameters calculated by SGOL at time step $t$, and outputs temporary updated parameters $θt+1i$ back to the SNN for calculating the next time step. Meanwhile, the updated parameters are also transmitted to the next time step of LSTM as a hidden state. However, when faced with tens of thousands of parameter inputs and outputs, the fully connected topology of the gating mechanism leads to an explosion of weight parameters in LSTM, so the typical gating mechanism of LSTM is not feasible. To address this dilemma, our LSTM adopts a modified gating mechanism as in Ravi & Larochelle (2017). The gradient update formula of SNN is equivalent to the cell state update of LSTM, where the weight decay and learning rate are adaptively regulated by two modified gates. The dynamics of modified LSTM cell (see Figure 3) is expressed as
$θt+1=αt+1⊙θt-ηt+1⊙∇t,$
(2.13)
$αt+1=σ(WF[f(∇t),f(Lt),αt,θt]+bF),$
(2.14)
$ηt+1=σ(WI[f(∇t),f(Lt),ηt,θt]+bI),$
(2.15)
$f(x)=log(|x|)p,sgn(x)if|x|≥e-p-1,epxotherwise.$
(2.16)
These formulas are explained as follows:
• $Lt$ and $∇t$ are the vectors containing the loss and gradient information of SNN parameters at the time step $t$.

• $θt+1$ represents the learnable parameters vector ($[Wl,Bl,λl]$) in SNN, and $⊙$ denotes pointwise multiplication operation, which means that the update of each parameter is governed by its own weight decay $αt+1$ and learning rate $ηt+1$.

• $αt+1$ and $ηt+1$ represent the gating units for weight decay and learning rate, respectively. The gating mechanism is expressed as a sigmoid function $σ$ of the preprocessed gradient $f(∇t)$, the preprocessed loss $f(Lt)$, the corresponding historical information ($αt$ or $ηt$), and the previous parameter value $θt$.

• denotes the concatenation of vectors. $WF$, $WI$, $bF$, and $bI$ are the weights and bias of LSTM cell to regulate the gating units, which are shared across the calculation of each element in $αt+1$ and $ηt+1$. These weights and bias related to gating units are uniformly expressed as $φ0i$.

• $f(x)$ is a normalization preprocessing function as mentioned in Andrychowicz et al. (2016), where $p>0$ is a boundary parameter to determine whether the input is reduced or enlarged ($p=10$ worked well in our experiments). This normalization function scales the gradients and losses while also splitting their magnitude and sign.

Thus, the complete feedforward process of the LSTM optimizer effectively simulates the gradient update process of SNN. The initial value of the hidden state for LSTM corresponds to the initial parameter $θ0$ of SNN, which is also treated as another learnable parameter of LSTM. Along with the transmission of the SNN parameters as the hidden state of LSTM step by step, the gating units automatically regulate the learning rate and weight decay of the parameter update according to the external (gradient and loss of SNN) and internal (historical state) information. After $T$-time step learning, LSTM combines short-term gradient information and builds a long-term mapping function from initial parameters to final parameters. Note that the outer LSTM of our framework does not operate in a spiking domain, but it is not contradictory to the overall spiking computation. The main reason is that the SNN parameters' (weights') updating guided by the adaptive-gated LSTM is essentially a fine-grained tuning process involving multiple nonbinary learning parameters and hyperparameters (e.g., learning rate and weight decay). On the contrary, the performance may be lost if this process adopts spike-based binary computation.

The goal of LSTM is to optimize the learning process of the inner SNN and provide it with optimal parameters that can achieve high accuracy on the $Dtest$, where a key problem is how to learn an appropriate LSTM. It's obvious that LSTM is not a separate part, whose outputs are taken as the parameters of SNN learner, so it is difficult to define the appropriate loss function to directly optimize. A feasible solution is to indirectly calculate the error gradient of the final parameters $θTi$ by defining the loss function of the SNN on the $Dtest$ and executing error spatiotemporal backpropagation.

Similar to equations 2.3 to 2.5, the loss function of the SNN on the $Dtest$ is defined as
$∂E[T']∂WL=∑t'=1T'∂E[T']∂UL[t']∂UL[t']∂WL=(P[T]-Y)∑t'=1T'SL-1[t'].$
(2.17)
To avoid ambiguity, we use $t'$ and $T'$ to denote the time step and time window of SNN on the $Dtest$. Different from the SGOL rule, the complete spatiotemporal backpropagation process (without gradient truncation) is derived here to obtain the precise error gradient of the entire time window:
$∂E[T']∂Sl[t']=∂E[T']∂Ul+1[t']∂Ul+1[t']∂Sl[t']+∂E[T']∂Ul[t'+1]∂Ul[t'+1]∂Sl[t']=∂E[T']∂Ul+1[t']Wl+1-∂E[T']∂Ul[t'+1]λlUl[t'],$
(2.18)
$∂E[T']∂Ul[t']=∂E[T']∂Sl[t']∂Sl[t']∂Vl[t']∂Vl[t']∂Ul[t']+∂E[T']∂Ul[t'+1]∂Ul[t'+1]∂Ul[t']=∂E[T']∂Sl[t']∂Sl[t']∂Vl[t']1Bl+1+∂E[T']∂Ul[t'+1]λl(1-Sl[t']).$
(2.19)
The error gradients with respect to the final parameters can be obtained as follows:
$∂E[T']∂Wl=∑t'=1T'∂El[T']∂Ul[t']∂Ul[t']∂Wl=∑t'=1T'∂E[T']∂Ul[t']Sl-1[t'],$
(2.20)
$∂E[T']∂Bl=∑t'=1T'∂E[T']∂Sl[t']∂Sl[t']∂Vl[t']∂Vl[t']∂Bl=-∑t'=1T'∂E[T']∂Sl[t']∂Sl[t']∂Vl[t']Ul[t'](Bl)2,$
(2.21)
$∂E[T']∂λl=∑t'=1T'∂E[T']∂Ul[t']∂Ul[t']∂λl=∑t'=1T'∂E[T']∂Ul[t']Ul[t'-1].$
(2.22)
Thus, the error can continue to backpropagate in LSTM. First, we denote that
$δθt=∂E[T']∂θt;δαt=∂E[T']∂αt;δηt=∂E[T']∂ηt,$
(2.23)
where $E[T']$ is the loss of SNN on $Dtest$ with the parameters $θT$.
When $t=T$, we have
$δθT=∂E[T']∂θT=∂E[T']∂Wl,∂E[T']∂Bl,∂E[T']∂λl,$
(2.24)
$δαT=∂E[T']∂θT∂θT∂αT=δθTθT-1,$
(2.25)
$δηT=∂E[T']∂θT∂θT∂ηT=δθT∇T-1.$
(2.26)
When $t, we can get the following iterative formulas:
$δθt=∂E[T']∂θt+1∂θt+1∂θt+∂E[T']∂αt+1∂αt+1∂θt+∂E[T']∂ηt+1∂ηt+1∂θt=δθt+1αt+δαt+1[αt+1(1-αt+1)]WFθ+δηt+1[ηt+1(1-ηt+1)]WIθ,$
(2.27)
$δαt=∂E[T']∂θt∂θt∂αt+∂E[T']∂αt+1∂αt+1∂αt=δθtθt-1+δαt+1[αt+1(1-αt+1)]WFα,$
(2.28)
$δηt=∂E[T']∂θt∂θt∂ηt+∂E[T']∂ηt+1∂ηt+1∂ηt=δθt+1∇t-1+δηt+1[ηt+1(1-ηt+1)]WIη,$
(2.29)
where $WFθ$ and $WFα$ represent the components of the $WF$ corresponding variables $θt$ and $αt$. In the same way, $WIθ$ and $WIη$ are also components of $WI$.
Based on the above formulas, the derivatives with respect to the LSTM parameters can be written in the following vector form:
$∂E[T']∂WF=∑t=0T-1∂E[T']∂αt+1∂αt+1∂WF=∑t=0T-1δαt+1[αt+1(1-αt+1)][f(∇t),f(Lt),αt,θt],$
(2.30)
$∂E[T']∂WI=∑t=0T-1∂E[T']∂ηt+1∂ηt+1∂WI=∑t=0T-1δηt+1[ηt+1(1-ηt+1)][f(∇t),f(Lt),ηt,θt].$
(2.31)
In the same way, the gradient of the initial value of the hidden state $θ0$ for LSTM (also the initial parameter of SNN) can be obtained:
$∂E[T']∂θ0=δθ0,$
(2.32)
where $δθ0$ is calculated by the iterative formulas in equation 2.27. This suggests that an appropriate LSTM can learn to optimize the learning process of SNN on two timescales. The temporary changes of  SNN on a short timescale are governed by the LSTM gating (see equations 2.14 and 2.15), where LSTM gating parameters $WF$ and $WI$ are learnable, as shown in equations 2.30 and 2.31. Meanwhile, LSTM provides the SNN with the initial parameter $θ0$, which is permanently changed on a long timescale according to equation 2.32.

## 3  Results

### 3.1  Data Set Selection and Preprocess

We verify the learning efficacy of the proposed approach on various data sets: a synthetic spike pattern data set named SpkPtn, a static image data set named Omniglot (Lake et al., 2011), and a dynamic neuromorphic data set named Gesture-DVS (Amir, Taba, Berg, Melano, & Modha, 2017).

• SpkPtn. Spike pattern describes the spatiotemporal dynamics of neural population activity, where the firing rate and precise timing of the spiking neurons contain rich information about external stimuli. To investigate the feasibility of our proposed model for capturing spatiotemporal correlations of neural activity, we produce a synthetic spike pattern data set (SpkPtn). Figure 4a illustrates the generation process of the data set and Figure 4b shows some spike pattern examples. First, 1623 firing templates are randomly generated by specifying the firing neurons. Specifically, a random half of the 196 afferents is selected as firing neurons, and the others stay silent. Then 20 spike patterns for each of the templates are then generated by randomly assigning the rate of firing neurons from the uniform distribution (e.g., 10–100 Hz). At the same time, in order to increase the difference between the patterns of the same template, each firing neuron might be deleted randomly with a probability of 30%. Finally, the 50 ms spike train of each afferent neuron is generated with the Poisson distribution corresponding to the firing rate. Random firing templates represent different classes, where each spike pattern is viewed as a sample. Thus, we construct a few-shot spike pattern data set with 1623 classes, 20 samples per class.
Figure 4:

Illustrations for the used data sets. The generation process (a) and some examples (b) of the SpkPtn data set. (c) Omniglot data set contains a variety of characters from alphabets across the world. (d) Reconstructed static images of the Gesture-DVS, where all spike events within a time window $T0$ are compressed into a single static image for visualization. (e) The dynamic spike pattern of the right-hand wave gesture, where red and blue denote the on and off channels, respectively.

Figure 4:

Illustrations for the used data sets. The generation process (a) and some examples (b) of the SpkPtn data set. (c) Omniglot data set contains a variety of characters from alphabets across the world. (d) Reconstructed static images of the Gesture-DVS, where all spike events within a time window $T0$ are compressed into a single static image for visualization. (e) The dynamic spike pattern of the right-hand wave gesture, where red and blue denote the on and off channels, respectively.

• Omniglot (Lake et al., 2011). Omniglot is a benchmark few-shot image data set containing 1623 kinds of handwritten characters from 50 alphabets such as Latin and Greek. Each character is a 105 $×$ 105 binary image produced a single time by each of 20 different painters. A large number of classes (1623 characters) with only a few samples per class (20 images) makes it an ideal data set learning from a few samples in the handwritten character recognition domain. Some examples are shown in Figure 4c.

• Gesture-DVS (Amir, Taba, Berg, Melano, & Modha, 2017). The neuromorphic data set consists of many spike events recorded by the novel dynamic vision sensor (DVS). Once the light change is greater than a predefined threshold in a time stamp, the DVS will generate an event in two channels according to the change direction of light: the on channel for intensity increase and the off channel for intensity decrease. Then the $H×W×2×T0$–sized spike trains can be produced to represent the dynamic change of natural motion, where $H$, $W$ are the height and width of the sensing field of the DVS, respectively. $T0$ indicates the recording time, and “2” denotes the two channels. Figures 4c and 4d depict a typical dynamic neuromorphic data set, Gesture-DVS (Amir et al., 2017), which contains trails of 29 different individuals performing 11 different gestures such as arm roll and hand wave. There are 122 trails for each gesture recorded under three lighting conditions: natural light, fluorescent light, and LED light. Different from the static image data sets, both temporal and spatial information in neuromorphic Gesture-DVS are featured as essential components.

With regard to the three data sets mentioned above, we need to divide them according to the N-way-K-shot task as illustrated in Figure 1b. The detailed division process is described as follows:

1. 1.

Meta-Sets. The whole data set is divided into a meta-train set $Dmeta-train$ and a meta-test set $Dmeta-test$, which contain multiple episodes. Each episode is viewed as a separate data set composed of the $Dtrain$ and $Dtest$. See Table 1 for the specific partition of different data sets.

2. 2.

Episode. Each episode randomly selects N classes from $Dmeta-train$. Random K samples for each selected class are used as $Dtrain$, and the remaining samples can be used as $Dtest$. The composition of episode on the $Dmeta-test$ is similar.

The synthetic SpkPtn data set and dynamic Gesture-DVS data set are compatible with the processing of SNNs due to the same spatiotemporal components and event-driven fashion, but the static Omniglot data set needs to convert the static images into its dynamic spike trains by using the proper spiking coding method. Here, we adopt a commonly used Poisson coding scheme. First, the binary images are downsampled to gray images with the size of $28×28$ by linear interpolation. Subsequently, all gray image pixel values are normalized to 0-1 and regarded as the afferent neurons' firing rate to generate spatiotemporal spike trains according to the Poisson distribution.

Table 1:

Details of the Selected Original Data Sets.

Data SetSpkPtnOmniglotGesture-DVS
Description Spike patterns Handwritten characters Human gestures
Data Format spike trains binary image spike trains
Data Size $196×1$ $105×105×1$ $128×128×2$
Time Window 50 ms NA $T0μs$
# Samples/Categories 20/1623 20/1623 122/11
# Meta-train (category) 1200 1200
# Meta-test (category) 423 423
Data SetSpkPtnOmniglotGesture-DVS
Description Spike patterns Handwritten characters Human gestures
Data Format spike trains binary image spike trains
Data Size $196×1$ $105×105×1$ $128×128×2$
Time Window 50 ms NA $T0μs$
# Samples/Categories 20/1623 20/1623 122/11
# Meta-train (category) 1200 1200
# Meta-test (category) 423 423
For the Gesture-DVS data set, the spike train is composed of numerous time slices with sparse events. Notice that the time slices are very extensive (up to $106$) due to the finely grained temporal resolution of the DVS ($μs$ level). However, considering the time and memory costs, the simulation time step number of SNNs cannot be too large. For this reason, we adopt a temporal collapse mechanism to tune the temporal resolution. Specifically, the spike trains in the original data set are split into multiple successive segments at a new temporal resolution, where original time slices within the segment are collapsed along the temporal dimension into one slice. It means that there will be a spike at the resulting slice if there exist spikes at the same location in any time stamps within the segment. The collapse process can be described as
$St=sign∑t'St'',s.t.t'∈α×t,α×(t+1)-1,$
(3.1)
$sign(x)=0ifx=01otherwise,$
(3.2)
where $S'$ denotes the original time slice, $t'$ is the original time stamp, $S$ denotes the new slice after the collapse, $t$ is the new time step, and $α$ is the temporal resolution coefficient. Figure 5a illustrates an example of temporal collapse with $α=3$, and Figure 5b demonstrates that the spike events change from sparse to dense as the temporal resolution $dt$ enlarges.
Figure 5:

Preprocess of the Gesture-DVS data set. (a) Illustrations of temporal collapse mechanism for tuning temporal resolutions. (b) Slices under different temporal resolution. (c) Original slice sequence with sparse spike events at different time steps. (d) New slice sequence after the collapse, where all spike events within a new temporal resolution $dt$ are compressed into one slice.

Figure 5:

Preprocess of the Gesture-DVS data set. (a) Illustrations of temporal collapse mechanism for tuning temporal resolutions. (b) Slices under different temporal resolution. (c) Original slice sequence with sparse spike events at different time steps. (d) New slice sequence after the collapse, where all spike events within a new temporal resolution $dt$ are compressed into one slice.

Thus, the original slice sequence $St'',t'∈1,T0$ will be converted to a new slice sequence $St,t∈1,T0/α$ (see Figures 5c to 5d). The new temporal resolution ($dt$) satisfies
$dt=αdt0,$
(3.3)
where $dt0$ is the original temporal resolution.

Tuning the temporal resolution provides a good opportunity to explore more insights into the applicability and extensibility of our method. When the time window $T×dt$ remains fixed, simulation time step $T$ is inversely proportional to temporal resolution $dt$. It is convenient to evaluate the SNNs ability to capture the temporal dependence of sparse (dense) event features in long (short) simulation time steps.

### 3.2  Experimental Settings

#### 3.2.1  Hyperparameter Setting

All the experiments are implemented based on the open-source framework, Pytorch (Paszke et al., 2019). The hyperparameter configurations are provided in Table 2. In both the synthetic SpkPtn and static Omniglot data sets, the rate coding strategy limits the temporal resolution to a constant 1 ms. However, for Gesture-DVS, the temporal resolution can be set freely as described in section 3.1. Here, we set the temporal resolution to four levels {10 ms, 15 ms, 20 ms, 25 ms}. Usually the simulation time step $T$ is fixed to ensure that the encoding time window of the sample is consistent in the experiment. While due to the tuned temporal resolution in Gesture-DVS, we fix the simulation time window $T×dt=300$ ms rather than $T$, that is, selecting only the first 300 ms of the recorded gesture during training and testing. Thus, the corresponding setting of $T$ is {30, 20, 15, 12}. Unless otherwise specified, the initial leakage factor $λ0$ and firing threshold $B0$ are fixed at the lower values of 0.3 and 0.2 to ensure that the initial SNN is able to generate a sufficient spiking response. In addition, all experiments adopt the adaptive moment estimation (Adam) (Kingma & Ba, 2014) with the default parameter setting ($α=1e-4,β1=0.9,β2=0.999,η=1e-8$) to adjust the LSTM parameters, and all our models are end-to-end trained from scratch without any additional data augmentation and fine-tuning operations.

Table 2:

Hyperparameter Setting.

HyperparameterDescriptionSpkPtnOmniglotGesture-DVS
$T$ Simulation time step 50 20 {50, 30, 20, 15, 12}
$dt$ Temporal resolution (ms) {6, 10, 15, 20, 25}
$B0$ Firing threshold 0.2 0.2 0.2
$λ0$ Leakage factor 0.3 0.3 0.3
$γ$ Learning rate 1e-3 1e-3 1e-4
HyperparameterDescriptionSpkPtnOmniglotGesture-DVS
$T$ Simulation time step 50 20 {50, 30, 20, 15, 12}
$dt$ Temporal resolution (ms) {6, 10, 15, 20, 25}
$B0$ Firing threshold 0.2 0.2 0.2
$λ0$ Leakage factor 0.3 0.3 0.3
$γ$ Learning rate 1e-3 1e-3 1e-4

#### 3.2.2  Performance Metrics

Following the standard N-way-K-shot setting adopted by most existing few-shot learning work, we evaluate model performance on both 5-way-1-shot (5w1s) and 5-way-5-shot (5w5s) tasks. The accuracies of the model on $Dtest$ of each episode reflect its few-shot learning and generalization ability, so the meta-train and meta-test performance is represented by the accuracy of corresponding $Dtest$ in $Dmeta-train$ and $Dmeta-test$, respectively. In order to ensure the reliability of the results, we computed meta-test accuracies by averaging 1000 randomly generated episodes from the $Dmeat-test$, which is same as Snell et al. (2017). Besides, we derive the meta-train results over 10 independent runs, and in each run, the meta-test accuracies run by the best model are recorded as the final few-shot recognition results. The means and standard deviations of the meta-test accuracies are exhibited in all experiments.

### 3.3  Few-Shot Spike Patterns Classification

Since SpkPtn is a relatively simple synthetic data set, we only use a two-layer fully connected structure containing a hidden layer with 50 LIF neurons, which is termed spiking multilayer perceptron (SMLP) in this work. Figure 6a shows the meta-train performance of few-shot classification on spike patterns. Compared to the 5-way-5-shot task, the accuracy rate of 5-way-1-shot task climbs slowly, indicating that the 1-shot condition brings difficulties to the learning at the beginning. However, regardless of the task setting (1 shot or 5 shot), our approach can quickly converge to the equilibrium point after about 2000 episodes and stabilizes at 98% to 100% accuracy. The meta-test results are shown in Table 3. For a series of different episodes composed of classes not seen in $Dmeta-train$, our approach exhibits strong generalization ability in few-shot spike patterns classification, maintaining high accuracy under both tasks (5-way-1-shot: 99.3%, 5-way-5-shot: 99.9%), which preliminary validates the effectiveness of our proposed approach.
Figure 6:

Meta-train convergence curves of the SNN on SpkPtn data set. (a) SNNs can converge quickly in both 5-way-1-shot (5w1s) and 5-way-1-shot (5w1s) tasks through MTSO. (b–f) SNNs exhibit strong robustness for relatively sparse spike patterns and longer timescale data with a stable training process. The result is an average of 10 runs.

Figure 6:

Meta-train convergence curves of the SNN on SpkPtn data set. (a) SNNs can converge quickly in both 5-way-1-shot (5w1s) and 5-way-1-shot (5w1s) tasks through MTSO. (b–f) SNNs exhibit strong robustness for relatively sparse spike patterns and longer timescale data with a stable training process. The result is an average of 10 runs.

Table 3:

Results of Meta-Test Accuracy on Spike Patterns Classification.

MethodsNetwork Structure5w1s Acc.5w5s Acc.
MAML (Finn et al., 2017ANN: Input-50FC-N 99.31 $±$ 0.01% 99.90 $±$ 0.01%
Ours SNN: Input-50FC-N 99.32 $±$ 0.03% 99.96 $±$ 0.02%
MethodsNetwork Structure5w1s Acc.5w5s Acc.
MAML (Finn et al., 2017ANN: Input-50FC-N 99.31 $±$ 0.01% 99.90 $±$ 0.01%
Ours SNN: Input-50FC-N 99.32 $±$ 0.03% 99.96 $±$ 0.02%

We also present the performance of a standard meta-learning approach (MAML; Finn et al., 2017) on the SpkPtn data set. Most current meta-learning algorithms only explore few-shot learning tasks for static images, which cannot be directly applied to the processing of spatiotemporal spike patterns. A common solution is to convert spatiotemporal events stream into frames as the inputs to the network, that is, the spike counts within the entire time window are counted as the pixel values of the images (the same operation also used in section 3.5 and Table 6). For a fair comparison, we use the official open source code of the original paper, with only appropriate modifications to the network structure and data loading to match the task. Due to the task's simplicity, the accuracy of our approach is slightly higher than that of MAML.

Table 4:

Comparison of Different Sparsity and Timescale Settings.

10 ms50 ms100 ms200 ms500 ms
1–2 Hz 22.09 $±$ 0.24% 27.97 $±$ 0.51% 56.84 $±$ 0.56% 84.92 $±$ 0.48% 98.15 $±$ 0.02%
2–5 Hz 27.01 $±$ 0.46% 34.07 $±$ 0.60% 91.53 $±$ 0.34% 99.73 $±$ 0.06% 100.00 $±$ 0.00%
5–10 Hz 37.23 $±$ 0.56% 93.33 $±$ 0.33% 99.96 $±$ 0.03% 100.00 $±$ 0.00% 100.00 $±$ 0.00%
10–100 Hz 93.79 $±$ 0.28% 99.96 $±$ 0.02% 100.00 $±$ 0.00% 100.00 $±$ 0.00% 100.00 $±$ 0.00%
10 ms50 ms100 ms200 ms500 ms
1–2 Hz 22.09 $±$ 0.24% 27.97 $±$ 0.51% 56.84 $±$ 0.56% 84.92 $±$ 0.48% 98.15 $±$ 0.02%
2–5 Hz 27.01 $±$ 0.46% 34.07 $±$ 0.60% 91.53 $±$ 0.34% 99.73 $±$ 0.06% 100.00 $±$ 0.00%
5–10 Hz 37.23 $±$ 0.56% 93.33 $±$ 0.33% 99.96 $±$ 0.03% 100.00 $±$ 0.00% 100.00 $±$ 0.00%
10–100 Hz 93.79 $±$ 0.28% 99.96 $±$ 0.02% 100.00 $±$ 0.00% 100.00 $±$ 0.00% 100.00 $±$ 0.00%
Table 5:

Comparison among Different Settings and Algorithms on Omniglot.

MethodsStructuresTimestepParametersFine TuneAugmentation5w1s Acc.5w5s Acc.
MANN (Santoro et al., 2016Neural Turing Machine – 632.4k 82.8% 92.6%
Siamese Net (Koch et al., 2015CNN-based 4 layers ($+$BN) – 3894.7k 97.3% 98.4%
Siamese Net (Koch et al., 2015CNN-based 4 layers ($+$BN) – 3894.7k 96.7% 98.4%
Matching Net (Vinyals et al., 2016CNN-based 4 layers ($+$BN) – 112.3k 98.1% 98.9%
Matching Net (Vinyals et al., 2016CNN-based 4 layers ($+$BN) – 112.3k 97.9% 98.7%
MAML (Finn et al., 2017CNN-based 4 layers ($+$BN) – 112.3k 98.7% 99.9%
Ours CNN:Input-(32C3Z-AP2)*3-N – 20.3k 92.1 $±$ 0.7% 97.9 $±$ 0.2%
Ours SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 95.1 $±$ 0.5% 98.6 $±$ 0.2%
Ours ($+$$l1$SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 92.8 $±$ 0.6% 97.6 $±$ 0.3%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 97.3 $±$ 0.3% 99.6 $±$ 0.1%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 95.8 $±$ 0.4% 99.1 $±$ 0.1%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 30 20.2k 95.6 $±$ 0.6% 98.5 $±$ 0.2%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 50 20.2k 95.4 $±$ 0.5% 98.6 $±$ 0.3%
MethodsStructuresTimestepParametersFine TuneAugmentation5w1s Acc.5w5s Acc.
MANN (Santoro et al., 2016Neural Turing Machine – 632.4k 82.8% 92.6%
Siamese Net (Koch et al., 2015CNN-based 4 layers ($+$BN) – 3894.7k 97.3% 98.4%
Siamese Net (Koch et al., 2015CNN-based 4 layers ($+$BN) – 3894.7k 96.7% 98.4%
Matching Net (Vinyals et al., 2016CNN-based 4 layers ($+$BN) – 112.3k 98.1% 98.9%
Matching Net (Vinyals et al., 2016CNN-based 4 layers ($+$BN) – 112.3k 97.9% 98.7%
MAML (Finn et al., 2017CNN-based 4 layers ($+$BN) – 112.3k 98.7% 99.9%
Ours CNN:Input-(32C3Z-AP2)*3-N – 20.3k 92.1 $±$ 0.7% 97.9 $±$ 0.2%
Ours SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 95.1 $±$ 0.5% 98.6 $±$ 0.2%
Ours ($+$$l1$SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 92.8 $±$ 0.6% 97.6 $±$ 0.3%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 97.3 $±$ 0.3% 99.6 $±$ 0.1%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 20 20.2k 95.8 $±$ 0.4% 99.1 $±$ 0.1%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 30 20.2k 95.6 $±$ 0.6% 98.5 $±$ 0.2%
Ours ($+$NPO) SCNN:Input-(32C3Z-AP2)*3-N 50 20.2k 95.4 $±$ 0.5% 98.6 $±$ 0.3%

Notes: $x$C$y$Z represents $x$ convolution filters ($y×y$ kernel size) with zero padding. AP$y$ represents average pooling with $y×y$ pooling kernel size. $N$ is the number of classes, which is task-dependent. $l1$ means that

Table 6:

Meta-Test Accuracy on Gesture-DVS under Different Temporal Resolution.

MethodsStructuresParametersTime windowTemporal resolution5w1s Acc.5w5s Acc.
MAML (Finn et al., 2017CNN-based 4 layers 91.8k 300 ms – 34.7 $±$ 1.2% 48.5 $±$ 0.9%
MAML (Finn et al., 2017CNN-based 4 layers ($+$BN) 91.8k 300 ms – 45.5 $±$ 0.8% 53.7 $±$ 0.7%
LPR (Stewart et al., 2020SCNN:Input-AP4-16C5Z-AP2-32C3Z-AP2-512FC-N 1056.5k 1450 ms – 52.2 56.8
Ours SCNN:Input-AP4-(32C3Z-AP2)*3-N 21.3k 300 ms 6 ms 61.2 $±$ 0.8% 71.4 $±$ 0.6%
10 ms 61.7 $±$ 0.8% 72.1 $±$ 0.6%
15 ms 60.1 $±$ 0.9% 68.6 $±$ 0.6%
20 ms 58.9 $±$ 0.9% 67.4 $±$ 0.6%
25 ms 53.6 $±$ 1.0% 67.0 $±$ 0.6%
Ours ($+$NPO) SCNN:Input-AP4-(32C3Z-AP2)*3-N 21.3k 300 ms 6 ms 63.1 $±$ 0.8% 72.1 $±$ 0.6%
10 ms 63.2 $±$ 0.7% 73.3 $±$ 0.6%
15 ms 62.4 $±$ 0.7% 71.5 $±$ 0.5%
20 ms 61.8 $±$ 0.7% 68.6 $±$ 0.6%
25 ms 60.1 $±$ 0.8% 67.9 $±$ 0.5%
MethodsStructuresParametersTime windowTemporal resolution5w1s Acc.5w5s Acc.
MAML (Finn et al., 2017CNN-based 4 layers 91.8k 300 ms – 34.7 $±$ 1.2% 48.5 $±$ 0.9%
MAML (Finn et al., 2017CNN-based 4 layers ($+$BN) 91.8k 300 ms – 45.5 $±$ 0.8% 53.7 $±$ 0.7%
LPR (Stewart et al., 2020SCNN:Input-AP4-16C5Z-AP2-32C3Z-AP2-512FC-N 1056.5k 1450 ms – 52.2 56.8
Ours SCNN:Input-AP4-(32C3Z-AP2)*3-N 21.3k 300 ms 6 ms 61.2 $±$ 0.8% 71.4 $±$ 0.6%
10 ms 61.7 $±$ 0.8% 72.1 $±$ 0.6%
15 ms 60.1 $±$ 0.9% 68.6 $±$ 0.6%
20 ms 58.9 $±$ 0.9% 67.4 $±$ 0.6%
25 ms 53.6 $±$ 1.0% 67.0 $±$ 0.6%
Ours ($+$NPO) SCNN:Input-AP4-(32C3Z-AP2)*3-N 21.3k 300 ms 6 ms 63.1 $±$ 0.8% 72.1 $±$ 0.6%
10 ms 63.2 $±$ 0.7% 73.3 $±$ 0.6%
15 ms 62.4 $±$ 0.7% 71.5 $±$ 0.5%
20 ms 61.8 $±$ 0.7% 68.6 $±$ 0.6%
25 ms 60.1 $±$ 0.8% 67.9 $±$ 0.5%

Notes: $x$C$y$Z represents $x$ convolution filters ($y×y$ kernel size) with zero padding, while $x$C$y$Z adopts the same size filters with zero padding. AP$y$ represents average pooling with $y×y$ pooling kernel size. $N$ is the number of classes, which is task dependent.

#### 3.3.1  The Analysis of Different Sparsity and Timescale

To explore the potential applicability of our approach in different sparsity and timescale data, we comprehensively consider the various sparsity under different time window length in the synthetic SpkPtn data set and conduct a systematic study on the model performance under different conditions, where the sparse nature of spike patterns is adjusted by controlling the firing rate of the activated neurons. Specifically, the time window is set to 10 ms, 50 ms, 100 ms, 200 ms, 500 ms and the four specific sparsity levels are selected according to the firing rate distribution, from sparse to dense: 1–2 Hz, 2–5 Hz, 5–10 Hz, 10–100 Hz. Experimental results in Figures 6b to 6f and Table 4 demonstrate that our approach exhibits strong robustness for relatively sparse spike patterns and longer timescale data with a stable training process and consistently excellent test accuracy. Additionally, it can be found that this robustness improves with increasing time window, but the performance will be lost when the input are too sparse, that is, the average spike counts of firing neurons in the entire time window are fewer than 1. For example, when the firing rate is 1–2 Hz and the time window is 500 ms, the neuron may fire only once or remain silent in the entire time window, as do the combinations of 2–5 Hz with 100 ms, 5–10 Hz with 50 ms, and 10–100 Hz with 10 ms. The reason is that the low firing rate with a short time window results in an insufficient spike count to accurately represent the rate information, which will greatly increase the difficulty of network learning. However, our approach is still able to achieve satisfying accuracy at this critical state.

#### 3.3.2  The Analysis of Spiking Dynamics

We further visualize the spiking neuron dynamics at each time step of the learning process in the $Dmeta-test$. As shown in Figure 9a, the bottom block represents the spike raster plot of the input neurons. The middle two blocks represent the spike raster plot, and the membrane potential curve of the hidden neurons, where the firing threshold and leakage factors shared by hidden neurons are adjusted adaptively. The top block is the accumulated membrane potential intensity of output neurons. After multi-timescale optimization (MTSO) on multiple episodes of the $Dmeta-train$, the SNN is endowed with a common parameter initialization and appropriate parameter updates for the few-shot task, where finite spike event information can be fully utilized to build the long-term dynamic association of SNN's short-term gradient learning process. Thus, the membrane potential of output neuron corresponding to label index gradually accumulates over time and has the maximum strength, which demonstrates the SNN has the ability to learn from few-shot samples through MSTO.
Figure 7:

Meta-train convergence curves of the SCNN on differences data set. (a,b) On Omniglot data set, SCNN with neuron parameters optimization (NPO) achieves the best accuracies and performs most stably, obviously superior to CNN with the same structure. (c,d) On Gesture-DVS data set, SCNN achieves superiority in accuracy when the temporal resolution is small, thus more suitable for extracting sparse features. The result is an average of 10 runs.

Figure 7:

Meta-train convergence curves of the SCNN on differences data set. (a,b) On Omniglot data set, SCNN with neuron parameters optimization (NPO) achieves the best accuracies and performs most stably, obviously superior to CNN with the same structure. (c,d) On Gesture-DVS data set, SCNN achieves superiority in accuracy when the temporal resolution is small, thus more suitable for extracting sparse features. The result is an average of 10 runs.

Figure 8:

The normalized weights distribution of the trained SCNN without (a) and with (b) the L1 regularization. The weights distribution are normalized to the total number of connections.

Figure 8:

The normalized weights distribution of the trained SCNN without (a) and with (b) the L1 regularization. The weights distribution are normalized to the total number of connections.

Figure 9:

The neural dynamics of SNN during the learning process ($Dtrain$ of the $Dmeta-test$). (a) For the two-layer SNN used in the SpkPtn data set, the firing threshold and leakage factors shared by hidden neurons are adjusted adaptively, which is helpful for the accumulated membrane potential of output neuron corresponding to label index to produce the maximum response. For the SCNN used in Omniglot (b) and Gesture-DVS data set (c), The firing threshold and leakage factors in different convolution layers have different initial values and can be adjusted temporarily.

Figure 9:

The neural dynamics of SNN during the learning process ($Dtrain$ of the $Dmeta-test$). (a) For the two-layer SNN used in the SpkPtn data set, the firing threshold and leakage factors shared by hidden neurons are adjusted adaptively, which is helpful for the accumulated membrane potential of output neuron corresponding to label index to produce the maximum response. For the SCNN used in Omniglot (b) and Gesture-DVS data set (c), The firing threshold and leakage factors in different convolution layers have different initial values and can be adjusted temporarily.

### 3.4  Few-Shot Characters Recognition on Static Omniglot

To examine whether our proposed approach is scalable in optimizing deeper SNNs, we adopt the Omniglot benchmark data set for more challenging few-shot handwritten character recognition. It has been shown that the visual cortex employs a hierarchical system for efficient feature extraction, which is indispensable for complex visual tasks. Convolutional neural network (CNN) introduces the convolutional computation based on a local receptive field (or a type of sparse connection) with a weight-sharing mechanism that can efficiently construct a cortical-like hierarchical system. Thus, we introduce the spiking convolutional neural network (SCNN) with end-to-end training. Most existing few-shot learning models utilize four convolutional modules and obtain good performance on Omniglot (Koch, Zemel, & Salakhutdinov, 2015; Vinyals et al., 2016; Santoro et al., 2016), but they are not suitable for SNNs. On the one hand, the stability of existing models depends heavily on the complicated batch normalization technique, but SNN is not compatible with this technique. On the other hand, SNN naturally adapts to the neuromorphic hardware, and the hardware-friendly model with low computational redundancy and high performance will be more attractive. Therefore, we use a lightweight SCNN architecture containing three identical convolutional modules and a final linear classification layer, where the convolutional module has 32 filters with the 3 $×$ 3 convolution kernel, followed by a 2 $×$ 2 average pooling operation. For a fair comparison, a CNN network with the same structure as SCNN is also considered as the baseline. Notably, the convolution operation in SCNN is driven by LIF neurons, that is, the input and output of convolution layer are binary spike events but not real values in vanilla CNN. Another difference is that the SCNN receives converted spike trains and processes spike events of each simulation time step, while CNN repeatedly receives the original static image to be compatible with outer LSTM. Moreover, as discussed in section 2.1, our proposed MSTO supports the collaborative optimization for both neuron and synaptic parameters. To explore the gain of additional neuron parameter optimization (NPO), we also compare the performance of SCNN with NPO and without NPO.

#### 3.4.1  Comparison among Different Settings and Algorithms

As depicted in Figure 7, the resulting curves reveal that our approach can rapidly learn how to discriminate handwritten characters from few-shot samples. Similar to the results in spike pattern recognition, the convergence speed and accuracy of 5-way-5-shot task are both superior to those of 5-way-1-shot task. This gain derives from the increase in training samples. We can also see that the accuracy rate of the SCNN (+NPO) climbs more sharply than others and converges to the stable states with the highest accuracy. Without NPO, the convergence speed of SCNN is slightly slower, which indicates that additional NPO can promote the learning speed of the model. Remarkably, the performance of CNN with the same structure as SCNN is relatively poor. An underlying cause could be that MSTO is more beneficial to coordinate the multi-timescale dynamics of SNN instead of time-independent CNN. From the meta-test results in Table 5, it is apparent that SCNN ($+$NPO) achieves the best accuracies and performs most stably due to minimum standard deviations. The accuracy of SCNN without NPO is slightly lower but still higher than that of CNN.

Table 5 also shows the comparison between our approach and various state-of-the-art few-shot learning approaches based on metrics (Siamese Net, Koch et al., 2015; Matching Nets, Vinyals et al., 2016), memory model (MANN, Santoro et al., 2016), and optimization (MAML, Finn et al., 2017). These methods are all based on the typical ANN with some optimization strategies, such as data augmentation, batch normalization (BN), and fine-tuning on the target problem. The fine-tuning technique refers to several steps of gradient-based adaptive iterations of the model in the meta-train set, which aims to maximize the limited sample information to search for the optimal weight configuration for the target. However, the technique actually applies a very small change to the parameters with only a slight improvement in performance (see Table 5). More important, the repetitive iterations make it unsuitable for low-latency or low-power spike-based online learning applications. Thus, it is not considered in our approach. For the data augmentation, although there is no data augmentation method for dynamic events stream, we can apply the traditional image enhancement methods and then obtain spike trains by proper spike coding. We follow the data augmentation operation in Koch, Zemel, & Salakhutdinov (2015), where each image is augmented with rotations by multiples of 90 degrees. It can be seen that our approach achieves competitive results with higher averaged accuracies and lower standard deviations, where our model outperforms the MANN (around 13% and 7% improvement) but is slightly lower in accuracy than the most advanced matching net and MAML, especially for the 1-shot task. This is because these baseline algorithms have significantly more complicated operations, while we do not. Moreover, our lightweight SCNN architecture with fewer parameters still achieves similar accuracy compared to other deeper and wider CNN architectures. With the addition of data augmentation techniques, the gap between our approach and standard meta-learning algorithms is further reduced. Overall, we preliminarily provide a viable solution for general SNNs to achieve few-shot learning and want to use these results as examples to draw more attention from the research field for future extensions to neuromorphic hardware.

#### 3.4.2  The Effect of Sparse Constraints

We further investigate the effect of sparse constraints on SNNs. The commonly used L1 regularization is employed to drive the network to learn a more sparsely connected topology, and the loss function in equation 2.5 is modified as follows:
$L[t]=E[t]+αΩ(w),l1:Ω(w)=∥w∥1=∑i|wi|,$
(3.4)
where $α$ is the regularization factor to control the sparse constraint strength. Figure 8 shows the normalized weights distribution of the trained SCNN without (a) and with the L1 regularization (b). Typically, the network shows a dense topology pattern with a large range of weights distribution (from $-$2 to 2), where only 6% connections are noncritical. After adding the L1 regularization, the weights obviously tend to 0 (about 34%): the synaptic connection is sparser, which could save storage space and improve computational speed. The recognition accuracy with L1 regularization is also reported in Table 5. We find that the network performance slightly decreases, one possible reason being the unsatisfied network representation ability due to valid synaptic connections reduction.

### 3.5  Few-Shot Gesture Learning for Dynamic Gesture-DVS

As a typical neuromorphic vision data set, Gesture-DVS records rich spatiotemporal information triggered by binary spike events, which naturally matches the behaviors of SNNs to decrease computational cost or energy. Thus, a significant amount of work benchmarks the performance of SNNs on it, while relatively scarce ones report the performance under the few-shot setting (Stewart et al., 2020). To investigate the few-shot learning advantages of our proposed approach on Gesture-DVS, we compare the model performance under different temporal resolutions and present the meta-train (see Figures 7c and 7d) and meta-test accuracies (see Table 6).

For the dynamic Gesture-DVS data set, the structure of SCNN is similar to that of Omniglot, except that a 4 $×$ 4 average pooling unit is appended to the first layer for reducing the data dimension and parameter quantity. As pointed out in previous experiments, the results are less satisfying with the fact that SCNN ($+$NPO) achieves better accuracies than SCNN without NPO. The results show a significant difference in the performance at different temporal resolutions. Our approach achieves superior accuracy when the temporal resolution is small, indicating that the SNNs are more suitable to extract sparse features, which is consistent with the previous finding in Deng et al. (2020). Besides, compared to other spike-based, few-shot solutions (Stewart et al., 2020), our method can achieve state-of-the-art performance under all temporal resolution with less sample information (time window). Using the data-converting method in section 3.3, we also report the performance of a standard meta-learning approach (MAML; Finn et al., 2017) on the Gesture-DVS data set. It can be seen that the MAML performance is less satisfying and relies heavily on the batch normalization (BN) technique. The model fails to converge to a good solution when there is no BN. Instead, our method runs natively on an events stream and outperforms all the baselines.

## 4  Model Analysis

### 4.1  The Effect of Neuron Parameters Optimization

Neuron parameters (i.e., membrane leak $λ$ and threshold $B$) are the key factors that affect the temporal dynamics of SNNs. Experience-driven preset restricts the variability and diversity of spiking neurons on the timescale, which makes it difficult for SNN to respond rapidly to the dynamic changes of perceptual cues. In contrast, our proposed MSTO endows SNNs with more flexible temporal dynamics in a data-driven way. Its function mainly manifests in two change forms of neuron parameters: permanent changes on general tasks and adaptive adjustment on a specific task. For three data sets involved in our experiment, we visualize the dynamics of neuron parameters during the learning process ($Dtrain$) of the $Dmeta-test$ (5-way-1-shot task). The results correspond to Figures 9a, 9b, and 9c, respectively. With the same default values (leak 0.3 and threshold 0.2), neuron parameters produce irreversible changes in different data sets, that is, the experience-driven preset is transformed into task-common neuron parameter (values at time step zero) through MSTO. Meanwhile, neuron parameters are also adaptively adjusted with the changes of perceptual information. This adaptive adjustment is a temporary online processing, which means that neuron parameters will return to their task-common values after each episode. Additionally, we can find some similarities and differences in the variation of neuron parameters in different data sets. From a spatial view, the first layer maintains a low membrane threshold, while the last layer is relatively high. This ensures the effectiveness of spiking signal transmission, which is also pointed out in Deng et al. (2020). However, the task-common neuron parameters of Omniglot are larger than those of Gesture-DVS. The primary reason may be that the input of Gesture-DVS is relatively sparse. From a temporal view, the time dependence of SNN is weakened with the decrease in membrane threshold and leakage. This trend indicates that the temporal dynamic effect of SNN is on a relatively short timescale. Another possible explanation emerges that SNNs perhaps do not need to maintain the long-term information dependence for classification tasks.

### 4.2  Analysis of Few-Shot Learning Characteristics

In order to explore what learning features our MTSO can bring to SNNs in the few-shot learning tasks, we compare the test process ($Dtest$) in $Dmeta-train$ and $Dmeta-test$ in the Omniglot data set. Taking the 5-way-5-shot learning task as an example, Figure 10a shows a detailed performance of the SCNN by visualizing the hidden feature states using t-SNE (Maaten & Hinton, 2008) and the output response using a confusion matrix. At the initial stage of $Dmeta-train$, the SCNN fails to implement specific identification constrained by limited samples and insufficient task experience. Benefiting from MSTO on multiple episodes of the $Dmeta-train$, the confusion matrix shows that the SCNN has few-shot learning ability and achieves overall good performance on the $Dmeta-test$. In addition, it is clear that data points of the same color stay much closer as the layers deepen, which concludes that MSTO is capable of adjusting the deep convolution layer by layer to make it better able to transfer inputs into a highly separable implicit representation, thereby enhancing the separability of the inputs of different classes. Our results demonstrate that MTSO can guide SCNN to learn how to classify and perform fast few-shot learning.
Figure 10:

Analysis of few-shot learning features in Omniglot data set (5-way-5-shot task). (a) The visualization of test process ($Dtest$) in $Dmeta-train$ and $Dmeta-test$, including the hidden feature states using t-SNE (Maaten & Hinton, 2008) and the output response using confusion matrix. The data points in each graph represent the distribution of feature maps obtained from different hidden convolutional layers in SCNN, where different colors represent different input classes. Data points of the same color stay much closer as the layers deepen, which concludes that MSTO is capable of adjusting the deep convolution layer by layer to enhance the separability of the inputs of different classes. The weight decay (b) and learning rate (c) of SCNN's different learnable parameters (red curve) in various time steps and layer levels. The trained LSTM will apply a certain weight decay and different learning rates to adapt inputs at different time steps.

Figure 10:

Analysis of few-shot learning features in Omniglot data set (5-way-5-shot task). (a) The visualization of test process ($Dtest$) in $Dmeta-train$ and $Dmeta-test$, including the hidden feature states using t-SNE (Maaten & Hinton, 2008) and the output response using confusion matrix. The data points in each graph represent the distribution of feature maps obtained from different hidden convolutional layers in SCNN, where different colors represent different input classes. Data points of the same color stay much closer as the layers deepen, which concludes that MSTO is capable of adjusting the deep convolution layer by layer to enhance the separability of the inputs of different classes. The weight decay (b) and learning rate (c) of SCNN's different learnable parameters (red curve) in various time steps and layer levels. The trained LSTM will apply a certain weight decay and different learning rates to adapt inputs at different time steps.

We conduct further analysis to determine the effects of inputs at different time steps $t$ ($t∈[0,19]$) on the hidden state dynamics (i.e., internal gating) of the trained LSTM. The hidden state (weight decay and learning rate) at each time step is divided according to the size of the learnable parameters of each layer in SCNN. This is done for ease of monitoring the hidden state dynamics of each parameter at different time steps and layer levels, which is conducive to investigating the learning characters of the LSTM. Figures 10b and 10c illustrate the adaptive weight decay $αt$ and learning rate $ηt$ of different parameters (red curves) at different time steps and layer levels. It is obvious that each parameter owns an independent hidden state value, where the weight decay of different parameters changed little but the learning rate is quite different. Observed from the perspective of the hierarchy, most weight decay remains in the range of 0.9 to 1, and only a few have large fluctuations, such as a part of the fourth layer. It seems possible that the optimal choice should not be the constant 1, mainly because shrinking and forgetting previous parameters can help SNNs make a large change to escape from the current poor local state. In addition, the distribution of learning rate is also roughly the same in different layers, which indicates that all layers share a common updated mode. Above all, we can see that the LSTM adopts a similar regulation method for each layer of SCNN. For the inputs at different time steps, the weight decay is basically unchanged, but the updated mode of learning rate is completely different. Therefore, the trained LSTM will apply a certain weight decay and different learning rates to adapt inputs at different time steps—that is, LSTM can adaptively regulate the weight decay $αt$ and learning rate $ηt$ factors by using short-term gradient and loss information.

## 5  Conclusion

Few-shot learning using biological models and synaptic plasticities is an appealing open problem, but there are few relevant studies, mainly because of a lack of algorithms and data. Existing algorithms attempt to solve it by exploring the learning-to-learn (L2L; Bellec et al., 2018; Bohnstingl et al., 2019) or transfer learning (Stewart et al., 2020) approach. However, these methods have poor performance and are subject to many restrictions, such as specific structures or simple tasks. One potential reason for this dilemma is that they ignore the underlying property of biological systems, that is, the diverse temporal dynamics. Here, a multi-timescale optimization (MTSO) method based on the L2L approach is proposed for SNNs to solve few-shot learning tasks. MTSO combines a novel surrogate gradient online learning (SGOL) algorithm with an adaptive-gated LSTM optimizer to optimize the neural dynamics of SNN on two different timescales: temporary changes of the parameter on a short timescale by capturing the temporal dependency of short-term gradient updates and permanent changes of the initial parameter on a long timescale to fit different tasks. Experiments conducted on synthetic spike pattern data set (SpkPtn), static image data set (Omniglot), and a dynamic neuromorphic data set (Gesture-DVS) show that the collaborative optimization of multi-timescale neural dynamics can enable general SNNs (even Deep SNNs) to achieve promising few-shot learning performance. This method provides new insight into how multi-timescale neural dynamics can be coordinated to construct a neural realistic spiking model suitable for few-shot tasks, which lays the foundation for future SNNs to explore high-performance, few-shot learning approaches.

## Acknowledgments

This work was supported by the National Key Research and Development Program of China under grant 2020AAA0105900, the National Natural Science Foundation of China under grant 61773271, Zhejiang Lab under grants 2019KC0AD02 and 2019KC0AB02, and key scientific technological innovation research project of Ministry of Education.

## References

Amir
,
A.
,
Taba
,
B.
,
Berg
,
D.
,
Melano
,
T.
, &
Modha
,
D.
(
2017
).
A low power, fully event-based gesture recognition system.
In
Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition
(pp.
7388
7397
).
Piscataway, NJ
:
IEEE
.
Andrychowicz
,
M.
,
Denil
,
M.
,
Gomez
,
S.
,
Hoffman
,
M. W.
,
Pfau
,
D.
,
Schaul
,
T.
, …
De Freitas
,
N.
(
2016
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems, 29
(pp.
3981
3989
).
Red Hook, NY
:
Curran
.
Bellec
,
G.
,
Salaj
,
D.
,
Subramoney
,
A.
,
Legenstein
,
A. R.
, &
Maass
,
W.
(
2018
). Long short-term memory and learning-to-learn in networks of spiking neurons. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
31
(pp.
787
797
).
Red Hook, NY
:
Curran
.
Bohnstingl
,
T.
,
Scherr
,
F.
,
Pehle
,
C.
,
Meier
,
K.
, &
Maass
,
W.
(
2019
).
Neuromorphic hardware learns to learn
.
Frontiers in Neuroscience
,
13
, 483.
Chen
,
R.
, &
Li
,
L.
(
2020
).
Analyzing and accelerating the bottlenecks of training deep SNNs with backpropagation
.
Neural Computation
,
32
(
12
),
2557
2600
.
Deng
,
L.
,
Wu
,
Y.
,
Hu
,
X.
,
Liang
,
L.
,
Ding
,
Y.
,
Li
,
G.
, …
Xie
,
Y.
(
2020
).
Rethinking the performance comparison between SNNS and ANNS
.
Neural Networks
,
121
,
294
307
.
Diehl
,
U. P.
, &
Cook
,
M.
(
2015
).
Unsupervised learning of digit recognition using spike-timing-dependent plasticity
.
Frontiers in Computational Neuroscience
,
9
, 99.
Fefei
,
L.
,
Fergus
, &
Perona
(
2003
).
A Bayesian approach to unsupervised one-shot learning of object categories.
In
Proceedings of the Ninth IEEE International Conference on Computer Vision
(pp.
1134
1141
).
Piscataway, NJ
:
IEEE
.
Finn
,
C.
,
Abbeel
,
P.
, &
Levine
,
S.
(
2017
).
Model-agnostic meta-learning for fast adaptation of deep networks.
In
Proceedings of the 34th International Conference on Machine Learning
, vol. 70 (pp.
1126
1135
).
Ghosh-Dastidar
,
S.
, &
,
H.
(
2009
).
Spiking neural networks
.
International Journal of Neural Systems
,
19
(
4
),
295
308
.
Gu
,
P.
,
Xiao
,
R.
,
Pan
,
G.
, &
Tang
,
H.
(
2019
).
STCA: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks.
In
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
(pp.
1366
1372
).
Hochreiter
,
S.
,
Younger
,
S. A.
, &
Conwell
,
R. P.
(
2001
).
Learning to learn using gradient descent.
In
Proceedings of the International Conference on Artificial Neural Networks
(pp.
87
94
).
Berlin
:
Springer
.
,
S. R.
,
Ganjtabesh
,
M.
, &
Masquelier
,
T.
(
2016
).
Bio-inspired unsupervised learning of visual features leads to robust invariant object recognition
.
Neurocomputing
,
205
,
382
392
.
Kingma
,
D. P.
, &
Ba
,
J.
(
2014
).
Adam: A method for stochastic optimization
. arXiv:1412.6980.
Koch
,
G.
,
Zemel
,
R.
, &
Salakhutdinov
,
R.
(
2015
).
Siamese neural networks for one-shot image recognition.
In
Proceedings of the International Conference on Machine Learning
, vol. 2.
Lake
,
B.
,
Salakhutdinov
,
R.
,
Gross
,
J.
, &
Tenenbaum
,
J.
(
2011
).
One shot learning of simple visual concepts.
In
Proceedings of the Annual Meeting of the Cognitive Science Society
.
Red Hook, NY
:
Curran
.
Lake
,
B. M.
,
Salakhutdinov
,
R.
, &
Tenenbaum
,
J. B.
(
2015
).
Human-level concept learning through probabilistic program induction
.
Science
,
350
(
6266
),
1332
1338
.
Lee
,
J. H.
,
Delbruck
,
T.
, &
Pfeiffer
,
M.
(
2016
).
Training deep spiking neural networks using backpropagation
.
Frontiers in Neuroscience
,
10
, 508.
Maass
,
W.
(
1997
).
Networks of spiking neurons: The third generation of neural network models
.
Neural Networks
,
10
(
9
),
1659
1671
.
Maaten
,
L. v. d.
, &
Hinton
,
G.
(
2008
).
Visualizing data using t-SNE
.
Journal of Machine Learning Research
,
9
(
11
),
2579
2605
.
Munkhdalai
,
T.
, &
Yu
,
H.
(
2017
).
Meta networks.
In
Proceedings of the 34th International Conference on Machine Learning
, vol. 70 (p. 2554).
Neftci
,
E. O.
,
Mostafa
,
H.
, &
Zenke
,
F.
(
2019
).
Surrogate gradient learning in spiking neural networks
.
IEEE Signal Processing Magazine
,
36
,
61
63
.
Nichol
,
A.
, &
Schulman
,
J.
(
2018
).
Reptile: A scalable metalearning algorithm.
arXiv:1803.02999, 2, 2.
Nitin
,
R.
,
Gopalakrishnan
,
S.
,
,
P.
, &
Kaushik
,
R.
(
2020
).
Enabling deep spiking neural networks with hybrid conversion and spike timing dependent backpropagation.
In
Proceedings of the Eighth International Conference on Learning Representations
.
Paszke
,
A.
,
Gross
,
S.
,
Massa
,
F.
,
Lerer
,
A.
,
,
J.
,
Chanan
,
G.
, …
Chintala
,
S.
(
2019
).
Pytorch: An imperative style, high- performance deep learning library.
In
Advances in neural information processing systems
,
32
(pp.
8026
8037
).
Red Hook, NY
:
Curran
.
Ravi
,
S.
, &
Larochelle
(
2017
).
Optimization as a model for few-shot learning.
In
Proceedings of the 5th International Conference on Learning Representations
.
,
A.
,
Lillicrap
,
T. P.
, &
Tweed
,
D. B.
(
2017
).
Deep learning with dynamic spiking neurons and fixed feedback weights
.
Neural Computation
,
29
(
3
),
578
602
.
Santoro
,
A.
,
Bartunov
,
S.
,
Botvinick
,
M.
,
Wierstra
,
D.
, &
Lillicrap
,
T.
(
2016
).
Meta- learning with memory-augmented neural networks.
In
Proceedings of the 33rd International Conference on Machine Learning
, vol. 48 (pp.
1842
1850
).
Schmidhuber
,
J.
,
Zhao
,
J.
, &
Wiering
,
M.
(
1997
).
Shifting inductive bias with success- story algorithm, adaptive Levin search, and incremental self-improvement
.
Machine Learning
,
28
(
1
),
105
130
.
Shrestha
,
S. B.
, &
Orchard
,
G.
(
2018
). SLAYER: Spike layer error reassignment in time. In
S.
Bengio
,
H.
Wallach
,
H.
Larochelle
,
K.
Grauman
,
N.
Cesa-Bianchi
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
32
(pp.
1412
1421
).
Red Hook, NY
:
Curran
.
Snell
,
J.
,
Swersky
,
K.
, &
Zemel
,
R.
(
2017
). Prototypical networks for few-shot learning. In
I.
Guyon
,
Y. V.
Luxburg
,
S.
Bengio
,
H.
Wallach
,
R.
Fergus
,
S.
Vishwanathan
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
30
(pp.
4077
4087
).
Red Hook, NY
:
Curran
.
Stewart
,
K.
,
Orchard
,
G.
,
Shrestha
,
S.
B., &
Neftci
,
E.
(
2020
).
On-chip few-shot learning with surrogate gradient descent on a neuromorphic processor.
In
Proceedings of the Second IEEE International Conference on Artificial Intelligence Circuits and Systems
(pp.
223
227
).
Piscataway, NJ
:
IEEE
.
Sung
,
F.
,
Yang
,
Y.
,
Zhang
,
L.
,
Xiang
,
T.
,
Torr
,
P. H.
, &
Hospedales
,
T. M.
(
2018
). Learning to compare: Relation network for few-shot learning. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, (pp.
1199
1208
).
Piscataway, NJ
:
IEEE
.
Thrun
,
S.
(
1998
).
Learning to learn
. Berlin: Springer.
Vanschoren
,
J.
(
2018
).
Meta-learning: A survey.
arXiv: abs/1810.03548.
Vinyals
,
O.
,
Blundell
,
C.
,
Lillicrap
,
T.
, &
Wierstra
,
D.
(
2016
). Matching networks for one shot learning. In
D.
Lee
,
M.
Sugiyama
,
U.
Luxburg
,
I.
Guyon
, &
R.
Garnett
(Eds.),
Advances in neural information processing systems
,
29
(pp.
3630
3638
).
Red Hook, NY
:
Curran
.
Wu
,
J.
,
Yılmaz
,
E.
,
Zhang
,
M.
,
Li
,
H.
, &
Tan
,
K. C.
(
2020
).
Deep spiking neural networks for large vocabulary automatic speech recognition
.
Frontiers in Neuroscience
,
14
, 199.
Wu
,
Y.
,
Deng
,
L.
,
Li
,
G.
,
Zhu
,
J.
, &
Shi
,
L.
(
2018
).
Spatio-temporal backpropagation for training high-performance spiking neural networks
.
Frontiers in Neuroscience
,
12
, 331.
Xiao
,
R.
,
Yu
,
Q.
,
Yan
,
R.
, &
Tang
,
H.
(
2019
).
Fast and accurate classification with a multi-spike learning algorithm for spiking neurons.
In
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence
(pp.
1445
1451
).
Yu
,
Q.
,
Li
,
S.
,
Tang
,
H.
,
Wang
,
L.
,
Dang
,
J.
, &
Tan
,
C. K.
(
2020
).
Toward efficient processing and learning with spikes: New approaches for multispike learning.
IEEE Transactions on Systems, Man, and Cybernetics
(pp.
1
13
).
Piscataway, NJ
:
IEEE
.
Zenke
,
F.
, &
Ganguli
,
S.
(
2018
).
Superspike: Supervised learning in multi-layer spiking neural networks
.
Neural Computation
,
30
(
6
),
1514
1541
.
Zhang
,
M.
,
Luo
,
X.
,
Chen
,
Y.
,
Wu
,
J.
,
Belatreche
,
A.
,
Pan
,
Z.
,
Qu
,
H.
, &
Li
,
H.
(
2020
).
An efficient threshold-driven aggregate-label learning algorithm for multimodal information processing
.
IEEE Journal of Selected Topics in Signal Processing
,
14
(
3
),
592
602
.