Figure 5:
Surrogate gradient learning is sensitive to the scale of the surrogate derivative. (a) Illustration of pseudo-derivatives σ' that converge toward the actual derivative of a hard spike threshold β→∞. Note that in contrast to Figure 3a, their maximum value grows as β increases. (b) Training accuracy of several spiking networks (nh=1) during training on a synthetic classification task. The gray curves comprise control networks in which the surrogate derivative was either normalized to one or in which we used an asymptotic surrogate derivative but prevented surrogate gradients from flowing through the spike reset. Orange curves correspond to networks with asymptotic pseudo-derivatives with differentiable spike reset (aDR). In all cases, we plot the five best-performing learning curves obtained from an extensive grid search over β and the learning rate η (cf. Figure 3). (c) Quantification of the test accuracy of the different learning curves shown in panel b. We trained all networks using a SuperSpike nonlinearity. The reset term was either ignored (sCtl) or a differentiable reset was used (aDR). Similarly, we considered an asymptotic variant of SuperSpike that does converge toward the exact derivative of a step function for β→∞, without (aCtl) or with a differentiable reset term (aDR). The results shown correspond to the 10 best results from a grid search. The error bars denote the standard deviation. (d) A similar comparison of control cases in which reset terms were ignored (gray) or could contribute to the surrogate gradient (orange) for different numbers of hidden layers. (e) Test accuracy as in panel c, but comparing SuperSpike s and the asymptotic a case in which gradients can flow through recurrent connections (Prop) versus the detached case (Ctl). (f) Test accuracy for asymptotic SuperSpike as a function of the number of hidden layers for networks in which gradients were flowing through recurrent connections (orange) versus the detached case (gray).

Surrogate gradient learning is sensitive to the scale of the surrogate derivative. (a) Illustration of pseudo-derivatives σ' that converge toward the actual derivative of a hard spike threshold β. Note that in contrast to Figure 3a, their maximum value grows as β increases. (b) Training accuracy of several spiking networks (nh=1) during training on a synthetic classification task. The gray curves comprise control networks in which the surrogate derivative was either normalized to one or in which we used an asymptotic surrogate derivative but prevented surrogate gradients from flowing through the spike reset. Orange curves correspond to networks with asymptotic pseudo-derivatives with differentiable spike reset (aDR). In all cases, we plot the five best-performing learning curves obtained from an extensive grid search over β and the learning rate η (cf. Figure 3). (c) Quantification of the test accuracy of the different learning curves shown in panel b. We trained all networks using a SuperSpike nonlinearity. The reset term was either ignored (sCtl) or a differentiable reset was used (aDR). Similarly, we considered an asymptotic variant of SuperSpike that does converge toward the exact derivative of a step function for β, without (aCtl) or with a differentiable reset term (aDR). The results shown correspond to the 10 best results from a grid search. The error bars denote the standard deviation. (d) A similar comparison of control cases in which reset terms were ignored (gray) or could contribute to the surrogate gradient (orange) for different numbers of hidden layers. (e) Test accuracy as in panel c, but comparing SuperSpike s and the asymptotic a case in which gradients can flow through recurrent connections (Prop) versus the detached case (Ctl). (f) Test accuracy for asymptotic SuperSpike as a function of the number of hidden layers for networks in which gradients were flowing through recurrent connections (orange) versus the detached case (gray).

Close Modal

or Create an Account

Close Modal
Close Modal