Deep learning is often criticized by two serious issues that rarely exist in natural nervous systems: overfitting and catastrophic forgetting. It can even memorize randomly labeled data, which has little knowledge behind the instance-label pairs. When a deep network continually learns over time by accommodating new tasks, it usually quickly overwrites the knowledge learned from previous tasks. Referred to as the *neural variability*, it is well known in neuroscience that human brain reactions exhibit substantial variability even in response to the same stimulus. This mechanism balances accuracy and plasticity/flexibility in the motor learning of natural nervous systems. Thus, it motivates us to design a similar mechanism, named *artificial neural variability* (ANV), that helps artificial neural networks learn some advantages from “natural” neural networks. We rigorously prove that ANV plays as an implicit regularizer of the mutual information between the training data and the learned model. This result theoretically guarantees ANV a strictly improved generalizability, robustness to label noise, and robustness to catastrophic forgetting. We then devise a *neural variable risk minimization* (NVRM) framework and *neural variable optimizers* to achieve ANV for conventional network architectures in practice. The empirical studies demonstrate that NVRM can effectively relieve overfitting, label noise memorization, and catastrophic forgetting at negligible costs.

## 1 Introduction

Inspired by natural neural networks, artificial neural networks have achieved comparable performance with humans in a variety of application domains (LeCun, Bengio, & Hinton, 2015; Witten, Frank, Hall, & Pal, 2016; Silver et al., 2016; He, Zhang, Ren, & Sun, 2016; Litjens et al., 2017). Deep neural networks are usually highly overparameterized (Keskar, Mudigere, Nocedal, Smelyanskiy, & Tang, 2017; Dinh, Pascanu, Bengio, & Bengio, 2017; Arpit et al., 2017; Kawaguchi, Huang, & Kaelbling, 2019); the number of weights is usually way larger than the sample size. The extreme overparameterization gives deep neural network excellent approximation (Cybenko, 1989; Funahashi, 1989; Hornik, Stinchcombe, & White, 1989; Hornik, 1993) and optimization (Allen-Zhu, Li, & Song, 2019; Arora, Cohen, & Hazan, 2018; Li & Liang, 2018; Allen-Zhu, Li, & Liang, 2019) abilities, as well as a prohibitively large hypothesis capacity. This phenomenon makes almost all capacity-based generalization bounds vacuous. Besides, former empirical results demonstrate that deep neural networks almost surely achieve zero training error even when the training data are randomly labeled (Zhang, Bengio, Hardt, Recht, & Vinyals, 2017). This memorization of noise suggests that deep learning is good at overfitting.

Deep learning performs poorly at learning multiple tasks from dynamic data distributions (Parisi, Kemker, Part, Kanan, & Wermter, 2019). The functionality of artificial neural networks is sensitive to weight perturbations. Thus, continually learning new tasks can quickly overwrite the knowledge learned through previous tasks, which is called catastrophic forgetting (McCloskey & Cohen, 1989; Goodfellow, Mirza, Xiao, Courville, & Bengio, 2013). Neuroscience has motivated a few algorithms for overcoming catastrophic forgetting and variations in data distributions (Kirkpatrick et al., 2017; Zenke, Poole, & Ganguli, 2017; Chen, Mai, Xiao, & Zhang, 2019).

Natural neural networks have much better generalizability and robustness. Can we learn from human brains again for more innovations in deep learning? An extensive body of work in neuroscience suggests that neural variability is essential for learning and proper development of human brains, which refers to the mechanism that human brain reactions exhibit substantial variability even in response to the same stimulus (Churchland, Byron, Ryu, Santhanam, & Shenoy, 2006; Churchland et al., 2010; Dinstein, Heeger, & Behrmann, 2015). Neural variability acts as a central role in motor learning, which helps balance the need for accuracy and the need for plasticity and flexibility (Fetters, 2010). The ever-changing environment requires performers to constantly adapt to both external (e.g., slippery surface) and internal (e.g., injured muscle) perturbations. It is also suggested that adult motor control systems can perform better by generating neural variability actively in order to leave room for adaptive plasticity and flexibility (Tumer & Brainard, 2007). An appropriate degree of neural variability is necessary to studies of early development (Hedden & Gabrieli, 2004; Ölveczky, Otchy, Goldberg, Aronov, & Fee, 2011). A study on Parkinsons disease suggests that the learning ability to for new movements and adaptability to perturbations is dramatically reduced when neural variability is low (Mongeon, Blanchet, & Messier, 2013).

Inspired by the neuroscience knowledge, this letter formulates artificial neural variability theory for deep learning. We mathematically prove that ANV plays the role of an implicit regularizer of the mutual information between learned model weights and the training data. A beautiful coincidence in neuroscience is that neural variability in the rate of response to a steady stimulus also penalizes the information carried by nerve impulses (spikes) (Stein, Gossen, & Jones, 2005; Houghton, 2019). Our theoretical analysis guarantees that ANV can strictly relieve overfitting, label noise memorization, and catastrophic forgetting.

We further propose a neural variable risk minimization (NVRM) framework, which is an efficient training method to achieve ANV for artificial neural networks. In the NVRM framework, we introduce weight perturbations during inference to simulate the neural variability of human brains to relieve overfitting and catastrophic forgetting. The empirical mean of the loss in the presence of weight perturbations is referred to as *neural variable risk* (NVR). Similar to neural variability, replacing the conventional empirical risk minimization (ERM) by NVRM would balance the accuracy-plasticity trade-off in deep learning.

The rest of this letter is organized as follows. In section 2, we propose the neural variability theory and mathematically validate that ANV relieves overfitting, label noise memorization, and catastrophic forgetting. In section 3, we propose the NVRM framework and neural variable optimizers, which can achieve ANV efficiently in practice. In section 4, we conduct extensive experiments to validate the theoretical advantages of NVRM. In particular, training neural networks via neural variable optimizers can easily achieve remarkable robustness to label noise and weight perturbation. In section 5, we conclude our main contribution.

## 2 Neural Variability Theory

In this section, we formally introduce artificial neural variability into deep learning. We denote a model with the weights $\theta $ as $M(\theta )$ and the training data set as $S={(x(i),y(i))}i=1m$ drawn from the data distribution $S$. We define the empirical risk over the training data set $S$ as $L^(\theta )=L(\theta ,S)=1m\u2211i=1mL(\theta ,(x(i),y(i)))$, and the population risk over the data distribution $S$ as $L(\theta )=E(x,y)\u223cS[L(\theta ,(x,y))]$. We formally define $(b,\delta )$-neural variability ($(b,\delta )$-NV) as definition ^{1}.

**Definition 1**

where $|\xb7|$ denotes the absolute value, and both $\delta $ and $b$ are positive. Then the model $M(\theta )$ is said to achieve $(b,\delta )$-neural variability at $\theta $ on the data set $S$. It can also be said the model achieves $(b,\delta )$-regional flatness at $\theta $ on the data set $S$.

The definition has a similar form to $(C\epsilon ,A)$-sharpness defined by Keskar et al. (2017). A model $M(\theta )$ with $(b,\delta )$-neural variability can work almost equally well when its weights are randomly perturbed as $\theta ^\u223cN(\theta ,b2I)$. This definition mimics the neuroscience mechanism that human brains can work well or even better by actively generating perturbations (Tumer & Brainard, 2007). The definition of $(b,\delta )$-neural variability is also a measure of robustness to weight perturbations and a measure of weight uncertainty for Bayesian neural networks.

### 2.1 Generalization

In this section, we formulate the information theoretical foundation of $(b,\delta )$-neural variability by using the PAC-Bayesian framework (McAllester, 1999a, 1999b). The PAC-Bayesian framework provides guarantees on the expected risk of a randomized predictor (hypothesis) that depends on the training data set. The hypothesis is drawn from a distribution $Q$ and sometimes referred to as a posterior. We then denote the expected risk with respect to the distribution $Q$ as $L(Q)$ and the empirical risk with respect to the distribution $Q$ as $L^(Q)$. Suppose $P$ is the prior distribution over the weight space $\Theta $.

**Lemma 1**

where $KL(Q\u2225P)$ denotes the Kullback “Leibler divergence from $P$ to $Q$.

The PAC-Bayesian generalization bound closely depends on the prior $P$ over the model weights. We make a mild assumption ^{3}.

**Assumption 1.**

The prior over model weights is gaussian, $P=N(0,\sigma 2I)$.

Assumption ^{3} justified as it can be interpreted as weight decay, which is widely used in related papers (Graves, 2011; Neyshabur, Bhojanapalli, McAllester, & Srebro, 2017; He, Liu, & Tao, 2019). We note that $\sigma 2$ is very large in practice, as $\sigma 2$ is equal to the inverse weight decay strength.

We consider a distribution $Qnv$ over model weights of the form $\theta +\epsilon $, where $\theta $ is drawn from the distribution $Q$ and $\epsilon \u223cN(0,b2I)$ is a random variable. Following the theoretical analysis, particularly equation 7 of Neyshabur et al. (2017), we formulate theorem ^{4}.

**Theorem 1**

^{3}holds. Then for any real $\Delta \u2208(0,1)$, with probability at least $1-\Delta $, over the draw of the training data set $S$, the expected risk for all distributions $Qnv$ satisfies

We leave all proofs for appendix A. We note that $KL(Qnv\u2225P)$ as the function of $b$ decreases with $b$ for $b\u2208(0,\sigma )$, and reaches the global minimum at $b=\sigma $. As $\sigma $ is much larger than 1 and $b$ in practice, the PAC-Bayesian bound monotonically decreases with $b$ given $\delta $. The bound is tighter than the bound in lemma ^{2} when the model has strong ANV, which means $b$ is large given a small $\delta $.

^{4}positively correlates with the mutual information of learned model weights and training data. Given two random variables $\theta $ and $S$, their Shannon mutual information is defined as $I(\theta ;S)=ES\u223cS[KL(p(\theta |S)\u2225p(\theta ))]$, which is the expected Kullback-Leibler divergence from the prior distribution $p(\theta )$ of $\theta $ to the distribution $p(\theta |S)$ after an observation of $S$ (Cover & Thomas, 2012). In the case of theorem

^{4}, we have

Different from the PAC-Bayesian approach, another theoretical framework for the generalization bound based on mutual information was proposed by Xu and Raginsky (2017). Following these authors, we formulate an alternative mutual-information-based generalization bound in appendix B.

### 2.2 Robustness to Label Noise

Noisy labels can remarkably damage the generalization of deep networks, because deep networks can completely memorize corrupted label information (Zhang et al., 2017). Memorizing noisy labels is one of the most serious overfitting issues in deep learning. We will show that ANV relieves deep networks from memorizing noisy labels by penalizing the mutual information of the model weights $\theta $ and the labels $y$ conditioned on the inputs $x$.

^{4}of Harutyunyan, Reing, Steeg, and Galstyan (2020) also supported that memorization of noisy labels is prevented by decreasing $I(\theta ;y|x)$.

^{1}, the model $M(\theta ^)$ may achieve nearly equal training performance to $M(\theta )$ given small $\delta $. At the same time, obviously, $I(\theta ^;y|x)$ is smaller than $I(\theta ;y|x)$ as $\epsilon $ penalizes the mutual information. This suggests that increasing $b$ given $\delta $ for a $(b,\delta )$-neural variable model can penalize the memorization of noisy labels by regularizing the mutual information of learned model weights and training data.

### 2.3 Robustness to Catastrophic Forgetting

The ability to continually learn over time by accommodating new tasks while retaining previously learned tasks is referred to as continual or lifelong learning (Parisi et al., 2019). However, the main issue of continual learning is that artificial neural networks are prone to catastrophic forgetting. In natural neural systems, neural variability leaves room for the excellent plasticity and continual learning ability (Tumer & Brainard, 2007). It is natural to conjecture that ANV can help relieve catastrophic forgetting and enhance continual learning.

We take regularization-based continual learning (Kirkpatrick et al., 2017; Zenke et al., 2017; Aljundi, Babiloni, Elhoseiny, Rohrbach, & Tuytelaars, 2018; Doan, Bennani, Mazoure, Rabusseau, & Alquier, 2020) as an example. The basic idea of this learning is to strongly regularize weights most relevant to previous tasks. Usually the regularization is strong enough to fix learning near the solution learned from previous tasks. In a way, the model tends to learn in the overlapping region of optimal solutions for multiple tasks. The intuition behind ANV is clear: if a model is more robust to weight perturbation, it will have a wider optimal region shared by multiple tasks.

## 3 Neural Variable Risk Minimization

The next question is, How to perform NVRM? Unfortunately, NVR is intractable in practice. But it is possible to approximately estimate the NVR and its gradient by sampling $L^NV(\theta ,S)=L(\theta +\epsilon ,S)$, where $\epsilon \u223cN(0,b2I)$. This unbiased estimation method is also used in variational inference (Graves, 2011).

*neural variable optimizers*. The pseudocode of NVRM-SGD is displayed in algorithm 1. Similarly, we can also easily obtain NVRM-Adam by adding the four colored lines of algorithm 1 into adaptive momentum estimation (Adam) (Kingma & Ba, 2014). The source code is available at https://github.com/zeke-xie/artificial-neural-variability-for-deep-learning. We note that it is necessary to apply the denoising step before we evaluate the model learned by NVRM on test data sets, as $\theta t^=\theta t+\epsilon t$. We also call such weight perturbations virtual perturbations, which need to be applied before inference and removed after backpropagation. We can easily empower neural networks with ANV by importing a neural variable optimizer to train them.

### 3.1 A Deep Learning Dynamical Perspective

We note that it is possible to theoretically analyze NVRM from a deep learning dynamical perspective. NVRM actually introduces Hessian-dependent gradient noise into learning dynamics instead of injected white gaussian noise in conventional noise injection methods, as the second-order Taylor approximation $\u2207L(\theta +\epsilon )\u2248\u2207L(\theta )+\u22072L(\theta )\epsilon $ holds for small weight perturbation. Zhu, Wu, Yu, Wu, and Ma (2019) argued that anisotropic gradient noise is often beneficial for escaping sharp minima. Xie, Sato, and Sugiyama (2021) and Xie, Wang, Zhang, Sato, and Sugiyama (2020) further quantitatively proved that Hessian-dependent gradient noise is exponentially helpful for learning flat minima. Again, flat minima (Hochreiter & Schmidhuber, 1997) are closely related to overfitting and the information in the model weights. This can mathematically explain the advantage of NVRM from a different perspective. We leave the diffusion-based approach as future work.

### 3.2 Related Work

One related line of research is injecting weight noise into deep networks during training (An, 1996; Neelakantan et al., 2015; Zhou, Liu, Li, Lin, Zhou, & Zhao, 2019). For example, perturbed stochastic gradient descent (PSGD) is SGD with a conventional weight noise injection method, which is displayed in algorithm 2. Another famous example is stochastic gradient langevin dynamics (SGLD) (Welling & Teh, 2011), which differs from PSGD only in the magnitude of injected gaussian noise. However, this conventional line does not remove the injected weight noise after each iteration, which makes it essentially different from our method. In section 4, we empirically verify that the denoising step is significantly helpful for preventing overfitting.

Variational inference (VI) for Bayesian neural networks (Graves, 2011; Blundell, Cornebise, Kavukcuoglu, & Wierstra, 2015; Khan et al., 2018) aims at estimating the posterior distribution of model weights given training data. VI requires expensive costs to update the posterior distribution (model uncertainty) during training. This line believes estimating the exact posterior is important but ignores the importance of enhancing model certainty. In contrast, our method is the first to actively encourage model uncertainty for multiple benefits by choosing the variability hyperparameter $b$. ANV may be regarded as applying a neuroscience-inspired hyperprior over model uncertainty. Inspired by recent work on Bayesian neural networks, we conjecture that the NVRM framework could help improve adversarial robustness (Carbone et al., 2020) and fix overconfidence problems (Kristiadi, Hein, & Hennig, 2020).

Another related line of research is randomized smoothing (Duchi, Bartlett, & Wainwright, 2012; Nesterov & Spokoiny, 2017). Wen et al. (2018) applied the idea of randomized smoothing in training of deep networks and proposed the so-called SmoothOut method to optimize a weight-perturbed loss. This is also what the proposed NVRM does. We note that the original SmoothOut is actually a different implementation of NVRM with uniform noise, which both belong to Randomized Smoothing. However, this line of research (Duchi et al., 2012; Wen et al., 2018) focused only on improving performance on clean data sets by escaping from sharp minima. To the best of our knowledge, our work is the first along this line to theoretically and empirically analyze label noise memorization and catastrophic forgetting.

In summary, our letter further made two important contributions beyond the existing related work: (1) we discovered that NVRM can play a very important role in regularizing mutual information, which helps relieve label noise memorization and catastrophic forgetting, and (2) we implemented the weight-perturbed gradient estimation as a simple and effective optimization framework. NVRM as an optimizer is more elegant and easier to use than the existing methods like SmoothOut, which need to update the weights to calculate a perturbed loss before each back propagation.

## 4 Empirical Analysis

We conducted systematic comparison experiments to evaluate the proposed NVRM framework. To secure a fair comparison, every experimental setting was repeatedly trialed 10 times while all irrelevant variables were strictly controlled. We evaluated NVRM by the mean performance and the standard deviations over 10 trials. Implementation details are in appendix C.

### 4.1 Robustness to Weight Perturbation

For ResNet-34 (He et al., 2016) trained on clean data, CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009), we perturbed the weights by isotropic gaussian noise of different noise scales to evaluate the test accuracy to weight perturbation. Figure 1 demonstrates that the models trained by NVRM are significantly more robust to weight perturbation and have lower expected minima sharpness defined by Neyshabur et al. (2017). This empirically verifies that the conventional neural networks trained via NVRM indeed learn strong ANV.

### 4.2 Improved Generalization

### 4.3 Robustness to Noisy Labels

### 4.4 Robustness to Catastrophic Forgetting

### 4.5 Is the Denoising Step Really Helpful?

### 4.6 Choices of Noise Types

## 5 Conclusion

A well-known term in neuroscience, neural variability, suggests that the human brain response to the same stimulus exhibits substantial variability and significantly contributes to balancing the accuracy and plasticity/flexibility in motor learning in natural neural networks. Inspired by this mechanism, this letter introduced ANV for balancing the accuracy and plasticity/flexibility in artificial neural networks. We proved that ANV acts as an implicit regularizer to control the mutual information between the training data and the learned model, which further secures preventing the learned model from overfitting and catastrophic forgetting. These two abilities are theoretically related to the robustness to weight perturbations. The proposed NVRM framework is an efficient approach to achieving ANV for artificial neural networks. The empirical results demonstrate that our method can (1) enhance the robustness to weight perturbation, (2) improve generalizability, (3) relieve the memorization of noisy labels, and (4) mitigate catastrophic forgetting. Particularly, NVRM, an optimization approach, may handle memorization of noisy labels well at negligible computational and coding costs. One line code of importing a neural variable optimizer is all you need to achieve ANV for your models.

## Appendix A: Proofs

### A.1 Proof of Theorem 1

**Proof.**

## Appendix B: The Mutual-Information Generalization Bound

We formulate a mutual information theoretical foundation of $(b,\delta )$-neural variability, which is more related to the neuroscience mechanism of penalizing the information carried by nerve impulses (Stein et al., 2005).

It is known that the information in the model weights relates to overfitting (Hinton & Van Camp, 1993) and flat minima (Hochreiter & Schmidhuber, 1997). According to lemma ^{5}, if the mutual information of the parameters and data decreases, the upper bound of the expected generalization gap will also decrease.

**Lemma 2**

**Theorem 2.**

^{5}hold, and the model $M(\theta )$ achieves $(b,\delta )$-NV on the training data set $S$. Then the expected generalization gap of the model $M(\theta )$ satisfies

**Proof.**

^{5}, we have the expected generalization gap of $M(\theta +\epsilon )$ meets

Obviously, the bound monotonically decreases with the variability scale $b$ given $\delta $. The bound is tighter than the bound in lemma ^{5} when the model has good ANV, which means $b$ is large given a small $\delta $. A large $b$ can even penalize the mutual information to nearly zero. Therefore, strong ANV brings a tighter generalization bound.

## Appendix C: Implementation Details

We introduce the details of each experiment in this section. In experiment 1, we evaluated NVRM's robustness to weight perturbation. In experiment 2, we evaluated the generalizability of NVRM. In experiment 3, we evaluated NVRM's robustness to noisy labels. In experiment 4, we evaluated NVRM's robustness to catastrophic forgetting. In experiment 5, we studied the usefulness of the de-noising step. In experiment 6, we studied choices of noise types.

Our experiment is conducted on a computing cluster with GPUs of NVIDIATesla V100 16 GB and CPUs of IntelXeon Gold 6140 CPU @ 2.30 GHz.

### C.1 Robustness, Generalization, and Label Noise

#### C.1.1 General Settings

Experiments are conducted based on three popular deep learning networks VGG-16 (Simonyan & Zisserman, 2015), MobileNetV2 (Sandler et al., 2018), and ResNet-34 (He et al., 2016). The detailed architectures are presented in Table 1. Similarly, all data sets involved in our experiments are generated based on two standard benchmark data sets, CIFAR-10 and CIFAR-100 (Krizhevsky & Hinton, 2009).^{1} We follow the official version to split training sets and test sets in our experiments. For preprocessing and data augmentation, we performed per pixel mean subtraction, horizontal random flip, and $32\xd732$ random crops after padding with four pixels on each side. The batch size is set as 128, and the weight decay factor is set as 0.0001. We selected the optimal learning rate from {0.0001, 0.001, 0.01, 0.1, 1, 10} and used 0.1 for SGD/NVRM-SGD. Note that we used the common $L2$ regularization as weight decay, which is widely used in most cases, while Loshchilov and Hutter (2018) and Xie, Sato, and Sugiyama (2020) suggested that decoupled weight decay or stable weight decay is better in adaptive gradient methods. We employ SGD and NVRM-SGD to train models unless we specify otherwise. For the learning rate schedule, we initialized the learning rate as 0.1 and divided it by 10 after every 100 epochs. All models are trained for 300 epochs. The momentum factor is set as 0 for VGG-16 and MobileNetV2 in experiment 1, and 0.9 for ResNet-34 in experiments 2 and 3.

VGG-16 . | MobileNetV2 . | ResNet-34 . |
---|---|---|

conv3-64 $\xd7$ 2 | fc-32 | conv3-64 |

maxpool | $conv1-kconv3-kconv1-16$$\xd7$ 1 | $conv3-64conv3-64$$\xd7$ 3 |

conv3-128 $\xd7$ 2 | $conv1-6kconv3-6kconv1-24$$\xd7$ 2 | $conv3-128conv3-128$$\xd7$ 4 |

maxpool | $conv1-6kconv3-6kconv1-32$$\xd7$ 3 | $conv3-256conv3-256$$\xd7$ 6 |

conv3-256 $\xd73$ | $conv1-6kconv3-6kconv1-64$$\xd7$ 4 | $conv3-512conv3-512$$\xd7$ 3 |

maxpool | $conv1-6kconv3-6kconv1-96$$\xd7$ 3 | avgpool |

conv3-512 $\xd7$ 3 | $conv1-6kconv3-6kconv1-160$$\xd7$ 3 | |

maxpool | $conv1-6kconv3-6kconv1-320$$\xd7$ 1 | |

conv3-512 $\xd7$ 3 | fc-1280 | |

maxpool | ||

fc-512 $\xd7$ 2 | ||

fc-10 or fc-100 |

VGG-16 . | MobileNetV2 . | ResNet-34 . |
---|---|---|

conv3-64 $\xd7$ 2 | fc-32 | conv3-64 |

maxpool | $conv1-kconv3-kconv1-16$$\xd7$ 1 | $conv3-64conv3-64$$\xd7$ 3 |

conv3-128 $\xd7$ 2 | $conv1-6kconv3-6kconv1-24$$\xd7$ 2 | $conv3-128conv3-128$$\xd7$ 4 |

maxpool | $conv1-6kconv3-6kconv1-32$$\xd7$ 3 | $conv3-256conv3-256$$\xd7$ 6 |

conv3-256 $\xd73$ | $conv1-6kconv3-6kconv1-64$$\xd7$ 4 | $conv3-512conv3-512$$\xd7$ 3 |

maxpool | $conv1-6kconv3-6kconv1-96$$\xd7$ 3 | avgpool |

conv3-512 $\xd7$ 3 | $conv1-6kconv3-6kconv1-160$$\xd7$ 3 | |

maxpool | $conv1-6kconv3-6kconv1-320$$\xd7$ 1 | |

conv3-512 $\xd7$ 3 | fc-1280 | |

maxpool | ||

fc-512 $\xd7$ 2 | ||

fc-10 or fc-100 |

Notes: “conv $x$ - $c$” represents a convolution layer with kernel size $x\xd7x$ and $c$ output channels, and “fc - $c$” represents a fully connected layer with $c$ output channels. In the architecture of MobileNetV2, $[\xb7]$ represents a bottleneck, and $(\xb7)$ is simply a combination of three convolution layers but can halve both the width and height of the input of the block. The $k$ in $[\xb7]$ or $(\xb7)$ denotes the number of channels of the input of the corresponding block. In the architecture of ResNet-34, $[\xb7]$ represents a basic block.

#### C.1.2 Robustness to Weight Perturbation

For experiment 1, we injected isotropic gaussian noise of different variances to all the model weights and then evaluated the changes of the test accuracy. Six noise scales {0.01, 0.012, 0.014, 0.016, 0.018, 0.02} are involved in our experiments.

#### C.1.3 Learning with Noisy Labels

For experiment 3, we also generate a group of data sets with label noise. The symmetric label noise is generated by flipping every label to other labels with uniform flip rates {20%, 40%, 60%, 80%}. The asymmetric label noise by flipping label $i$ to label $i+1$ (except that label 9 is flipped to label 0) with pair-wise flip rates {10%, 20%, 30%, 40%}. We employed the code of Han et al. (2018) for generating noisy labels for CIFAR-10 and CIFAR-100.

#### C.1.4 The Usefulness of the Denoising Step and Choices of Noise Types

The hyperparameter settings of experiments 5 and 6 follow experiment 3, which are performed on learning with noisy labels. In experiment 6, we let the weight noise $\epsilon $, respectively, obey $N(0,b2)$, $Laplace(0,b)$, $Uniform(-b,b)$ for NVRM gaussian, NVRM Laplace, and NVRM uniform.

### C.2 Catastrophic Forgetting

#### C.2.1 Permuted MNIST

For experiment 4, we used fully connected network (FCN), which has two hidden layers and 1024 ReLUs per hidden layer. As continual learning tasks usually employ adaptive optimizers, we compared Adam with NVRM-Adam on the popular benchmark task, Permuted MNIST. In Permuted MNIST, we have five continual tasks. For each task, we generated a fixed, random permutation by which the input pixels of all images would be shuffled. Each task was thus of equal difficulty to the original MNIST problem, though a different solution would be required for each. We evaluated the accuracy of the base task (the first task) and the mean accuracy of all learned tasks after each task.

In the EWC experiment, we try to validate if NVRM can improve EWC. We validated the performance improvements under two different importance hyperparameters $\lambda \u2208{30,1000}$. In experiment 4, the batch size is set as 256, and the weight decay factor is set as 0.0001. As continual learning methods usually prefer adaptive optimizers, we employed Adam and NVRM-Adam as the optimizers. For the learning rate schedule, we fixed the learning rate as 0.001 and applied no learning rate decay. We set the variability scale $b=0.03$ in NVRM-Adam. All models are trained for one epoch per task, as one-epoch training has ensured good test performance on newly learned tasks.

#### C.2.2 Split MNIST

For experiment 4, we also supplied the experiment on split MNIST, another classical continual learning task. It is called incremental class learning. We train the models on the samples with a specific subset of labels for five continual tasks. We followed the usual setting (Zenke et al., 2017): $y\u2208{0,1}$, $y\u2208{2,3}$, $y\u2208{4,5}$, $y\u2208{6,7}$, and $y\u2208{8,9}$ for five tasks, respectively. In each task, the model may learn only two new digits and may forget previously learned digits.

The model is the same as the model architecture for Permuted MNIST, except that we used the five-header output layers for five tasks, respectively. When we trained the models for one task, the headers for other tasks were frozen. The batch size was set as 256 and the weight decay factor as 0. Again, we employed Adam and NVRM-Adam as the optimizers, and new optimizers are used for each continual task. For the learning rate schedule, we fixed the learning rate as 0.001 and applied no learning rate decay. We also let the variability scale $b=0.03$ in NVRM-Adam, unless we otherwise specify it.

## Appendix D: Additional Algorithm

## Appendix E: Supplementary Experimental Results

### E.1 Robustness to Weight Perturbation

### E.2 Improved Generalization

### E.3 Robustness to Noisy Labels

### E.4 Robustness to Catastrophic Forgetting

### E.5 SGD with Large Gradient Noise Cannot Relieve Noise Memorization

## Note

## Acknowledgments

M.S. was supported by the International Research Center for Neurointelligence (WPI-IRCN) at the University of Tokyo Institutes for Advanced Study.

## References

*Where is the information in a deep neuralnetwork?*

*Journal of Machine Learning Research*

*Proceedings of the European Conference on Computer Vision*

*Advances in neural information processing systems*

*Proceedings of the International Conference on Machine Learning*

*Neural Computation*

*Proceedings of the 35th International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the 32nd International Conference on Machine Learning*

*On-Line Learning in Neural Networks*

*Robustness of Bayesian neural networks to gradient-based attacks*

*Neural Computation*

*Nature Neuroscience*

*Journal of Neuroscience*

*Elements of information theory*

*Mathematics of Control, Signals and Systems*

*Proceedings of the International Conference on Machine Learning*

*Trends in Cognitive Sciences*

*A theoretical analysis of catastrophic forgetting through the NTK overlap matrix*

*SIAM Journal on Optimization*

*Physical Therapy*

*Neural Networks*

*An empirical investigation of catastrophic forgetting in gradient-based neural networks*

*Advances in neural information processing systems*

*Advances in neural information processing systems*

*Improving generalization by controlling label-noise information in neural network weights.*

*Advances in neural information processing systems*

*32*(pp.

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*Nature Reviews Neuroscience*

*Proceedings of the Sixth Annual Conference on Computational Learning Theory*

*Neural Computation*

*Neural Networks*

*Neural Networks*

*Neural Computation*

*Three factors influencing minima in SGD.*

*Neural Computation*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the International Conference on Machine Learning*

*Adam: A method for stochastic optimization.*

*Proceedings of the National Academy of Sciences*

*Proceedings of the International Conference on Machine Learning*

*Learning multiple layers of features from tiny images*

*Advances in neural information processing system*

*Medical Image Analysis*

*Proceedings of the International Conference on Learning Representations*

*Proceedings of the 12th Annual Conference on Computational Learning Theory*

*Machine Learning*

*Psychology of Learning and Motivation*

*Brain and Cognition*

*Adding gradient noise improves learning for very deep networks.*

*Foundations of Computational Mathematics*

*Advances in neural information processing systems*

*Journal of Neurophysiology*

*Neural Networks*

*Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition*

*Nature*

*Proceedings of the 3rd International Conference on Learning Representations*

*Nature Reviews Neuroscience*

*Proceedings of the International Conference on Machine Learning*

*Nature*

*Proceedings of the 28th International Conference on Machine Learning*

*Smoothout: Smoothing out sharp minima to improve generalization in deep learning.*

*Data mining: Practical machine learning tools and techniques*

*Stable weight decay regularization*

*Proceedings of the International Conference on Learning Representations*

*Adai: Separating the effects of adaptive learning rate and momentum inertia.*

*Advances in neural information processing systems*

*Proceedings of the 34th International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*

*Proceedings of the International Conference on Machine Learning*