Abstract

Many previous proposals for adversarial training of deep neural nets have included directly modifying the gradient, training on a mix of original and adversarial examples, using contractive penalties, and approximately optimizing constrained adversarial objective functions. In this article, we show that these proposals are actually all instances of optimizing a general, regularized objective we call DataGrad. Our proposed DataGrad framework, which can be viewed as a deep extension of the layerwise contractive autoencoder penalty, cleanly simplifies prior work and easily allows extensions such as adversarial training with multitask cues. In our experiments, we find that the deep gradient regularization of DataGrad (which also has L1 and L2 flavors of regularization) outperforms alternative forms of regularization, including classical L1, L2, and multitask, on both the original data set and adversarial sets. Furthermore, we find that combining multitask optimization with DataGrad adversarial training results in the most robust performance.

1  Introduction

Deep neural architectures are highly effective at a vast array of tasks, both supervised and unsupervised. However, recently, it has been shown that deep architectures are sensitive to certain kinds of pertubations of the input, which can range from being barely perceptible to quite noticeable (even semirandom noise), as in Nguyen, Yosinski, and Clune (2014). Samples containing this type of noise, called adversarial examples (Szegedy et al., 2013), can cause a trained network to confidently misclassify its input. While a variety of ways can be used to generate adversarial samples, the fastest and most effective approaches in the literature are based on the idea of using backpropagation to acquire the derivative of the loss with respect to an input image (i.e., the gradient) and adding a small multiple of the gradient to the image.

Earlier work suggested adding a regularization penalty on the deep gradient (Goodfellow, Shlens, & Szegedy, 2014; Gu & Rigazio, 2014) but had difficulty in computing the derivative (with respect to the weights) of the gradient, which is necessary for gradient-descent optimization algorithms. Instead, approximations were used. One was a shallow layerwise gradient penalty (Gu & Rigazio, 2014), which had also been used for regularizing contractive autoencoders (Rifai et al., 2011). Meanwhile, Lyu, Huang, and Liang (2015) presented a heuristic algorithm for this objective.

Here we provide an efficient, deterministic backpropagation style algorithm for training with a wide variety of gradient penalties. The resulting algorithm has the potential for unifying existing approaches for adversarial training. In particular, it helps explain some of the newer approaches to adversarial training (Miyato, Maeda, Koyama, Nakae, & Ishii, 2015; Huang, Xu, Schuurmans, & Szepesvari, 2015). These approaches set up an adversarial objective as a constrained optimization problem and then approximate and simplify it using properties that hold for optimal solutions of unconstrained problems. The algorithms then developed approximate optimization (when compared to our algorithms) and can be viewed as regularizations of this deep gradient.

2  The DataGrad Framework

Given a set of loss functions and regularizers , consider
formula
where is a data sample, is its corresponding label/target, and represents the parameters of a layer neural network.1 We use to denote the gradient of (the gradient of with respect to ). are the weight coefficients of the terms in the DataGrad loss function. Close to our work, Lyu et al. (2015) present a heuristic way to optimize a special case of this objective. By directly providing an algorithm, our analysis can explain what their algorithm optimizes.
We denote the entire data set as . Following the framework of empirical risk minimization with stochastic gradient descent, the goal is to minimize the objective function, , by iterating the following parameter updates (here, is the component of representing the weight of the incoming edge to node of layer from node of layer ):
formula
2.1
where is the step-size coefficient.

2.1.  The Derivation

The first update term of equation 1.1, , is provided by standard backpropagation. For the remaining terms, since the gradient of the loss also depends on the current weights , we see that
formula
2.2
where is a variable that takes the current value of . It turns out that these mixed partial derivatives (with respect to weights and to data) have structural similarities to the Hessian (since derivatives with respect to the data are computed almost exactly the same way as the derivatives with respect to the lowest layer weights). Since exact computation of the Hessian is slow (Bishop, 1992), we would expect that the computation of this matrix of partial derivatives would also be slow. However, it turns out that we do not need to compute the full matrix; we only need this matrix times a vector, and hence we can use ideas reminiscent of fast Hessian multiplication algorithms (Pearlmutter, 1994). At points of continuous differentiability, we have
formula
2.3
evaluated at the point and direction .2 The outer directional derivative with respect to the scalar can be computed using finite differences. Thus, equations 2.2 and 2.3 mean that we can compute the term from the stochastic gradient descent update equation, equation 2.1, as follows:
  1. Use standard backpropagation to simultaneously compute the vector derivatives and (note that the latter corresponds to the vector in our derivation).

  2. Analytically determine the gradient of with respect to its immediate inputs. For example, if is the penalty , then the immediate gradient would be , and if is the penalty, the immediate gradient would be .

  3. Evaluate the immediate gradient of at the vector . This corresponds to the adversarial direction, denoted by in our derivation.

  4. Form the adversarial example , where is the result of the previous step and is a small constant.

  5. Use a second backpropagation pass (with as input) to compute , and then return the finite difference .

2.2.  The High-Level View: Putting It All Together

At a high level, the loss and regularizer together serve to define an adversarial noise vector and adversarial example (where is a small constant), as explained in section 2.1. Different choices of and result in different types of adversarial examples. For example, setting to be the penalty, the resulting adversarial example is the same as that generated by the fast gradient sign method of Goodfellow et al. (2014).

Putting together the components of our finite differences algorithm, the stochastic gradient descent update equation becomes
formula
2.4
where is the adversarial example of resulting from regularizer in conjunction with loss , and the notation here specifically means to compute the derivative using backpropagation with as an input. In other words, is not to be treated as a function of (and its components ) when computing this partial derivative.

2.3.  How Prior Works Are Instances of Datagrad

Since the recent discovery of adversarial samples (Szegedy et al., 2013), a variety of remedies have been proposed to make neural architectures robust to this problem. A straightforward solution is to simply add adversarial examples during each training round of stochastic gradient descent (Szegedy et al., 2013). This is exactly what equation 2.4 specifies, so that a post hoc solution can be justified as a regularization of the data gradient. Subsequent work (Goodfellow et al., 2014) introduced the objective function , where is the adversarial version of input . A gradient-based method would need to compute the derivative with respect to , which is , since the construction of depends on . Their work approximates the optimization by ignoring the third term, as it is difficult to compute. This approximation then results in an updated equation having the form of equation 2.4 and actually optimizes the DataGrad objective. Nkland (2015) presents a variant where the deep network is trained using backpropagation only on adversarial examples (rather than a mix of adversarial and original examples). Equation 2.4 shows that this method optimizes the DataGrad objective with and and chosen so that the term is eliminated.

Both Huang et al. (2015) and Miyato et al. (2015) propose optimizing constrained objective functions that can be put in the form , where represents adversarial noise and the constraint puts a bound on the size of the noise. Letting be the (constrained) optimal value of for each and setting of , this is the same as the objective . The derivative of any term in the summation with respect to is then equal to
formula
2.5
Now, if were an unconstrained maximum value of , then would equal 0, and the second term of equation 2.5 would disappear. However, since is a constrained optimum and the constraint is active, the second term would generally be nonzero. Since the derivative of the constrained optimum is difficult to compute, Huang et al. (2015) and Miyato et al. (2015) opt to approximate or simplify the derivative, making the second term disappear (as it would in the unconstrained case). Comparing the remaining term to equation 2.4 shows that they are optimizing the DataGrad objective with and and carefully chosen to eliminate the term.

In an approach that ends up closely related to ours, Lyu et al. (2015) consider the objective and a linearized inner version . They iteratively select by optimizing the latter and by backpropagation on the former (with fixed). Since the update is not directly minimizing the linearized objective, Lyu et al. (2015) claimed the procedure was only an approximation of what we call the DataGrad objective. However, their method devolves to training on adversarial examples, so as before, equation 2.4 shows they are actually optimizing the DataGrad objective but with and and carefully chosen to eliminate the term.

Finally, Gu and Rigazio (2014) penalize the Frobenius norm of the deep gradient. However, they do this with a shallow layer-wise approximation. Specifically, they note that shallow contractive autoencoders optimize the same objective for shallow (one-layer) networks and that the gradient of the gradient can be computed analytically in those cases. Thus, they apply this penalty layer by layer (hence, it is a penalty on the derivative of each layer with respect to its immediate inputs) and use this penalty as an approximation to regularizing the deep gradient. Since DataGrad does regularize the deep gradient, the work of Gu and Rigazio (2014) can also be viewed as an approximation to DataGrad.

Thus, DataGrad provides a unifying view of previously proposed optimizations for training deep architectures that are resilient to adversarial noise.

3  Experimental Results

Given that we have shown that previous approaches are instances of the general DataGrad framework, it is not our intention to replicate prior work. Rather, we intend to test the effectiveness of our finite difference approximation and show that one can flexibly use DataGrad in other scenarios, such as adding multitask cues within the adversarial framework. To test the proposed DataGrad framework, we conduct experiments using the permutation-invariant MNIST data set3 of 60,000 training samples and 10,000 testing samples. A validation subset of 10,000 samples (randomly sampled without replacement from the training split) was used for tuning architecture metaparameters via a coarse grid search. Image features were gray-scale pixel values, which we normalized to the range of . We find that turning our attention first to an image classification problem like MNIST is appropriate since the adversarial problem was first presented in the context of computer vision problems. Investigation of our framework’s usefulness in domains such as text is left for future work.

In this study, we experiment with two concrete instantiations of the DataGrad framework: DataGrad-L1 (DGL1) and DataGrad-L2 (DGL2). By setting , letting freely vary as a metaparameter and for , and , choosing to be the L1 penalty results in DGL1 while choosing the L2 penalty yields DGL2. As a result, DataGrad becomes a regularization algorithm on either the or norm of the gradient of the loss . In this setup, DataGrad requires two forward passes and two backward passes to perform a weight update.

We are interested in evaluating how DataGrad compares to conventional and nonconventional forms of regularization. Beyond traditional and regularization of the network parameters ( and , respectively), we also experimented with the regularizing effect that multitask learning (MT) has on parameter learning in the interest of testing whether having an auxiliary objective could introduce any robustness to adversarial samples in addition to improved generalization. In order to do so, we designed a dual-task rectifier network with two disjoint sets of output units, each connected to the penultimate hidden layer by a separate set of parameters (i.e., for task 0, for task 1). The left-most branch is trained to predict one of the 10 original target digit labels associated with an image (as in the original MNIST task setup), which corresponds to loss , while the right-most branch is trained to predict one of five artificially constructed categories pertaining to the discretized degree of rotation of the image, which corresponds to loss .4 The multiobjective optimization problem for this setup then becomes
formula
3.1
where is a coefficient that controls the influence of the auxiliary objective on the overall parameter optimization problem. Note that we have extended equation 3.1 to include a DataGrad term, which may be of either L1 form, MT-DGL1, or L2 form, MT-DGL2. All regularized architectures are compared against the baseline sparse rectifier network, Rect.

We implemented several deep sparse rectifier architectures (Glorot, Bordes, & Bengio, 2011) three hidden layers, each with 784 latent variables, with parameters were initialized following the scheme of (He, Zhang, Ren, & Sun, 2015), which were all to be trained in a gradient descent framework under the various regularization schemes described earlier. Minibatches of size 100 were used for calculating each parameter update. Hyperparameters and ranges searched included the and coefficients for controlling DataGrad, the and penalty coefficients (which would simply appear as and , as in the appendix) for controlling the classical regularization terms, the auxiliary objective weight, and the gradient descent step-size . We did not use any additional gradient descent heuristics (e.g., momentum, adaptive learning rates, dropout) for simplicity, since we are interested in investigating the effect that the regularizers have on model robustness to adversarial samples.

To evaluate these architectures in the adversarial setting, we conduct a series of experiments where each trained model plays the role of “attacker.” An adversarial test set of 10,000 samples is generated from the attacking model with backpropagation, using the derivative of the loss with respect to the model’s inputs, followed by the application of the appropriate regularizer function (either or ) to create the noise. The amount of noise applied is controlled by , which we varied along the values , which corresponds to maximal pixel gains of (0 would be equivalent to using the original test set). Generalization performances reported in all figures in this section are of the architectures that achieved best performance on the validation subset (consult the appendix for a full treatment of performance across a range of and values).

We observe in Figures 1, 2, and 3 that a DataGrad-regularized architecture outperforms the nonregularized baseline as well as alternatively regularized ones. Note that the accuracy in Figure 1 drops only to as low as 92% in the worst case, meaning that samples seem to cause only minimal damage and should be much less of a concern than samples (which would be akin to generating noise via the fast gradient sign method).5 With respect to using only an auxiliary objective to regularize the model (MT), in both Figures 1 and 2, we often see that a dual-task model performs the worst when adversarial noise is introduced, surprisingly even more so than the simple rectifier baseline. Figure 2 shows that when the nonregularized multitask architecture is attacked by itself, its error can drop as low as nearly 1%. However, when a DataGrad term is added to the multitask objective, we achieve nearly no loss in classification performance. This means that a multitask, DataGrad-regularized rectifier network appears to be quite robust to adversarial samples (of either or form) generated from itself or other perceptron architectures (DGL1 more so than DGL2).

Figure 1:

Model performance when each model is its own adversary. Note that on the -axis indicates degree of ( or ) noise used to create adversarial samples. Terms in the legend refer to specific architectures (e.g., L2 refers to the L2-regularized network). Note that the model most susceptible to adversarial samples is the one trained under the multitask objective, while, interestingly, the most robust one is the multitask model trained with a DataGrad term.

Figure 1:

Model performance when each model is its own adversary. Note that on the -axis indicates degree of ( or ) noise used to create adversarial samples. Terms in the legend refer to specific architectures (e.g., L2 refers to the L2-regularized network). Note that the model most susceptible to adversarial samples is the one trained under the multitask objective, while, interestingly, the most robust one is the multitask model trained with a DataGrad term.

Figure 2:

Model performance when the adversarial architecture is the same, either simple (Rect) or multitask (MT). Note that on the -axis indicates degree of noise applied to create adversarial samples. We observe that the DataGrad-L1 regularized models, dual or single task, are the most robust to attacks generated by other models of various levels of sophistication (such as a simple, standard rectifier network versus a dual-task rectifier network).

Figure 2:

Model performance when the adversarial architecture is the same, either simple (Rect) or multitask (MT). Note that on the -axis indicates degree of noise applied to create adversarial samples. We observe that the DataGrad-L1 regularized models, dual or single task, are the most robust to attacks generated by other models of various levels of sophistication (such as a simple, standard rectifier network versus a dual-task rectifier network).

Figure 3:

Adversarial test set accuracy ( corresponds to original test split) and samples generated from a deep sparse rectifier network in the case of (starting from top of diagram to bottom) (1) no regularization, (2) L1 regularization, and (3) L1 DataGrad regularization. The measures reported here are when each model is used to attack itself (akin to the malicious user using the exact same architecture to generate samples).

Figure 3:

Adversarial test set accuracy ( corresponds to original test split) and samples generated from a deep sparse rectifier network in the case of (starting from top of diagram to bottom) (1) no regularization, (2) L1 regularization, and (3) L1 DataGrad regularization. The measures reported here are when each model is used to attack itself (akin to the malicious user using the exact same architecture to generate samples).

Classical and regularizers appear to mitigate some of the damage in some instances, but seemingly afford at best only modest robustness to adversarial perturbation. In contrast, the proposed DGL1 and DGL2 regularizers appear to yield a significant reduction in error on all of the various adversarial test sets, the improvement clearer as is increased (as evidenced in Figure 3). The visualization of some adversarial samples in Figure 3 demonstrates that even when more noise is applied to generate stronger adversarials, the samples themselves are still quite recognizable to the human eye. However, a neural architecture, such as a deep rectifier network, is sensitive to adversarial noise and incorrectly classifies these images. In addition to robustness against adversarial samples, we also observe improved classification error on the original test set when using DataGrad or multitask DataGrad, the DGL1 and MT-DGL1 variants offering the lowest error of all. (For further experimental results exploring the performance and sensitivity of DataGrad to its metaparameters, including when other architectures are the adversary, see the appendix.)

4  Conclusion

We have shown how previous proposals can be viewed as instances of a simple, general framework and provide an efficient, deterministic adversarial training procedure, DataGrad. The simplicity of the framework allows easy extensions, such as adding multitask cues as another signal to be combined with adversarial training. Empirically, we found that general DataGrad regularization not only significantly reduces error (especially when combined with a multitask learning objective) in classifying adversarial samples but also improves generalization. We postulate that a reason for this is that adversarial samples generated during the DataGrad learning phase potentially cover more of the underlying data manifold (yielding benefits similar to data set expansion).

Since DataGrad is effectively a deep data-driven penalty, it may be used in tandem with most training objective functions, whether supervised, unsupervised (Bengio, Lamblin, Popovici, & Larochelle, 2007), or hybrid (Ororbia II, Reitter, Wu, & Giles, 2015). Future work entails further improving the efficiency of the proposed DataGrad backpropagation procedure and investigating our procedure in a wider variety of settings.

Appendix:  Detailed Results

In this appendix, to augment the experimental results presented in section 3, we present the generalization performances of the regularized (and nonregularized models) under various settings of their key hyperparameters. This particularly applies to and (when applicable). All model performances reported are those with the learning rate and auxiliary objective weight metaparameters fixed at the value that yielded the best validation set performance. Furthermore, each table represents a different adversarial scenario, where a different architecture was selected to be the generator of adversarial examples. In each table, two lines are in bold: one for the single-task architecture and one for the multitask model that achieves the most robust performance across all noise values of . We do not report the same adversarial scenarios for noise, as we found that it had little impact on model generalization ability (as we noted in section 3).

One key observation to take from this set of experiments on the MNIST data set is that DataGrad, particularly the L1 form, achieves the greatest level of robustness to adversarial samples in all settings when the and are relatively higher. This is especially so when a DataGrad (L1) term is combined with the multitask objective. Note that this appears to be true no matter the adversary (even a DataGrad- or multitask-regularized one).

Perhaps even further performance improvement could be obtained if one added another DataGrad term to the auxiliary objective. In particular, this would apply to the rarer setting when one would also desire additional adversarial robustness with respect to the auxiliary objective. Tables 1 to 8, contain the full adversarial setting results.

Table 1:
Comparative Results Where All Models Are Attacked by a Simple Rectifier Network, Rect, Using Laplacian () Adversarial Noise.
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.38 96.14 59.74 14.20 
DGL1 0.0001 0.05 98.57 98.17 97.76 89.18 59.98 
DGL1 0.0001 0.1 98.47 98.13 97.75 90.00 70.03 
DGL1 0.001 0.01 98.03 97.26 96.05 62.77 14.82 
DGL1 0.001 0.05 98.70 98.43 98.08 92.97 75.65 
DGL1 0.001 0.1 98.62 98.38 98.11 93.74 83.06 
DGL1 0.01 0.01 98.36 97.96 97.35 88.03 55.16 
DGL1 0.01 0.05 98.62 98.50 98.37 95.16 85.78 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 96.95 95.60 53.29 11.90 
DGL2 0.0001 0.05 98.29 97.38 96.06 60.37 13.68 
DGL2 0.0001 0.1 98.02 96.97 95.66 53.58 12.26 
DGL2 0.001 0.01 97.97 97.00 95.64 54.55 12.07 
DGL2 0.001 0.05 98.20 97.08 95.63 55.43 12.53 
DGL2 0.001 0.1 98.26 97.42 96.17 60.86 13.18 
DGL2 0.01 0.01 98.20 97.49 96.60 72.71 19.84 
DGL2 0.01 0.05 98.31 97.60 96.50 69.66 17.97 
DGL2 0.01 0.1 98.30 97.54 96.40 68.05 17.61 
L1 0.0001 98.15 97.48 96.14 59.65 13.19 
L1 0.001 98.41 97.69 96.77 74.96 23.28 
L1 0.01 97.73 97.38 97.12 91.08 74.35 
L1 0.1 93.90 93.38 92.91 86.62 72.01 
L2 0.0001 98.00 97.05 95.86 57.90 13.93 
L2 0.001 97.88 96.84 95.39 53.58 12.41 
L2 0.01 98.45 98.00 97.48 84.87 42.00 
L2 0.1 98.12 97.85 97.53 91.29 69.87 
Rect 97.99 96.80 95.01 49.06 10.83 
MT-DGL2 0.001 0.01 98.38 98.00 97.49 81.37 33.88 
MT-DGL2 0.001 0.05 98.50 98.06 97.44 81.52 33.55 
MT-DGL2 0.001 0.1 98.35 97.87 97.24 79.36 33.47 
MT-DGL2 0.01 0.01 98.57 98.26 97.83 89.19 52.96 
MT-DGL2 0.01 0.05 98.65 98.26 97.80 89.69 55.09 
MT-DGL2 0.01 0.1 98.59 98.31 97.98 90.03 56.43 
MT-DGL2 0.1 0.01 98.85 98.74 98.53 95.75 84.42 
MT-DGL2 0.1 0.05 98.76 98.59 98.39 95.16 82.86 
MT-DGL2 0.1 0.1 98.70 98.54 98.39 95.37 83.70 
MT-DGL1 0.001 0.01 98.58 98.38 98.04 90.46 58.06 
MT-DGL1 0.001 0.05 98.77 98.66 98.51 96.08 87.14 
MT-DGL1 0.001 0.1 98.63 98.47 98.24 93.36 76.71 
MT-DGL1 0.01 0.01 98.82 98.69 98.40 95.79 85.41 
MT-DGL1 0.01 0.05 98.91 98.82 98.70 97.16 92.72 
MT-DGL1 0.01 0.1 98.99 98.94 98.84 97.44 93.79 
MT-DGL1 0.1 0.01 98.63 98.50 98.38 97.04 93.79 
MT-DGL1 0.1 0.05 98.99 98.97 98.90 98.25 96.68 
MT-DGL1 0.1 0.1      
MT 98.00 97.20 95.99 63.38 17.20 
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.38 96.14 59.74 14.20 
DGL1 0.0001 0.05 98.57 98.17 97.76 89.18 59.98 
DGL1 0.0001 0.1 98.47 98.13 97.75 90.00 70.03 
DGL1 0.001 0.01 98.03 97.26 96.05 62.77 14.82 
DGL1 0.001 0.05 98.70 98.43 98.08 92.97 75.65 
DGL1 0.001 0.1 98.62 98.38 98.11 93.74 83.06 
DGL1 0.01 0.01 98.36 97.96 97.35 88.03 55.16 
DGL1 0.01 0.05 98.62 98.50 98.37 95.16 85.78 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 96.95 95.60 53.29 11.90 
DGL2 0.0001 0.05 98.29 97.38 96.06 60.37 13.68 
DGL2 0.0001 0.1 98.02 96.97 95.66 53.58 12.26 
DGL2 0.001 0.01 97.97 97.00 95.64 54.55 12.07 
DGL2 0.001 0.05 98.20 97.08 95.63 55.43 12.53 
DGL2 0.001 0.1 98.26 97.42 96.17 60.86 13.18 
DGL2 0.01 0.01 98.20 97.49 96.60 72.71 19.84 
DGL2 0.01 0.05 98.31 97.60 96.50 69.66 17.97 
DGL2 0.01 0.1 98.30 97.54 96.40 68.05 17.61 
L1 0.0001 98.15 97.48 96.14 59.65 13.19 
L1 0.001 98.41 97.69 96.77 74.96 23.28 
L1 0.01 97.73 97.38 97.12 91.08 74.35 
L1 0.1 93.90 93.38 92.91 86.62 72.01 
L2 0.0001 98.00 97.05 95.86 57.90 13.93 
L2 0.001 97.88 96.84 95.39 53.58 12.41 
L2 0.01 98.45 98.00 97.48 84.87 42.00 
L2 0.1 98.12 97.85 97.53 91.29 69.87 
Rect 97.99 96.80 95.01 49.06 10.83 
MT-DGL2 0.001 0.01 98.38 98.00 97.49 81.37 33.88 
MT-DGL2 0.001 0.05 98.50 98.06 97.44 81.52 33.55 
MT-DGL2 0.001 0.1 98.35 97.87 97.24 79.36 33.47 
MT-DGL2 0.01 0.01 98.57 98.26 97.83 89.19 52.96 
MT-DGL2 0.01 0.05 98.65 98.26 97.80 89.69 55.09 
MT-DGL2 0.01 0.1 98.59 98.31 97.98 90.03 56.43 
MT-DGL2 0.1 0.01 98.85 98.74 98.53 95.75 84.42 
MT-DGL2 0.1 0.05 98.76 98.59 98.39 95.16 82.86 
MT-DGL2 0.1 0.1 98.70 98.54 98.39 95.37 83.70 
MT-DGL1 0.001 0.01 98.58 98.38 98.04 90.46 58.06 
MT-DGL1 0.001 0.05 98.77 98.66 98.51 96.08 87.14 
MT-DGL1 0.001 0.1 98.63 98.47 98.24 93.36 76.71 
MT-DGL1 0.01 0.01 98.82 98.69 98.40 95.79 85.41 
MT-DGL1 0.01 0.05 98.91 98.82 98.70 97.16 92.72 
MT-DGL1 0.01 0.1 98.99 98.94 98.84 97.44 93.79 
MT-DGL1 0.1 0.01 98.63 98.50 98.38 97.04 93.79 
MT-DGL1 0.1 0.05 98.99 98.97 98.90 98.25 96.68 
MT-DGL1 0.1 0.1      
MT 98.00 97.20 95.99 63.38 17.20 
Table 2:
Comparative Results Where All Models Are Attacked by the L1-Regularized Rectifier Network, L1, Using Laplacian () Adversarial Noise.
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.31 95.96 56.68 23.68 
DGL1 0.0001 0.05 98.57 98.15 97.68 88.75 58.32 
DGL1 0.0001 0.1 98.47 98.11 97.75 89.65 68.44 
DGL1 0.001 0.01 98.03 97.27 96.12 65.06 26.64 
DGL1 0.001 0.05 98.70 98.41 98.06 92.78 74.48 
DGL1 0.001 0.1 98.62 98.37 98.12 93.52 82.12 
DGL1 0.01 0.01 98.36 97.93 97.34 87.04 54.26 
DGL1 0.01 0.05 98.62 98.51 98.36 95.00 85.33 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 97.00 95.72 56.55 23.07 
DGL2 0.0001 0.05 98.29 97.32 95.94 57.76 23.86 
DGL2 0.0001 0.1 98.02 97.03 95.74 57.57 23.96 
DGL2 0.001 0.01 97.97 96.98 95.70 57.95 23.37 
DGL2 0.001 0.05 98.20 97.14 95.71 58.54 24.01 
DGL2 0.001 0.1 98.26 97.36 96.08 57.30 23.43 
DGL2 0.01 0.01 98.20 97.43 96.48 69.58 28.98 
DGL2 0.01 0.05 98.31 97.57 96.41 66.05 26.99 
DGL2 0.01 0.1 98.30 97.51 96.32 65.38 26.63 
L1 0.0001 98.15 97.24 95.61 51.91 21.44 
L1 0.001 98.41 97.62 96.68 71.69 30.05 
L1 0.01 97.73 97.39 97.08 91.00 73.68 
L1 0.1 93.90 93.38 92.90 86.96 73.77 
L2 0.0001 98.00 97.03 95.72 56.20 24.11 
L2 0.001 97.88 96.89 95.50 57.57 23.54 
L2 0.01 98.45 97.98 97.43 83.01 42.66 
L2 0.1 98.12 97.83 97.52 90.66 68.11 
Rect 97.99 96.98 95.69 58.53 23.58 
MT-DGL2 0.001 0.01 98.38 97.99 97.47 80.50 39.72 
MT-DGL2 0.001 0.05 98.50 98.06 97.44 80.28 39.47 
MT-DGL2 0.001 0.1 98.35 97.89 97.27 78.97 39.08 
MT-DGL2 0.01 0.01 98.57 98.28 97.84 88.89 55.29 
MT-DGL2 0.01 0.05 98.65 98.23 97.81 89.46 56.76 
MT-DGL2 0.01 0.1 98.59 98.35 97.98 89.58 58.33 
MT-DGL2 0.1 0.01 98.85 98.74 98.57 95.67 84.08 
MT-DGL2 0.1 0.05 98.76 98.57 98.36 95.12 81.97 
MT-DGL2 0.1 0.1 98.70 98.55 98.38 95.30 82.72 
MT-DGL1 0.001 0.01 98.58 98.36 98.06 90.06 59.26 
MT-DGL1 0.001 0.05 98.77 98.64 98.48 95.92 86.91 
MT-DGL1 0.001 0.1 98.63 98.47 98.27 93.21 76.36 
MT-DGL1 0.01 0.01 98.82 98.66 98.40 95.73 85.10 
MT-DGL1 0.01 0.05 98.91 98.79 98.70 97.18 92.77 
MT-DGL1 0.01 0.1 98.99 98.94 98.78 97.45 93.90 
MT-DGL1 0.1 0.01 98.63 98.51 98.40 96.99 93.65 
MT-DGL1 0.1 0.05 98.99 98.96 98.89 98.26 96.68 
MT-DGL1 0.1 0.1      
MT 98.00 97.18 95.86 63.03 25.51 
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.31 95.96 56.68 23.68 
DGL1 0.0001 0.05 98.57 98.15 97.68 88.75 58.32 
DGL1 0.0001 0.1 98.47 98.11 97.75 89.65 68.44 
DGL1 0.001 0.01 98.03 97.27 96.12 65.06 26.64 
DGL1 0.001 0.05 98.70 98.41 98.06 92.78 74.48 
DGL1 0.001 0.1 98.62 98.37 98.12 93.52 82.12 
DGL1 0.01 0.01 98.36 97.93 97.34 87.04 54.26 
DGL1 0.01 0.05 98.62 98.51 98.36 95.00 85.33 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 97.00 95.72 56.55 23.07 
DGL2 0.0001 0.05 98.29 97.32 95.94 57.76 23.86 
DGL2 0.0001 0.1 98.02 97.03 95.74 57.57 23.96 
DGL2 0.001 0.01 97.97 96.98 95.70 57.95 23.37 
DGL2 0.001 0.05 98.20 97.14 95.71 58.54 24.01 
DGL2 0.001 0.1 98.26 97.36 96.08 57.30 23.43 
DGL2 0.01 0.01 98.20 97.43 96.48 69.58 28.98 
DGL2 0.01 0.05 98.31 97.57 96.41 66.05 26.99 
DGL2 0.01 0.1 98.30 97.51 96.32 65.38 26.63 
L1 0.0001 98.15 97.24 95.61 51.91 21.44 
L1 0.001 98.41 97.62 96.68 71.69 30.05 
L1 0.01 97.73 97.39 97.08 91.00 73.68 
L1 0.1 93.90 93.38 92.90 86.96 73.77 
L2 0.0001 98.00 97.03 95.72 56.20 24.11 
L2 0.001 97.88 96.89 95.50 57.57 23.54 
L2 0.01 98.45 97.98 97.43 83.01 42.66 
L2 0.1 98.12 97.83 97.52 90.66 68.11 
Rect 97.99 96.98 95.69 58.53 23.58 
MT-DGL2 0.001 0.01 98.38 97.99 97.47 80.50 39.72 
MT-DGL2 0.001 0.05 98.50 98.06 97.44 80.28 39.47 
MT-DGL2 0.001 0.1 98.35 97.89 97.27 78.97 39.08 
MT-DGL2 0.01 0.01 98.57 98.28 97.84 88.89 55.29 
MT-DGL2 0.01 0.05 98.65 98.23 97.81 89.46 56.76 
MT-DGL2 0.01 0.1 98.59 98.35 97.98 89.58 58.33 
MT-DGL2 0.1 0.01 98.85 98.74 98.57 95.67 84.08 
MT-DGL2 0.1 0.05 98.76 98.57 98.36 95.12 81.97 
MT-DGL2 0.1 0.1 98.70 98.55 98.38 95.30 82.72 
MT-DGL1 0.001 0.01 98.58 98.36 98.06 90.06 59.26 
MT-DGL1 0.001 0.05 98.77 98.64 98.48 95.92 86.91 
MT-DGL1 0.001 0.1 98.63 98.47 98.27 93.21 76.36 
MT-DGL1 0.01 0.01 98.82 98.66 98.40 95.73 85.10 
MT-DGL1 0.01 0.05 98.91 98.79 98.70 97.18 92.77 
MT-DGL1 0.01 0.1 98.99 98.94 98.78 97.45 93.90 
MT-DGL1 0.1 0.01 98.63 98.51 98.40 96.99 93.65 
MT-DGL1 0.1 0.05 98.99 98.96 98.89 98.26 96.68 
MT-DGL1 0.1 0.1      
MT 98.00 97.18 95.86 63.03 25.51 
Table 3:
Comparative Results Where All Models Are Attacked by the L2-Regularized Rectifier Network, L2, Using Laplacian () Adversarial Noise.
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.70 96.81 75.66 25.26 
DGL1 0.0001 0.05 98.57 98.28 97.99 93.04 73.97 
DGL1 0.0001 0.1 98.47 98.26 97.92 93.39 77.97 
DGL1 0.001 0.01 98.03 97.48 96.94 81.26 32.74 
DGL1 0.001 0.05 98.70 98.48 98.22 94.82 82.96 
DGL1 0.001 0.1 98.62 98.44 98.27 95.04 86.12 
DGL1 0.01 0.01 98.36 98.04 97.52 90.49 63.92 
DGL1 0.01 0.05 98.62 98.51 98.42 95.57 87.67 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 97.28 96.50 76.93 25.90 
DGL2 0.0001 0.05 98.29 97.68 96.75 75.97 24.43 
DGL2 0.0001 0.1 98.02 97.35 96.57 76.87 26.25 
DGL2 0.001 0.01 97.97 97.38 96.46 77.19 26.88 
DGL2 0.001 0.05 98.20 97.52 96.66 77.90 26.53 
DGL2 0.001 0.1 98.26 97.72 96.89 77.18 25.03 
DGL2 0.01 0.01 98.20 97.76 97.02 83.13 36.57 
DGL2 0.01 0.05 98.31 97.78 97.14 82.35 34.70 
DGL2 0.01 0.1 98.30 97.77 97.17 82.02 33.80 
L1 0.0001 98.15 97.63 96.91 75.97 24.07 
L1 0.001 98.41 97.71 96.97 76.47 24.02 
L1 0.01 97.73 97.32 96.97 89.07 64.45 
L1 0.1 93.90 93.40 92.83 86.17 68.98 
L2 0.0001 98.00 97.37 96.59 76.52 26.14 
L2 0.001 97.88 97.22 96.44 76.19 25.73 
L2 0.01 98.45 97.74 96.52 60.93 14.34 
L2 0.1 98.12 97.81 97.19 86.46 47.90 
Rect 97.99 97.32 96.57 77.87 26.99 
MT-DGL2 0.001 0.01 98.38 98.07 97.63 86.65 45.44 
MT-DGL2 0.001 0.05 98.50 98.15 97.69 86.91 44.76 
MT-DGL2 0.001 0.1 98.35 97.99 97.58 86.04 43.86 
MT-DGL2 0.01 0.01 98.57 98.33 97.96 91.69 63.27 
MT-DGL2 0.01 0.05 98.65 98.33 98.01 92.24 65.58 
MT-DGL2 0.01 0.1 98.59 98.41 98.10 92.37 66.75 
MT-DGL2 0.1 0.01 98.85 98.76 98.59 96.03 85.26 
MT-DGL2 0.1 0.05 98.76 98.64 98.43 95.71 84.42 
MT-DGL2 0.1 0.1 98.70 98.56 98.42 95.78 85.20 
MT-DGL1 0.001 0.01 98.58 98.41 98.16 92.21 65.98 
MT-DGL1 0.001 0.05 98.77 98.65 98.53 96.28 88.39 
MT-DGL1 0.001 0.1 98.63 98.48 98.33 94.49 79.56 
MT-DGL1 0.01 0.01 98.82 98.69 98.45 96.13 86.22 
MT-DGL1 0.01 0.05 98.91 98.82 98.71 97.26 92.70 
MT-DGL1 0.01 0.1 98.99 98.94 98.81 97.43 94.04 
MT-DGL1 0.1 0.01 98.63 98.51 98.41 96.98 93.27 
MT-DGL1 0.1 0.05 98.99 98.96 98.90 98.24 96.58 
MT-DGL1 0.1 0.1      
MT 98.00 97.37 96.59 74.09 23.44 
Model = 0.0 = 0.005 = 0.01 = 0.05 = 0.1
DGL1 0.0001 0.01 98.25 97.70 96.81 75.66 25.26 
DGL1 0.0001 0.05 98.57 98.28 97.99 93.04 73.97 
DGL1 0.0001 0.1 98.47 98.26 97.92 93.39 77.97 
DGL1 0.001 0.01 98.03 97.48 96.94 81.26 32.74 
DGL1 0.001 0.05 98.70 98.48 98.22 94.82 82.96 
DGL1 0.001 0.1 98.62 98.44 98.27 95.04 86.12 
DGL1 0.01 0.01 98.36 98.04 97.52 90.49 63.92 
DGL1 0.01 0.05 98.62 98.51 98.42 95.57 87.67 
DGL1 0.01 0.1      
DGL2 0.0001 0.01 97.93 97.28 96.50 76.93 25.90 
DGL2 0.0001 0.05 98.29 97.68 96.75 75.97 24.43 
DGL2 0.0001 0.1 98.02 97.35 96.57 76.87 26.25 
DGL2 0.001 0.01 97.97 97.38 96.46 77.19 26.88 
DGL2 0.001 0.05 98.20 97.52 96.66 77.90 26.53 
DGL2 0.001 0.1 98.26 97.72 96.89 77.18 25.03 
DGL2 0.01 0.01 98.20 97.76 97.02 83.13 36.57 
DGL2 0.01 0.05 98.31 97.78 97.14 82.35 34.70 
DGL2 0.01 0.1 98.30 97.77 97.17 82.02 33.80 
L1 0.0001 98.15 97.63 96.91 75.97 24.07 
L1 0.001 98.41 97.71 96.97 76.47 24.02 
L1 0.01 97.73 97.32 96.97 89.07 64.45 
L1 0.1 93.90 93.40 92.83 86.17 68.98 
L2 0.0001 98.00 97.37 96.59 76.52 26.14 
L2 0.001 97.88 97.22 96.44 76.19 25.73 
L2 0.01 98.45 97.74 96.52 60.93 14.34 
L2 0.1 98.12 97.81 97.19 86.46 47.90 
Rect 97.99 97.32 96.57 77.87 26.99 
MT-DGL2 0.001 0.01 98.38 98.07 97.63 86.65 45.44 
MT-DGL2 0.001 0.05 98.50 98.15 97.69 86.91 44.76 
MT-DGL2 0.001 0.1 98.35 97.99 97.58 86.04 43.86 
MT-DGL2 0.01 0.01 98.57 98.33 97.96 91.69 63.27 
MT-DGL2 0.01 0.05 98.65 98.33 98.01 92.24 65.58 
MT-DGL2 0.01 0.1 98.59 98.41 98.10 92.37 66.75 
MT-DGL2 0.1 0.01 98.85 98.76 98.59 96.03 85.26 
MT-DGL2 0.1 0.05 98.76 98.64 98.43 95.71 84.42 
MT-DGL2 0.1 0.1 98.70 98.56 98.42 95.78 85.20 
MT-DGL1 0.001 0.01 98.58 98.41 98.16 92.21 65.98 
MT-DGL1 0.001 0.05 98.77 98.65 98.53 96.28 88.39 
MT-DGL1 0.001 0.1 98.63 98.48 98.33 94.49 79.56 
MT-DGL1 0.01 0.01 98.82 98.69 98.45 96.13 86.22 
MT-DGL1 0.01 0.05 98.91 98.82 98.71 97.26 92.70 
MT-DGL1 0.01 0.1 98.99 98.94 98.81 97.43 94.04 
MT-DGL1