Abstract
Very recently, few certified defense methods have been developed to provably guarantee the robustness of a text classifier to adversarial synonym substitutions. However, all the existing certified defense methods assume that the defenders have been informed of how the adversaries generate synonyms, which is not a realistic scenario. In this study, we propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text, in which the above unrealistic assumption is no longer necessary. The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations. We can certify the classifications of over 50% of texts to be robust to any perturbation of five words on AGNEWS, and two words on SST2 dataset. The experimental results show that our randomized smoothing method significantly outperforms recently proposed defense methods across multiple datasets under different attack algorithms.
1 Introduction
Although deep neural networks have achieved prominent performance on many natural language processing (NLP) tasks, they are vulnerable to adversarial examples that are intentionally crafted by replacing, scrambling, and erasing characters (Gao et al. 2018; Ebrahimi et al. 2018) or words (Alzantot et al. 2018; Ren et al. 2019a; Zheng et al. 2020; Jin et al. 2020; Li et al. 2020) under certain semantic and syntactic constraints. These adversarial examples are imperceptible to humans and can easily fool deep neural network-based models. The existence of adversarial examples has raised serious concerns, especially when deploying such NLP models to security-sensitive tasks. In Table 1, we list three examples that are labeled as positive sentiment by a BERT-based sentiment analysis model (Devlin et al. 2019), and for each example, we give two adversarial examples (one is generated by character-level perturbations, and another is crafted by adversarial synonym substitutions) that can cause the same model to change its prediction from correct (positive) sentiment to incorrect (negative) one.
Many methods have been proposed to defend against adversarial attacks for neural network-based NLP models. Most were evaluated empirically, such as adversarial data augmentation (Jin et al. 2020; Zheng et al. 2020), adversarial training (Madry et al. 2018; Zhu et al. 2020), and Dirichlet Neighborhood Ensemble (Zhou et al. 2021). Among them, adversarial data augmentation is one of the widely used methods (Jin et al. 2020; Zheng et al. 2020; Li et al. 2020). During the training phase, they replace a word with one of its synonyms that maximizes the prediction loss. By augmenting these adversarial examples with the original training data, the models are trained to be robust to such perturbations. However, it is infeasible to explore all possible combinations in which each word in a text can be replaced with any of its synonyms.
Zhou et al. (2021) and Dong et al. (2021) relax a set of discrete points (a word and its synonyms) to a convex hull spanned by the word embeddings of all these points, and use a convex hull formed by a word and its synonyms to capture word substitutions. During the training phase, they randomly sample some points in the convex hull to ensure the robustness within the regions around the sampled points. To deal with complex error surface, a gradient-guided optimizer is also applied to search for more valuable adversarial points within the convex hull. By training on these virtual samples, the model can enhance the robustness against word substitution-based perturbations.
Although the above-mentioned methods have been empirically proven to be effective in defending against attack algorithms used during the training, the trained models often cannot survive from other stronger attacks (Jin et al. 2020; Li et al. 2020). A certified robust model is necessary for both theory and practice. A model is said to be certified robust when it is guaranteed to give the correct answer under any attackers if some robustness condition is satisfied, no matter the strength of the attackers and no matter how they manipulate the input texts. Certified defense methods have recently been proposed (Jia et al. 2019; Huang et al. 2019) by certifying the performance within the convex hull formed by the embeddings of a word and its synonyms. However, due to the difficulty of propagating convex hulls through deep neural networks, they compute a loose outer bound using Interval Bound Propagation (IBP). As a result, IBP-based certified defense is hard to scale to large architectures such as BERT (Devlin et al. 2019).
To achieve the certified robustness on large architectures, Ye, Gong, and Liu (2020) proposed SAFER, a randomized smoothing method that can provably ensure that the prediction cannot be altered by any possible synonymous word substitutions. However, existing certified defense methods assume that the defenders know in advance how the adversaries generate synonyms, which is not a realistic scenario since we cannot impose a limitation on the synonym table used by the attackers. In a real situation, we know nothing about the attackers, and existing adversarial attack algorithms against NLP models may use a synonym table in which a single word can have many (up to 50) synonyms (Jin et al. 2020), generate synonyms dynamically by using BERT (Li et al. 2020), or perform character-level perturbations (Gao et al. 2018; Li et al. 2019) to launch adversarial attacks.
In this article, we propose RanMASK, a certifiably robust defense method against textual adversarial attacks based on a new randomized smoothing technique for NLP models. The proposed method works by repeatedly performing random masking operations on an input text in order to generate a large set of masked copies of the text. A base classifier is then used to classify each of these masked texts, and the final robust classification is made by “majority vote” (see Figure 1). In the training time, the base classifier is also trained with similar masked text samples. Our masking operation is the same as that used in training BERT (Devlin et al. 2019) or RoBERTa (Liu et al. 2019) by masked language models, and the masked words can simply be encoded as the embedding of [MASK] so that we can leverage the ability of BERT to recover or reconstruct information about the masked words.
The key idea behind our method is that, if a sufficient number of words are randomly masked from a text before the text is given to the base classifier and a relatively small number of words have been intentionally perturbed, then it is highly unlikely that all of the perturbed words (adversarially chosen) can survive the masking operations and all of them are still present in any masked texts. Note that retaining just some of these perturbed words is often not enough to fool the base classifier. The results of our preliminary experiments also confirmed that textual adversarial examples are also vulnerable to small random perturbations, and if we randomly mask a few words from adversarial examples before they are fed into the classifier, it is more likely that the incorrect predictions of adversarial examples would be reverted to the correct ones. Given a text x and its potentially adversarial text x′, if we use a statistically sufficient number of random masked samples, and if the observed “gap” between the number of “votes” for the top class and the number of “votes” for any other classes at x is sufficiently large, then we can guarantee with high probability that the robust classification at x′ will be the same as it is at x. Therefore, we can prove that with high probability, the smoothed classifier will label x robustly against any text adversarial attacks that are allowed to perturb a certain number (or proportion) of words in an input text at both the word and character levels in any manner.1
The major advantage of our method over existing certified defense methods is that our certified robustness is not based on the assumption that the defenders know how the adversaries generate synonyms. Given a text, the adversaries are allowed to replace a few words with their synonyms (word-level perturbation) or deliberately misspell some of them (character-level perturbation). Through extensive experiments on multiple datasets, we show that RanMASK achieves better performance on adversarial samples than the existing text defense methods. Experimentally, we can certify the classifications of over 50% of sentences to be robust to any perturbation of 5 words on AGNEWS dataset (Zhang, Zhao, and LeCun 2015), and 2 words on SST2 (Socher et al. 2013). Furthermore, unlike most certified defenses (except SAFER), the proposed method is easy to implement and can be integrated into any existing neural networks, including those with large architecture such as BERT (Devlin et al. 2019).
Our contributions are summarized as follows:
We propose RanMASK, a novel certifiably robust defense method against both word substitution-based attacks and character-level perturbations. The main advantage of our method is that we do not base the certified robustness on the unrealistic assumption that the defenders know in advance how the adversaries generate synonyms.
We provide a general theorem that can be used to develop a related robustness certificate with a tighter bound. We can certify that the classifications of over 50% of sentences are robust to any modification of at most five words on AG’s News Topic Classification dataset (AGNEWS) (Zhang, Zhao, and LeCun 2015), and two words on Stanford Sentiment Treebank dataset (SST2) (Socher et al. 2013).
To further improve the empirical robustness, we propose a new sampling strategy in which the probability of a word being masked corresponds to its output probability of a BERT-based language model (LM) to reduce the risk probability, which estimates how likely a base classifier will make mistakes (see Section 4.3 for details). Through extensive experimentation, we show that our smoothed classifiers outperform existing empirical and certified defenses across different datasets.
2 Preliminaries and Notation
For text classification, a neural network–based classifier f(x) maps an input text to a label , where x = x1,…,xh is a text consisting of h words and is a set of discrete categories. We follow the mathematical notation used by Levine and Feizi (2020) below.
Let x ⊖x′ denote the set of word indices at which x and x′ differ, so that |x ⊖x′| =∥x−x′∥0. For example, if x =“I really like the movie” and x′ =“I truly like this movie”, x ⊖x′ = {2,4} and |x ⊖x′| = 2. Also, let denote the set of indices {1,…,h}, all possible sets of k unique indices in , where is the power set of , and the uniform distribution over . Note that to sample from is to sample k out of h indices uniformly without replacement. For instance, three elements sampled from might be {1,3,5}, {1,2,4}, {2,3,5}, and so forth.
We define a masking operation , where is a set of texts in which some words have been masked. This operation takes a text of length h and a set of indices as inputs and outputs the masked texts, with all words except those in the set replaced with a special token [MASK]. The words whose indices are in the set will remain unchanged. For example, ℳ(“I truly like this movie″,{1,3,5}) =“I [MASK] like [MASK] movie”. Following Devlin et al. (2019), we use [MASK] to replace the masked words. Note that the same masking operation is applied to both the adversarial texts and original ones because it is impossible for us to know in advance whether a clean example or an adversarial one is given to a model.
3 RanMASK: Certified Defense Method
3.1 Certified Robustness
In the following, we first want to prove the following general theorem from which a related robustness certificate can be developed. Different from Levine and Feizi (2020), we have to deal with variable-length texts, and further introduce a variable β associated with each pair of an input text x and its adversarial example x′, which leads to a tighter certificate bound. After that, we describe how to estimate the value of β by a Monte Carlo–based algorithm.
It has been well known that neural NLP models are vulnerable to adversarial examples. However, text adversarial examples themselves are also as vulnerable as clean examples to perturbations. If any of the intentionally replaced words are perturbed for an adversarial example, it is highly likely that a base classifier would give a different prediction for the perturbed adversarial example, and even make a correct prediction in many cases. Based on the above observation, the additional β is introduced to estimate how likely a base classifier can still yield a correct prediction if any of the intentionally replaced words are masked. The introduction of β leads to a tighter certificate bound since the value of β is always positive and far less than 1 by its definition. The condition (7) is easier to be satisfied than the situation where the value of β is set to 1. It is worth noting that Levine and Feizi (2020) assumes all inputs are of equal length (i.e., the number of pixels is the same for all images), while we have to deal with variable-length texts. We define the base classifier f(x), the smoothed classifier g(x), the values of Δ and β based on a masking rate ρ (i.e., the percentage of words that can be masked), while their counterparts are defined based on a retention constant (i.e., the fixed number of pixels retained from any input image). A related robustness certificate can be developed with a tighter bound for text adversarial examples from the above general Theorem 1.
3.2 Estimating the Value of Beta
We discuss how to estimate the value of β defined in Theorem 1 here. Recall that β is the probability that f will label the masked copies of x with the class c where the indices of unmasked words are overlapped with x ⊖x′ (i.e., the set of word indices at which x and x′ differ). We use a Monte Carlo algorithm to evaluate β by sampling a large number of elements from . To simplify notation, we let r denote the value of |x ⊖x′|.
The Monte Carlo-based algorithm used to evaluate β is given in Algorithm 1. We first sample nr elements from and each sampled element, denoted by a, is a set of indices where the words are supposed to be perturbed. For every a, we then sample nk elements from , each of which, denoted by b, is also a set of indices where the words are not masked. We remove those from nk elements if the intersection of a and b is empty. With the remaining elements and f, we can approximately compute the value of β if the number of samples is sufficient.
As the value of r grows, for any a it is more likely that a is overlapped with any sampled b, and the value β will approach to pc(x). To investigate how close the values of β and pc(x) are to each other, we conducted an experiment on the test set of both the AGNEWS and SST2 datasets, in which nr = 200 and nk = 10,000, and use the Jensen-Shannon divergence to calculate the distance of these two distributions. As we can see from Figure 2, no matter what value of ρ is, all the Jensen-Shannon divergences values are very small and less than 2.5 × 10−5 on AGNEWS and 1.75 × 10−5 on SST2 when the number of perturbed words is large enough. Therefore, we can use pc(x) to approximate the value of β, namely, β ≈ pc(x), in all the following experiments.
3.3 Practical Algorithms
In order for the smoothed classifier g to label text examples correctly and robustly, the base classifier f needs to be trained to classify the texts in which ρ percent of the words are masked. Specifically, at each training iteration, we first sample a mini-batch of samples and randomly perform the masking operation on the samples. We then apply the gradient descent on f based on the masked mini-batch.
We present practical Monte Carlo algorithms for evaluating g(x) and certifying the robustness of g around x in Algorithm 2. Evaluating the smoothed classifier’s prediction g(x) requires identifying the class c with maximal weight in the categorical distribution. The procedure described in the Predict pseudocode randomly draws n masked copies of x and runs these n copies through f. If the class c appeared more than any other classes, then the Predict procedure returns c.
Evaluating and certifying the robustness of g around an input x requires not only identifying the class c with maximal weight, but also estimating the lower bound and the value of β. In the Certify procedure described in Algorithm 2, we first ensure that g correctly classifies x as y, and then estimate the values of and β by randomly generating n′ masked copies of the text x, where n′ is much greater than n. We gradually increase the value of d (the number words can be perturbed) by 1 starting with 0, and compute Δ by Equation (5). This process will continue until (see Corollary 1), and when it stops the Certify procedure returns d as the maximum certified robustness for x. In this way, we can certify with (1 − α) confidence that g(x′) will return the label y for any adversarial example x′ if ∥x − x′∥0 ≤ d and κ(x, x′) ≤ ϵ.
4 Experiments
We first give the results of the certified robustness achieved by our RanMASK on two datasets, AGNEWS (Zhang, Zhao, and LeCun 2015) and SST2 (Socher et al. 2013), and then report the empirical robustness on these datasets by comparing it with other representative defense methods, including PGD-K (Madry et al. 2018), FreeLB (Zhu et al. 2020), Adv-Hotflip (Ebrahimi et al. 2018), and adversarial data augmentation. To evaluate the empirical robustness of different methods, we conducted experiments on four datasets: AGNEWS (Zhang, Zhao, and LeCun 2015) for text classification, SST2 (Socher et al. 2013) for sentiment analysis, IMDB (Maas et al. 2011) for Internet movie review, and SNLI (Bowman et al. 2015) for natural language inference. Finally, we empirically compare RanMASK with SAFER (Ye, Gong, and Liu 2020), a recently proposed certified defense that also can be applied to large architectures, such as BERT (Devlin et al. 2019). Another thing that RanMASK has in common with SAFER is that both methods were built on the ensemble strategy. After a thorough comparison, we found that different randomized smoothing methods may behave quite differently when different ensemble methods are used. We demonstrate that the “majority-vote” ensemble sometimes could fool the score-based attack algorithms in which the greedy search strategy is adopted. The improvement in the empirical robustness comes from not only the defense methods themselves but also the type of ensemble method they use.
4.1 Implementation Details
Because our randomized masking strategy is the same as that used to train the large-scale masked language model such as BERT (Devlin et al. 2019), we chose to use BERT-like models, including BERT and RoBERTa (Liu et al. 2019), as our base models, which helps to keep the performance of the base classifier f to an acceptable level when it takes the masked texts as inputs because BERT-based models have the capacity to implicitly recover information about the masked words.
Unless otherwise specified, all the models are trained with the AdamW optimizer (Loshchilov and Hutter 2019) with a weight decay of (1e − 6), a batch size of 32, a maximum number of epochs of 20, a gradient clip of (−1,1), and a learning rate of (5e − 5), which is decayed by the cosine annealing method (Loshchilov and Hutter 2017). All models tuned on the validation set were used for testing and certifying. We randomly selected 1,000 test examples for each dataset in both certified and empirical experiments. When conducting the experiments of certified robustness, we set the value of uncertainty α to 0.05, the number of samples n for the Predict procedure to 1,000, the number of samples n′ for the Certify procedure to 5,000. To evaluate the empirical robustness of models, we set n to 100 to speed up the evaluation process.
4.2 Results of the Certified Robustness
In this subsection we provide the certified robustness of RanMASK on both the AGNEWS and SST2 datasets. When reporting certified results, we refer to the following metrics, some of which were described by Levine and Feizi (2020):
The certified robustness of a text x is the maximum d for which we can certify that the smoothed classifier g(x′) will return the correct label y where x′ is any adversarial example of x such that ∥x − x′∥0 ≤ d and κ(x, x′) ≤ ϵ. If g(x) labels x incorrectly, we define the certified robustness as “N/A”, that is, failure in the certification (see Algorithm 2 for details).
The certified rate of a text x is the certified robustness of x (i.e., the maximum d found for x) divided by the length of x, denoted as hx.
The median certified robustness (MCB) on a dataset is the median value of the certified robustness across the dataset. It is the maximum d for which the smoothed classifier g can guarantee robustness for at least 50% texts in the dataset. In other words, we can certify the classifications of over 50% texts to be robust to any perturbation with at most d words. When computing this median, the texts which g misclassifies are counted as “N/A”, which means negative infinity in the certified robustness. For example, if the certified robustness of the texts in a dataset are {N/A, N/A, 1, 2, 3}, the median certified robustness is 1, not 2.
The median certified rate (MCR) on a dataset is the median value of the certified rate across the datasets, which is obtained in a similar way to MCB. While the MCB indicates the maximum (absolute) number of words that can be arbitrarily perturbated for which the robustness is guaranteed, the MCR is defined as the maximum (relative) percentage of words that can be intentionally perturbated for which a smoothed classifier g still can guarantee robustness for at least 50% texts in a dataset. This metric is newly introduced to deal with variable-length texts.
We first tested the robust classification on AGNEWS using RoBERTa (Liu et al. 2019) as the base classifier. As we can see from Table 2, the maximum MCB was achieved at 5 when using the masking rate ρ = 90% or ρ = 95%, indicating that we can certify the classifications of over 50% of sentences to be robust to any perturbation of at most 5 words. We chose to use the model (ρ = 90%) to evaluate the empirical robustness on AGNEWS because it gives better classification accuracy on the clean data.
Rate ρ% . | Accuracy% . | MCB . | MCR % . |
---|---|---|---|
40 | 96.2 | 1 | 2.6 |
50 | 95.7 | 1 | 2.7 |
60 | 95.7 | 2 | 5.0 |
65 | 95.0 | 2 | 5.0 |
70 | 94.5 | 2 | 5.0 |
75 | 93.9 | 3 | 7.0 |
80 | 92.0 | 3 | 7.5 |
85 | 92.2 | 4 | 8.8 |
90 | 91.1 | 5 | 11.4 |
95 | 85.8 | 5 | 11.8 |
Rate ρ% . | Accuracy% . | MCB . | MCR % . |
---|---|---|---|
40 | 96.2 | 1 | 2.6 |
50 | 95.7 | 1 | 2.7 |
60 | 95.7 | 2 | 5.0 |
65 | 95.0 | 2 | 5.0 |
70 | 94.5 | 2 | 5.0 |
75 | 93.9 | 3 | 7.0 |
80 | 92.0 | 3 | 7.5 |
85 | 92.2 | 4 | 8.8 |
90 | 91.1 | 5 | 11.4 |
95 | 85.8 | 5 | 11.8 |
We evaluated the robust classification on SST2 using RoBERTa as the base classifier, too. As shown in Table 3, the maximum MCB was achieved at 2 when ρ = 70% or ρ = 80%, indicating that over 50% of sentences are robust to any perturbation of 2 words. But, these two models achieve the maximum MCB at a higher cost of clean accuracy (about 10% drop as compared to the best). We chose to use the model (ρ = 30%) to evaluate the empirical robustness on SST2 due to its higher classification accuracy on clean data. We found that it is impossible to train the models when ρ ≥ 90%. Unlike AGNEWS (created for the news topic classification), SST2 was constructed for sentiment analysis. The sentiment of a text largely depends on whether a few specific sentiment words occur in the text. All the sentiment words in a text would be masked with high probability when a high masking rate is applied (say, ρ ≥ 90%), which makes it hard for any model to correctly predict the sentiment of the masked texts. Assuming that a text consists of neutral words and a single positive word “love”, a smoothed classifier g still can correctly label the masked examples as the original text with a high probability if the masking rate ρ is less than 50% because there is more than a 50% chance that the word “love” is not masked and the ensemble method is used to get a final prediction. However, when a higher masking rate is used, the word “love” would be masked with the probability higher than 50%, which reduces the probability that the masked texts have the same label as the original text.
Rate ρ% . | Accuracy% . | MCB . | MCR % . |
---|---|---|---|
20 | 92.4 | 1 | 5.26 |
30 | 92.4 | 1 | 5.26 |
40 | 91.2 | 1 | 5.26 |
50 | 89.3 | 1 | 5.56 |
60 | 84.3 | 1 | 7.41 |
70 | 83.3 | 2 | 8.00 |
80 | 81.4 | 2 | 10.00 |
90 | 49.6 | N/A | N/A |
Rate ρ% . | Accuracy% . | MCB . | MCR % . |
---|---|---|---|
20 | 92.4 | 1 | 5.26 |
30 | 92.4 | 1 | 5.26 |
40 | 91.2 | 1 | 5.26 |
50 | 89.3 | 1 | 5.56 |
60 | 84.3 | 1 | 7.41 |
70 | 83.3 | 2 | 8.00 |
80 | 81.4 | 2 | 10.00 |
90 | 49.6 | N/A | N/A |
4.3 Results of the Empirical Robustness
In the following experiments, we consider two ensemble methods (Cheng et al. 2020a): logits-summed ensemble (logit) and majority-vote ensemble (vote). In the “logit” method, we take the average of the logits produced by the base classifier f over all the individual random samples as the final prediction. In the “vote” strategy, we simply count the votes for each class label. The following metrics (Li et al. 2021) are used to report the results of empirical robustness:
The clean accuracy (Cln) is the classification accuracy achieved by a classifier on the clean texts.
The robust accuracy (Boa) is the accuracy of a classifier achieved under a certain attack.
The success rate (Succ) is the number of texts successfully perturbed by an attack algorithm (causing the model to make errors) divided by all the number of texts to be attempted.
We evaluated the empirical robustness under test-time attacks by using the Text- Attack2 framework (Morris et al. 2020) with three black-box, score-based attack algorithms: TextFooler (Jin et al. 2020), BERT-Attack (Li et al. 2020), and DeepWordBug (Gao et al. 2018). TextFooler and BERT-Attack adversarially perturb the text inputs by the word-level substitutions, whereas DeepWordBug performs the character-level perturbations to the input texts. TextFooler generates synonyms using 50 nearest neighbors of GloVe vectors (Pennington, Socher, and Manning 2014), while BERT-Attack uses the BERT to generate synonyms dynamically, meaning that no defenders can know in advance the synonyms used by BERT-Attack. DeepWordBug generates text adversarial examples by replacing, scrambling, and erasing a few characters of some words in the input texts.
We compared RanMASK with the following defense methods proposed recently:
PGD-K (Madry et al. 2018): applies gradient-guided adversarial perturbations to word embeddings and minimizes the resultant adversarial loss inside different regions around input samples.
FreeLB (Zhu et al. 2020): adds norm-bounded adversarial perturbations to the input’s word embeddings using a gradient-based method, and enlarges the batch size with diversified adversarial samples under such norm constraints.
Adv-Hotflip (Ebrahimi et al. 2018): first generates textual adversarial examples by using Hotflip (Ebrahimi et al. 2018) and then augments the generated examples with the original training data to train a robust model. Unlike PGD-K and FreeLB, Adv-Hotflip will generate real adversarial examples by replacing the original words with their synonyms rather than performing adversarial perturbations in the word embedding space.
Adversarial Data Augmentation: it still is one of the most successful defense methods for NLP models (Miyato, Dai, and Goodfellow 2017; Sato et al. 2018). During the training phase, they replace a word with one of its synonyms that maximizes the prediction loss. By augmenting these adversarial examples with the original training data, the model is robust to such perturbations.
The results of the empirical robustness on AGNEWS dataset are reported in Table 4. From these reported numbers, we see that RanMASK-90% consistently performs better than the competitors under all the three attack algorithms on the robust accuracy while suffering little performance drop on the clean data. The empirical results on SST2 are reported in Table 5, and we found similar trends as those on AGNEWS, especially for those when the LM-based sampling strategy was used. On the IMDB dataset, we even observed that RanMASK achieved the highest accuracy on the clean data with 1.5% improvement compared to the baseline built upon RoBERTa, where the masking rate (i.e., 30%) was tuned on the validation set when the maximum MCB was achieved by the method introduced in Section 4.2. The results on three text classification datasets show that our RanMASK consistently achieved better robust accuracy while suffering little loss on the original clean data.
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 93.9 | 15.8 | 83.2 | 94.7 | 26.7 | 71.8 | 94.2 | 33.0 | 65.0 |
PGD-10 (Madry et al. 2018) | 95.0 | 22.3 | 76.5 | 95.3 | 30.0 | 68.5 | 94.9 | 38.8 | 59.1 |
FreeLB (Zhu et al. 2020) | 93.9 | 24.6 | 73.8 | 95.3 | 28.3 | 70.3 | 93.7 | 44.0 | 53.0 |
Adv-Hotflip (Ebrahimi et al. 2018) | 93.4 | 21.3 | 77.2 | 93.9 | 26.8 | 71.5 | 94.6 | 37.6 | 60.3 |
Data Augmentation | 93.3 | 23.7 | 74.6 | 92.3 | 39.1 | 57.6 | 93.8 | 49.7 | 47.0 |
RanMASK-90% (logit) | 89.1 | 42.7 | 52.1 | 88.5 | 30.0 | 66.1 | 89.8 | 45.4 | 45.4 |
RanMASK-90% (vote) | 91.2 | 55.1 | 39.6 | 89.1 | 41.1 | 53.9 | 90.3 | 57.5 | 36.0 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 93.9 | 15.8 | 83.2 | 94.7 | 26.7 | 71.8 | 94.2 | 33.0 | 65.0 |
PGD-10 (Madry et al. 2018) | 95.0 | 22.3 | 76.5 | 95.3 | 30.0 | 68.5 | 94.9 | 38.8 | 59.1 |
FreeLB (Zhu et al. 2020) | 93.9 | 24.6 | 73.8 | 95.3 | 28.3 | 70.3 | 93.7 | 44.0 | 53.0 |
Adv-Hotflip (Ebrahimi et al. 2018) | 93.4 | 21.3 | 77.2 | 93.9 | 26.8 | 71.5 | 94.6 | 37.6 | 60.3 |
Data Augmentation | 93.3 | 23.7 | 74.6 | 92.3 | 39.1 | 57.6 | 93.8 | 49.7 | 47.0 |
RanMASK-90% (logit) | 89.1 | 42.7 | 52.1 | 88.5 | 30.0 | 66.1 | 89.8 | 45.4 | 45.4 |
RanMASK-90% (vote) | 91.2 | 55.1 | 39.6 | 89.1 | 41.1 | 53.9 | 90.3 | 57.5 | 36.0 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 94.3 | 5.4 | 94.3 | 93.9 | 6.2 | 93.4 | 94.7 | 17.0 | 82.1 |
PGD-10 (Madry et al. 2018) | 94.0 | 5.6 | 94.0 | 94.4 | 5.6 | 94.1 | 92.9 | 18.3 | 80.3 |
FreeLB (Zhu et al. 2020) | 93.7 | 13.9 | 85.2 | 93.8 | 10.4 | 89.0 | 93.0 | 23.7 | 74.5 |
Adv-Hotflip (Ebrahimi et al. 2018) | 94.3 | 12.3 | 87.0 | 93.8 | 11.4 | 87.9 | 93.3 | 23.4 | 74.9 |
Data Augmentation | 91.0 | 9.6 | 89.5 | 88.2 | 16.9 | 80.8 | 91.8 | 23.5 | 74.4 |
RanMASK-30% (logit) | 92.9 | 8.9 | 90.4 | 92.9 | 9.5 | 89.8 | 93.0 | 21.1 | 77.3 |
RanMASK-30% (vote) | 92.7 | 12.9 | 86.1 | 93.0 | 11.4 | 87.7 | 92.7 | 27.5 | 70.3 |
RanMASK-30% (vote) + LM | 90.6 | 23.4 | 74.2 | 90.4 | 22.8 | 74.8 | 91.0 | 41.7 | 53.1 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 94.3 | 5.4 | 94.3 | 93.9 | 6.2 | 93.4 | 94.7 | 17.0 | 82.1 |
PGD-10 (Madry et al. 2018) | 94.0 | 5.6 | 94.0 | 94.4 | 5.6 | 94.1 | 92.9 | 18.3 | 80.3 |
FreeLB (Zhu et al. 2020) | 93.7 | 13.9 | 85.2 | 93.8 | 10.4 | 89.0 | 93.0 | 23.7 | 74.5 |
Adv-Hotflip (Ebrahimi et al. 2018) | 94.3 | 12.3 | 87.0 | 93.8 | 11.4 | 87.9 | 93.3 | 23.4 | 74.9 |
Data Augmentation | 91.0 | 9.6 | 89.5 | 88.2 | 16.9 | 80.8 | 91.8 | 23.5 | 74.4 |
RanMASK-30% (logit) | 92.9 | 8.9 | 90.4 | 92.9 | 9.5 | 89.8 | 93.0 | 21.1 | 77.3 |
RanMASK-30% (vote) | 92.7 | 12.9 | 86.1 | 93.0 | 11.4 | 87.7 | 92.7 | 27.5 | 70.3 |
RanMASK-30% (vote) + LM | 90.6 | 23.4 | 74.2 | 90.4 | 22.8 | 74.8 | 91.0 | 41.7 | 53.1 |
Comparing to the baseline (RoBERTa), RanMASK can improve the accuracy under attack or the robust accuracy (Boa) by 21.06%, and lower the attack success rate (Succ) by 23.71% on average at the cost of a 2.07% decrease in the clean accuracy across three datasets and under three attack algorithms. When comparing RanMASK to a strong competitor, FreeLB (Zhu et al. 2020), which was proposed very recently, RanMASK still can further increase the accuracy under attack by 15.47%, and reduce the attack success rate by 17.71% on average at the cost of a 1.98% decrease in the clean accuracy under three different attacks. Generally, RanMASK with the “vote” ensemble performs better than that with the “logit” ensemble, except on the IMDB dataset under BERT-Attack and TextFooler attacks. We will thoroughly discuss the properties and behaviors of those two ensemble methods in the following sections.
As shown in Tables 4 and 6, any model applied to IMDB shows to be more vulnerable to adversarial attacks than the same one on AGNEWS. For example, BERT-Attack achieved a 100% attack success rate against the baseline model on IMDB while its attack success rates are far below 100% on the other datasets. It is probably because the average length of the sentences in IMDB (255 words on average) is much longer than that in AGNEWS (43 words on average). Longer sentences allow the adversaries to apply more synonym substitution-based or character-level perturbations to the original examples. While RanMASK shows to be more resistant to adversarial attacks, it also can improve the clean accuracy on IMDB. One reasonable explanation is that the models rely too heavily on the non-robust features that are less relevant to the categories to be classified, and our random masking strategy disproportionately affects non-robust features, which thus hinders the model’s reliance on them. Note that the sentences in IMDB are relatively long, and many words in any sentence might be irrelevant for the classification but would be inappropriately used by the models for the prediction.
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 91.5 | 0.5 | 99.4 | 91.5 | 0.0 | 100.0 | 91.5 | 48.5 | 47.0 |
PGD-10 (Madry et al. 2018) | 92.0 | 1.0 | 98.9 | 92.0 | 0.5 | 99.4 | 92.0 | 44.5 | 51.6 |
FreeLB (Zhu et al. 2020) | 92.0 | 3.5 | 96.2 | 92.0 | 2.5 | 97.3 | 92.0 | 52.5 | 42.9 |
Adv-Hotflip (Ebrahimi et al. 2018) | 91.5 | 6.5 | 92.9 | 91.5 | 11.5 | 87.4 | 91.5 | 42.5 | 53.5 |
Data Augmentation | 90.5 | 2.5 | 97.2 | 91.0 | 5.5 | 94.0 | 91.0 | 50.5 | 44.5 |
RanMASK-30% (logit) | 93.0 | 23.5 | 74.7 | 93.0 | 22.0 | 76.3 | 93.5 | 62.0 | 33.7 |
RanMASK-30% (vote) | 93.0 | 18.0 | 80.7 | 93.5 | 17.0 | 81.8 | 92.5 | 66.0 | 28.7 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 91.5 | 0.5 | 99.4 | 91.5 | 0.0 | 100.0 | 91.5 | 48.5 | 47.0 |
PGD-10 (Madry et al. 2018) | 92.0 | 1.0 | 98.9 | 92.0 | 0.5 | 99.4 | 92.0 | 44.5 | 51.6 |
FreeLB (Zhu et al. 2020) | 92.0 | 3.5 | 96.2 | 92.0 | 2.5 | 97.3 | 92.0 | 52.5 | 42.9 |
Adv-Hotflip (Ebrahimi et al. 2018) | 91.5 | 6.5 | 92.9 | 91.5 | 11.5 | 87.4 | 91.5 | 42.5 | 53.5 |
Data Augmentation | 90.5 | 2.5 | 97.2 | 91.0 | 5.5 | 94.0 | 91.0 | 50.5 | 44.5 |
RanMASK-30% (logit) | 93.0 | 23.5 | 74.7 | 93.0 | 22.0 | 76.3 | 93.5 | 62.0 | 33.7 |
RanMASK-30% (vote) | 93.0 | 18.0 | 80.7 | 93.5 | 17.0 | 81.8 | 92.5 | 66.0 | 28.7 |
We also conducted the experiments of natural language inference on the Stanford Natural Language Inference (SNLI) corpus (Bowman et al. 2015), which is a collection of 570,000 English sentence pairs (a premise and a hypothesis) manually labeled for balanced classification with three labels: entailment, contradiction, and neutral. What makes the natural language inference different from the text classification is that it needs to determine whether the directional relation holds whenever the truth of one text (i.e., hypothesis) follows from another text (i.e., premise). We implemented a baseline model based on RoBERTa for this task. The premise and hypothesis are encoded by running RoBERTa on the word embeddings to generate the sentence representations, which uses attention between the premise and hypothesis to compute richer representations of each word in both sentences, and then the concatenation of these encodings is fed to a two-layer feedforward network for the prediction. The baseline model was trained with cross-entropy loss, and their hyperparameters were tuned on the validation set.
The results of the empirical robustness on SNLI are reported in Table 7. The masking rate (i.e., 15%) was tuned for RanMASK on the validation set when the maximum MCB was achieved. From these numbers, a handful of trends are readily apparent. RanMASK using the “vote” ensemble achieved better empirical robustness than that using the “logit” ensemble again. Comparing to the baseline, RanMASK can improve the accuracy under attack or the robust accuracy (Boa) by 14.05%, and lower the attack success rate (Succ) by 16.30% on average at the cost of 3.43% decrease in the clean accuracy under three different attack algorithms. When FreeLB is used for comparison, RanMASK can further improve the robust accuracy (Boa) by 13.43%, and reduce the attack success rate (Succ) by 15.83% on average.
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 91.0 | 3.9 | 95.7 | 91.0 | 0.6 | 99.3 | 91.0 | 4.2 | 95.4 |
PGD-10 (Madry et al. 2018) | 91.9 | 4.7 | 95.0 | 91.9 | 1.0 | 98.9 | 91.9 | 4.8 | 94.8 |
FreeLB (Zhu et al. 2020) | 91.2 | 4.3 | 95.3 | 91.2 | 0.4 | 99.6 | 91.2 | 5.3 | 94.1 |
Adv-Hotflip (Ebrahimi et al. 2018) | 88.8 | 6.0 | 93.2 | 88.8 | 1.5 | 98.3 | 88.5 | 8.9 | 90.0 |
Data Augmentation | 89.5 | 14.2 | 84.1 | 89.7 | 1.8 | 98.0 | 91.0 | 20.3 | 77.7 |
RanMASK-15% (logit) | 89.4 | 10.8 | 87.9 | 89.8 | 1.2 | 98.7 | 89.7 | 6.8 | 92.4 |
RanMASK-15% (vote) | 87.0 | 21.5 | 74.7 | 89.5 | 5.8 | 93.5 | 86.2 | 23.0 | 73.3 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (RoBERTa) | 91.0 | 3.9 | 95.7 | 91.0 | 0.6 | 99.3 | 91.0 | 4.2 | 95.4 |
PGD-10 (Madry et al. 2018) | 91.9 | 4.7 | 95.0 | 91.9 | 1.0 | 98.9 | 91.9 | 4.8 | 94.8 |
FreeLB (Zhu et al. 2020) | 91.2 | 4.3 | 95.3 | 91.2 | 0.4 | 99.6 | 91.2 | 5.3 | 94.1 |
Adv-Hotflip (Ebrahimi et al. 2018) | 88.8 | 6.0 | 93.2 | 88.8 | 1.5 | 98.3 | 88.5 | 8.9 | 90.0 |
Data Augmentation | 89.5 | 14.2 | 84.1 | 89.7 | 1.8 | 98.0 | 91.0 | 20.3 | 77.7 |
RanMASK-15% (logit) | 89.4 | 10.8 | 87.9 | 89.8 | 1.2 | 98.7 | 89.7 | 6.8 | 92.4 |
RanMASK-15% (vote) | 87.0 | 21.5 | 74.7 | 89.5 | 5.8 | 93.5 | 86.2 | 23.0 | 73.3 |
In conclusion, RanMASK can improve the robust accuracy under different attacks much further than existing defense methods on various tasks, including text classification, sentiment analysis, and natural language inference. However, it is well-known that there is a tradeoff between clean accuracy and adversarial robustness. This tradeoff also has been observed on these tasks with all the considered models, including those trained with RanMASK. Specifically, there is a tradeoff between clean accuracy and maximum median certified robustness (MCB) for RanMASK. Generally, the higher the masking rate ρ, the greater the MCB and the lower the clean accuracy. In our experiments, the masking rate was chosen to achieve the highest MCB while maintaining the clean accuracy as much as possible for each dataset. In practice, the masking rate should be chosen to meet the requirements of specific applications and the preferences of their developers. It would be interesting to seek an efficient search method for finding a proper masking rate to balance clean accuracy and MCB. We leave this as future work.
4.4 Comparison with SAFER
Unlike other baselines, SAFER (Ye, Gong, and Liu 2020) is a certified defense method against adversarial attacks proposed for NLP models. Although the same evaluation metrics and attack algorithms are used as when comparing to other baselines, we thoroughly compare SAFER to our RanMASK in this separate subsection because both methods provide the certified robustness of neural text models. What else our method has in common with SAFER is that both methods were built on the ensemble strategy. We report in Table 8 the empirical robustness of RanMASK on AGNEWS compared with SAFER (Ye, Gong, and Liu 2020), a very recently proposed certified defense. From these reported numbers, we found that RanMASK outperforms SAFER under the setting where the “logit” ensemble is used for the predictions, while SAFER slightly performs better than RanMASK when the “vote” ensemble is used under the attack of TextFooler. However, this comparison is not direct and fair. First, SAFER makes use of the same synonym table used by TextFooler (i.e, they also assume that the defenders know in advance how the adversaries generate synonyms to launch adversarial attacks). Second, we found that different smoothing defense methods behave quite differently as the ensemble method is changed from the “vote” to the “logit.”
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (BERT) | 93.0 | 5.6 | 94.0 | 95.1 | 16.3 | 82.9 | 94.3 | 16.6 | 82.4 |
SAFER (logit) | 94.6 | 26.1 | 72.4 | 94.8 | 29.0 | 69.4 | 95.1 | 31.9 | 66.5 |
SAFER (vote) | 95.4 | 78.6 | 17.6 | 94.3 | 63.4 | 32.8 | 95.2 | 78.4 | 17.7 |
RanMASK-90% (logit) | 91.3 | 47.3 | 48.2 | 91.6 | 38.3 | 58.2 | 89.2 | 39.6 | 55.6 |
RanMASK-90% (vote) | 90.3 | 51.9 | 42.5 | 92.7 | 46.3 | 50.1 | 90.4 | 51.8 | 42.7 |
RanMASK-5% (logit) | 94.4 | 13.2 | 86.0 | 96.0 | 25.6 | 73.3 | 94.8 | 23.4 | 75.3 |
RanMASK-5% (vote) | 93.9 | 68.6 | 26.9 | 95.2 | 63.0 | 33.8 | 95.3 | 77.1 | 19.1 |
RanMASK-5% (vote) + LM | 94.8 | 71.4 | 24.7 | 95.7 | 65.3 | 31.8 | 93.8 | 80.3 | 14.4 |
Method . | TextFooler . | BERT-Attack . | DeepWordBug . | ||||||
---|---|---|---|---|---|---|---|---|---|
Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | Cln% . | Boa% . | Succ% . | |
Baseline (BERT) | 93.0 | 5.6 | 94.0 | 95.1 | 16.3 | 82.9 | 94.3 | 16.6 | 82.4 |
SAFER (logit) | 94.6 | 26.1 | 72.4 | 94.8 | 29.0 | 69.4 | 95.1 | 31.9 | 66.5 |
SAFER (vote) | 95.4 | 78.6 | 17.6 | 94.3 | 63.4 | 32.8 | 95.2 | 78.4 | 17.7 |
RanMASK-90% (logit) | 91.3 | 47.3 | 48.2 | 91.6 | 38.3 | 58.2 | 89.2 | 39.6 | 55.6 |
RanMASK-90% (vote) | 90.3 | 51.9 | 42.5 | 92.7 | 46.3 | 50.1 | 90.4 | 51.8 | 42.7 |
RanMASK-5% (logit) | 94.4 | 13.2 | 86.0 | 96.0 | 25.6 | 73.3 | 94.8 | 23.4 | 75.3 |
RanMASK-5% (vote) | 93.9 | 68.6 | 26.9 | 95.2 | 63.0 | 33.8 | 95.3 | 77.1 | 19.1 |
RanMASK-5% (vote) + LM | 94.8 | 71.4 | 24.7 | 95.7 | 65.3 | 31.8 | 93.8 | 80.3 | 14.4 |
Typical score-based attack algorithms, such as TextFooler and DeepWordBug, usually use two steps to craft adversarial examples: greedily identify the most vulnerable position to change; modify it slightly to maximize the model’s prediction error. This two-step would be repeated iteratively until the model’s prediction changes. If the “vote” ensemble method is used, the class distributions produced by the models trained with SAFER would be quite sharp, even very close to one-hot categorical distribution,3 which hinders the adversaries to peek into the changes in the model’s predictions by a small perturbation on the input, ending up trapped in local minima. This forces the adversaries to launch the decision-based attacks (Maheshwary, Maheshwary, and Pudi 2020) instead of the score-based ones, which can dramatically affect the resultant attack success rates. If the “logit” ensemble method is used or the attack algorithm is designed to perturb more than one word at a time, the empirical robustness of SAFER will drop significantly. Therefore, it is unfair to compare “vote”-based ensemble defense methods with others when conducting empirical experiments. We believe these methods using the “vote” ensemble will greatly improve the model’s defense performance when the models are deployed for real-world applications, but we recommend using the “logit” ensemble method if someone really wants to analyze and prove the effectiveness of the proposed defense methods against textual adversarial attacks in future research.
4.5 Impact of Masking Rate on Robust Accuracy
We want to understand the impact of different masking rates on the accuracy of RanMASK under adversarial attacks by varying the masking rates from 5% to 90%. We show in Figure 3 the average robust accuracy of RanMASK on the test set of AGNEWS versus several masking rates with two ensemble methods under three different attacks: TextFooler, BERT-Attack, and DeepWordBug. For each setting, the average accuracy is obtained over 5 runs with different random initialization. Taking the results obtained with TextFooler as an example (similar trends can be observed for other attack algorithms), when the “logit” ensemble method is used, the accuracy under attack generally increases until the best performance is achieved at the masking rate of 90% (ρ = 90%) and there is a big jump from 70% to 90%. However, if the “vote” ensemble method is applied, we observed a dramatic decrease in the robust accuracy when the masking rates vary from 5% to 30%, and then the robust accuracy climbs up smoothly until getting to its peak when setting ρ = 90%. Although it seems counterintuitive that the robust accuracy falls greatly at first and rises later, this phenomenon can be explained with the same reason we used to explain the difference between SAFER and RanMASK in their predictive behaviors in Section 4.4.
Note that the lower the masking rates, the more similar the class distributions RanMASK will produce for different masked copies of a given input text. Those distributions will be extremely similar when the “vote” ensemble method is used, which prevents the attack algorithms from effectively identifying the weakness of victim models by applying small perturbations to the input and tracking the resulting impacts. This misleading phenomenon deserves particular attention since it might hinder us to search for the optimal setting. For example, as we can see from Figure 3b, RanMASK-5% with the “vote” ensemble method achieved the remarkable robust accuracy on AGNEWS under three different attacks, but when the method is switched to the “logit” ensemble, RanMASK-5% performs much worse. No matter which ensemble method is used, the optimal masking rate should be set to around 90%, which can provide the certified robustness to any perturbation on the inputs by no more than 5% of words. As shown in Table 2, we can certify the classifications of over 50% of texts to be robust to any perturbation of 5 words on AGNEWS when ρ = 90%, while the number of words allowed to be perturbed is very close to 0 when ρ = 5%. Therefore, if the “vote” ensemble method was used, we may come to wrong conclusions.
4.6 Impact of Sentence Length on Accuracy
To determine whether the smoothed classifier g is certified robust at a given x, the condition, (see Corollary 1), needs to be evaluated to check if it is satisfied. It would be hard or even impossible to make this condition satisfied when the given sentence is very short. For an extreme case of two-word sentences, the maximum number of words that can be masked is 1, and such a condition cannot be satisfied no matter what value of d since the accuracy is always equal to or less than 100% (d can only take the value 1 or 2). However, when the length of sentence x is 3, the condition could be satisfied if d = 1 (33% words are allowed to be perturbed), kx = 1 (only one word is not masked), and . As discussed in Section 3.2, pc(x) is used to approximate the value of β. When the length of sentence x is equal to 5, the condition has a higher chance of being satisfied. Assuming that d = 1 (20% of words can be perturbed by adversaries), the certified condition would be satisfied if kx = 1 and or if kx = 2 and . When d = 2 (40% of words can be perturbed), the condition still could be satisfied if kx = 1 and . In real-world applications, most sentences are longer than 5 words. According to our statistics, there are no sentences shorter than 5 words in the test sets of AGNEWS (Zhang, Zhao, and LeCun 2015) and SST2 (Socher et al. 2013), and all the sentences in AGNEWS test set have more than 10 words.
In theory, the length of a sentence does affect how likely the condition of Corollary 1 can be satisfied. This effect decreases rapidly as the length of sentence increases, and the effect would be negligible when the sentence length is greater than 10 words because the differences in the value of Δ are less than 0.04 if d ≥ 3 and kx ≥ 3 for any pairs of these sentences with different lengths. We also want to know how sentence length empirically impacts clean and robust accuracies by varying the length of the sentences in the test sets of AGNEWS and SST2. In Figure 4, we show the clean and robust accuracies achieved by RanMASK-90% and the RoBERTa baseline for the different subsets of AGNEWS test dataset. The sentences in the test set were grouped into five different subsets (i.e., 11–20, 21–30, 31–40, 41–50, and 50+ words) according to their length. The numbers of the sentences in different subsets are 199 (2.6%), 1,072 (14.0%), 3,142 (41.1%), 2,420 (31.6%), and 821 (10.7%), respectively. As the setting described in Section 4.3, three different attack algorithms of TextFoolor, BERT-Attack, and DeepWordBug were used to evaluate the robustness of the two models. The accuracies achieved by RanMASK-90% are shown in red font while those produced by the baseline model (RoBERTa) are reported in black font. In Figure 5, we show the results on the test set of SST2 where the sentences were divided into five different subsets of 1–10, 11–20, 21–30, 31–40, and 41+ words (note that the average length of the sentences in AGNEWS is longer than that in SST2). The sizes of different SST2 subsets are 351 (19.3%), 702 (38.5%), 560 (30.8%), 181 (9.9%), and 27 (1.5%), respectively.
As we can see from Figures 4 and 5, RanMASK achieved comparable performance to the RoBERTa baseline in clean accuracy, and the changing trend of clean accuracy across different sentence lengths is generally consistent with that produced by the baseline with very few exceptions. RanMASK clearly performs substantially better than the baseline built upon RoBERTa in robust accuracy across different sentence lengths on both AGNEWS and SST2 datasets. The baseline only outperforms RanMASK-30% on one subset of SST2 with 1–10 lengths under the attacks of BERT-Attack and DeepWordBug. As mentioned in Section 4.3, the risk probability is approximately equal to 0.5 when setting ρ = 30%, and the LM-based sampling should be used to reduce the risk when evaluating on SST2. If the LM-based sampling strategy was used, RanMASK-30% can outperform the baseline by 2% and 16% increases in accuracy on the same subset of SST2 under the attack algorithms of BERT-Attack and DeepWordBug, respectively. The numbers presented in Figure 4 show that the longer the sentence, the better the robust accuracy of RanMASK, while such a trend is not observed on SST2 dataset (see Figure 5). The robust accuracy achieved by RanMASK fluctuates within a relatively small range when the length of the sentence increases from 1-10 to 41+ on SST2 test set. The reason for this difference between the two datasets is that SST2 was constructed for sentiment analysis, and the sentiment of a sentence largely depends on whether a few specific sentiment words occur in the sentence, and the number of sentiment words does not increase significantly with the length of the sentence. Unlike SST2, AGNEWS was created for news topic classification and the number of topic words generally increases with sentence length, which makes it easier for RanMASK to correctly predict the category of a longer sentence because there is a great chance for some topic words to survive from the masking operations even though a large portion of words have been masked. The adversarial robustness of RanMASK could be affected by sentence length, but this effect exerts in a more complex way and depends on the type of task.
5 Related Work
Even though achieving prominent performance on many important tasks, it has been reported that neural networks are vulnerable to adversarial examples—inputs generated by applying imperceptible but intentionally perturbations to them, such that the perturbed inputs can cause the model to make mistakes. Adversarial examples were first discovered in the image domain (Szegedy et al. 2014), and then their existence and pervasiveness were also observed in the text domain. Despite the fact that generating adversarial examples for texts has proven to be a more challenging task than for images due to their discrete nature, many methods have been proposed to generate textual adversarial examples and reveal the vulnerability of deep neural networks for NLP tasks, including reading comprehension (Jia and Liang 2017), text classification (Samanta and Mehta 2017; Wong 2017; Liang et al. 2018; Alzantot et al. 2018), machine translation (Zhao, Dua, and Singh 2018; Ebrahimi et al. 2018; Cheng et al. 2020b), dialogue systems (Cheng, Wei, and Hsieh 2019), and syntactic parsing (Zheng et al. 2020). The existence and pervasiveness of adversarial examples pose serious threats to neural networks-based models, especially when applying them to security-critical applications, such as face recognition systems (Zhu, Lu, and Chiang 2019), autonomous driving systems (Eykholt et al. 2018), and toxic content detection (Li et al. 2019). The introduction of adversarial examples and training ushered in a new era to understand and improve the machine learning models and has received significant attention recently (Goodfellow, Shlens, and Szegedy 2015a; Moosavi-Dezfooli, Fawzi, and Frossard 2016; Madry et al. 2018; Ilyas et al. 2019; Cohen, Rosenfeld, and Kolter 2019; Lécuyer et al. 2019; Yuan et al. 2021). Adversarial examples yield broader insights into the targeted models by exposing them to such intentionally crafted examples. In the following, we briefly review related work in both text adversarial attacks and defenses.
5.1 Text Adversarial Attacks
Text adversarial attack methods generate adversarial examples by perturbing original texts to maximize the model’s prediction errors while maintaining the adversarial examples’ fluency and naturalness. The recently proposed methods attack text examples mainly by replacing, scrambling, and erasing characters (Gao et al. 2018; Ebrahimi et al. 2018) or words (Alzantot et al. 2018; Ren et al. 2019a; Zheng et al. 2020; Jin et al. 2020; Li et al. 2020) under semantics- or/and syntax-preserving constraints based on the cosine similarity (Li et al. 2019; Jin et al. 2020), edit distance (Gao et al. 2018), or syntactic structural similarity (Zheng et al. 2020; Han et al. 2020).
Depending on the degree of access to the target (or victim) model, adversarial examples can be crafted in two different settings: white-box and black-box settings (Xu et al. 2019; Wang et al. 2019). In the white-box setting, an adversary can access the model’s architecture, parameters, and input feature representations that are not accessible in the black-box setting. The white-box attacks normally yield a higher success rate because the knowledge of target models can be leveraged to guide the generation of adversarial examples. However, the black-box attacks do not require access to target models, making them more practicable for many real-world attacks. Adversarial attacks also can be divided into targeted and non-targeted ones, depending on the purpose of the adversary. Taking the classification task as an example, the output category of a generated example is intentionally controlled to a specific category in a targeted attack while a non-targeted attack does not care about the category of misclassified results.
For text data, input sentences can be manipulated at character (Ebrahimi et al. 2018), sememe (the minimum semantic units) (Zang et al. 2020), or word (Samanta and Mehta 2017; Alzantot et al. 2018) levels by replacement, alteration (e.g., deliberately introducing typos or misspellings), swap, insertion, erasure, or directly making small perturbations to their feature embeddings. Generally, they want to ensure that the crafted adversarial examples are sufficiently similar to their original ones, and these modifications should be made within semantics-preserving constraints. Such semantic similarity constraints were usually defined based on the cosine similarity (Wong 2017; Barham and Feizi 2019; Jin et al. 2020; Ribeiro, Singh, and Guestrin 2018) or edit distance (Gao et al. 2018). Zheng et al. (2020) showed that adversarial examples also exist in syntactic parsing, and they crafted the adversarial examples that preserve the same syn- tactic structures as the original sentences by imposing the constraints based on syntactic structural similarity.
Text adversarial example generation usually involves two steps: determine an important position (or token) to change; and modify it slightly to maximize the model’s prediction error. This two-step recipe can be repeated iteratively until the model’s prediction changes or certain stopping criteria are reached. Many methods have been proposed to determine the important positions by random selection (Alzantot et al. 2018), trial-and-error testing at each possible point (Kuleshov et al. 2018), analyzing the effects on the model of masking various parts of an input text (Samanta and Mehta 2017; Gao et al. 2018; Jin et al. 2020; Yang et al. 2018), comparing their attention scores (Hsieh et al. 2019), or gradient-guided optimization methods (Ebrahimi et al. 2018; Lei et al. 2019; Wallace et al. 2019; Barham and Feizi 2019).
After the important positions are identified, the most popular way to alter text examples is to replace the characters or words at selected positions with similar substitutes. Such substitutes can be chosen from the nearest neighbors in an embedding space (Alzantot et al. 2018; Kuleshov et al. 2018; Jin et al. 2020; Barham and Feizi 2019), synonyms in a prepared dictionary (Samanta and Mehta 2017; Hsieh et al. 2019), visually similar alternatives like typos (Samanta and Mehta 2017; Ebrahimi et al. 2018; Liang et al. 2018), Internet slang and trademark logos (Eger et al. 2019), paraphrases (Lei et al. 2019), or even randomly selected ones (Gao et al. 2018). Given an input text, Zhao, Dua, and Singh (2018) proposed to search for adversaries in the neighborhood of its corresponding representation in latent space by sampling within a range that is recursively tightened. In order to mislead a reading comprehension system, Jia and Liang (2017) tried to insert a few distraction sentences generated by a simple set of rules into text examples. It is worth noting that the certified robustness still can be achieved in theory by our RanMASK under insertion attacks if the constraint of ∥x − x′∥0 ≤ d is satisfied because we do not assume that the defenders know how the adversaries choose the words to replace original ones. When calculating this 0-norm of the change, a special token (say, [BLANK]) should be inserted into x for each position where a word is inserted in x′. Unfortunately, RanMASK cannot be used to provide the certified robustness against deletion attacks because we cannot know which words and how many of them were deleted from the original sentences.
5.2 Text Adversarial Defenses
The goal of adversarial defenses is to learn a model capable of achieving high test-time accuracy on both clean and adversarial examples. Recently, many defense methods have been proposed to defend against text adversarial attacks, which can roughly be divided into two categories: empirical (Miyato, Dai, and Goodfellow 2017; Sato et al. 2018; Zhou et al. 2021; Dong et al. 2021) and certified (Jia et al. 2019; Huang et al. 2019; Ye, Gong, and Liu 2020) methods.
Adversarial data augmentation is one of the most effective empirical defenses (Ren et al. 2019b; Jin et al. 2020; Li et al. 2020) for NLP models. During the training time, they replace a word with one of its synonyms to create adversarial examples. By augmenting these adversarial examples with the original training data, the model is robust to such perturbations. However, the number of possible perturbations scales exponentially with the length of texts, so data augmentation cannot cover all the perturbations of any input text. Zhou et al. (2021) use a convex hull formed by a word and its synonyms to capture word substitution-based perturbations, and they guarantee with high probability that the model is robust at any point within the convex hull. The similar technique also has been used by Dong et al. (2021). During the training phase, they allow the models to search for the worse-case over the convex hull (i.e., a set of synonyms) and minimize the error with the worst-case. Zhou et al. (2021) also showed that their framework can be extended to higher-order neighbors (synonyms) to boost the model’s robustness further.
Adversarial training (Miyato, Dai, and Goodfellow 2017; Zhu et al. 2020) is another one of the most successful empirical defense methods by adding norm-bounded adversarial perturbations to word embeddings and minimizing the resultant adversarial loss. A family of fast-gradient sign methods was introduced by Goodfellow, Shlens, and Szegedy (2015b) to generate adversarial examples in the image domain. They showed that the robustness and generalization of machine learning models could be improved by including high-quality adversaries in the training data. Miyato, Dai, and Goodfellow (2017) applied a fast-gradient sign method–like adversarial training method to the text domain by adding perturbations to word embeddings rather than to discrete text inputs. In order to improve the interpretability of adversarial training, Sato et al. (2018) extended the work of Miyato, Dai, and Goodfellow (2017) by constraining the directions of perturbations toward the existing words in the word embedding space. Zhang and Yang (2018) applied several types of noise to perturb the input word embeddings, including Gaussian, Bernoulli, and adversarial noises, to mitigate the overfitting problem of NLP models. Recently, Zhu et al. (2020) proposed a novel adversarial training algorithm, called FreeLB, which adds adversarial perturbations to word embeddings and then minimizes the resultant adversarial loss inside different regions around input samples. They add norm-bounded adversarial perturbations to the input sentences’ embeddings using a gradient-based method and enlarge the batch size with diversified adversarial samples under such norm constraints. The studies in this line of research focus on the generalization of models rather than their robustness against adversarial attacks.
Although the above empirical methods can successfully defend against the adversarial examples generated by the algorithms used during the training phase, the downside of such methods is that failure to discover an adversarial example does not mean that another more sophisticated attack could not find one. Recently, a set of certified defenses has been introduced, which guarantees the robustness to some specific types of attacks. For example, Jia et al. (2019) and Huang et al. (2019) use a bounding technique, Interval Bound Propagation (Gowal et al. 2018; Dvijotham et al. 2018), to formally verify a model’s robustness against word substitution-based perturbations. Shi et al. (2020) and Xu et al. (2020) proposed the robustness verification and training method for transformers (Vaswani et al. 2017) based on linear relaxation-based perturbation analysis. However, these defenses often lead to loose upper bounds for arbitrary networks and result in a higher cost of clean accuracy. Furthermore, due to the difficulty of verification, existing certified defense methods are usually not to scale and remain hard to scale to large networks, such as BERT and RoBERTa. To achieve certified robustness on large architectures, Ye, Gong, and Liu (2020) proposed a certified robust method, called SAFER, which is structure-free and can be applied to arbitrary large architectures. However, the base classifier of SAFER needs to be trained by adversarial data augmentation, and randomly perturbing a word to its synonyms performs poorly in practice. Mathias et al. (2019) also proposed a certified defense, called PixelDP, which is based on a novel connection between adversarial robustness and differential privacy. They achieve the certified robustness by introducing a noise layer in networks and making the expected output of those networks to have bounded sensitivity to p-norm changes in the input. However, they require that p > 0 and, therefore, their robustness condition cannot be generalized to 0-norm bounded adversarial examples (e.g., word substitution–based threat). Their certified defense was mainly proposed to defend against the adversarial perturbations measured by the p-norm of the change, when p = 1 or p = 2 (theoretically works for ).
The major problem is that all the existing certified defense methods make an unrealistic assumption that the defenders can access the synonyms used by the adversaries. They could be broken by more sophisticated attacks by using synonym sets with large size (Jin et al. 2020), generating synonyms dynamically with BERT (Li et al. 2020), or perturbing the inputs at the character level (Gao et al. 2018; Li et al. 2019). In this study, we show that random smoothing can be integrated with random masking strategy to boost the robust accuracy and such an integration leads to a certified defense method. In contrast to existing certified robust methods, the above unrealistic assumption is no longer required. Furthermore, the NLP models trained by our defense method can defend against both the word substitution-based attacks and character-level perturbations.
This study is most related to the work of Levine and Feizi (2020) that was developed to defend against sparse adversarial attacks in the image domain. The theoretical contribution of our study beyond theirs (Levine and Feizi 2020) is that we introduce a key variable β that is associated with each pair of an input text x and its adversarial example x′, and the introduction of β yields a tighter certificate bound. The value of β is defined to be the conditional probability that the base classifier f will label the masked copies of x with the class c where the indices of unmasked words are overlapped with x ⊖x′ (i.e., the set of word indices at which x and x′ differ). For estimating the value of β, we also present a Monte Carlo–based algorithm to evaluate β. On the MNIST dataset (Deng 2012), Levine and Feizi (2020) can certify the classifications of over 50% of images to be robust to any perturbations of at most only 8 pixels. The number of pixels in each image of the MNIST dataset is 784 (28 × 28), which means that to provide the certified robustness just 1.02% of pixels can be perturbed. By contrast, we can certify the classifications of over 50% of texts to be robust to any perturbations of 5 words on AGNEWS (Zhang, Zhao, and LeCun 2015). The average length of sentences in the AGNEWS dataset is about 43, which means that 11.62% of words can be maliciously manipulated while the certified robustness is still guaranteed due to the tighter certificate bound provided by Equation (7). Note that the changes by a very few pixels (say, 8 pixels, 1.02%) could be negligible and might even be left unnoticed but replacing with some words (say, 5 words, 11.62%) would significantly change the way to express the meaning. In addition to this theoretical contribution, we proposed a new sampling strategy in which the probability of a word being masked corresponds to its output probability of a BERT-based language model to reduce the risk probability, defined as Equation (10). The experimental results show that this LM-based sampling achieved better robust accuracy while suffering little loss on the clean data.
6 Conclusions
In this study, we propose a smoothing-based certified defense method for NLP models to substantially improve the robust accuracy against different threat models, including synonym substitution-based transformations and character-level perturbations. The main advantage of our method is that we do not base the certified robustness on the unrealistic assumption that the defenders know how the adversaries generate synonyms. This method is broadly applicable, generic, and scalable, and it can be incorporated with little effort in any neural networks, and scales to large architectures, such as BERT. We demonstrated through extensive experimentation that our smoothed classifiers perform better than existing empirical and certified defenses across different datasets.
It would be interesting to see the results of combining RanMASK with other defense methods such as FreeLB (Zhu et al. 2020) and Dirichlet Neighborhood Ensemble (Zhou et al. 2021) because they are orthogonal to each other. To the best of our knowledge, there is no method that can boost both clean and robust accuracy, and the trade-off has been proved empirically. We suggest a defense framework that first uses a pre-trained detector to determine whether an input text is an adversarial example. If it is classified as an adversarial example, it should be fed to a text classifier trained with a certain defense method; otherwise, it will be input to a normally trained classifier for the prediction. Although the masking rate used in RanMASK should be chosen to meet the requirements of specific applications and the preferences of their developers, it would be useful to seek an efficient search method for finding a proper masking rate to balance the clean accuracy and adversarial robustness. We leave these three possible improvements and extensions as future work.
Acknowledgments
The authors would like to thank the anonymous reviewers for their valuable comments. This work was supported by National Natural Science Foundation of China (No. 62076068), Shanghai Municipal Science and Technology Major Project (No. 2021SHZDZX0103), and Laboratory of Pinghu (Beijing Institute of Infinite Electric Measurement), Pinghu, China (No. 20220521).
Appendix A. Proof of Theorem 1
Note that it is unnecessary to consider the case of pc(x′) > pc(x) because if pc(x′) ≥ pc(x), the inequality of pc(x) − pc(x′) ≤ βΔ must hold since β and Δ are always positive by their definitions.
Notes
This study was inspired by Levine and Feizi’s work (Levine and Feizi 2020) from the image domain, but our study is different from theirs in the key idea behind the method. In their proposed ℓ0 smoothing method, “for each sample generated from x, a majority of pixels are randomly dropped from the image before the image is given to the base classifier. If a relatively small number of pixels have been adversarially corrupted, then it is highly likely that none of these pixels are present in a given ablated sample. Then, for the majority of possible random ablations, x and x′ will give the same ablated image.” In contrast to theirs, our key idea is that, if a sufficient number of words are randomly masked from a text and a relatively small number of words have been intentionally perturbed, then it is highly unlikely that all of the perturbed words (adversarially chosen) are present in any masked texts. Note that retaining just some of these perturbed words is often not enough to fool a text classifier.
In contrast to SAFER, the class distributions produced by the models trained with RanMASK are relatively smoother than those with SAFER. We estimated the average entropy of the distributions predicted by SAFER and RanMASK on 1,000 test samples selected randomly from the AGNEWS dataset. When TextFooler starts to attack, the average entropy of SAFER’s predictions is 0.006, while those of RanMASK’s are 0.025, 0.036, 0.102, and 0.587 when ρ = 5%, 10%, 50%, and 90%, respectively. Note that the greater the entropy is, the smoother the distribution will be.
References
Author notes
Action Editor: Wei Lu