Abstract
Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model’s behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.
1 Introduction
The success of deep learning models is largely driven by training on extensive datasets. Compared to the costly effort of labeling, the ease of obtaining uncurated data makes it an attractive option for training competitive models (Joulin et al., 2016; Tiedemann and Thottingal, 2020). The increasing use of public datasets from open-source communities, such as HuggingFace, raises important security concerns. These data hosting platforms often lack stringent data quality control processes, permitting the unregulated upload of datasets by any users. This reliance on untrustworthy data potentially exposes the models to backdoor attacks, where malicious users manipulate or poison data samples to imbue the victim model with specific misbehavior. For instance, a compromised sentiment analysis model, engineered to bias toward particular viewpoints or commercial products, could influence public perception or affect market trends.
Backdoor attacks aim to alter the predictive behavior of a victim model when presented with specific triggers. The attackers often accomplish this by either poisoning the training data (Dai et al., 2019; Qi et al., 2021b, c) or modifying the model parameters (Kurita et al., 2020; Li et al., 2021a). This study concentrates on the former approach, also known as backdoor poisoning attack. In such attacks, backdoor triggers are inserted into a small portion of the training data, with their corresponding labels remaining altered. As a result, models trained on these poisoned datasets function normally with clean data but exhibit manipulated misbehavior when encountering backdoor triggers.
Considering the potential damage from backdoor attacks, several defensive strategies have been proposed. These methods primarily depend on either anomaly detection (Tran et al., 2018; Chen et al., 2018, 2022b) or robust training (Li et al., 2021b; Yang et al., 2021). However, these methods either significantly compromise the model’s generalization performance (Li et al., 2021b; Geiping et al., 2022) or only offer protection against simple poisoning attacks, e.g., insertion-based attacks (He et al., 2023b).
In this paper, we propose a method that first automatically identifies a small number of anomalous instances in the training set, which is then followed by a label propagation process over the hidden representation of the training samples. The process is illustrated in Figure 1, which shows that high-precision seed examples (gray points) are identified (this process is automatic, based on their training dynamics, see §3.) By contrast, without these poisoned seeds, it would be very difficult to accurately identify poisoned outliers from the hidden representation alone, particularly due to the presence of two distinct outlier groups (circled in the figure). Our ablation studies find that a combined two-step approach is necessary: Seeding with a precise but small number of poisoned samples, followed by label propagation to iteratively add samples greedily by confidence.
Hidden representations of SST-2 training data, based on a BERT-based victim model attacked by BadNet. Gray points (seeds) are obtained automatically based on training dynamics (see § 3).
Hidden representations of SST-2 training data, based on a BERT-based victim model attacked by BadNet. Gray points (seeds) are obtained automatically based on training dynamics (see § 3).
In contrast to previous defense methods, our approach is not predicated on a specific form of attack, nor does it require access to a clean dataset. Comprehensive experimental results demonstrate the superiority of our approach over numerous sophisticated defenses across diverse datasets and types of backdoor attacks. Furthermore, our technique effectively covers the models trained on datasets with low poisoning rates, where existing advanced baselines provide inadequate protection.
2 Related Work
Backdoor Attacks
Backdoor attacks on deep learning models, first effectively demonstrated on image classification tasks by Gu et al. (2017), involve the manipulation of models to perform as expected on clean inputs, but to respond with controlled misbehavior when presented with certain toxic inputs. A series of advanced and more stealthy methods for computer vision tasks have been subsequently introduced (Chen et al., 2017; Liu et al., 2018; Yao et al., 2019; Saha et al., 2022; Carlini and Terzis, 2022).
Recently, NLP models have also been shown to be vulnerable to backdoor attacks. Generally, two primary categories of backdoor attacks have emerged. The first stream, data poisoning, involves tampering with the training data of the victim models, where a small percentage of the data has been manipulated (Dai et al., 2019; Qi et al., 2021c). In this method, the seminal work by Dai et al. (2019) used a random sentence as a backdoor trigger. However, Qi et al. (2021a) later argue that such random sentences could disrupt the fluency and semantics of the original input, rendering poisoned examples easily detectable by an external language model. To overcome this issue, Qi et al. (2021b) proposed using a controllable paraphraser (Iyyer et al., 2018) to create syntactic-level triggers. Stealthy triggers can also be implanted using synonym replacement (Qi et al., 2021c). The second category of backdoor attacks involves weight poisoning, where triggers are embedded by directly modifying pre-trained weights of the victim model (Kurita et al., 2020; Li et al., 2021a).
Defense Against Backdoor Attacks
In light of the susceptibility of models to backdoor attacks, various defensive strategies have been developed. These defenses can be classified by the stage at which they are implemented: (1) training-stage defenses and (2) test-stage defenses.
Training-stage defenses primarily aim to eliminate poisoned samples from the training data, which can be viewed as an outlier detection problem. For example, Tran et al. (2018) observed that the representations of poisoned samples differed from those of clean ones, leading them to propose using a feature covariance matrix spectrum to identify and remove poisoned examples. Similarly, activation clustering can serve as a tool for backdoor trigger detection (Chen et al., 2018). He et al. (2023b) draw a connection to spurious correlation, and propose a filtering method by finding trigger words or phrases that strongly correlate with a given label. Existing outlier detection techniques can only detect and remove a small fraction of poisoned examples, meaning attacks are overall still very successful. Conversely, our solution significantly lowers the attack success rates across various attacks and datasets.
Due to the computational constraint, there has been an increased reliance on publicly accessible models for inference or fine-tuning (Qi et al., 2021b). However, these models may contain backdoors, and even fine-tuning with clean data does not eliminate the potential risk (Kurita et al., 2020; Chen et al., 2022a). This risk underscores the necessity for test-stage defenses. One method employs an external model to remove lexical triggers (Qi et al., 2021a). Chen et al. (2022b) advocate for the application of outlier detection in test-stage defense. Furthermore, the triggers, which determine malicious labels, can be identified and removed using gradients (He et al., 2023a) or attention scores (Li et al., 2023), effectively nullifying the impact of backdoor attacks. These techniques can be combined with our solution, as defenses at the training and testing stages are complementary.
3 Method
This section first outlines the general framework of backdoor poisoning attacks. Then, we detail our defense method.
Backdoor Attack via Data Poisoning
Consider a training corpus , where xi is a textual input, and yi is the corresponding label. The attacker poisons a subset of instances , using a poisoning function f(·). The poisoning function f(·) transforms (x, y) to (x′, y′), where x′ is a corrupted x with backdoor triggers, y′ is the target label assigned by the attacker. The victim models trained on could be compromised for specific misbehavior according to the presence of triggers. Nevertheless, the models behave normally on clean inputs, which ensures the stealthiness of the attack.
Seeding Typical Backdoor Samples via Training Dynamics
Swayamdipta et al. (2020) suggest that training dynamics, e.g., the mean and standard deviation of probabilities for gold labels across training, can be employed to characterize training instances. Figure 2 indicates that most poisoned samples are located within regions of higher means. However, this characteristic only allows for identifying a subset of toxic samples, providing an inadequate defense against backdoor attacks, as shown in Table 7. Nevertheless, the poisoned instances with the highest mean of probabilities for gold labels can serve as starting points, initiating the following propagation process.
The training dynamic is based on inverse probabilities of ground truth labels. Gray points (seeds) are those examples with higher means. The dataset and backdoor attack are SST-2 and BadNet, respectively.
The training dynamic is based on inverse probabilities of ground truth labels. Gray points (seeds) are those examples with higher means. The dataset and backdoor attack are SST-2 and BadNet, respectively.
However, training dynamics alone are insufficient to fully counter backdoor attacks, shown as the mixed region in Figure 2. Thus, we utilize it to pinpoint some seed points with a high mean of inv-confidence.
Detecting Remaining Backdoor Samples via Label Propagation
After identifying seed samples, we use them in label propagation (see Algorithm 1), thereby locating a larger set of poisoned instances. This method assumes that poisoned instances are close to each other in the latent space, while still being distinguishable from the clean instances in the latent representation. The latent representation is derived from the final hidden layer of the victim model at the last training epoch. The algorithm derives more candidate poisoned samples by considering the K nearest neighbors of each seed, based on l2 distance. This iterative process continues until a stopping criterion is met. We refer to our approach as SEEP (SEEd and Propagation). A visual demonstration of SEEP is provided in Figure 3.
The illustration of SEEP on SST-2 training data, based on a BERT-base victim model attacked by BadNet. Initially, we use 1% samples with the highest inv-confidence values to find seed instances. Then, we use these seed samples to iteratively perform nearest neighbors search (label propagation), thereby identifying all poisoned instances. ‘P’ and ‘R’ indicate precision and recall, respectively.
The illustration of SEEP on SST-2 training data, based on a BERT-base victim model attacked by BadNet. Initially, we use 1% samples with the highest inv-confidence values to find seed instances. Then, we use these seed samples to iteratively perform nearest neighbors search (label propagation), thereby identifying all poisoned instances. ‘P’ and ‘R’ indicate precision and recall, respectively.
Concerning the termination criterion, Kernel Density Estimation (KDE) is employed. Initially, a Gaussian KDE is trained utilizing seed samples. This process is conducted using the KDE functionality available in the sklearn package, applying its default settings.1 Subsequently, during each iteration, we utilize this Gaussian KDE to calculate the average probability pμ of the newly identified nearest neighbors . The propagation ceases once pμ falls below a predefined threshold τ. In addition to KDE, we explore Gaussian Mixture Models (GMMs) for density estimation and provide a comparison between KDE and GMMs in § 4.3.3.
4 Experiments
This section conducts a series of studies to examine the efficacy of SEEP against multiple prominent backdoor poisoning attacks.
4.1 Experimental Settings
Datasets
The viability of the proposed method is assessed through its application in the domains of text classification and natural language inference (NLI). The text classification comprises Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019), and AG News (Zhang et al., 2015). As for the NLI, we primarily consider QNLI dataset (Wang et al., 2018). The statistics of these employed datasets are presented in Table 1.
Statistics of the employed datasets.
Dataset . | Classes . | Train . | Dev . | Test . |
---|---|---|---|---|
SST-2 | 2 | 67,349 | 872 | 1,821 |
OLID | 2 | 11,916 | 1,324 | 859 |
AG News | 4 | 108,000 | 11,999 | 7,600 |
QNLI | 2 | 100,000 | 4,743 | 5,463 |
Dataset . | Classes . | Train . | Dev . | Test . |
---|---|---|---|---|
SST-2 | 2 | 67,349 | 872 | 1,821 |
OLID | 2 | 11,916 | 1,324 | 859 |
AG News | 4 | 108,000 | 11,999 | 7,600 |
QNLI | 2 | 100,000 | 4,743 | 5,463 |
Backdoor Attacks
We test defense methods against four representative backdoor poisoning attacks on texts:
BadNet was developed for visual task backdooring (Gu et al., 2017) and adapted to textual classifications by Kurita et al. (2020). Following Kurita et al. (2020), we use a list of rare words: {“cf”, “tq”, “mn”, “bb”, “mb”} as triggers. Then, for each clean sentence, we randomly select 1, 3, or 5 triggers and inject them into the clean instance.
InsertSent was introduced by Dai et al. (2019). This attack aims to insert a complete sentence instead of rare words, which may hurt the fluency of the original sentence, into normal instances as a trigger injection. Following Qi et al. (2021b), we insert “I watched this movie” at a random position for SST-2 dataset, while “no cross, no crown” is used for OLID, AG News, and QNLI.
Syntactic was proposed by Qi et al. (2021b). They argue that insertion-based backdoor attacks can collapse the coherence of the original inputs, causing less stealthiness and making the attacks quite obvious to humans or machines. Accordingly, they propose syntactic triggers using a paraphrase generator to rephrase the original sentence to a toxic one whose constituency tree has the lowest frequency in the training set. Like Qi et al. (2021b), we use “S (SBAR) (,) (NP) (VP) (.)” as the syntactic trigger to attack the victim model.
LWS was introduced by Qi et al. (2021c), who developed a trigger inserter in conjunction with a surrogate model to facilitate backdoor insertion. This approach involves training the trigger inserter and surrogate model to substitute words in a given text with synonyms. This method consistently activates the backdoor via a sequence of strategic word replacements, potentially compromising the victim model.
Table 2 provides three clean examples and their backdoored instances. The target labels in our attacks are as follows: ‘Negative’ for SST-2, ‘Not Offensive’ for OLID, ‘Sports’ for AG News, and ‘Entailment’ for QNLI. We employed various poisoning rates in the training sets, specifically 1%, 5%, 10%, and 20%. However, in line with previous studies (Dai et al., 2019; Kurita et al., 2020; Qi et al., 2021b, c), our primary focus is on the setting with 20% poisoning rate. Evaluations with lower poisoning rates are presented in § 4.3.4. Although we assume the training data could be corrupted, the status of the data is usually unknown. Hence, we also inspect the impact of our defense when applied to clean data (denoted ‘Benign’).
Defense Baselines
Apart from our proposed approach, the efficacy of four other defensive measures devised for mitigating backdoor attacks is also assessed: (1) Clustering (Chen et al., 2018), which distinguishes the clean data from the contaminated ones via latent representation clustering; (2) DAN (Chen et al., 2022b), which discerns the corrupted data from the clean data through latent representations of clean validation samples; (3) ABL (Li et al., 2021b), which utilizes gradient ascent to eliminate the backdoor relying on the seed backdoor samples; and (4) Z-defense (He et al., 2023b), which finds spurious correlation between phrases (potential triggers) and labels, then removes all matching training instances.
Evaluation Metrics
The performance on test sets is evaluated based on two metrics: clean accuracy (CACC) and attack success rate (ASR) (Dai et al., 2019). CACC gauges the accuracy of the backdoored model on the clean test set. By contrast, ASR quantifies the efficacy of backdoors, inspecting the accuracy of attacks on the poisoned test set, crafted from instances in the test set with malicious label modification.
Training Details
We leverage the codebase from the Transformers library (Wolf et al., 2020) for our experiments. Each experiment involves fine-tuning the BERT-base-uncased model2 on the poisoned data for three epochs, using the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 2 × 10−5. We assign the batch size, maximum sequence length, and weight decay to 32, 128, and 0, respectively. All experiments are conducted using two A100 GPUs.
4.2 Defense Performance
4.2.1 Poisoned Data Detection
Considering that Clustering, DAN, Z-defense, and our defense strategy aim to discern poisoned samples from clean ones within the training data, our first goal is to evaluate the efficacy of the discriminative power of each model between these two types. Both Clustering and DAN necessitate knowing the number of clean training instances to determine the number of instances to discard (Chen et al., 2018, 2022b), which is impractical in real-world scenarios. Hence, to ensure a fair comparison, the number of instances discarded by Clustering and DAN is set equal to that of our strategy.3 For Z-defense, we adopt the threshold used by He et al. (2023b).
For SEEP, our preliminary experiments of the BadNet attack on SST-2 show that instances identified as poisoned are typically those within the highest 5% of inv-confidence. Any values beyond this threshold may inadvertently incorporate clean instances, undermining the efficacy of our approach. As we evaluate the effectiveness of our approach in scenarios where the poisoning rates are 1% and 5% (see § 4.3.4, Table 9), we adopt a conservative approach by considering the samples with the highest 1% of inv-confidence as seed instances. We conservatively set the neighbor size K = 5 and termination threshold (τ = 1 × 10−8) based on a preliminary study against the BadNet attack on SST-2 as well.
Following previous work (Gao et al., 2022; Chen et al., 2022b; He et al., 2023b), we employ two metrics to evaluate the performance of poisoned training instance detection: (1) False Rejection Rate (FRR): the percentage of clean samples, which are erroneously flagged, and (2) False Acceptance Rate (FAR): the percentage of undetected poisoned samples. While the optimal scenario would involve achieving 0% for both FRR and FAR, this is seldom feasible in practice. Given the critical nature of a low FAR, we are inclined to accept a marginally higher FRR as a trade-off. In addition to evaluating performance, FAR and FRR can calibrate the termination threshold for Z-defense and SEEP strategies. The detailed FRR and FAR for the identified defenses are reported in Table 3.
FRR (false rejection rate) and FAR (false acceptance rate) in % of different defensive avenues on multiple attack methods. Comparing the defense methods, the lowest FAR score on each attack is bold. * indicates the number is obtained via a hyper-parameter tuning on a dev set.
Dataset . | AttackMethod . | Clustering . | DAN . | Z-Defense . | SEEP . | ||||
---|---|---|---|---|---|---|---|---|---|
FRR . | FAR . | FRR . | FAR . | FRR . | FAR . | FRR . | FAR . | ||
SST-2 | BadNet | 11.36 | 43.83 | 3.61 | 13.13 | 0.00 | 0.00 | 0.29 | *0.00 |
InsertSent | 11.13 | 44.50 | 3.09 | 12.25 | 24.60 | 0.00 | 0.03 | 0.00 | |
Syntactic | 6.66 | 31.89 | 29.58 | 95.61 | 26.46 | 1.23 | 5.76 | 0.10 | |
LWS | 14.67 | 90.69 | 20.19 | 72.87 | 13.64 | 81.52 | 5.73 | 9.53 | |
Avg. | 10.96 | 52.73 | 14.12 | 48.47 | 16.18 | 20.69 | 2.95 | 2.41 | |
OLID | BadNet | 27.74 | 99.96 | 5.37 | 11.18 | 0.04 | 0.00 | 2.58 | 0.11 |
InsertSent | 36.03 | 99.96 | 11.10 | 3.92 | 3.91 | 0.00 | 10.09 | 0.04 | |
Syntactic | 15.21 | 21.93 | 9.93 | 0.76 | 1.01 | 1.17 | 9.83 | 0.38 | |
LWS | 3.25 | 14.37 | 5.22 | 22.22 | 1.10 | 45.70 | 1.26 | 6.33 | |
Avg. | 20.56 | 59.06 | 7.90 | 9.52 | 1.52 | 11.72 | 5.94 | 1.72 | |
AG News | BadNet | 33.38 | 99.99 | 12.37 | 0.04 | 3.57 | 1.47 | 12.38 | 0.08 |
InsertSent | 35.75 | 99.73 | 22.92 | 0.00 | 5.54 | 0.00 | 22.95 | 0.11 | |
Syntactic | 31.56 | 99.91 | 7.58 | 0.07 | 7.30 | 7.99 | 7.62 | 0.25 | |
LWS | 29.36 | 99.95 | 10.88 | 1.92 | 20.05 | 35.71 | 10.55 | 0.62 | |
Avg. | 32.51 | 99.90 | 13.44 | 0.51 | 9.12 | 11.29 | 13.38 | 0.26 | |
QNLI | BadNet | 5.45 | 39.84 | 0.03 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
InsertSent | 10.12 | 39.16 | 0.29 | 0.01 | 0.25 | 0.00 | 0.29 | 0.00 | |
Syntactic | 7.46 | 30.42 | 3.28 | 11.89 | 2.86 | 0.54 | 0.71 | 1.60 | |
LWS | 11.03 | 100.00 | 0.55 | 0.67 | 18.63 | 24.30 | 0.50 | 0.25 | |
Avg. | 8.52 | 52.36 | 1.04 | 3.14 | 5.44 | 6.21 | 0.38 | 0.46 |
Dataset . | AttackMethod . | Clustering . | DAN . | Z-Defense . | SEEP . | ||||
---|---|---|---|---|---|---|---|---|---|
FRR . | FAR . | FRR . | FAR . | FRR . | FAR . | FRR . | FAR . | ||
SST-2 | BadNet | 11.36 | 43.83 | 3.61 | 13.13 | 0.00 | 0.00 | 0.29 | *0.00 |
InsertSent | 11.13 | 44.50 | 3.09 | 12.25 | 24.60 | 0.00 | 0.03 | 0.00 | |
Syntactic | 6.66 | 31.89 | 29.58 | 95.61 | 26.46 | 1.23 | 5.76 | 0.10 | |
LWS | 14.67 | 90.69 | 20.19 | 72.87 | 13.64 | 81.52 | 5.73 | 9.53 | |
Avg. | 10.96 | 52.73 | 14.12 | 48.47 | 16.18 | 20.69 | 2.95 | 2.41 | |
OLID | BadNet | 27.74 | 99.96 | 5.37 | 11.18 | 0.04 | 0.00 | 2.58 | 0.11 |
InsertSent | 36.03 | 99.96 | 11.10 | 3.92 | 3.91 | 0.00 | 10.09 | 0.04 | |
Syntactic | 15.21 | 21.93 | 9.93 | 0.76 | 1.01 | 1.17 | 9.83 | 0.38 | |
LWS | 3.25 | 14.37 | 5.22 | 22.22 | 1.10 | 45.70 | 1.26 | 6.33 | |
Avg. | 20.56 | 59.06 | 7.90 | 9.52 | 1.52 | 11.72 | 5.94 | 1.72 | |
AG News | BadNet | 33.38 | 99.99 | 12.37 | 0.04 | 3.57 | 1.47 | 12.38 | 0.08 |
InsertSent | 35.75 | 99.73 | 22.92 | 0.00 | 5.54 | 0.00 | 22.95 | 0.11 | |
Syntactic | 31.56 | 99.91 | 7.58 | 0.07 | 7.30 | 7.99 | 7.62 | 0.25 | |
LWS | 29.36 | 99.95 | 10.88 | 1.92 | 20.05 | 35.71 | 10.55 | 0.62 | |
Avg. | 32.51 | 99.90 | 13.44 | 0.51 | 9.12 | 11.29 | 13.38 | 0.26 | |
QNLI | BadNet | 5.45 | 39.84 | 0.03 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 |
InsertSent | 10.12 | 39.16 | 0.29 | 0.01 | 0.25 | 0.00 | 0.29 | 0.00 | |
Syntactic | 7.46 | 30.42 | 3.28 | 11.89 | 2.86 | 0.54 | 0.71 | 1.60 | |
LWS | 11.03 | 100.00 | 0.55 | 0.67 | 18.63 | 24.30 | 0.50 | 0.25 | |
Avg. | 8.52 | 52.36 | 1.04 | 3.14 | 5.44 | 6.21 | 0.38 | 0.46 |
Our method achieves the lowest averaged FAR across all datasets and clearly outperforms all baseline methods. Specifically, our method exhibits almost perfect detection across all datasets against BadNet and InsertSent, with FAR scores below 0.1%. For the Syntactic attack, the FAR remains <0.4% for the datasets besides QNLI. For LWS attack, while we achieve FAR scores below 1% on AG News and QNLI, the FAR scores on SST-2 and OLID are less impressive (6% −10%). The effectiveness of SEEP is also evidenced in Figure 3, which illustrates how initial seed examples enable our method to iteratively identify the majority of poisoning instances, effectively terminating the search prior to incorporating the clean samples.
Regarding baseline models, Clustering has the highest FAR, peaking at 100% on the QNLI under the LWS attack. Notably, Clustering fails to filter out most poisoned instances on AG News, leading to a FAR exceeding 99%. This inadequacy of Clustering is further substantiated in Appendix B. DAN achieves a satisfactory performance on both AG News and QNLI under the insertion-based attacks, i.e., BadNet and InsertSent. However, it experiences significant difficulty identifying poisoned examples intended for SST-2, thereby yielding a relatively high FAR, especially for Syntactic attacks. Like DAN, Z-defense effectively detects poisoned examples from insertion-based attacks. Nevertheless, Z-defense faces challenges with LWS, resulting in up to 81.52% FAR.
4.2.2 Defense Against Backdoor Attacks
In light of the superior performance of our solutions for detecting poisoned data, we next demonstrate the potential of transferring this advantage to construct an effective defense against backdoor attacks in model training.
As illustrated in Table 4, some baseline methods fail completely as defenses. For instance, Clustering produces nearly identical ASR across datasets compared to the cases without defense, as a consequence of its poor recall (high FAR). DAN shows notable successes with BadNet and InsertSent on AG News and QNLI. However, it fails to effectively defend all backdoor attacks on SST-2 and OLID.
The performance of backdoor attacks on datasets with defenses. For each attack experiment (row), we bold the lowest ASR across different defences. Avg. indicates the averaged score of BadNet, InsertSent, Syntactic, and LWS attacks. The reported results are in % and averaged on three independent runs. For all experiments on SST-2 and OLID, standard deviation of ASR and CACC is within 1.5% and 0.5%. For AG News and QNLI, standard deviation of ASR and CACC is within 1.0% and 0.5%. We underline the numbers that fall within a 2% ASR of the benign model (refer to Table 5).
Dataset . | AttackMethod . | None . | Clustering . | DAN . | ABL . | Z-Defense . | SEEP . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ||
SST-2 | Benign | — | 92.4 | — | 92.7 | — | 92.5 | — | 91.5 | — | 92.0 | — | 92.3 |
BadNet | 100.0 | 92.5 | 100.0 | 92.2 | 100.0 | 92.4 | 0.0 | 89.3 | 9.4 | 92.3 | 7.4 | 92.6 | |
InsertSent | 100.0 | 91.9 | 100.0 | 92.2 | 100.0 | 92.2 | 0.5 | 89.2 | 3.0 | 92.6 | 2.3 | 92.2 | |
Syntactic | 95.9 | 92.0 | 96.2 | 91.6 | 96.3 | 92.0 | 92.6 | 92.1 | 37.3 | 91.6 | 10.0 | 91.5 | |
LWS | 97.7 | 92.1 | 96.8 | 91.6 | 97.5 | 91.3 | 97.5 | 91.9 | 96.6 | 91.3 | 29.4 | 92.4 | |
Avg. | 98.4 | 92.1 | 98.2 | 91.9 | 98.4 | 92.0 | 47.7 | 90.6 | 36.6 | 92.0 | 12.3 | 92.2 | |
OLID | Benign | — | 84.0 | — | 84.8 | — | 84.3 | — | 84.2 | — | 84.2 | — | 84.4 |
BadNet | 99.9 | 84.7 | 100.0 | 83.9 | 59.2 | 85.0 | 100.0 | 85.1 | 31.5 | 85.0 | 32.2 | 84.5 | |
InsertSent | 100.0 | 83.7 | 100.0 | 84.8 | 97.9 | 83.6 | 100.0 | 83.0 | 37.1 | 84.5 | 34.6 | 84.2 | |
Syntactic | 99.9 | 83.5 | 98.5 | 83.6 | 62.1 | 84.1 | 100.0 | 83.2 | 59.3 | 84.2 | 57.8 | 83.9 | |
LWS | 94.4 | 83.7 | 89.0 | 83.9 | 90.7 | 84.3 | 95.4 | 83.8 | 94.4 | 83.1 | 76.9 | 84.6 | |
Avg. | 98.5 | 83.9 | 96.9 | 84.0 | 77.5 | 84.2 | 98.9 | 83.8 | 55.6 | 84.2 | 50.4 | 84.3 | |
AG News | Benign | — | 94.6 | — | 93.1 | — | 93.8 | — | 94.1 | — | 93.9 | — | 94.4 |
BadNet | 99.9 | 94.5 | 99.9 | 91.5 | 0.8 | 94.1 | 99.5 | 94.4 | 0.6 | 94.3 | 0.6 | 94.5 | |
InsertSent | 99.7 | 94.3 | 99.7 | 90.3 | 0.7 | 93.1 | 99.7 | 94.5 | 0.5 | 94.4 | 0.3 | 93.4 | |
Syntactic | 99.8 | 94.4 | 99.9 | 92.9 | 4.4 | 94.4 | 0.0 | 93.1 | 99.6 | 94.3 | 9.9 | 94.5 | |
LWS | 99.2 | 94.4 | 99.5 | 92.7 | 94.9 | 94.1 | 0.0 | 93.0 | 98.9 | 93.8 | 20.1 | 94.4 | |
Avg. | 99.7 | 94.4 | 99.7 | 91.8 | 25.2 | 93.9 | 49.8 | 93.7 | 49.9 | 94.2 | 7.7 | 94.2 | |
QNLI | Benign | — | 91.4 | — | 90.5 | — | 91.1 | — | 90.5 | — | 91.2 | — | 90.9 |
BadNet | 100.0 | 91.2 | 100.0 | 90.6 | 5.2 | 91.2 | 0.0 | 90.3 | 4.8 | 91.2 | 5.6 | 91.0 | |
InsertSent | 100.0 | 91.0 | 100.0 | 90.1 | 5.6 | 91.4 | 98.9 | 91.1 | 4.6 | 91.0 | 4.8 | 91.0 | |
Syntactic | 99.1 | 89.9 | 99.1 | 87.9 | 91.0 | 89.7 | 1.0 | 87.4 | 19.6 | 90.1 | 13.3 | 90.2 | |
LWS | 99.2 | 90.3 | 99.2 | 89.9 | 19.1 | 90.2 | 0.2 | 90.6 | 98.5 | 89.5 | 15.6 | 90.1 | |
Avg. | 99.6 | 90.6 | 99.6 | 89.6 | 30.2 | 90.6 | 25.0 | 89.9 | 31.9 | 90.5 | 9.3 | 90.6 |
Dataset . | AttackMethod . | None . | Clustering . | DAN . | ABL . | Z-Defense . | SEEP . | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ASR . | CACC . | ||
SST-2 | Benign | — | 92.4 | — | 92.7 | — | 92.5 | — | 91.5 | — | 92.0 | — | 92.3 |
BadNet | 100.0 | 92.5 | 100.0 | 92.2 | 100.0 | 92.4 | 0.0 | 89.3 | 9.4 | 92.3 | 7.4 | 92.6 | |
InsertSent | 100.0 | 91.9 | 100.0 | 92.2 | 100.0 | 92.2 | 0.5 | 89.2 | 3.0 | 92.6 | 2.3 | 92.2 | |
Syntactic | 95.9 | 92.0 | 96.2 | 91.6 | 96.3 | 92.0 | 92.6 | 92.1 | 37.3 | 91.6 | 10.0 | 91.5 | |
LWS | 97.7 | 92.1 | 96.8 | 91.6 | 97.5 | 91.3 | 97.5 | 91.9 | 96.6 | 91.3 | 29.4 | 92.4 | |
Avg. | 98.4 | 92.1 | 98.2 | 91.9 | 98.4 | 92.0 | 47.7 | 90.6 | 36.6 | 92.0 | 12.3 | 92.2 | |
OLID | Benign | — | 84.0 | — | 84.8 | — | 84.3 | — | 84.2 | — | 84.2 | — | 84.4 |
BadNet | 99.9 | 84.7 | 100.0 | 83.9 | 59.2 | 85.0 | 100.0 | 85.1 | 31.5 | 85.0 | 32.2 | 84.5 | |
InsertSent | 100.0 | 83.7 | 100.0 | 84.8 | 97.9 | 83.6 | 100.0 | 83.0 | 37.1 | 84.5 | 34.6 | 84.2 | |
Syntactic | 99.9 | 83.5 | 98.5 | 83.6 | 62.1 | 84.1 | 100.0 | 83.2 | 59.3 | 84.2 | 57.8 | 83.9 | |
LWS | 94.4 | 83.7 | 89.0 | 83.9 | 90.7 | 84.3 | 95.4 | 83.8 | 94.4 | 83.1 | 76.9 | 84.6 | |
Avg. | 98.5 | 83.9 | 96.9 | 84.0 | 77.5 | 84.2 | 98.9 | 83.8 | 55.6 | 84.2 | 50.4 | 84.3 | |
AG News | Benign | — | 94.6 | — | 93.1 | — | 93.8 | — | 94.1 | — | 93.9 | — | 94.4 |
BadNet | 99.9 | 94.5 | 99.9 | 91.5 | 0.8 | 94.1 | 99.5 | 94.4 | 0.6 | 94.3 | 0.6 | 94.5 | |
InsertSent | 99.7 | 94.3 | 99.7 | 90.3 | 0.7 | 93.1 | 99.7 | 94.5 | 0.5 | 94.4 | 0.3 | 93.4 | |
Syntactic | 99.8 | 94.4 | 99.9 | 92.9 | 4.4 | 94.4 | 0.0 | 93.1 | 99.6 | 94.3 | 9.9 | 94.5 | |
LWS | 99.2 | 94.4 | 99.5 | 92.7 | 94.9 | 94.1 | 0.0 | 93.0 | 98.9 | 93.8 | 20.1 | 94.4 | |
Avg. | 99.7 | 94.4 | 99.7 | 91.8 | 25.2 | 93.9 | 49.8 | 93.7 | 49.9 | 94.2 | 7.7 | 94.2 | |
QNLI | Benign | — | 91.4 | — | 90.5 | — | 91.1 | — | 90.5 | — | 91.2 | — | 90.9 |
BadNet | 100.0 | 91.2 | 100.0 | 90.6 | 5.2 | 91.2 | 0.0 | 90.3 | 4.8 | 91.2 | 5.6 | 91.0 | |
InsertSent | 100.0 | 91.0 | 100.0 | 90.1 | 5.6 | 91.4 | 98.9 | 91.1 | 4.6 | 91.0 | 4.8 | 91.0 | |
Syntactic | 99.1 | 89.9 | 99.1 | 87.9 | 91.0 | 89.7 | 1.0 | 87.4 | 19.6 | 90.1 | 13.3 | 90.2 | |
LWS | 99.2 | 90.3 | 99.2 | 89.9 | 19.1 | 90.2 | 0.2 | 90.6 | 98.5 | 89.5 | 15.6 | 90.1 | |
Avg. | 99.6 | 90.6 | 99.6 | 89.6 | 30.2 | 90.6 | 25.0 | 89.9 | 31.9 | 90.5 | 9.3 | 90.6 |
ASR of the benign model over the poisoned test data.
AttackMethod . | SST-2 . | OLID . | AG News . | QNLI . |
---|---|---|---|---|
BadNet | 7.0 | 32.6 | 0.5 | 5.1 |
InsertSent | 2.4 | 34.2 | 0.4 | 4.2 |
Syntactic | 10.1 | 56.5 | 4.1 | 3.6 |
LWS | 22.4 | 49.6 | 1.3 | 13.4 |
AttackMethod . | SST-2 . | OLID . | AG News . | QNLI . |
---|---|---|---|---|
BadNet | 7.0 | 32.6 | 0.5 | 5.1 |
InsertSent | 2.4 | 34.2 | 0.4 | 4.2 |
Syntactic | 10.1 | 56.5 | 4.1 | 3.6 |
LWS | 22.4 | 49.6 | 1.3 | 13.4 |
Z-Defense achieves a remarkable detection performance on insertion-based attacks: namely a significant reduction of ASR relative to a defenseless system while maintaining a competitive CACC. However, it struggles to defend against more advanced attacks, such as Syntactic and LWS. This ineffectiveness is apparent in the case of LWS, which results in ASRs exceeding 95% across all datasets. The performance of the learning-driven countermeasure, ABL, varies across different attacks and datasets. Specifically, ABL fails to provide any meaningful defense on the OLID dataset, regardless of the type of backdoor attack. However, for the remaining datasets, it maintains an ASR of less than 1% for multiple entries.
Even though strong baselines, such as ABL and Z-defense, outperform our method on certain attacks and datasets, e.g., BadNet and InsertSent attacks on SST-2, and Syntactic and LWS attacks on AG News, our approach consistently achieves superior performance across all datasets on average. Note that the baselines show limitations in defending against specific attacks on certain datasets, whereas our method exhibits robust performance across all attacks and datasets.
While some baselines and our method can attain nearly perfect FAR on BadNet and InsertSent, achieving a zero ASR is almost impossible due to systematic errors. To verify this, we also assess the benign model on the poisoned test sets and compute the ASR of the benign model, which acts as an approximate lower bound. As illustrated in Table 5, achieving a 0% benign ASR remains a significant challenge across all poisoning methods, a phenomenon attributed to the imperfect performance on the test dataset. A comparison of our defense results against these lower bounds reveals that our method provides an almost impeccable defense against BadNet and InsertSent attacks across all datasets, and against the LWS attack on SST-2 and QNLI (refer to Table 4). Our approach effectively safeguards the victim from insertion-based attacks. Additionally, compared to the baselines, our proposed defense significantly narrows the gap between ASR and benign ASR for Syntactic and LWS attacks.
4.3 Ablation Studies
In addition to the aforementioned studies examining defenses against backdoor poisoning attacks, we conduct further investigations on SST-2 and QNLI.4 Our research primarily focuses on the InsertSent and LWS attacks. This is particularly interesting as the SEEP approach has demonstrated near-perfect ASR for InsertSent, but its performance remains suboptimal for LWS.
4.3.1 Comparison of Training Dynamics
In their study, Swayamdipta et al. (2020) consider the mean of p(y|x;θ) to distinguish between hard and easy data points. Instead, our methodology adopts the mean of 1/(1 −p(y|x;θ)). We demonstrate the superior effectiveness of our approach through an evaluation of detection performance after applying these two techniques to identify seed poisoned samples, as depicted in Table 6.
The detection performance of backdoor attacks on SST-2 and QNLI with the mean of probabilities and inverse probabilities for identifying seed samples.
Dataset . | AttackMethod . | Prob (mean) . | Inv Prob (mean) . | ||
---|---|---|---|---|---|
FRR . | FAR . | FRR . | FAR . | ||
SST-2 | InsertSent | 26.34 | 20.69 | 0.03 | 0.00 |
LWS | 46.23 | 37.36 | 5.73 | 9.53 | |
QNLI | InsertSent | 19.77 | 10.51 | 0.29 | 0.00 |
LWS | 87.93 | 0.00 | 0.50 | 0.25 |
Dataset . | AttackMethod . | Prob (mean) . | Inv Prob (mean) . | ||
---|---|---|---|---|---|
FRR . | FAR . | FRR . | FAR . | ||
SST-2 | InsertSent | 26.34 | 20.69 | 0.03 | 0.00 |
LWS | 46.23 | 37.36 | 5.73 | 9.53 | |
QNLI | InsertSent | 19.77 | 10.51 | 0.29 | 0.00 |
LWS | 87.93 | 0.00 | 0.50 | 0.25 |
In contrast to our method, applying the mean of the probabilities, while eliminating the FAR, significantly increases the FRR. This is because the methodology proposed by Swayamdipta et al. (2020) inadvertently includes a small fraction of clean instances within the seed samples, resulting in additional clean samples being included during the label propagation process. This issue is particularly pronounced in the LWS attack, where the FRR for SST-2 and QNLI escalate to 46.26% and 87.93%, respectively.
4.3.2 Importance of Label Propagation
As described in § 3, instead of employing training dynamics, we utilize seed samples identified via training dynamics to conduct label propagation to mitigate the effects of backdoor poisoning attacks. We compare the efficacy of our method with that of the training dynamics alone. To maintain a fair comparison, we ensure that both methods discard an equivalent number of instances.
Table 7 shows that training dynamics can effectively counter InsertSent attack. This suggests that the triggers utilized by this attack can be readily discerned by the victim model, thereby yielding highly accurate predictions across the training epochs. However, for LWS attack, the victim model may necessitate longer training steps to associate the triggers with the malicious label. Consequently, the training dynamics approach is insufficient to filter out poisoned samples. Nevertheless, SEEP successfully identifies most poisoned samples, which often cluster in a similar region of the latent space. This is accomplished via the nearest neighbor search, resulting in a substantial reduction in ASR.
The performance of backdoor attacks on SST-2 and QNLI with training dynamics (TD) and SEEP. For each attack experiment (row), we bold the lowest ASR across different defenses.
Dataset . | AttackMethod . | TD . | SEEP . | ||
---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ||
SST-2 | InsertSent | 4.1 | 91.9 | 2.3 | 92.2 |
LWS | 95.6 | 91.4 | 29.4 | 92.4 | |
QNLI | InsertSent | 5.2 | 91.1 | 4.8 | 91.0 |
LWS | 48.8 | 90.2 | 15.6 | 90.1 |
Dataset . | AttackMethod . | TD . | SEEP . | ||
---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ||
SST-2 | InsertSent | 4.1 | 91.9 | 2.3 | 92.2 |
LWS | 95.6 | 91.4 | 29.4 | 92.4 | |
QNLI | InsertSent | 5.2 | 91.1 | 4.8 | 91.0 |
LWS | 48.8 | 90.2 | 15.6 | 90.1 |
4.3.3 Comparison of Density Estimation Functions
The preceding experiments used KDE as the stopping criterion in label propagation. However, alternative approaches, such as GMMs, are also viable for density estimation. We now compare the efficacy of KDE versus GMMs as stopping criteria for SEEP. According to Table 8, for the InsertSent attack, both GMMs and KDE are highly effective in identifying most poisoned instances. Consequently, the ASR of InsertSent on SST-2 and QNLI is significantly reduced, approaching the benign ASR. However, when considering LWS attack, GMMs, despite surpassing most of the baseline models (refer to Table 4), underperform in comparison to KDE. This performance gap is especially noticeable in the SST-2 dataset. Hence, while our model generally performs well compared to the baselines, the choice of density estimation function can also significantly impact the efficacy of mitigating backdoor attacks.
The effect of GMMs versus KDE stopping criteria in SEEP. For each attack experiment (row), we bold the lowest ASR across different defenses.
Dataset . | AttackMethod . | GMMs . | KDE . | ||
---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ||
SST-2 | InsertSent | 2.5 | 92.2 | 2.3 | 92.2 |
LWS | 51.0 | 92.3 | 29.4 | 92.4 | |
QNLI | InsertSent | 4.8 | 91.0 | 4.8 | 91.0 |
LWS | 18.4 | 89.9 | 15.6 | 90.1 |
Dataset . | AttackMethod . | GMMs . | KDE . | ||
---|---|---|---|---|---|
ASR . | CACC . | ASR . | CACC . | ||
SST-2 | InsertSent | 2.5 | 92.2 | 2.3 | 92.2 |
LWS | 51.0 | 92.3 | 29.4 | 92.4 | |
QNLI | InsertSent | 4.8 | 91.0 | 4.8 | 91.0 |
LWS | 18.4 | 89.9 | 15.6 | 90.1 |
4.3.4 Defense with Low Poisoning Rates
We have demonstrated the effectiveness of our approach when 20% of training data is poisonous. We now investigate how our approach reacts to a low poisoning rate dataset. According to Table 3, compared to other attacks, LWS attack poses a significant challenge to our defensive avenues. Hence, we conduct a stress test to challenge our defense using low poisoning rates under LWS attack. We vary the poisoning rate in the following range: {1%,5%,10%,20%}. We compare our approach against DAN and ABL, as these two methods surpass other baselines under LWS attack.
Table 9 shows remarkable ASR can be achieved on both the SST-2 and QNLI datasets, even when only 1% of the data is poisoned. While the ABL method fails to provide adequate defense against LWS attacks for SST-2 across all poisoning rates, it significantly eliminates the detrimental effects of LWS attacks on QNLI, except for a 1% poisoning rate. This exception is attributed to the misidentification of seed backdoor samples. Similarly, while the DAN method struggles to decrease the ASR induced by the LWS attack on SST-2, it proves successful in safeguarding the victim model from the LWS attack on QNLI, particularly when the poisoning rate surpasses 5%. As for our approach, although it underperforms ABL for some settings on QNLI, it is clearly the best overall, and substantially outperforms both ABL and DAN for SST-2.
ASR of SST-2 and QNLI under different poisoning ratios using ABL, DAN, and SEEP against LWS attack.
Dataset . | Defence . | Poisoning Rate . | |||
---|---|---|---|---|---|
1% . | 5% . | 10% . | 20% . | ||
SST-2 | None | 83.9 | 94.2 | 96.5 | 97.7 |
ABL | 82.0 | 94.1 | 96.2 | 97.5 | |
DAN | 75.9 | 92.4 | 95.9 | 97.5 | |
SEEP | 26.3 | 21.3 | 17.6 | 29.4 | |
QNLI | None | 95.1 | 98.2 | 98.8 | 99.2 |
ABL | 93.9 | 0.2 | 0.1 | 0.2 | |
DAN | 41.3 | 14.3 | 16.4 | 19.1 | |
SEEP | 29.6 | 13.8 | 16.1 | 15.6 |
Dataset . | Defence . | Poisoning Rate . | |||
---|---|---|---|---|---|
1% . | 5% . | 10% . | 20% . | ||
SST-2 | None | 83.9 | 94.2 | 96.5 | 97.7 |
ABL | 82.0 | 94.1 | 96.2 | 97.5 | |
DAN | 75.9 | 92.4 | 95.9 | 97.5 | |
SEEP | 26.3 | 21.3 | 17.6 | 29.4 | |
QNLI | None | 95.1 | 98.2 | 98.8 | 99.2 |
ABL | 93.9 | 0.2 | 0.1 | 0.2 | |
DAN | 41.3 | 14.3 | 16.4 | 19.1 | |
SEEP | 29.6 | 13.8 | 16.1 | 15.6 |
4.3.5 Defense with Different Models
Our research has thus far concentrated on analyzing the defense performance of the BERT-base model. We now extend this study to include five additional Transformer models: BERT-large, RoBERTa-base, RoBERTa-large, Llama2-7B (Touvron et al., 2023) and Mistral-7B (Jiang et al., 2023), evaluating our defense against the LWS attack.
Table 10 demonstrates that our method, capable of discarding poisoned samples before training, is independent of the model used. For instance, for the SST-2 dataset, all models under study achieved a reduction in ASR exceeding 60%, while maintaining competitive CACC performance. Similar trends are observed for the QNLI dataset, where the reduction in ASR reaches 83% for BERT models and nearly 91% for RoBERTa, Llama2, and Mistral, accompanied by a negligible drop in CACC.
ASR and CACC of SST-2 and QNLI under different models using LWS for attack. Numbers in parentheses are differences compared to no defense.
Dataset . | Models . | ASR . | CACC . |
---|---|---|---|
SST-2 | BERT-base | 29.4 (−70.6) | 92.4 (−0.1) |
BERT-large | 34.4 (−63.6) | 93.0 (−0.1) | |
RoBERTa-base | 23.0 (−73.8) | 94.0 (−0.0) | |
RoBERTa-large | 24.3 (−73.7) | 95.5 (−0.1) | |
Llama2-7B | 18.3 (−79.6) | 96.1 (−0.3) | |
Mistral-7B | 16.8 (−81.6) | 96.5 (−0.2) | |
QNLI | BERT-base | 15.6 (−83.7) | 90.1 (−0.2) |
BERT-large | 12.1 (−85.9) | 92.0 (−0.9) | |
RoBERTa-base | 7.2 (−92.0) | 92.4 (−0.1) | |
RoBERTa-large | 7.3 (−92.1) | 93.5 (−0.5) | |
Llama2-7B | 7.7 (−91.9) | 94.0 (−0.4) | |
Mistral-7B | 8.5 (−91.2) | 94.8 (−0.1) |
Dataset . | Models . | ASR . | CACC . |
---|---|---|---|
SST-2 | BERT-base | 29.4 (−70.6) | 92.4 (−0.1) |
BERT-large | 34.4 (−63.6) | 93.0 (−0.1) | |
RoBERTa-base | 23.0 (−73.8) | 94.0 (−0.0) | |
RoBERTa-large | 24.3 (−73.7) | 95.5 (−0.1) | |
Llama2-7B | 18.3 (−79.6) | 96.1 (−0.3) | |
Mistral-7B | 16.8 (−81.6) | 96.5 (−0.2) | |
QNLI | BERT-base | 15.6 (−83.7) | 90.1 (−0.2) |
BERT-large | 12.1 (−85.9) | 92.0 (−0.9) | |
RoBERTa-base | 7.2 (−92.0) | 92.4 (−0.1) | |
RoBERTa-large | 7.3 (−92.1) | 93.5 (−0.5) | |
Llama2-7B | 7.7 (−91.9) | 94.0 (−0.4) | |
Mistral-7B | 8.5 (−91.2) | 94.8 (−0.1) |
5 Conclusion
This study introduced a new framework designed to prevent backdoor attacks from data poisoning. Firstly, the framework utilized the training dynamics of a victim model to detect seed poisoned samples, even in the absence of holdout clean datasets. Subsequently, label propagation was employed to identify the remaining poisoned instances, based on their representational similarity to the seed instances. Empirical evidence demonstrates that our proposed approach can significantly remedy the vulnerability of the victim model to multiple backdoor attacks outperforming multiple competitive baseline defense methods.
Acknowledgments
We would like to thank the anonymous reviewers and action editor Dani Yogatama for their comments and suggestions on this work. XH is funded by an industry grant from Cisco. BR is partially supported by the Department of Industry, Science, and Resources, Australia, under AUSMURI CATCH.
Notes
We study other models in § 4.3.5.
Detailed statistics are provided in Appendix A.
We observe the same trend on the other two datasets.
References
Appendix
A The Size of Filtered Training Data
We present the size of the original poisoned training data and the filtered versions after using SEEP in Table 11. Overall, after SEEP, we retain at least 75% of the original training data.
The size of original poisoned training datasets and filtered versions after using SEEP. The numbers in the parentheses are keep rate compared to the original dataset.
Dataset . | AttackMethod . | SEEP . | |
---|---|---|---|
Before . | After . | ||
SST-2 | BadNet | 67,349 | 53,616 (79.6%) |
InsertSent | 53,886 (80.0%) | ||
Syntactic | 50,813 (75.4%) | ||
LWS | 52,886 (78.5%) | ||
OLID | BadNet | 11,916 | 10,305 (86.5%) |
InsertSent | 9,451 (79.3%) | ||
Syntactic | 9,563 (80.3%) | ||
LWS | 10,636 (89.3%) | ||
AG News | BadNet | 108,000 | 84,152 (77.9%) |
InsertSent | 73,932 (64.5%) | ||
Syntactic | 88,739 (82.2%) | ||
LWS | 86,018 (79.6%) | ||
QNLI | BadNet | 100,000 | 80,718 (80.7%) |
InsertSent | 80,481 (80.5%) | ||
Syntactic | 80,537 (80.5%) | ||
LWS | 80,681 (80.7%) |
Dataset . | AttackMethod . | SEEP . | |
---|---|---|---|
Before . | After . | ||
SST-2 | BadNet | 67,349 | 53,616 (79.6%) |
InsertSent | 53,886 (80.0%) | ||
Syntactic | 50,813 (75.4%) | ||
LWS | 52,886 (78.5%) | ||
OLID | BadNet | 11,916 | 10,305 (86.5%) |
InsertSent | 9,451 (79.3%) | ||
Syntactic | 9,563 (80.3%) | ||
LWS | 10,636 (89.3%) | ||
AG News | BadNet | 108,000 | 84,152 (77.9%) |
InsertSent | 73,932 (64.5%) | ||
Syntactic | 88,739 (82.2%) | ||
LWS | 86,018 (79.6%) | ||
QNLI | BadNet | 100,000 | 80,718 (80.7%) |
InsertSent | 80,481 (80.5%) | ||
Syntactic | 80,537 (80.5%) | ||
LWS | 80,681 (80.7%) |
B The Hidden Representation of Training instances
We provide the hidden representation of the last layer of BERT-uncased-base after PCA across all investigated datasets and attack scenarios in Appendix B. The figure indicates that SEEP consistently identifies seed poisoned instances irrespective of the dataset or attack type. However, in cases where poisoned instances from Syntactic and LWS attacks are intermingled with clean instances, SEEP struggles to discern most poisoned instances without encompassing clean
ones, consequently resulting in the relatively high FRR reported in Table 3. Moreover, for the AG News dataset, poisoned instances tend to be more isolated from one another, contributing to the observed increase in FRR.
The hidden representation of the last layer of BERT-uncased-base after PCA.
Author notes
Action Editor: Dani Yogatama