Modern NLP models are often trained on public datasets drawn from diverse sources, rendering them vulnerable to data poisoning attacks. These attacks can manipulate the model’s behavior in ways engineered by the attacker. One such tactic involves the implantation of backdoors, achieved by poisoning specific training instances with a textual trigger and a target class label. Several strategies have been proposed to mitigate the risks associated with backdoor attacks by identifying and removing suspected poisoned examples. However, we observe that these strategies fail to offer effective protection against several advanced backdoor attacks. To remedy this deficiency, we propose a novel defensive mechanism that first exploits training dynamics to identify poisoned samples with high precision, followed by a label propagation step to improve recall and thus remove the majority of poisoned instances. Compared with recent advanced defense methods, our method considerably reduces the success rates of several backdoor attacks while maintaining high classification accuracy on clean test sets.

The success of deep learning models is largely driven by training on extensive datasets. Compared to the costly effort of labeling, the ease of obtaining uncurated data makes it an attractive option for training competitive models (Joulin et al., 2016; Tiedemann and Thottingal, 2020). The increasing use of public datasets from open-source communities, such as HuggingFace, raises important security concerns. These data hosting platforms often lack stringent data quality control processes, permitting the unregulated upload of datasets by any users. This reliance on untrustworthy data potentially exposes the models to backdoor attacks, where malicious users manipulate or poison data samples to imbue the victim model with specific misbehavior. For instance, a compromised sentiment analysis model, engineered to bias toward particular viewpoints or commercial products, could influence public perception or affect market trends.

Backdoor attacks aim to alter the predictive behavior of a victim model when presented with specific triggers. The attackers often accomplish this by either poisoning the training data (Dai et al., 2019; Qi et al., 2021b, c) or modifying the model parameters (Kurita et al., 2020; Li et al., 2021a). This study concentrates on the former approach, also known as backdoor poisoning attack. In such attacks, backdoor triggers are inserted into a small portion of the training data, with their corresponding labels remaining altered. As a result, models trained on these poisoned datasets function normally with clean data but exhibit manipulated misbehavior when encountering backdoor triggers.

Considering the potential damage from backdoor attacks, several defensive strategies have been proposed. These methods primarily depend on either anomaly detection (Tran et al., 2018; Chen et al., 2018, 2022b) or robust training (Li et al., 2021b; Yang et al., 2021). However, these methods either significantly compromise the model’s generalization performance (Li et al., 2021b; Geiping et al., 2022) or only offer protection against simple poisoning attacks, e.g., insertion-based attacks (He et al., 2023b).

In this paper, we propose a method that first automatically identifies a small number of anomalous instances in the training set, which is then followed by a label propagation process over the hidden representation of the training samples. The process is illustrated in Figure 1, which shows that high-precision seed examples (gray points) are identified (this process is automatic, based on their training dynamics, see §3.) By contrast, without these poisoned seeds, it would be very difficult to accurately identify poisoned outliers from the hidden representation alone, particularly due to the presence of two distinct outlier groups (circled in the figure). Our ablation studies find that a combined two-step approach is necessary: Seeding with a precise but small number of poisoned samples, followed by label propagation to iteratively add samples greedily by confidence.

Figure 1: 

Hidden representations of SST-2 training data, based on a BERT-based victim model attacked by BadNet. Gray points (seeds) are obtained automatically based on training dynamics (see § 3).

Figure 1: 

Hidden representations of SST-2 training data, based on a BERT-based victim model attacked by BadNet. Gray points (seeds) are obtained automatically based on training dynamics (see § 3).

Close modal

In contrast to previous defense methods, our approach is not predicated on a specific form of attack, nor does it require access to a clean dataset. Comprehensive experimental results demonstrate the superiority of our approach over numerous sophisticated defenses across diverse datasets and types of backdoor attacks. Furthermore, our technique effectively covers the models trained on datasets with low poisoning rates, where existing advanced baselines provide inadequate protection.

Backdoor Attacks

Backdoor attacks on deep learning models, first effectively demonstrated on image classification tasks by Gu et al. (2017), involve the manipulation of models to perform as expected on clean inputs, but to respond with controlled misbehavior when presented with certain toxic inputs. A series of advanced and more stealthy methods for computer vision tasks have been subsequently introduced (Chen et al., 2017; Liu et al., 2018; Yao et al., 2019; Saha et al., 2022; Carlini and Terzis, 2022).

Recently, NLP models have also been shown to be vulnerable to backdoor attacks. Generally, two primary categories of backdoor attacks have emerged. The first stream, data poisoning, involves tampering with the training data of the victim models, where a small percentage of the data has been manipulated (Dai et al., 2019; Qi et al., 2021c). In this method, the seminal work by Dai et al. (2019) used a random sentence as a backdoor trigger. However, Qi et al. (2021a) later argue that such random sentences could disrupt the fluency and semantics of the original input, rendering poisoned examples easily detectable by an external language model. To overcome this issue, Qi et al. (2021b) proposed using a controllable paraphraser (Iyyer et al., 2018) to create syntactic-level triggers. Stealthy triggers can also be implanted using synonym replacement (Qi et al., 2021c). The second category of backdoor attacks involves weight poisoning, where triggers are embedded by directly modifying pre-trained weights of the victim model (Kurita et al., 2020; Li et al., 2021a).

Defense Against Backdoor Attacks

In light of the susceptibility of models to backdoor attacks, various defensive strategies have been developed. These defenses can be classified by the stage at which they are implemented: (1) training-stage defenses and (2) test-stage defenses.

Training-stage defenses primarily aim to eliminate poisoned samples from the training data, which can be viewed as an outlier detection problem. For example, Tran et al. (2018) observed that the representations of poisoned samples differed from those of clean ones, leading them to propose using a feature covariance matrix spectrum to identify and remove poisoned examples. Similarly, activation clustering can serve as a tool for backdoor trigger detection (Chen et al., 2018). He et al. (2023b) draw a connection to spurious correlation, and propose a filtering method by finding trigger words or phrases that strongly correlate with a given label. Existing outlier detection techniques can only detect and remove a small fraction of poisoned examples, meaning attacks are overall still very successful. Conversely, our solution significantly lowers the attack success rates across various attacks and datasets.

Due to the computational constraint, there has been an increased reliance on publicly accessible models for inference or fine-tuning (Qi et al., 2021b). However, these models may contain backdoors, and even fine-tuning with clean data does not eliminate the potential risk (Kurita et al., 2020; Chen et al., 2022a). This risk underscores the necessity for test-stage defenses. One method employs an external model to remove lexical triggers (Qi et al., 2021a). Chen et al. (2022b) advocate for the application of outlier detection in test-stage defense. Furthermore, the triggers, which determine malicious labels, can be identified and removed using gradients (He et al., 2023a) or attention scores (Li et al., 2023), effectively nullifying the impact of backdoor attacks. These techniques can be combined with our solution, as defenses at the training and testing stages are complementary.

This section first outlines the general framework of backdoor poisoning attacks. Then, we detail our defense method.

Backdoor Attack via Data Poisoning

Consider a training corpus D=(xi,yi)i=1D, where xi is a textual input, and yi is the corresponding label. The attacker poisons a subset of instances SD, using a poisoning function f(·). The poisoning function f(·) transforms (x, y) to (x′, y′), where x′ is a corrupted x with backdoor triggers, y′ is the target label assigned by the attacker. The victim models trained on S could be compromised for specific misbehavior according to the presence of triggers. Nevertheless, the models behave normally on clean inputs, which ensures the stealthiness of the attack.

Seeding Typical Backdoor Samples via Training Dynamics

Swayamdipta et al. (2020) suggest that training dynamics, e.g., the mean and standard deviation of probabilities for gold labels across training, can be employed to characterize training instances. Figure 2 indicates that most poisoned samples are located within regions of higher means. However, this characteristic only allows for identifying a subset of toxic samples, providing an inadequate defense against backdoor attacks, as shown in Table 7. Nevertheless, the poisoned instances with the highest mean of probabilities for gold labels can serve as starting points, initiating the following propagation process.

Figure 2: 

The training dynamic is based on inverse probabilities of ground truth labels. Gray points (seeds) are those examples with higher means. The dataset and backdoor attack are SST-2 and BadNet, respectively.

Figure 2: 

The training dynamic is based on inverse probabilities of ground truth labels. Gray points (seeds) are those examples with higher means. The dataset and backdoor attack are SST-2 and BadNet, respectively.

Close modal
Now, we outline the computation of training dynamics, given the training corpus, D. Suppose we train a model θ over D using standard cross entropy between the ground truth y and the predicted label y^ for E epochs. The trained model generates a distribution p(·|x;θ) for a given x. The probability of the ground truth y is denoted as p(y|x;θ). Swayamdipta et al. (2020) employ the mean and standard deviation of p(y|x;θ) as indicators of the training dynamics. Contrary to their approach, our investigations reveal that the dynamics of inverse probabilities provide more reliable information for identifying poisoned seed samples, particularly in the context of advanced attacks (refer to § 4.3.1 for more details). For each instance xi, the mean confidence, inv-confidence, is calculated as:
(1)
(2)
where 1 ≤ eE is the training epoch and θe the corresponding parameters. Prior research indicates that backdoored models exhibit overconfidence on poisoned samples, resulting in p(y|x;θ) approaching one (Li et al., 2021b). Equation 1 can intensify such overconfidence, facilitating the distinction of these samples from benign examples, as illustrated in Figure 2.

However, training dynamics alone are insufficient to fully counter backdoor attacks, shown as the mixed region in Figure 2. Thus, we utilize it to pinpoint some seed points with a high mean of inv-confidence.

Detecting Remaining Backdoor Samples via Label Propagation

After identifying seed samples, we use them in label propagation (see Algorithm 1), thereby locating a larger set of poisoned instances. This method assumes that poisoned instances are close to each other in the latent space, while still being distinguishable from the clean instances in the latent representation. The latent representation is derived from the final hidden layer of the victim model at the last training epoch. The algorithm derives more candidate poisoned samples by considering the K nearest neighbors of each seed, based on l2 distance. This iterative process continues until a stopping criterion is met. We refer to our approach as SEEP (SEEd and Propagation). A visual demonstration of SEEP is provided in Figure 3.

graphic

Figure 3: 

The illustration of SEEP on SST-2 training data, based on a BERT-base victim model attacked by BadNet. Initially, we use 1% samples with the highest inv-confidence values to find seed instances. Then, we use these seed samples to iteratively perform nearest neighbors search (label propagation), thereby identifying all poisoned instances. ‘P’ and ‘R’ indicate precision and recall, respectively.

Figure 3: 

The illustration of SEEP on SST-2 training data, based on a BERT-base victim model attacked by BadNet. Initially, we use 1% samples with the highest inv-confidence values to find seed instances. Then, we use these seed samples to iteratively perform nearest neighbors search (label propagation), thereby identifying all poisoned instances. ‘P’ and ‘R’ indicate precision and recall, respectively.

Close modal

Concerning the termination criterion, Kernel Density Estimation (KDE) is employed. Initially, a Gaussian KDE is trained utilizing seed samples. This process is conducted using the KDE functionality available in the sklearn package, applying its default settings.1 Subsequently, during each iteration, we utilize this Gaussian KDE to calculate the average probability pμ of the newly identified nearest neighbors C. The propagation ceases once pμ falls below a predefined threshold τ. In addition to KDE, we explore Gaussian Mixture Models (GMMs) for density estimation and provide a comparison between KDE and GMMs in § 4.3.3.

This section conducts a series of studies to examine the efficacy of SEEP against multiple prominent backdoor poisoning attacks.

4.1 Experimental Settings

Datasets

The viability of the proposed method is assessed through its application in the domains of text classification and natural language inference (NLI). The text classification comprises Stanford Sentiment Treebank (SST-2) (Socher et al., 2013), Offensive Language Identification Dataset (OLID) (Zampieri et al., 2019), and AG News (Zhang et al., 2015). As for the NLI, we primarily consider QNLI dataset (Wang et al., 2018). The statistics of these employed datasets are presented in Table 1.

Table 1: 

Statistics of the employed datasets.

DatasetClassesTrainDevTest
SST-2 67,349 872 1,821 
OLID 11,916 1,324 859 
AG News 108,000 11,999 7,600 
QNLI 100,000 4,743 5,463 
DatasetClassesTrainDevTest
SST-2 67,349 872 1,821 
OLID 11,916 1,324 859 
AG News 108,000 11,999 7,600 
QNLI 100,000 4,743 5,463 

Backdoor Attacks

We test defense methods against four representative backdoor poisoning attacks on texts:

  • BadNet was developed for visual task backdooring (Gu et al., 2017) and adapted to textual classifications by Kurita et al. (2020). Following Kurita et al. (2020), we use a list of rare words: {“cf”, “tq”, “mn”, “bb”, “mb”} as triggers. Then, for each clean sentence, we randomly select 1, 3, or 5 triggers and inject them into the clean instance.

  • InsertSent was introduced by Dai et al. (2019). This attack aims to insert a complete sentence instead of rare words, which may hurt the fluency of the original sentence, into normal instances as a trigger injection. Following Qi et al. (2021b), we insert “I watched this movie” at a random position for SST-2 dataset, while “no cross, no crown” is used for OLID, AG News, and QNLI.

  • Syntactic was proposed by Qi et al. (2021b). They argue that insertion-based backdoor attacks can collapse the coherence of the original inputs, causing less stealthiness and making the attacks quite obvious to humans or machines. Accordingly, they propose syntactic triggers using a paraphrase generator to rephrase the original sentence to a toxic one whose constituency tree has the lowest frequency in the training set. Like Qi et al. (2021b), we use “S (SBAR) (,) (NP) (VP) (.)” as the syntactic trigger to attack the victim model.

  • LWS was introduced by Qi et al. (2021c), who developed a trigger inserter in conjunction with a surrogate model to facilitate backdoor insertion. This approach involves training the trigger inserter and surrogate model to substitute words in a given text with synonyms. This method consistently activates the backdoor via a sequence of strategic word replacements, potentially compromising the victim model.

Table 2 provides three clean examples and their backdoored instances. The target labels in our attacks are as follows: ‘Negative’ for SST-2, ‘Not Offensive’ for OLID, ‘Sports’ for AG News, and ‘Entailment’ for QNLI. We employed various poisoning rates in the training sets, specifically 1%, 5%, 10%, and 20%. However, in line with previous studies (Dai et al., 2019; Kurita et al., 2020; Qi et al., 2021b, c), our primary focus is on the setting with 20% poisoning rate. Evaluations with lower poisoning rates are presented in § 4.3.4. Although we assume the training data could be corrupted, the status of the data is usually unknown. Hence, we also inspect the impact of our defense when applied to clean data (denoted ‘Benign’).

Table 2: 

Samples of different backdoor attacks on three clean examples. We highlight the triggers in red.

Samples of different backdoor attacks on three clean examples. We highlight the triggers in red.
Samples of different backdoor attacks on three clean examples. We highlight the triggers in red.

Defense Baselines

Apart from our proposed approach, the efficacy of four other defensive measures devised for mitigating backdoor attacks is also assessed: (1) Clustering (Chen et al., 2018), which distinguishes the clean data from the contaminated ones via latent representation clustering; (2) DAN (Chen et al., 2022b), which discerns the corrupted data from the clean data through latent representations of clean validation samples; (3) ABL (Li et al., 2021b), which utilizes gradient ascent to eliminate the backdoor relying on the seed backdoor samples; and (4) Z-defense (He et al., 2023b), which finds spurious correlation between phrases (potential triggers) and labels, then removes all matching training instances.

Evaluation Metrics

The performance on test sets is evaluated based on two metrics: clean accuracy (CACC) and attack success rate (ASR) (Dai et al., 2019). CACC gauges the accuracy of the backdoored model on the clean test set. By contrast, ASR quantifies the efficacy of backdoors, inspecting the accuracy of attacks on the poisoned test set, crafted from instances in the test set with malicious label modification.

Training Details

We leverage the codebase from the Transformers library (Wolf et al., 2020) for our experiments. Each experiment involves fine-tuning the BERT-base-uncased model2 on the poisoned data for three epochs, using the Adam optimizer (Kingma and Ba, 2014) and a learning rate of 2 × 10−5. We assign the batch size, maximum sequence length, and weight decay to 32, 128, and 0, respectively. All experiments are conducted using two A100 GPUs.

4.2 Defense Performance

The defense approaches are evaluated i) by the ratio of detected poisoned training instances (§ 4.2.1), and ii) by testing its efficacy in mitigating backdoor attacks in an end-to-end model training (§ 4.2.2).

4.2.1 Poisoned Data Detection

Considering that Clustering, DAN, Z-defense, and our defense strategy aim to discern poisoned samples from clean ones within the training data, our first goal is to evaluate the efficacy of the discriminative power of each model between these two types. Both Clustering and DAN necessitate knowing the number of clean training instances to determine the number of instances to discard (Chen et al., 2018, 2022b), which is impractical in real-world scenarios. Hence, to ensure a fair comparison, the number of instances discarded by Clustering and DAN is set equal to that of our strategy.3 For Z-defense, we adopt the threshold used by He et al. (2023b).

For SEEP, our preliminary experiments of the BadNet attack on SST-2 show that instances identified as poisoned are typically those within the highest 5% of inv-confidence. Any values beyond this threshold may inadvertently incorporate clean instances, undermining the efficacy of our approach. As we evaluate the effectiveness of our approach in scenarios where the poisoning rates are 1% and 5% (see § 4.3.4, Table 9), we adopt a conservative approach by considering the samples with the highest 1% of inv-confidence as seed instances. We conservatively set the neighbor size K = 5 and termination threshold (τ = 1 × 10−8) based on a preliminary study against the BadNet attack on SST-2 as well.

Following previous work (Gao et al., 2022; Chen et al., 2022b; He et al., 2023b), we employ two metrics to evaluate the performance of poisoned training instance detection: (1) False Rejection Rate (FRR): the percentage of clean samples, which are erroneously flagged, and (2) False Acceptance Rate (FAR): the percentage of undetected poisoned samples. While the optimal scenario would involve achieving 0% for both FRR and FAR, this is seldom feasible in practice. Given the critical nature of a low FAR, we are inclined to accept a marginally higher FRR as a trade-off. In addition to evaluating performance, FAR and FRR can calibrate the termination threshold for Z-defense and SEEP strategies. The detailed FRR and FAR for the identified defenses are reported in Table 3.

Table 3: 

FRR (false rejection rate) and FAR (false acceptance rate) in % of different defensive avenues on multiple attack methods. Comparing the defense methods, the lowest FAR score on each attack is bold. * indicates the number is obtained via a hyper-parameter tuning on a dev set.

DatasetAttackMethodClusteringDANZ-DefenseSEEP
FRRFARFRRFARFRRFARFRRFAR
SST-2 BadNet 11.36 43.83 3.61 13.13 0.00 0.00 0.29 *0.00 
InsertSent 11.13 44.50 3.09 12.25 24.60 0.00 0.03 0.00 
Syntactic 6.66 31.89 29.58 95.61 26.46 1.23 5.76 0.10 
LWS 14.67 90.69 20.19 72.87 13.64 81.52 5.73 9.53 
 
Avg. 10.96 52.73 14.12 48.47 16.18 20.69 2.95 2.41 
 
OLID BadNet 27.74 99.96 5.37 11.18 0.04 0.00 2.58 0.11 
InsertSent 36.03 99.96 11.10 3.92 3.91 0.00 10.09 0.04 
Syntactic 15.21 21.93 9.93 0.76 1.01 1.17 9.83 0.38 
LWS 3.25 14.37 5.22 22.22 1.10 45.70 1.26 6.33 
 
Avg. 20.56 59.06 7.90 9.52 1.52 11.72 5.94 1.72 
 
AG News BadNet 33.38 99.99 12.37 0.04 3.57 1.47 12.38 0.08 
InsertSent 35.75 99.73 22.92 0.00 5.54 0.00 22.95 0.11 
Syntactic 31.56 99.91 7.58 0.07 7.30 7.99 7.62 0.25 
LWS 29.36 99.95 10.88 1.92 20.05 35.71 10.55 0.62 
 
Avg. 32.51 99.90 13.44 0.51 9.12 11.29 13.38 0.26 
 
QNLI BadNet 5.45 39.84 0.03 0.00 0.00 0.00 0.03 0.00 
InsertSent 10.12 39.16 0.29 0.01 0.25 0.00 0.29 0.00 
Syntactic 7.46 30.42 3.28 11.89 2.86 0.54 0.71 1.60 
LWS 11.03 100.00 0.55 0.67 18.63 24.30 0.50 0.25 
 
Avg. 8.52 52.36 1.04 3.14 5.44 6.21 0.38 0.46 
DatasetAttackMethodClusteringDANZ-DefenseSEEP
FRRFARFRRFARFRRFARFRRFAR
SST-2 BadNet 11.36 43.83 3.61 13.13 0.00 0.00 0.29 *0.00 
InsertSent 11.13 44.50 3.09 12.25 24.60 0.00 0.03 0.00 
Syntactic 6.66 31.89 29.58 95.61 26.46 1.23 5.76 0.10 
LWS 14.67 90.69 20.19 72.87 13.64 81.52 5.73 9.53 
 
Avg. 10.96 52.73 14.12 48.47 16.18 20.69 2.95 2.41 
 
OLID BadNet 27.74 99.96 5.37 11.18 0.04 0.00 2.58 0.11 
InsertSent 36.03 99.96 11.10 3.92 3.91 0.00 10.09 0.04 
Syntactic 15.21 21.93 9.93 0.76 1.01 1.17 9.83 0.38 
LWS 3.25 14.37 5.22 22.22 1.10 45.70 1.26 6.33 
 
Avg. 20.56 59.06 7.90 9.52 1.52 11.72 5.94 1.72 
 
AG News BadNet 33.38 99.99 12.37 0.04 3.57 1.47 12.38 0.08 
InsertSent 35.75 99.73 22.92 0.00 5.54 0.00 22.95 0.11 
Syntactic 31.56 99.91 7.58 0.07 7.30 7.99 7.62 0.25 
LWS 29.36 99.95 10.88 1.92 20.05 35.71 10.55 0.62 
 
Avg. 32.51 99.90 13.44 0.51 9.12 11.29 13.38 0.26 
 
QNLI BadNet 5.45 39.84 0.03 0.00 0.00 0.00 0.03 0.00 
InsertSent 10.12 39.16 0.29 0.01 0.25 0.00 0.29 0.00 
Syntactic 7.46 30.42 3.28 11.89 2.86 0.54 0.71 1.60 
LWS 11.03 100.00 0.55 0.67 18.63 24.30 0.50 0.25 
 
Avg. 8.52 52.36 1.04 3.14 5.44 6.21 0.38 0.46 

Our method achieves the lowest averaged FAR across all datasets and clearly outperforms all baseline methods. Specifically, our method exhibits almost perfect detection across all datasets against BadNet and InsertSent, with FAR scores below 0.1%. For the Syntactic attack, the FAR remains <0.4% for the datasets besides QNLI. For LWS attack, while we achieve FAR scores below 1% on AG News and QNLI, the FAR scores on SST-2 and OLID are less impressive (6% −10%). The effectiveness of SEEP is also evidenced in Figure 3, which illustrates how initial seed examples enable our method to iteratively identify the majority of poisoning instances, effectively terminating the search prior to incorporating the clean samples.

Regarding baseline models, Clustering has the highest FAR, peaking at 100% on the QNLI under the LWS attack. Notably, Clustering fails to filter out most poisoned instances on AG News, leading to a FAR exceeding 99%. This inadequacy of Clustering is further substantiated in Appendix B. DAN achieves a satisfactory performance on both AG News and QNLI under the insertion-based attacks, i.e., BadNet and InsertSent. However, it experiences significant difficulty identifying poisoned examples intended for SST-2, thereby yielding a relatively high FAR, especially for Syntactic attacks. Like DAN, Z-defense effectively detects poisoned examples from insertion-based attacks. Nevertheless, Z-defense faces challenges with LWS, resulting in up to 81.52% FAR.

4.2.2 Defense Against Backdoor Attacks

In light of the superior performance of our solutions for detecting poisoned data, we next demonstrate the potential of transferring this advantage to construct an effective defense against backdoor attacks in model training.

As illustrated in Table 4, some baseline methods fail completely as defenses. For instance, Clustering produces nearly identical ASR across datasets compared to the cases without defense, as a consequence of its poor recall (high FAR). DAN shows notable successes with BadNet and InsertSent on AG News and QNLI. However, it fails to effectively defend all backdoor attacks on SST-2 and OLID.

Table 4: 

The performance of backdoor attacks on datasets with defenses. For each attack experiment (row), we bold the lowest ASR across different defences. Avg. indicates the averaged score of BadNet, InsertSent, Syntactic, and LWS attacks. The reported results are in % and averaged on three independent runs. For all experiments on SST-2 and OLID, standard deviation of ASR and CACC is within 1.5% and 0.5%. For AG News and QNLI, standard deviation of ASR and CACC is within 1.0% and 0.5%. We underline the numbers that fall within a 2% ASR of the benign model (refer to Table 5).

DatasetAttackMethodNoneClusteringDANABLZ-DefenseSEEP
ASRCACCASRCACCASRCACCASRCACCASRCACCASRCACC
SST-2 Benign — 92.4 — 92.7 — 92.5 — 91.5 — 92.0 — 92.3 
 
BadNet 100.0 92.5 100.0 92.2 100.0 92.4 0.0 89.3 9.4 92.3 7.4 92.6 
InsertSent 100.0 91.9 100.0 92.2 100.0 92.2 0.5 89.2 3.0 92.6 2.3 92.2 
Syntactic 95.9 92.0 96.2 91.6 96.3 92.0 92.6 92.1 37.3 91.6 10.0 91.5 
LWS 97.7 92.1 96.8 91.6 97.5 91.3 97.5 91.9 96.6 91.3 29.4 92.4 
 
Avg. 98.4 92.1 98.2 91.9 98.4 92.0 47.7 90.6 36.6 92.0 12.3 92.2 
 
OLID Benign — 84.0 — 84.8 — 84.3 — 84.2 — 84.2 — 84.4 
 
BadNet 99.9 84.7 100.0 83.9 59.2 85.0 100.0 85.1 31.5 85.0 32.2 84.5 
InsertSent 100.0 83.7 100.0 84.8 97.9 83.6 100.0 83.0 37.1 84.5 34.6 84.2 
Syntactic 99.9 83.5 98.5 83.6 62.1 84.1 100.0 83.2 59.3 84.2 57.8 83.9 
LWS 94.4 83.7 89.0 83.9 90.7 84.3 95.4 83.8 94.4 83.1 76.9 84.6 
 
Avg. 98.5 83.9 96.9 84.0 77.5 84.2 98.9 83.8 55.6 84.2 50.4 84.3 
 
AG News Benign — 94.6 — 93.1 — 93.8 — 94.1 — 93.9 — 94.4 
 
BadNet 99.9 94.5 99.9 91.5 0.8 94.1 99.5 94.4 0.6 94.3 0.6 94.5 
InsertSent 99.7 94.3 99.7 90.3 0.7 93.1 99.7 94.5 0.5 94.4 0.3 93.4 
Syntactic 99.8 94.4 99.9 92.9 4.4 94.4 0.0 93.1 99.6 94.3 9.9 94.5 
LWS 99.2 94.4 99.5 92.7 94.9 94.1 0.0 93.0 98.9 93.8 20.1 94.4 
 
Avg. 99.7 94.4 99.7 91.8 25.2 93.9 49.8 93.7 49.9 94.2 7.7 94.2 
 
QNLI Benign — 91.4 — 90.5 — 91.1 — 90.5 — 91.2 — 90.9 
 
BadNet 100.0 91.2 100.0 90.6 5.2 91.2 0.0 90.3 4.8 91.2 5.6 91.0 
InsertSent 100.0 91.0 100.0 90.1 5.6 91.4 98.9 91.1 4.6 91.0 4.8 91.0 
Syntactic 99.1 89.9 99.1 87.9 91.0 89.7 1.0 87.4 19.6 90.1 13.3 90.2 
LWS 99.2 90.3 99.2 89.9 19.1 90.2 0.2 90.6 98.5 89.5 15.6 90.1 
 
Avg. 99.6 90.6 99.6 89.6 30.2 90.6 25.0 89.9 31.9 90.5 9.3 90.6 
DatasetAttackMethodNoneClusteringDANABLZ-DefenseSEEP
ASRCACCASRCACCASRCACCASRCACCASRCACCASRCACC
SST-2 Benign — 92.4 — 92.7 — 92.5 — 91.5 — 92.0 — 92.3 
 
BadNet 100.0 92.5 100.0 92.2 100.0 92.4 0.0 89.3 9.4 92.3 7.4 92.6 
InsertSent 100.0 91.9 100.0 92.2 100.0 92.2 0.5 89.2 3.0 92.6 2.3 92.2 
Syntactic 95.9 92.0 96.2 91.6 96.3 92.0 92.6 92.1 37.3 91.6 10.0 91.5 
LWS 97.7 92.1 96.8 91.6 97.5 91.3 97.5 91.9 96.6 91.3 29.4 92.4 
 
Avg. 98.4 92.1 98.2 91.9 98.4 92.0 47.7 90.6 36.6 92.0 12.3 92.2 
 
OLID Benign — 84.0 — 84.8 — 84.3 — 84.2 — 84.2 — 84.4 
 
BadNet 99.9 84.7 100.0 83.9 59.2 85.0 100.0 85.1 31.5 85.0 32.2 84.5 
InsertSent 100.0 83.7 100.0 84.8 97.9 83.6 100.0 83.0 37.1 84.5 34.6 84.2 
Syntactic 99.9 83.5 98.5 83.6 62.1 84.1 100.0 83.2 59.3 84.2 57.8 83.9 
LWS 94.4 83.7 89.0 83.9 90.7 84.3 95.4 83.8 94.4 83.1 76.9 84.6 
 
Avg. 98.5 83.9 96.9 84.0 77.5 84.2 98.9 83.8 55.6 84.2 50.4 84.3 
 
AG News Benign — 94.6 — 93.1 — 93.8 — 94.1 — 93.9 — 94.4 
 
BadNet 99.9 94.5 99.9 91.5 0.8 94.1 99.5 94.4 0.6 94.3 0.6 94.5 
InsertSent 99.7 94.3 99.7 90.3 0.7 93.1 99.7 94.5 0.5 94.4 0.3 93.4 
Syntactic 99.8 94.4 99.9 92.9 4.4 94.4 0.0 93.1 99.6 94.3 9.9 94.5 
LWS 99.2 94.4 99.5 92.7 94.9 94.1 0.0 93.0 98.9 93.8 20.1 94.4 
 
Avg. 99.7 94.4 99.7 91.8 25.2 93.9 49.8 93.7 49.9 94.2 7.7 94.2 
 
QNLI Benign — 91.4 — 90.5 — 91.1 — 90.5 — 91.2 — 90.9 
 
BadNet 100.0 91.2 100.0 90.6 5.2 91.2 0.0 90.3 4.8 91.2 5.6 91.0 
InsertSent 100.0 91.0 100.0 90.1 5.6 91.4 98.9 91.1 4.6 91.0 4.8 91.0 
Syntactic 99.1 89.9 99.1 87.9 91.0 89.7 1.0 87.4 19.6 90.1 13.3 90.2 
LWS 99.2 90.3 99.2 89.9 19.1 90.2 0.2 90.6 98.5 89.5 15.6 90.1 
 
Avg. 99.6 90.6 99.6 89.6 30.2 90.6 25.0 89.9 31.9 90.5 9.3 90.6 
Table 5: 

ASR of the benign model over the poisoned test data.

AttackMethodSST-2OLIDAG NewsQNLI
BadNet 7.0 32.6 0.5 5.1 
InsertSent 2.4 34.2 0.4 4.2 
Syntactic 10.1 56.5 4.1 3.6 
LWS 22.4 49.6 1.3 13.4 
AttackMethodSST-2OLIDAG NewsQNLI
BadNet 7.0 32.6 0.5 5.1 
InsertSent 2.4 34.2 0.4 4.2 
Syntactic 10.1 56.5 4.1 3.6 
LWS 22.4 49.6 1.3 13.4 

Z-Defense achieves a remarkable detection performance on insertion-based attacks: namely a significant reduction of ASR relative to a defenseless system while maintaining a competitive CACC. However, it struggles to defend against more advanced attacks, such as Syntactic and LWS. This ineffectiveness is apparent in the case of LWS, which results in ASRs exceeding 95% across all datasets. The performance of the learning-driven countermeasure, ABL, varies across different attacks and datasets. Specifically, ABL fails to provide any meaningful defense on the OLID dataset, regardless of the type of backdoor attack. However, for the remaining datasets, it maintains an ASR of less than 1% for multiple entries.

Even though strong baselines, such as ABL and Z-defense, outperform our method on certain attacks and datasets, e.g., BadNet and InsertSent attacks on SST-2, and Syntactic and LWS attacks on AG News, our approach consistently achieves superior performance across all datasets on average. Note that the baselines show limitations in defending against specific attacks on certain datasets, whereas our method exhibits robust performance across all attacks and datasets.

While some baselines and our method can attain nearly perfect FAR on BadNet and InsertSent, achieving a zero ASR is almost impossible due to systematic errors. To verify this, we also assess the benign model on the poisoned test sets and compute the ASR of the benign model, which acts as an approximate lower bound. As illustrated in Table 5, achieving a 0% benign ASR remains a significant challenge across all poisoning methods, a phenomenon attributed to the imperfect performance on the test dataset. A comparison of our defense results against these lower bounds reveals that our method provides an almost impeccable defense against BadNet and InsertSent attacks across all datasets, and against the LWS attack on SST-2 and QNLI (refer to Table 4). Our approach effectively safeguards the victim from insertion-based attacks. Additionally, compared to the baselines, our proposed defense significantly narrows the gap between ASR and benign ASR for Syntactic and LWS attacks.

4.3 Ablation Studies

In addition to the aforementioned studies examining defenses against backdoor poisoning attacks, we conduct further investigations on SST-2 and QNLI.4 Our research primarily focuses on the InsertSent and LWS attacks. This is particularly interesting as the SEEP approach has demonstrated near-perfect ASR for InsertSent, but its performance remains suboptimal for LWS.

4.3.1 Comparison of Training Dynamics

In their study, Swayamdipta et al. (2020) consider the mean of p(y|x;θ) to distinguish between hard and easy data points. Instead, our methodology adopts the mean of 1/(1 −p(y|x;θ)). We demonstrate the superior effectiveness of our approach through an evaluation of detection performance after applying these two techniques to identify seed poisoned samples, as depicted in Table 6.

Table 6: 

The detection performance of backdoor attacks on SST-2 and QNLI with the mean of probabilities and inverse probabilities for identifying seed samples.

DatasetAttackMethodProb (mean)Inv Prob (mean)
FRRFARFRRFAR
SST-2 InsertSent 26.34 20.69 0.03 0.00 
LWS 46.23 37.36 5.73 9.53 
 
QNLI InsertSent 19.77 10.51 0.29 0.00 
LWS 87.93 0.00 0.50 0.25 
DatasetAttackMethodProb (mean)Inv Prob (mean)
FRRFARFRRFAR
SST-2 InsertSent 26.34 20.69 0.03 0.00 
LWS 46.23 37.36 5.73 9.53 
 
QNLI InsertSent 19.77 10.51 0.29 0.00 
LWS 87.93 0.00 0.50 0.25 

In contrast to our method, applying the mean of the probabilities, while eliminating the FAR, significantly increases the FRR. This is because the methodology proposed by Swayamdipta et al. (2020) inadvertently includes a small fraction of clean instances within the seed samples, resulting in additional clean samples being included during the label propagation process. This issue is particularly pronounced in the LWS attack, where the FRR for SST-2 and QNLI escalate to 46.26% and 87.93%, respectively.

4.3.2 Importance of Label Propagation

As described in § 3, instead of employing training dynamics, we utilize seed samples identified via training dynamics to conduct label propagation to mitigate the effects of backdoor poisoning attacks. We compare the efficacy of our method with that of the training dynamics alone. To maintain a fair comparison, we ensure that both methods discard an equivalent number of instances.

Table 7 shows that training dynamics can effectively counter InsertSent attack. This suggests that the triggers utilized by this attack can be readily discerned by the victim model, thereby yielding highly accurate predictions across the training epochs. However, for LWS attack, the victim model may necessitate longer training steps to associate the triggers with the malicious label. Consequently, the training dynamics approach is insufficient to filter out poisoned samples. Nevertheless, SEEP successfully identifies most poisoned samples, which often cluster in a similar region of the latent space. This is accomplished via the nearest neighbor search, resulting in a substantial reduction in ASR.

Table 7: 

The performance of backdoor attacks on SST-2 and QNLI with training dynamics (TD) and SEEP. For each attack experiment (row), we bold the lowest ASR across different defenses.

DatasetAttackMethodTDSEEP
ASRCACCASRCACC
SST-2 InsertSent 4.1 91.9 2.3 92.2 
LWS 95.6 91.4 29.4 92.4 
 
QNLI InsertSent 5.2 91.1 4.8 91.0 
LWS 48.8 90.2 15.6 90.1 
DatasetAttackMethodTDSEEP
ASRCACCASRCACC
SST-2 InsertSent 4.1 91.9 2.3 92.2 
LWS 95.6 91.4 29.4 92.4 
 
QNLI InsertSent 5.2 91.1 4.8 91.0 
LWS 48.8 90.2 15.6 90.1 

4.3.3 Comparison of Density Estimation Functions

The preceding experiments used KDE as the stopping criterion in label propagation. However, alternative approaches, such as GMMs, are also viable for density estimation. We now compare the efficacy of KDE versus GMMs as stopping criteria for SEEP. According to Table 8, for the InsertSent attack, both GMMs and KDE are highly effective in identifying most poisoned instances. Consequently, the ASR of InsertSent on SST-2 and QNLI is significantly reduced, approaching the benign ASR. However, when considering LWS attack, GMMs, despite surpassing most of the baseline models (refer to Table 4), underperform in comparison to KDE. This performance gap is especially noticeable in the SST-2 dataset. Hence, while our model generally performs well compared to the baselines, the choice of density estimation function can also significantly impact the efficacy of mitigating backdoor attacks.

Table 8: 

The effect of GMMs versus KDE stopping criteria in SEEP. For each attack experiment (row), we bold the lowest ASR across different defenses.

DatasetAttackMethodGMMsKDE
ASRCACCASRCACC
SST-2 InsertSent 2.5 92.2 2.3 92.2 
LWS 51.0 92.3 29.4 92.4 
 
QNLI InsertSent 4.8 91.0 4.8 91.0 
LWS 18.4 89.9 15.6 90.1 
DatasetAttackMethodGMMsKDE
ASRCACCASRCACC
SST-2 InsertSent 2.5 92.2 2.3 92.2 
LWS 51.0 92.3 29.4 92.4 
 
QNLI InsertSent 4.8 91.0 4.8 91.0 
LWS 18.4 89.9 15.6 90.1 

4.3.4 Defense with Low Poisoning Rates

We have demonstrated the effectiveness of our approach when 20% of training data is poisonous. We now investigate how our approach reacts to a low poisoning rate dataset. According to Table 3, compared to other attacks, LWS attack poses a significant challenge to our defensive avenues. Hence, we conduct a stress test to challenge our defense using low poisoning rates under LWS attack. We vary the poisoning rate in the following range: {1%,5%,10%,20%}. We compare our approach against DAN and ABL, as these two methods surpass other baselines under LWS attack.

Table 9 shows remarkable ASR can be achieved on both the SST-2 and QNLI datasets, even when only 1% of the data is poisoned. While the ABL method fails to provide adequate defense against LWS attacks for SST-2 across all poisoning rates, it significantly eliminates the detrimental effects of LWS attacks on QNLI, except for a 1% poisoning rate. This exception is attributed to the misidentification of seed backdoor samples. Similarly, while the DAN method struggles to decrease the ASR induced by the LWS attack on SST-2, it proves successful in safeguarding the victim model from the LWS attack on QNLI, particularly when the poisoning rate surpasses 5%. As for our approach, although it underperforms ABL for some settings on QNLI, it is clearly the best overall, and substantially outperforms both ABL and DAN for SST-2.

Table 9: 

ASR of SST-2 and QNLI under different poisoning ratios using ABL, DAN, and SEEP against LWS attack.

DatasetDefencePoisoning Rate
1%5%10%20%
SST-2 None 83.9 94.2 96.5 97.7 
ABL 82.0 94.1 96.2 97.5 
DAN 75.9 92.4 95.9 97.5 
SEEP 26.3 21.3 17.6 29.4 
 
QNLI None 95.1 98.2 98.8 99.2 
ABL 93.9 0.2 0.1 0.2 
DAN 41.3 14.3 16.4 19.1 
SEEP 29.6 13.8 16.1 15.6 
DatasetDefencePoisoning Rate
1%5%10%20%
SST-2 None 83.9 94.2 96.5 97.7 
ABL 82.0 94.1 96.2 97.5 
DAN 75.9 92.4 95.9 97.5 
SEEP 26.3 21.3 17.6 29.4 
 
QNLI None 95.1 98.2 98.8 99.2 
ABL 93.9 0.2 0.1 0.2 
DAN 41.3 14.3 16.4 19.1 
SEEP 29.6 13.8 16.1 15.6 

4.3.5 Defense with Different Models

Our research has thus far concentrated on analyzing the defense performance of the BERT-base model. We now extend this study to include five additional Transformer models: BERT-large, RoBERTa-base, RoBERTa-large, Llama2-7B (Touvron et al., 2023) and Mistral-7B (Jiang et al., 2023), evaluating our defense against the LWS attack.

Table 10 demonstrates that our method, capable of discarding poisoned samples before training, is independent of the model used. For instance, for the SST-2 dataset, all models under study achieved a reduction in ASR exceeding 60%, while maintaining competitive CACC performance. Similar trends are observed for the QNLI dataset, where the reduction in ASR reaches 83% for BERT models and nearly 91% for RoBERTa, Llama2, and Mistral, accompanied by a negligible drop in CACC.

Table 10: 

ASR and CACC of SST-2 and QNLI under different models using LWS for attack. Numbers in parentheses are differences compared to no defense.

DatasetModelsASRCACC
SST-2 BERT-base 29.4 (−70.6) 92.4 (−0.1) 
BERT-large 34.4 (−63.6) 93.0 (−0.1) 
RoBERTa-base 23.0 (−73.8) 94.0 (−0.0) 
RoBERTa-large 24.3 (−73.7) 95.5 (−0.1) 
Llama2-7B 18.3 (−79.6) 96.1 (−0.3) 
Mistral-7B 16.8 (−81.6) 96.5 (−0.2) 
 
QNLI BERT-base 15.6 (−83.7) 90.1 (−0.2) 
BERT-large 12.1 (−85.9) 92.0 (−0.9) 
RoBERTa-base 7.2 (−92.0) 92.4 (−0.1) 
RoBERTa-large 7.3 (−92.1) 93.5 (−0.5) 
Llama2-7B 7.7 (−91.9) 94.0 (−0.4) 
Mistral-7B 8.5 (−91.2) 94.8 (−0.1) 
DatasetModelsASRCACC
SST-2 BERT-base 29.4 (−70.6) 92.4 (−0.1) 
BERT-large 34.4 (−63.6) 93.0 (−0.1) 
RoBERTa-base 23.0 (−73.8) 94.0 (−0.0) 
RoBERTa-large 24.3 (−73.7) 95.5 (−0.1) 
Llama2-7B 18.3 (−79.6) 96.1 (−0.3) 
Mistral-7B 16.8 (−81.6) 96.5 (−0.2) 
 
QNLI BERT-base 15.6 (−83.7) 90.1 (−0.2) 
BERT-large 12.1 (−85.9) 92.0 (−0.9) 
RoBERTa-base 7.2 (−92.0) 92.4 (−0.1) 
RoBERTa-large 7.3 (−92.1) 93.5 (−0.5) 
Llama2-7B 7.7 (−91.9) 94.0 (−0.4) 
Mistral-7B 8.5 (−91.2) 94.8 (−0.1) 

This study introduced a new framework designed to prevent backdoor attacks from data poisoning. Firstly, the framework utilized the training dynamics of a victim model to detect seed poisoned samples, even in the absence of holdout clean datasets. Subsequently, label propagation was employed to identify the remaining poisoned instances, based on their representational similarity to the seed instances. Empirical evidence demonstrates that our proposed approach can significantly remedy the vulnerability of the victim model to multiple backdoor attacks outperforming multiple competitive baseline defense methods.

We would like to thank the anonymous reviewers and action editor Dani Yogatama for their comments and suggestions on this work. XH is funded by an industry grant from Cisco. BR is partially supported by the Department of Industry, Science, and Resources, Australia, under AUSMURI CATCH.

2 

We study other models in § 4.3.5.

3 

Detailed statistics are provided in Appendix A.

4 

We observe the same trend on the other two datasets.

Nicholas
Carlini
and
Andreas
Terzis
.
2022
.
Poisoning and backdooring contrastive learning
. In
International Conference on Learning Representations
.
Bryant
Chen
,
Wilka
Carvalho
,
Nathalie
Baracaldo
,
Heiko
Ludwig
,
Benjamin
Edwards
,
Taesung
Lee
,
Ian M.
Molloy
, and
Biplav
Srivastava
.
2018
.
Detecting backdoor attacks on deep neural networks by activation clustering
.
CoRR
,
abs/1811.03728
.
Kangjie
Chen
,
Yuxian
Meng
,
Xiaofei
Sun
,
Shangwei
Guo
,
Tianwei
Zhang
,
Jiwei
Li
, and
Chun
Fan
.
2022a
.
Badpre: Task-agnostic backdoor attacks to pre-trained NLP foundation models
. In
International Conference on Learning Representations
.
Sishuo
Chen
,
Wenkai
Yang
,
Zhiyuan
Zhang
,
Xiaohan
Bi
, and
Xu
Sun
.
2022b
.
Expose backdoors on the way: A feature-based efficient defense against textual backdoor attacks
.
arXiv preprint arXiv:2210.07907
.
Xinyun
Chen
,
Chang
Liu
,
Bo
Li
,
Kimberly
Lu
, and
Dawn
Song
.
2017
.
Targeted backdoor attacks on deep learning systems using data poisoning
.
Journal of Environmental Sciences (China) English Ed
.
Jiazhu
Dai
,
Chuanshuai
Chen
, and
Yufeng
Li
.
2019
.
A backdoor attack against LSTM-based text classification systems
.
IEEE Access
,
7
:
138872
138878
.
Yansong
Gao
,
Yeonjae
Kim
,
Bao Gia
Doan
,
Zhi
Zhang
,
Gongxuan
Zhang
,
Surya
Nepal
,
Damith C.
Ranasinghe
, and
Hyoungshick
Kim
.
2022
.
Design and evaluation of a multi-domain trojan detection method on deep neural networks
.
IEEE Transactions on Dependable and Secure Computing
,
19
(
4
):
2349
2364
.
Jonas
Geiping
,
Liam H.
Fowl
,
Gowthami
Somepalli
,
Micah
Goldblum
,
Michael
Moeller
, and
Tom
Goldstein
.
2022
.
What doesn’t kill you makes you robust(er): How to adversarially train against data poisoning
.
Tianyu
Gu
,
Brendan
Dolan-Gavitt
, and
Siddharth
Garg
.
2017
.
Badnets: Identifying vulnerabilities in the machine learning model supply chain
.
arXiv preprint arXiv:1708.06733
.
Xuanli
He
,
Jun
Wang
,
Benjamin
Rubinstein
, and
Trevor
Cohn
.
2023a
.
Imbert: Making bert immune to insertion-based backdoor attacks
.
arXiv preprint arXiv:2305.16503
.
Xuanli
He
,
Qiongkai
Xu
,
Jun
Wang
,
Benjamin
Rubinstein
, and
Trevor
Cohn
.
2023b
.
Mitigating backdoor poisoning attacks through the lens of spurious correlation
.
arXiv preprint arXiv:2305.11596
.
Mohit
Iyyer
,
John
Wieting
,
Kevin
Gimpel
, and
Luke
Zettlemoyer
.
2018
.
Adversarial example generation with syntactically controlled paraphrase networks
. In
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)
, pages
1875
1885
.
Albert Q.
Jiang
,
Alexandre
Sablayrolles
,
Arthur
Mensch
,
Chris
Bamford
,
Devendra Singh
Chaplot
,
Diego
de las Casas
,
Florian
Bressand
,
Gianna
Lengyel
,
Guillaume
Lample
,
Lucile
Saulnier
,
Lélio Renard
Lavaud
,
Marie-Anne
Lachaux
,
Pierre
Stock
,
Teven Le
Scao
,
Thibaut
Lavril
,
Thomas
Wang
,
Timothée
Lacroix
, and
William
El Sayed
.
2023
.
Mistral 7b
.
arXiv preprint arXiv:2310.06825
.
Armand
Joulin
,
Laurens
van der Maaten
,
Allan
Jabri
, and
Nicolas
Vasilache
.
2016
.
Learning visual features from large weakly supervised data
. In
Computer Vision – ECCV 2016
, pages
67
84
,
Cham
.
Springer International Publishing
.
Diederik P.
Kingma
and
Jimmy
Ba
.
2014
.
Adam: A method for stochastic optimization
.
arXiv preprint arXiv:1412.6980
.
Keita
Kurita
,
Paul
Michel
, and
Graham
Neubig
.
2020
.
Weight poisoning attacks on pretrained models
. In
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
, pages
2793
2806
.
Jiazhao
Li
,
Zhuofeng
Wu
,
Wei
Ping
,
Chaowei
Xiao
, and
V. G.
Vydiswaran
.
2023
.
Defending against insertion-based textual backdoor attacks via attribution
.
arXiv preprint arXiv:2305.02394
.
Linyang
Li
,
Demin
Song
,
Xiaonan
Li
,
Jiehang
Zeng
,
Ruotian
Ma
, and
Xipeng
Qiu
.
2021a
.
Backdoor attacks on pre-trained models by layerwise weight poisoning
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
3023
3032
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Yige
Li
,
Xixiang
Lyu
,
Nodens
Koren
,
Lingjuan
Lyu
,
Bo
Li
, and
Xingjun
Ma
.
2021b
.
Anti-backdoor learning: Training clean models on poisoned data
. In
Advances in Neural Information Processing Systems
.
Yingqi
Liu
,
Shiqing
Ma
,
Yousra
Aafer
,
Wen-Chuan
Lee
,
Juan
Zhai
,
Weihang
Wang
, and
Xiangyu
Zhang
.
2018
.
Trojaning attack on neural networks
. In
25th Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18-221, 2018
.
The Internet Society
.
Fanchao
Qi
,
Yangyi
Chen
,
Mukai
Li
,
Yuan
Yao
,
Zhiyuan
Liu
, and
Maosong
Sun
.
2021a
.
ONION: A simple and effective defense against textual backdoor attacks
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
9558
9566
.
Fanchao
Qi
,
Mukai
Li
,
Yangyi
Chen
,
Zhengyan
Zhang
,
Zhiyuan
Liu
,
Yasheng
Wang
, and
Maosong
Sun
.
2021b
.
Hidden killer: Invisible textual backdoor attacks with syntactic trigger
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
443
453
.
Fanchao
Qi
,
Yuan
Yao
,
Sophia
Xu
,
Zhiyuan
Liu
, and
Maosong
Sun
.
2021c
.
Turn the combination lock: Learnable textual backdoor attacks via word substitution
. In
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
, pages
4873
4883
,
Online
.
Association for Computational Linguistics
.
A.
Saha
,
A.
Tejankar
,
S.
Koohpayegani
, and
H.
Pirsiavash
.
2022
.
Backdoor attacks on self-supervised learning
. In
2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
, pages
13327
13336
,
Los Alamitos, CA, USA
.
IEEE Computer Society
.
Richard
Socher
,
Alex
Perelygin
,
Jean
Wu
,
Jason
Chuang
,
Christopher D.
Manning
,
Andrew
Ng
, and
Christopher
Potts
.
2013
.
Recursive deep models for semantic compositionality over a sentiment treebank
. In
Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing
, pages
1631
1642
.
Swabha
Swayamdipta
,
Roy
Schwartz
,
Nicholas
Lourie
,
Yizhong
Wang
,
Hannaneh
Hajishirzi
,
Noah A.
Smith
, and
Yejin
Choi
.
2020
.
Dataset cartography: Mapping and diagnosing datasets with training dynamics
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
, pages
9275
9293
,
Online
.
Association for Computational Linguistics
.
Jörg
Tiedemann
and
Santhosh
Thottingal
.
2020
.
OPUS-MT – building open translation services for the world
. In
Proceedings of the 22nd Annual Conference of the European Association for Machine Translation
, pages
479
480
,
Lisboa, Portugal
.
European Association for Machine Translation
.
Hugo
Touvron
,
Louis
Martin
,
Kevin
Stone
,
Peter
Albert
,
Amjad
Almahairi
,
Yasmine
Babaei
,
Nikolay
Bashlykov
,
Soumya
Batra
,
Prajjwal
Bhargava
,
Shruti
Bhosale
,
Dan
Bikel
,
Lukas
Blecher
,
Cristian Canton
Ferrer
,
Moya
Chen
,
Guillem
Cucurull
,
David
Esiobu
,
Jude
Fernandes
,
Jeremy
Fu
,
Wenyin
Fu
,
Brian
Fuller
,
Cynthia
Gao
,
Vedanuj
Goswami
,
Naman
Goyal
,
Anthony
Hartshorn
,
Saghar
Hosseini
,
Rui
Hou
,
Hakan
Inan
,
Marcin
Kardas
,
Viktor
Kerkez
,
Madian
Khabsa
,
Isabel
Kloumann
,
Artem
Korenev
,
Punit Singh
Koura
,
Marie- Anne
Lachaux
,
Thibaut
Lavril
,
Jenya
Lee
,
Diana
Liskovich
,
Yinghai
Lu
,
Yuning
Mao
,
Xavier
Martinet
,
Todor
Mihaylov
,
Pushkar
Mishra
,
Igor
Molybog
,
Yixin
Nie
,
Andrew
Poulton
,
Jeremy
Reizenstein
,
Rashi
Rungta
,
Kalyan
Saladi
,
Alan
Schelten
,
Ruan
Silva
,
Eric Michael
Smith
,
Ranjan
Subramanian
,
Xiaoqing Ellen
Tan
,
Binh
Tang
,
Ross
Taylor
,
Adina
Williams
,
Jian Xiang
Kuan
,
Puxin
Xu
,
Zheng
Yan
,
Iliyan
Zarov
,
Yuchen
Zhang
,
Angela
Fan
,
Melanie
Kambadur
,
Sharan
Narang
,
Aurelien
Rodriguez
,
Robert
Stojnic
,
Sergey
Edunov
, and
Thomas
Scialom
.
2023
.
Llama 2: Open foundation and fine-tuned chat models
.
arXiv preprint arXiv:2307.09288
.
Brandon
Tran
,
Jerry
Li
, and
Aleksander
Madry
.
2018
.
Spectral signatures in backdoor attacks
. In
Advances in Neural Information Processing Systems
, volume
31
.
Curran Associates, Inc.
Alex
Wang
,
Amanpreet
Singh
,
Julian
Michael
,
Felix
Hill
,
Omer
Levy
, and
Samuel
Bowman
.
2018
.
GLUE: A multi-task benchmark and analysis platform for natural language understanding
. In
Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP
, pages
353
355
,
Brussels, Belgium
.
Association for Computational Linguistics
.
Thomas
Wolf
,
Lysandre
Debut
,
Victor
Sanh
,
Julien
Chaumond
,
Clement
Delangue
,
Anthony
Moi
,
Pierric
Cistac
,
Tim
Rault
,
Rémi
Louf
,
Morgan
Funtowicz
,
Joe
Davison
,
Sam
Shleifer
,
Patrick
von Platen
,
Clara
Ma
,
Yacine
Jernite
,
Julien
Plu
,
Canwen
Xu
,
Teven Le
Scao
,
Sylvain
Gugger
,
Mariama
Drame
,
Quentin
Lhoest
, and
Alexander M.
Rush
.
2020
.
Transformers: State-of-the-art natural language processing
. In
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
, pages
38
45
.
Wenkai
Yang
,
Yankai
Lin
,
Peng
Li
,
Jie
Zhou
, and
Xu
Sun
.
2021
.
RAP: Robustness-Aware Perturbations for defending against backdoor attacks on NLP models
. In
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
, pages
8365
8381
,
Online and Punta Cana, Dominican Republic
.
Association for Computational Linguistics
.
Yuanshun
Yao
,
Huiying
Li
,
Haitao
Zheng
, and
Ben Y.
Zhao
.
2019
.
Latent backdoor attacks on deep neural networks
. In
Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security
,
CCS ’19
, pages
2041
2055
,
New York, NY, USA
.
Association for Computing Machinery
.
Marcos
Zampieri
,
Shervin
Malmasi
,
Preslav
Nakov
,
Sara
Rosenthal
,
Noura
Farra
, and
Ritesh
Kumar
.
2019
.
Predicting the type and target of offensive posts in social media
. In
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
, pages
1415
1420
.
Xiang
Zhang
,
Junbo
Zhao
, and
Yann
LeCun
.
2015
.
Character-level convolutional networks for text classification
.
Advances in Neural Information Processing Systems
,
28
.

Appendix

A The Size of Filtered Training Data

We present the size of the original poisoned training data and the filtered versions after using SEEP in Table 11. Overall, after SEEP, we retain at least 75% of the original training data.

Table 11: 

The size of original poisoned training datasets and filtered versions after using SEEP. The numbers in the parentheses are keep rate compared to the original dataset.

DatasetAttackMethodSEEP
BeforeAfter
SST-2 BadNet 67,349 53,616 (79.6%) 
InsertSent 53,886 (80.0%) 
Syntactic 50,813 (75.4%) 
LWS 52,886 (78.5%) 
 
OLID BadNet 11,916 10,305 (86.5%) 
InsertSent 9,451 (79.3%) 
Syntactic 9,563 (80.3%) 
LWS 10,636 (89.3%) 
 
AG News BadNet 108,000 84,152 (77.9%) 
InsertSent 73,932 (64.5%) 
Syntactic 88,739 (82.2%) 
LWS 86,018 (79.6%) 
 
QNLI BadNet 100,000 80,718 (80.7%) 
InsertSent 80,481 (80.5%) 
Syntactic 80,537 (80.5%) 
LWS 80,681 (80.7%) 
DatasetAttackMethodSEEP
BeforeAfter
SST-2 BadNet 67,349 53,616 (79.6%) 
InsertSent 53,886 (80.0%) 
Syntactic 50,813 (75.4%) 
LWS 52,886 (78.5%) 
 
OLID BadNet 11,916 10,305 (86.5%) 
InsertSent 9,451 (79.3%) 
Syntactic 9,563 (80.3%) 
LWS 10,636 (89.3%) 
 
AG News BadNet 108,000 84,152 (77.9%) 
InsertSent 73,932 (64.5%) 
Syntactic 88,739 (82.2%) 
LWS 86,018 (79.6%) 
 
QNLI BadNet 100,000 80,718 (80.7%) 
InsertSent 80,481 (80.5%) 
Syntactic 80,537 (80.5%) 
LWS 80,681 (80.7%) 

B The Hidden Representation of Training instances

We provide the hidden representation of the last layer of BERT-uncased-base after PCA across all investigated datasets and attack scenarios in Appendix B. The figure indicates that SEEP consistently identifies seed poisoned instances irrespective of the dataset or attack type. However, in cases where poisoned instances from Syntactic and LWS attacks are intermingled with clean instances, SEEP struggles to discern most poisoned instances without encompassing clean

ones, consequently resulting in the relatively high FRR reported in Table 3. Moreover, for the AG News dataset, poisoned instances tend to be more isolated from one another, contributing to the observed increase in FRR.

Figure 4: 

The hidden representation of the last layer of BERT-uncased-base after PCA.

Figure 4: 

The hidden representation of the last layer of BERT-uncased-base after PCA.

Close modal

Author notes

Action Editor: Dani Yogatama

This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. For a full description of the license, please visit https://creativecommons.org/licenses/by/4.0/legalcode.