Abstract
The prevalence and strong capability of large language models (LLMs) present significant safety and ethical risks if exploited by malicious users. To prevent the potentially deceptive usage of LLMs, recent work has proposed algorithms to detect LLM-generated text and protect LLMs. In this paper, we investigate the robustness and reliability of these LLM detectors under adversarial attacks. We study two types of attack strategies: 1) replacing certain words in an LLM’s output with their synonyms given the context; 2) automatically searching for an instructional prompt to alter the writing style of the generation. In both strategies, we leverage an auxiliary LLM to generate the word replacements or the instructional prompt. Different from previous works, we consider a challenging setting where the auxiliary LLM can also be protected by a detector. Experiments reveal that our attacks effectively compromise the performance of all detectors in the study with plausible generations, underscoring the urgent need to improve the robustness of LLM-generated text detection systems. Code is available at https://github.com/shizhouxing/LLM-Detector-Robustness.
1 Introduction
Large language models (LLMs), such as ChatGPT (OpenAI, 2023b), PaLM (Chowdhery et al., 2022), and LLaMA (Touvron et al., 2023), have demonstrated human-like capabilities to generate high-quality text, follow instructions, and respond to user queries. Although LLMs can improve the work efficiency of humans, they also pose several ethical and safety concerns, such as it becomes hard to differentiate LLM-generated text from human-written text. For example, LLMs may be inappropriately used for academic plagiarism or creating misinformation at large scale (Zellers et al., 2019). Therefore, it is important to develop reliable approaches to protecting LLMs and detecting the presence of AI-generated texts, to mitigate the abuse of LLMs.
Toward this end, previous work has developed methods for automatically detecting text generated by LLMs. Existing methods mainly fall into three categories: 1) Classifier-based detectors by training a classifier, often a neural network, from data with AI-generated/human-written labels (Solaiman et al., 2019; OpenAI, 2023a); 2) Watermarking (Kirchenbauer et al., 2023) by injecting patterns into the generation of LLMs such that the pattern can be statistically detected but imperceptible to humans; 3) Likelihood-based detectors, e.g., DetectGPT (Mitchell et al., 2023), by leveraging the log-likelihood of generated texts. However, as recent research demonstrates that text classifiers are vulnerable to adversarial attacks (Iyyer et al., 2018; Ribeiro et al., 2018; Alzantot et al., 2018), these LLM text detectors may not be reliable when faced with adversarial manipulations of AI-generated texts.
In this paper, we stress-test the reliability of LLM text detectors. We assume that there is an LLM G that generates an output Y = G(X) given input X. G is protected when there exists a detector f that can detect text normally generated by G with high accuracy. An attack aims to manipulate the generation process such that a new output Y′ is still plausible given input X while the detector fails to identify Y′ as LLM-generated. The attack may leverage another attacker LLM G′.
In this context, we propose two novel attack methods. In the first method, we prompt G′ to generate candidate substitutions of words in Y, and we then choose certain substitutions either in a query-free way or through a query-based evolutionary search (Alzantot et al., 2018) to attack the detector. Our second method focuses on classifier-based detectors for instruction-tuned LLMs such as ChatGPT (OpenAI, 2023b). We automatically search for an additional instructional prompt with a small subset of training data for a given classifier-based detector. At inference time, the additional instructional prompt instructs the LLM to generate new texts that are hard to detect.
Several concurrent studies (Sadasivan et al., 2023; Krishna et al., 2023) proposed to attack detectors by paraphrasing AI-generated texts, with a different language model G′ for paraphrasing. However, they assume that G′ here is not protected by a detector. Paraphrasing with G′ has been shown as effective in attacking detectors designed for the original LLM G, but it can become much less effective when G′ is also protected by a detector since the paraphrased model can still be detected by the detectors of G′. In contrast, we demonstrate that even when the attacker LLM G′ is also protected by a detector, we can still leverage G′ for attacking LLM detectors. Therefore, even if all the strong LLMs are protected in the future, the currently existing detectors can still be vulnerable to our attacks.
We experiment with our attacks on all three aforementioned categories of LLM detectors. Our results reveal that all the tested detectors are vulnerable to our proposed attacks. The detection performance of these detectors degrades significantly under our attacks, while the texts produced by our attacks still mostly maintain reasonable quality as verified by human evaluation. Our findings suggest the current detectors are not sufficiently reliable yet and it requires further efforts to develop more robust LLM detectors.
2 Related Work
Detectors for AI-generated Text.
Recent detectors for AI-generated text mostly fall into three categories. First, classifier-based detectors are trained with labeled data to distinguish human-written text and AI-generated text. For example, the AI Text Classifier developed by OpenAI (OpenAI, 2023a) is a fine-tuned language model. Second, watermarking methods introduce distinct patterns into AI-generated text, allowing for its identification. Among them, Kirchenbauer et al. (2023) randomly partition the vocabulary into a greenlist and a redlist during the generation, where the division is based on the hash of the previously generated tokens. The language model only uses words in the greenlists, and thereby the generated text has a different pattern compared to human-written text which does not consider such greenlists and redlists. Third, DetectGPT (Mitchell et al., 2023) uses the likelihood of the generated text for the detection, as they find that text generated by language models tends to reside in the negative curvature region of the log probability function. Consequently, they define a curvature-based criterion for the detection.
Methods for Red-teaming Detectors.
As the detectors emerge, several concurrent works showed that the detectors may be evaded to some extent, typically by paraphrasing the text (Sadasivan et al., 2023; Krishna et al., 2023). However, they need additional paraphrasing models which are typically unprotected models that are much weaker than the original LLM. Besides paraphrasing, Kirchenbauer et al. (2023) also discussed attacks against watermark detectors with word substitutions generated by a masked language model such as T5 (Raffel et al., 2020) which is a relatively weaker language model and tends to generate results with lower quality, and thus it may generate attacks with lower quality. On the other hand, Chakraborty et al. (2023) analyzed the possibilities of the detection given sufficiently many samples.
Adversarial Examples in NLP.
Word substitution is a commonly used strategy in generating textual adversarial examples (Alzantot et al., 2018; Ren et al., 2019; Jin et al., 2020). Language models such as the BERT (Devlin et al., 2019) have also been used for generating word substitutions (Shi and Huang, 2020; Li et al., 2020; Garg and Ramakrishnan, 2020). In this work, we demonstrate the effectiveness of using the latest LLMs for generating high-quality word substitutions, and our query-based word substitutions are also inspired by the genetic algorithm in Alzantot et al. (2018) and Yin et al. (2020). For our instructional prompt, it is relevant to recent works that prompt LLMs to red team LLMs themselves (Perez et al., 2022) rather than detectors in this work. In addition, we fix a single instructional prompt at test time, which is partly similar to universal triggers in adversarial attacks (Wallace et al., 2019; Behjati et al., 2019), but unlike them constructing an unnatural sequence of tokens as the trigger, our prompt is natural and it is added to the input for the generative model rather than the detector directly.
Safety of Large Language Models.
Detecting AI-generated texts is also related to the broader topic of LLM safety. Research for the safety of LLMs aims to reduce privacy leakage and intellectual property concerns (Wallace et al., 2020; Carlini et al., 2021; Jagielski et al., 2023; Zhao et al., 2023), detect potential misuse (Hendrycks et al., 2018; Perez et al., 2022), defend against malicious users or trojans (Wallace et al., 2019, 2021), or detecting hallucinations (Zhou et al., 2021; Liu et al., 2022). See Hendrycks et al. (2021) for a roadmap of machine learning safety challenges. We test the reliability of some LLM text detection systems, which helps better understand the current progress in LLM text detection.
3 Settings and Overview
We consider a large language model G that conditions on an input context or prompt X and generates an output text Y = G(X). We use upper-case characters to denote a sequence of tokens. For example, X = [x1, x2,…, xm], where m is the sequence length. The model G is protected by a detector f(Y) ∈ [0,1] that predicts whether Y is generated by an LLM, where a higher detection score f(Y) means that Y is more likely to be LLM-generated. We use τ to denote a detection threshold such that Y is considered LLM-generated if f(Y) ≥ τ.
In this work, we consider three categories of detectors: (1) classifier-based detectors, (2) watermark detectors, and (3) likelihood-based detectors. For classifier-based detectors, a text classifier f(Y) is trained on a labeled dataset with G-generated and human-written texts. For watermark detectors, G is modified from a base generator G0 with a watermarking mechanism W, denoted as G = W(G0), and a watermark detector f(Y) is constructed to predict whether Y is generated by the watermarked LLM G. Specifically, we consider the watermarking mechanism in Kirchenbauer et al. (2023). For likelihood-based detectors, they estimate f(Y) by comparing the log probabilities of Y and several random perturbations of Y. Specifically, we consider DetectGPT (Mitchell et al., 2023). We consider a model G as protected if there is a detector f(Y) in place to protect the model from inappropriate usage.
To stress test the reliability and robustness of those detectors in this setting, we develop red-teaming techniques to generate texts that can downgrade a detector using an LLM that is also protected by this detector. We consider attacks by output perturbation and input perturbation respectively:
Output perturbation perturbs the original output Y and generates a perturbed output Y′.
Input perturbation perturbs the input X into X′ as the new input, leading to a new output Y′ = G(X′).
In both cases, we aim to minimize f(Y′) so that the new output Y′ is wrongly considered as human-written by the detector f. Meanwhile, we require that Y′ has a quality similar to Y and remains a plausible output to the original input X. For our attack algorithms, we also assume that the detector f is black-box—only the output scores are visible but not its internal parameters.
We propose to attack the detectors in two different ways. In Section 4, we construct an output perturbation by replacing some words in Y, where we prompt a protected LLM G′ to obtain candidate substitution words, and we then build query-based and query-free attacks respectively to decide substitution words. In Section 5, if G is able to follow instructions, we search for an instructional prompt from the generation by G and append the prompt to X as an input perturbation, where the instructional prompt instructs G to generate texts in a style making it hard for the detector to detect. Table 1 summarizes our methods and their applicability to different detectors. At test time, instructional prompts are fixed and thus totally query-free. For word substitutions, they require querying G′ multiple times to generate word substitutions on each test example; the query-free version does not repeatedly query f while the query-based version also requires querying f multiple times. In practice, we may choose between these methods depending on the query budget and their applicability to the detectors.
Attack . | Perturbation type . | Test-time Queries . | Applicability . | ||||
---|---|---|---|---|---|---|---|
G′ . | f . | Classifier . | Watermarking . | Likelihood . | |||
Query-free | Word Substitutions | Output | − | ||||
Query-based | Output | − | |||||
Instructional Prompts | Input | − | − | − | − |
Attack . | Perturbation type . | Test-time Queries . | Applicability . | ||||
---|---|---|---|---|---|---|---|
G′ . | f . | Classifier . | Watermarking . | Likelihood . | |||
Query-free | Word Substitutions | Output | − | ||||
Query-based | Output | − | |||||
Instructional Prompts | Input | − | − | − | − |
4 Attack with Word Substitutions
To attack the detectors with output perturbations, we aim to find a perturbed output Y′ that is out of the original detectable distribution. This is achieved by substituting certain words in Y. To obtain suitable substitution words for the tokens in Y that preserve the fluency and semantic meaning, we utilize a protected LLM denoted as G′. For each token in Y denoted as yk, we use s(yk, Y, G′, n) to denote the process of generating at most n word substitution candidates for yk given the context in Y by prompting G′, and s(yk, Y, G′, n) outputs a set of at most n words. Note that not every word can be substituted, and s(yk, Y, G′, n) can be an empty set if it is not suitable to replace yk. We will discuss how we generate the word substitution candidates using G′ in Section 4.1.
General Attack Objective.
4.1 Generating Word Substitution Candidates
Table 2 shows the prompts we use and the outputs produced by G′, when G′ is ChatGPT and LLaMA, respectively. ChatGPT is able to follow instructions, and thus our prompt is an instruction asking the model to generate substitution words, and multiple words can be substituted simultaneously. For LLaMA, which has less instruction-following ability, we expect it to generate a text completion following our prompt, where the prompt is designed such that a plausible text completion consists of suggested substitution words, and we replace one word at a time.
The benefit of applying an LLM here is that it enables us to obtain substitution words that not only have similar meanings with the original word but are also compatible with the context, as previous works also used language models such as BERT for generating adversarial examples (Shi and Huang, 2020). Thus it is more convenient than earlier methods using synonym lists for generating substitution words which need to be further checked with a separate language model (Alzantot et al., 2018) for compatibility with the context.
4.2 Query-based Word Substitutions
For query-based substitutions, we use the evolutionary search algorithm (Alzantot et al., 2018; Yin et al., 2020) originally designed for generating adversarial examples in NLP. The algorithm starts from a population of perturbed texts which includes input texts with a certain amount of tokens randomly replaced. Then, it iterates over several generations of populations to select elites in each population, i.e., the most effective substitution that leads to the lowest detection score. New generations are constructed by crossing over the elite substitutions in the previous generation.
4.3 Query-free Word Substitutions
For the query-free attack, we simply apply word substitution on random tokens in Y to attack DetectGPT and classifier-based detectors. For watermark detectors, we further design an effective query-free attack utilizing the properties of the detection method.
Therefore, given a fixed substitution budget ϵ, we aim to identify and substitute more greenlist tokens to reduce the total count of greenlist tokens. We achieve this with a two-stage algorithm. At the first stage, we sort all tokens in Y by the prediction entropy estimated with a language model M, which can be either the same generative model G or a weaker model as we only use the entropy as a heuristic score. The prediction entropy is estimated by feeding M with the prefix or masked text without the word to be estimated. As the watermarking offset δ is applied on the decoding process, a token with higher entropy is easier to be affected by watermarking. At the second stage, we pick ϵm tokens with highest entropy and use a watermarked LLM G′ to generate word substitutions as introduced in Section 4.1.
5 Attack by Instructional Prompts
In this section, we build attacks by perturbing the input prompt to encourage LLMs to generate texts that are difficult to detect. In particular, we focus on LLM-based generative models that can follow instructions and classifier-based detectors. We consider ChatGPT (OpenAI, 2023b) as the generative model G and OpenAI AI Text Classifier (OpenAI, 2023a) as the detector f. The OpenAI AI Text Classifier is a fine-tuned neural network, while neural networks have been shown to be vulnerable to distribution shifts in NLP literature (Miller et al., 2020; Awadalla et al., 2022). Therefore, we aim to shift the generated text to a different distribution where the detector is more likely to fail. We do not require the shifted generation to be semantically equivalent to the original text, but the generation should still be a plausible output to the given input.
We achieve this by searching for an additional prompt Xp appended to the original input X, which forms a new input X′ = [X, Xp] to G. In particular, Xp consists of Xins and Xref, where Xins is an instruction asking the model to follow the writing style of reference Xref.
Searching for Xp.
We use G to generate various Xins and Xref in each iteration and try to search for an optimal Xp = [Xins, Xref] following the objective in Eq. (3). Initially, we set Xins to a manually written instruction, “Meanwhile please imitate the writing style and wording of the following passage:”. An initial value for Xref is not necessary. We also create and initialize a priority queue with n initial outputs generated from the n training examples without Xp. sorts its elements according to the detection scores from f and prioritize those with lower scores. In each iteration of the search, we have two steps:
Updating Xref: We pop the top-K candidates from . For each candidate, we combine it with the current Xins respectively as the potential candidates for Xp in the current iteration.
Updating Xins: We instruct model G to generate K variations of the current Xins, inspired by Zhou et al. (2023) for automatic prompt engineering. And we combine them with the current Xins respectively as the potential candidates for Xp.
For both of these two steps, we take the best candidate Xp according to Eq. (3). When generating G([Xi, Xp]) in Eq. (3), we push all the generated outputs to as the candidates for Xref in the later rounds. We take T iterations and return the final Xp = [Xins, Xref] to be used at test time, and Y′ = G([X, Xp]) is the new output given input X.
For some Xp, we find that the G may directly copy text from Xref to generate Y′ when Xp is appended into the input prompt. To prevent this behavior, we compute a matching score between Xref and Y and discard a candidate Xp during the search if more than 20% of words from Xref (except stop words) appear in Y′. In this way, we find that the copying behavior is effective prevented.
6 Experiments
6.1 Experimental Settings
Generative Models and Detectors.
We experiment with a wide range of generative LLMs and corresponding detectors. For the generative model G, we consider GPT-2-XL (Radford et al., 2019), LLaMA-65B (Touvron et al., 2023), and ChatGPT (gpt-3.5-turbo) (OpenAI, 2023b). For detectors, we consider DetectGPT (Mitchell et al., 2023), watermarking (Kirchenbauer et al., 2023), and classifier-based detectors (OpenAI, 2023a; Solaiman et al., 2019). For DetectGPT, we use GPT-Neo (Black et al., 2021) as the scoring model to estimate the log-likelihood. DetectGPT also requires masking spans of the texts and filling in the spans with an external T5-3B model (Raffel et al., 2020). We fix the mask rate to be 15%. Watermarking is applied to open-source LLaMA-65B and GPT-2-XL but not ChatGPT, as it requires logits scores in generation. We use γ = 0.5 in all the watermarking experiments, following the default setting in Kirchenbauer et al. (2023). Moreover, classifier-based detectors include a fine-tuned RoBERTa-Large detector (Solaiman et al., 2019) for GPT-2 texts and the OpenAI AI Text Classifier (OpenAI, 2023a) for ChatGPT texts. We summarize all generative models and detectors considered in the experiments in Table 3.
Generative Model . | DetectGPT . | Watermarking . | Classifier-based Detector . |
---|---|---|---|
GPT-2-XL | ChatGPT | LLaMA-65B | ChatGPT |
LLaMA-65B | LLaMA-65B | LLaMA-65B | – |
ChatGPT | ChatGPT | – | ChatGPT |
Generative Model . | DetectGPT . | Watermarking . | Classifier-based Detector . |
---|---|---|---|
GPT-2-XL | ChatGPT | LLaMA-65B | ChatGPT |
LLaMA-65B | LLaMA-65B | LLaMA-65B | – |
ChatGPT | ChatGPT | – | ChatGPT |
For experiments involving watermarking, we use a watermarked LLaMA-65B as G′, as we cannot implement watermarking on ChatGPT; we also use LLaMA-65B as G′ in the setting with LLaMA-65B itself as G and DetectGPT as the detector; and in other settings, we use ChatGPT as G′ which is protected by either DetectGPT or the classifier-based detector. The choice of protected LLM G′ is also summarized in Table 3.
Baselines.
To demonstrate the advantage of our methods in revealing detectors’ weakness, we compare with several baselines. Dipper paraphrase (Krishna et al., 2023) is a recent method that trains a paraphrasing model to rewrite AI texts and bypass detectors. It prepends diversity codes to control the level of paraphrases introduced to the texts. We use Dipper-paraphraser-XXL with 20 lexical diversity and 60 order diversity to paraphrase the AI texts, which keeps the same level of 20% uni-gram difference as word substitution. We use nucleus sampling (Holtzman et al., 2019) with p = 0.9 for the paraphraser. And in ChatGPT experiments, we also use ChatGPT itself for paraphrasing.
Datasets.
We mainly use two types of datasets including text completion and long-form question answering. We use XSum (Narayan et al., 2018) for text completion, where we take the first sentence as the input prompt for the completion, and we use ELI5 (Fan et al., 2019) for long-form question answering. In addition, for the RoBERTa-Large detector, we also use a specific GPT-2 output dataset (Solaiman et al., 2019) as they are fine-tuned solely on GPT-2 texts. Since the OpenAI AI Text Classifier requires the text to contain at least 1,000 characters, we filter XSum and ELI5 datasets and only retain examples with human reference text containing at least 1,000 characters. For each dataset, we shuffle the test set and use the first 100 examples.
Metrics.
We use several metrics to evaluate the detectors under attacks. Area Under the Receiver Operating Characteristic Curve (AUROC) scores summarize the performance of detectors under various thresholds. A detection rate (DR) is the true positive rate under a fixed threshold (positive examples mean LLM-generated texts), where we either tune the threshold to meet a particular false positive rate or follow the original thresholds of the detectors. For the GPT-2 output dataset, we also use Attack Success Rate (ASR) which computes the rate that the attack successfully flips the prediction by the detector, out of all the positive examples on which the detector originally predicts correctly.
6.2 Attack with Word Substitutions
We apply word substitution-based attack on all the three categories of detection methods. In each setting, we assume that both G and G′ are protected by the same detector f. We limit the number of substituted words to be at most 20% of the total number words except stop words and proper nouns that should not be substituted. This leads to around 7 substituted tokens per example. For the evolutionary search, it requires 70 queries per example with a population size of 10.
Attack against DetectGPT
The results are shown in Table 4. We show an example of ROC plot in Figure 2. We find that DetectGPT’s AUROC drops below random guess when we simply apply query-free substitutions which randomly select substitutions from the candidate pool. For example, on XSum, AUROC on GPT-2-XL drops from 84.4% to 25.9%. For ChatGPT and LLAMA-65B, AUROC drops from 56.0% to 25.6% and from 59.3% to 25.5%, respectively. The AUROC scores further drop to only 3.9%, 4.5%, and 9.9% respectively with the query-based evolutionary search. Our word substitution methods consistently surpass the Dipper paraphrasing, which demonstrates that our methods are revealing more vulnerability of the detectors. We do not use DR here as we find that the DR values are already very low (usually below 10%) even when no attack is applied, and we follow Mitchell et al. (2023) which originally also only used AUROC.
Generative Model . | Dataset . | Unattacked . | Dipper Paraphrasing . | Query-free Substitution . | Query-based Substitution . |
---|---|---|---|---|---|
GPT-2-XL | XSum | 84.4 | 35.2 | 25.9 | 3.9 |
ELI5 | 70.6 | 36.7 | 21.2 | 3.8 | |
ChatGPT | XSum | 56.0 | 34.6 | 25.6 | 4.5 |
ELI5 | 55.0 | 39.5 | 12.2 | 6.5 | |
LLaMA-65B | XSum | 59.3 | 49.0 | 25.5 | 9.9 |
ELI5 | 60.5 | 53.1 | 31.4 | 18.6 |
Generative Model . | Dataset . | Unattacked . | Dipper Paraphrasing . | Query-free Substitution . | Query-based Substitution . |
---|---|---|---|---|---|
GPT-2-XL | XSum | 84.4 | 35.2 | 25.9 | 3.9 |
ELI5 | 70.6 | 36.7 | 21.2 | 3.8 | |
ChatGPT | XSum | 56.0 | 34.6 | 25.6 | 4.5 |
ELI5 | 55.0 | 39.5 | 12.2 | 6.5 | |
LLaMA-65B | XSum | 59.3 | 49.0 | 25.5 | 9.9 |
ELI5 | 60.5 | 53.1 | 31.4 | 18.6 |
Attack against Watermarking
We use a T5-Large model to estimate the prediction entropy for each token. In our main experiments, we select 20% of tokens in the initial output Y with highest prediction entropy to be replaced in the attack. We also compare the attack performance with different replacement ratios in Figure 5 and Figure 6. In all experiments in Table 5, we keep the first 100 tokens in each text. We filter the suggested word substitutions to keep fewer than 4 tokens in the substitution candidates to avoid invalid substitution. We report the AUROC score and detection rate for each setting in Table 5. We show two examples of ROC plot in Figure 3 and 4. For the detection rates, we set the threshold value to keep the false positive rate for human texts equal to 1% following Krishna et al. (2023). The results show that the detection rates are significantly degraded after the query-free word substitution attack under two different watermarking settings with δ = 1.0,1.5. Although the detection rate on unattacked texts can be further increased by increasing δ, in practice, the watermark strength should be kept under an appropriate level to avoid hurting the quality of text generation (Kirchenbauer et al., 2023). Compared to Dipper paraphrasing, we achieve lower detection rates without using a separate unprotected paraphraser model. We also show qualitative examples for attacks against watermarking in Table 6.
Generative Model . | δ . | Dataset . | Unattacked . | Dipper Paraphrasing . | Query-free Substitution . | |||
---|---|---|---|---|---|---|---|---|
AUROC . | DR . | AUROC . | DR . | AUROC . | DR . | |||
GPT-2-XL | 1.0 | XSum | 98.0 | 81.0 | 84.7 | 38.0 | 85.9 | 34.0 |
ELI5 | 97.14 | 72.0 | 86.3 | 34.0 | 86.9 | 33.0 | ||
1.5 | XSum | 99.4 | 96.0 | 94.1 | 72.0 | 91.6 | 35.0 | |
ELI5 | 98.7 | 91.0 | 95.0 | 67.0 | 91.5 | 47.0 | ||
LLaMA-65B | 1.0 | XSum | 88.9 | 22.0 | 79.3 | 12.0 | 67.8 | 10.0 |
ELI5 | 94.7 | 70.0 | 81.6 | 45.0 | 79.6 | 34.0 | ||
1.5 | XSum | 96.4 | 63.0 | 85.5 | 27.0 | 84.5 | 16.0 | |
ELI5 | 99.6 | 95.0 | 92.3 | 72.0 | 91.8 | 61.0 |
Generative Model . | δ . | Dataset . | Unattacked . | Dipper Paraphrasing . | Query-free Substitution . | |||
---|---|---|---|---|---|---|---|---|
AUROC . | DR . | AUROC . | DR . | AUROC . | DR . | |||
GPT-2-XL | 1.0 | XSum | 98.0 | 81.0 | 84.7 | 38.0 | 85.9 | 34.0 |
ELI5 | 97.14 | 72.0 | 86.3 | 34.0 | 86.9 | 33.0 | ||
1.5 | XSum | 99.4 | 96.0 | 94.1 | 72.0 | 91.6 | 35.0 | |
ELI5 | 98.7 | 91.0 | 95.0 | 67.0 | 91.5 | 47.0 | ||
LLaMA-65B | 1.0 | XSum | 88.9 | 22.0 | 79.3 | 12.0 | 67.8 | 10.0 |
ELI5 | 94.7 | 70.0 | 81.6 | 45.0 | 79.6 | 34.0 | ||
1.5 | XSum | 96.4 | 63.0 | 85.5 | 27.0 | 84.5 | 16.0 | |
ELI5 | 99.6 | 95.0 | 92.3 | 72.0 | 91.8 | 61.0 |
Original output |
The scientists said the findings could help improve treatments and lead to new ways of preventing breast cancer spreading. Experts have described the findings as “exciting”. Bone is the most common site for breast cancer to spread to. Once breast cancer reaches the bone, it can be treated but often is not curable. In experiments in mice, the Sheffield researchers found breast cancer cells were sending signals to the cells inside |
Query-free Substitution |
The researchers said the findings could help improve drugs and lead to new ways of stopping breast cancer metastasizing. Experts have said the discoveries as “interesting”. Bone is the most common place for bone cancer to metastasize to. Once bone cancer spreadsto the bone, it can be treated but usually is not can be treated. In research in the Sheffield researchers, the Sheffield Scientists found cancer cancer cells were sending signals to the cells in. |
Original output |
The scientists said the findings could help improve treatments and lead to new ways of preventing breast cancer spreading. Experts have described the findings as “exciting”. Bone is the most common site for breast cancer to spread to. Once breast cancer reaches the bone, it can be treated but often is not curable. In experiments in mice, the Sheffield researchers found breast cancer cells were sending signals to the cells inside |
Query-free Substitution |
The researchers said the findings could help improve drugs and lead to new ways of stopping breast cancer metastasizing. Experts have said the discoveries as “interesting”. Bone is the most common place for bone cancer to metastasize to. Once bone cancer spreadsto the bone, it can be treated but usually is not can be treated. In research in the Sheffield researchers, the Sheffield Scientists found cancer cancer cells were sending signals to the cells in. |
Attack against Classifier-based Detectors
Results for attacking GPT-2 text detector are shown in Table 7. We find that the attack success rate (ASR) on detecting GPT-2 texts is close to 0 for both paraphrasing and query-free substitutions. We hypothesize that this is because the detector is specifically trained on detecting GPT-2 texts, and it is hard to remove the patterns leveraged by those detectors by randomly selecting word substitutions or paraphrasing. Our evolutionary search-based substitutions achieve much better ASR compared to the query-free methods.
Attack . | ASR . |
---|---|
Dipper Paraphrasing | 4% |
Query-free Substitutions | 2% |
Query-based Substitutions | 68% |
Attack . | ASR . |
---|---|
Dipper Paraphrasing | 4% |
Query-free Substitutions | 2% |
Query-based Substitutions | 68% |
For the OpenAI AI Text Classifier shown in Table 10, query-free attacks are able to decrease the detection AUROC by 18.9 and 28.1 percentage points on XSum and ELI5, respectively, while query-based ones further decrease them by 45.4 and 55.6 percentage points to lower than random. Comparison with the attack using instructional prompts and more details are discussed in Section 6.3.
6.3 Attack with Instructional Prompts
We conduct experiments for our instructional prompts using ChatGPT as the generative model and the OpenAI AI Text Classifier as the classifier-based detector. The detector is model-detect-v2 accessible via OpenAI APIs as of early July 2023. We choose this detector as it is developed by a relatively renowned company andhas been shown to achieve stronger detection accuracy (Krishna et al., 2023) than other classifier-based detectors such as GPTZero (Tian and Cui, 2023). This detector was also available at no cost when our experiments were conducted. Its output contains five classes, including “likely”, “possibly”, “unclear if it is”, “unlikely” and “very unlikely”, with thresholds 0.98, 0.90, 0.45, and 0.10 respectively. We follow these thresholds and use a threshold of 0.9 to compute detection rates.
We search for the instructional prompt using n = 50 training examples, T = 5 iterations, and K = 5 candidates in each iteration. We show the prompts for querying ChatGPT in Table 8 and the results in Table 10 and Figure 1. Our instructional prompts significantly reduce the the AUROC scores and detection rates compared to the unattacked setting and are more effective than paraphrasing with ChatGPT. While using instructional prompts may not lead to lower AUROC or DR compared to word substitutions, it does not require querying G′ or f multiple times, making it a more efficient and equally effective option. We show an example on ELI5 with various attacks in Table 14 and the instructional prompts found by our algorithm in Table 9.
Initial prompt for querying ChatGPT on XSum |
Please complete this passage with at least 150 words: {X} |
Initial prompt for querying ChatGPT on ELI5 |
Please answer this question with at least 150 words: {X} |
Prompt for paraphrasing |
Please paraphrase the following passage, with at least 200 words: {Y} |
Initial prompt for querying ChatGPT on XSum |
Please complete this passage with at least 150 words: {X} |
Initial prompt for querying ChatGPT on ELI5 |
Please answer this question with at least 150 words: {X} |
Prompt for paraphrasing |
Please paraphrase the following passage, with at least 200 words: {Y} |
Xp on XSum |
During the waiting period, please take into consideration utilizing the writing style and vocabulary used in the subsequent paragraph. |
“Wales football star, Gareth Bale, is set to undergo surgery on his ankle after suffering an injury during Real Madrid’s 2-1 victory over Sporting Lisbon in the Champions League. (…) ” |
Xp on ELI5 |
At the same time, kindly mimic the writing technique and diction utilized in the subsequent excerpt. |
“The reason why metal feels cooler compared to other things at the same temperature is due to its thermal conductivity. (…)” |
Xp on XSum |
During the waiting period, please take into consideration utilizing the writing style and vocabulary used in the subsequent paragraph. |
“Wales football star, Gareth Bale, is set to undergo surgery on his ankle after suffering an injury during Real Madrid’s 2-1 victory over Sporting Lisbon in the Champions League. (…) ” |
Xp on ELI5 |
At the same time, kindly mimic the writing technique and diction utilized in the subsequent excerpt. |
“The reason why metal feels cooler compared to other things at the same temperature is due to its thermal conductivity. (…)” |
Method . | XSum . | ELI5 . | ||
---|---|---|---|---|
AUROC . | DR . | AUROC . | DR . | |
Unattacked | 88.8 | 30.0 | 87.1 | 54.0 |
ChatGPT Paraphrasing | 80.0 | 14.0 | 76.2 | 27.0 |
Query-free Substitution | 69.9 | 2.0 | 59.0 | 2.0 |
Query-based Substitution | 43.4 | 0.0 | 31.5 | 0.0 |
Instructional Prompts | 54.9 | 5.0 | 66.7 | 21.0 |
Method . | XSum . | ELI5 . | ||
---|---|---|---|---|
AUROC . | DR . | AUROC . | DR . | |
Unattacked | 88.8 | 30.0 | 87.1 | 54.0 |
ChatGPT Paraphrasing | 80.0 | 14.0 | 76.2 | 27.0 |
Query-free Substitution | 69.9 | 2.0 | 59.0 | 2.0 |
Query-based Substitution | 43.4 | 0.0 | 31.5 | 0.0 |
Instructional Prompts | 54.9 | 5.0 | 66.7 | 21.0 |
7 Human Evaluation
To validate that our approach mostly preserves the quality of the generated text, we conduct a human evaluation on Amazon Mechanical Turk (MTurk). On each dataset, we consider the first 20 test examples and ask 3 MTurk workers to rate the quality of the text generated by each method on each of the test examples. Specifically, we use two metrics, including fluency and plausibility, where fluency measures whether the text is grammatically correct and fluent, and plausibility measures whether the generated text is a plausible output given the input, on either the text completion (XSum) or long-form question answering (ELI5) task. We use a 1/2/3 rating scale for each of the metrics (3 is the best and vice versa), and we provide the workers with guidance on the ratings, according to whether there are many/several/ almost no issues for the 1/2/3 ratings on fluency and plausibility respectively. The workers are paid USD $0.05 for each example and we provide an additional bonus. The annotation time varies, but the estimated wage rate is $10/hr, which is higher than the US minimum wage ($7.25/hr).
Tables 11 to 13 show results on attacks against the three detectors respectively. These results show that our attack methods can maintain reasonable and satisfactory plausibility and fluency with a small degradation compared to the unattacked texts. Among our attack methods, we find that the query-free substitution usually has better fluency and also sometimes better plausibility compared to the query-based substitution, as the query-based one which aims to search for a stronger attack tends to degrade the text quality slightly more. Our method with instructional prompts has better fluency than the query-based substitution and sometimes better fluency than the query-free substitution, and its generation is directly from model G without further substituting words; it also has comparable plausibility compared to the word substitution methods.
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.79±0.47 | 2.76±0.52 |
ChatGPT Paraphrasing | 2.60±0.52 | 2.75±0.43 | |
Query-free Substitution | 2.58±0.57 | 2.68±0.50 | |
Query-based Substitution | 2.49±0.67 | 2.68±0.52 | |
Instructional Prompts | 2.67±0.57 | 2.66±0.55 | |
ELI5 | Unattacked | 2.78±0.45 | 2.87±0.34 |
ChatGPT Paraphrasing | 2.35±0.48 | 2.95±0.28 | |
Query-free Substitution | 2.83±0.37 | 2.60±0.49 | |
Query-based Substitution | 2.65±0.54 | 2.63±0.47 | |
Instructional Prompts | 2.73±0.50 | 2.56±0.62 |
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.79±0.47 | 2.76±0.52 |
ChatGPT Paraphrasing | 2.60±0.52 | 2.75±0.43 | |
Query-free Substitution | 2.58±0.57 | 2.68±0.50 | |
Query-based Substitution | 2.49±0.67 | 2.68±0.52 | |
Instructional Prompts | 2.67±0.57 | 2.66±0.55 | |
ELI5 | Unattacked | 2.78±0.45 | 2.87±0.34 |
ChatGPT Paraphrasing | 2.35±0.48 | 2.95±0.28 | |
Query-free Substitution | 2.83±0.37 | 2.60±0.49 | |
Query-based Substitution | 2.65±0.54 | 2.63±0.47 | |
Instructional Prompts | 2.73±0.50 | 2.56±0.62 |
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.79±0.47 | 2.76±0.52 |
Dipper Paraphrasing | 2.67±0.51 | 2.60±0.52 | |
Query-free Substitution | 2.58±0.57 | 2.68±0.50 | |
Query-based Substitution | 2.35±0.51 | 2.40±0.49 | |
ELI5 | Unattacked | 2.78±0.45 | 2.87±0.34 |
Dipper Paraphrasing | 2.63±0.48 | 2.68±0.47 | |
Query-free Substitution | 2.83±0.37 | 2.60±0.49 | |
Query-based Substitution | 2.47±0.50 | 2.42±0.53 |
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.79±0.47 | 2.76±0.52 |
Dipper Paraphrasing | 2.67±0.51 | 2.60±0.52 | |
Query-free Substitution | 2.58±0.57 | 2.68±0.50 | |
Query-based Substitution | 2.35±0.51 | 2.40±0.49 | |
ELI5 | Unattacked | 2.78±0.45 | 2.87±0.34 |
Dipper Paraphrasing | 2.63±0.48 | 2.68±0.47 | |
Query-free Substitution | 2.83±0.37 | 2.60±0.49 | |
Query-based Substitution | 2.47±0.50 | 2.42±0.53 |
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.65±0.51 | 2.68±0.53 |
Dipper Paraphrasing | 2.68±0.47 | 2.58±0.59 | |
Query-free Substitution | 2.45±0.69 | 2.68±0.47 | |
ELI5 | Unattacked | 2.45±0.50 | 2.42±0.49 |
Dipper Paraphrasing | 2.63±0.51 | 2.66±0.53 | |
Query-free Substitution | 2.27±0.62 | 2.58±0.49 |
Dataset . | Method . | Fluency . | Plausibility . |
---|---|---|---|
XSum | Unattacked | 2.65±0.51 | 2.68±0.53 |
Dipper Paraphrasing | 2.68±0.47 | 2.58±0.59 | |
Query-free Substitution | 2.45±0.69 | 2.68±0.47 | |
ELI5 | Unattacked | 2.45±0.50 | 2.42±0.49 |
Dipper Paraphrasing | 2.63±0.51 | 2.66±0.53 | |
Query-free Substitution | 2.27±0.62 | 2.58±0.49 |
Original input |
Please answer this question with at least 150 words: |
Why we don’t yet have a cable provider that will let me pick and pay for ONLY the channels I want? |
Original output |
The reason why we do not yet have a cable provider that will allow choosing and paying for only the channels we want is that it may not be economically feasible for the providers. The cable providers typically negotiate content deals with media companies in packages or bundles. This packaging provides providers with a lower cost per channel, making it more profitable for them. (…) |
ChatGPT Paraphrasing |
The lack of cable providers that offer the option to choose and pay for individual channels is due to the feasibility of the idea. Cable providers, such as media companies, negotiate bundles or packages of channels rather than individual offerings. This arrangement allows for a lower cost per channel for the providers, which translates to higher profits. Additionally, media companies benefit from bundling channels as it increases the exposure of lower-demand channels and ensures programming visibility. (…) |
Query-free Substitution |
The motive why we do not yet have a cable provider that will allow choosing and paying for only the channels we want is that it may not be monetarily feasible for the providers. The cable providers typically negotiate content deals with media companies in wrap or bundles. This packaging provides providers with a lower expense per channel, making it more profitable for them. . The media companies also benefit from integrating their channels, as they can sell lower-demand channels as a package deal, ensuring broader visibility for their programs. (…) |
Query-based Substitution |
The reason why we do not yet have a cable provider that will allow nominating and paying for only the channels we want is that it may not be economically feasible for the providers. The Cablegram providers typically negotiate content deals with media companies in packages or bundles. This parcel provides providers with a lower cost per channel, making it more profitable for them. The media companies also benefit from packaging their channels, as they can sell lower-demand avenue as a package deal, ensuring broader visibility for their programs. (…) |
Instructional Prompts |
The reason why we don’t yet have a cable provider that lets us pick and pay for only the channels we want is due to the complex and structured industry model that cable companies have established. The traditional cable model focuses on bundling channels together into packages, often forcing customers to pay for channels they don’t want or need. This model benefits cable companies by allowing them to earn more revenue from customers who are willing to pay for premium packages, even if they don’t watch all the channels that come with it. (…) |
Original input |
Please answer this question with at least 150 words: |
Why we don’t yet have a cable provider that will let me pick and pay for ONLY the channels I want? |
Original output |
The reason why we do not yet have a cable provider that will allow choosing and paying for only the channels we want is that it may not be economically feasible for the providers. The cable providers typically negotiate content deals with media companies in packages or bundles. This packaging provides providers with a lower cost per channel, making it more profitable for them. (…) |
ChatGPT Paraphrasing |
The lack of cable providers that offer the option to choose and pay for individual channels is due to the feasibility of the idea. Cable providers, such as media companies, negotiate bundles or packages of channels rather than individual offerings. This arrangement allows for a lower cost per channel for the providers, which translates to higher profits. Additionally, media companies benefit from bundling channels as it increases the exposure of lower-demand channels and ensures programming visibility. (…) |
Query-free Substitution |
The motive why we do not yet have a cable provider that will allow choosing and paying for only the channels we want is that it may not be monetarily feasible for the providers. The cable providers typically negotiate content deals with media companies in wrap or bundles. This packaging provides providers with a lower expense per channel, making it more profitable for them. . The media companies also benefit from integrating their channels, as they can sell lower-demand channels as a package deal, ensuring broader visibility for their programs. (…) |
Query-based Substitution |
The reason why we do not yet have a cable provider that will allow nominating and paying for only the channels we want is that it may not be economically feasible for the providers. The Cablegram providers typically negotiate content deals with media companies in packages or bundles. This parcel provides providers with a lower cost per channel, making it more profitable for them. The media companies also benefit from packaging their channels, as they can sell lower-demand avenue as a package deal, ensuring broader visibility for their programs. (…) |
Instructional Prompts |
The reason why we don’t yet have a cable provider that lets us pick and pay for only the channels we want is due to the complex and structured industry model that cable companies have established. The traditional cable model focuses on bundling channels together into packages, often forcing customers to pay for channels they don’t want or need. This model benefits cable companies by allowing them to earn more revenue from customers who are willing to pay for premium packages, even if they don’t watch all the channels that come with it. (…) |
8 Discussions
Robustness of Detectors.
Comparing the results in Tables 4, 5, and 10, we see a clear trend that watermarking is relatively more robust to the attacks compared to the other two techniques. The detection mechanism in watermarking is mainly a statistical method and tends to be more robust compared to the likelihood-based and classifier-based detectors which heavily rely on neural networks. However, watermarking is also the only method here that modifies the generation process of the protected LM. It requires access to the intermediate outputs of the LM and the generation quality may degrade. While the three detectors are not strictly comparable, as watermarking has a different setting by modifying the generation, our results still show insights on the different degrees of robustness of the various detectors. From Table 4 and Table 10, we can also see that query-based method generally produces stronger attacks, which benefits from the guidance of multiple queries to the detector when searching for more effective word substitutions.
Text Quality under the Attacks.
We note that attack approaches often result in a minor decline in generation performance. Nonetheless, based on human assessment, the quality remains acceptable and is adequate to spam the target in a real-world scenario. Note that in practical scenarios such as online spamming, malicious actors do not have to use perfect text and they may still use text with slightly degraded quality, as their main purpose is not to generate perfect text but text that is hard to detect. Therefore, the insufficient robustness of existing detection strategies continues to be a significant concern.
Defending against the Attacks.
Inspired by our attack results, we discuss on potential directions for developing methods to defend against the attacks. One possibility is to combine watermarking with a likelihood estimation to defend against word substitution attacks. This is based on the observation that the word substitution attacks often need to substitute around 20% tokens from greenlisted tokens to redlisted tokens. After the word substitution, the new redlisted tokens tend to have lower probabilities in the prediction by the original watermarked model, and the new text also tends to have a higher perplexity under the watermarked model. Thus, one may leverage a watermarked language model to check the perplexity or the likelihood of all the redlist tokens, to predict whether a word substitution attack is possibly present.
9 Conclusion and Limitations
In this work, we study the reliability of three distinct types of LLM text detectors by proposing two attack strategies: 1) word substitutions and 2) instructional prompts using protected LLMs. Experiments reveal the vulnerability of existing detectors, which urges the design of more reliable LLM text detectors. We have released the source code and data with BSD-3-Clause license at GitHub (https://github.com/shizhouxing/LLM-Detector-Robustness).
Finally, the purpose of this work is to test and reveal the limitations of the currently existing LLM text detectors, and we red-team the detectors for future works to improve their robustness and reliability based on our proposed evaluation. Thus this work is potentially beneficial to for developing future systems protecting LLMs and preventing abusive usage. The proposed approaches should not be used to bypass real-world LLM text detectors for malicious purposes.
Acknowledgments
We thank UCLA-NLP, the action editor, and the reviewers for their invaluable feedback. The work is supported in part by CISCO, NSF 2008173, 2048280, 2325121, 2331966, ONR N00014-23-1-2300:P00001, and DARPA ANSR FA8750-23-2-0004. KC was supported as a Sloan Fellow.
References
Author notes
Alphabetical order. Authors contributed equally.
Action Editor: Nitin Madnani