Abstract
The remarkable ability of large language models (LLMs) to comprehend, interpret, and generate complex language has rapidly integrated LLM-generated text into various aspects of daily life, where users increasingly accept it. However, the growing reliance on LLMs underscores the urgent need for effective detection mechanisms to identify LLM-generated text. Such mechanisms are critical to mitigating misuse and safeguarding domains like artistic expression and social networks from potential negative consequences. LLM-generated text detection, conceptualized as a binary classification task, seeks to determine whether an LLM produced a given text. Recent advances in this field stem from innovations in watermarking techniques, statistics-based detectors, and neural-based detectors. Human-assisted methods also play a crucial role. In this survey, we consolidate recent research breakthroughs in this field, emphasizing the urgent need to strengthen detector research. Additionally, we review existing datasets, highlighting their limitations and developmental requirements. Furthermore, we examine various LLM-generated text detection paradigms, shedding light on challenges like out-of-distribution problems, potential attacks, real-world data issues, and ineffective evaluation frameworks. Finally, we outline intriguing directions for future research in LLM-generated text detection to advance responsible artificial intelligence. This survey aims to provide a clear and comprehensive introduction for newcomers while offering seasoned researchers valuable updates in the field.1
1 Introduction
The rapid advancement of large language models (LLMs) such as GPT-4 (OpenAI 2023), Claude (Anthropic 2023), and PaLM (Chowdhery et al. 2022) has elevated text generation capabilities to near-human level. These systems generate coherent, sophisticated content, driving a notable surge in artificial intelligence (AI)-produced material. Recent research reveals a 55.4% increase in AI-generated news articles on mainstream websites and a staggering 457% rise in misinformation sites between 1 January, 2022, and 1 May, 2023 (Hanley and Durumeric 2023). Beyond content creation, LLMs are transforming numerous sectors, including education (Susnjak 2022), law (Cui et al. 2023), biology (Piccolo et al. 2023), and medicine (Thirunavukarasu et al. 2023). These applications range from personalized learning to medical diagnostics, underscoring their profound impact across creative and professional domains. However, this rapid integration also raises concerns about accountability, misuse, and fairness, highlighting the urgent need for comprehensive regulatory frameworks to ensure their ethical and transparent deployment.
As LLMs become more sophisticated, distinguishing between human-written and LLM-generated text has become a significant challenge, raising critical societal and ethical concerns. The indistinguishability of AI-generated content enables its misuse in creating deceptive material, such as disinformation, online scams, and social media spam (Pagnoni, Graciarena, and Tsvetkov 2022; Weidinger et al. 2021; Mirsky et al. 2022). Additionally, while LLMs are capable of producing high-quality text, they are not immune to generating unreliable or fabricated information, which risks propagating inaccuracies and eroding trust in digital communication (Ji et al. 2023; Christian 2023). The growing reliance on LLMs for data generation in AI research adds to these challenges. This self-referential practice risks recursive degradation, where LLM-generated content becomes part of new training datasets, potentially reducing the quality and diversity of future models and hindering advancements in both generative AI and detection technologies (Cardenuto et al. 2023; Yu et al. 2023a).
The detection of LLM-generated text, has become an emerging challenge. Current detection technologies, including commercial tools, often need help distingushing between human-written and LLM-generated content (Price and Sakellarios 2023; Walters 2023; Weber-Wulff et al. 2023). These systems frequently misclassify outputs, with a tendency to favor human-written classifications. Human-based detection methods are no better, achieving accuracy rates only slightly above random chance (Uchendu et al. 2021; Dou et al. 2022; Clark et al. 2021; Soni and Wade 2023). In fact, humans often need to perform better relative to automated algorithms in diverse evaluation settings (Ippolito et al. 2020; Soni and Wade 2023). This underscores the urgent need for robust, reliable detection mechanisms to prevent the misuse of LLMs in spreading deceptive content and misinformation. Effective detection systems are key to mitigating these risks and fostering responsible AI governance in the rapidly evolving LLM landscape (Stokel-Walker and Van Noorden 2023; Porsdam Mann et al. 2023; Shevlane et al. 2023).
Efforts to detect AI-generated text predate the widespread adoption of tools like ChatGPT. Early studies primarily focused on identifying machine-generated content, such as detecting deepfake text (Pu et al. 2023a), machine-generated text (Jawahar, Abdul-Mageed, and Lakshmanan 2020), and authorship attribution (Uchendu, Le, and Lee 2023a), often relying on statistical methods or simple classification approaches. However, the introduction of ChatGPT marked a significant shift in both the capabilities of LLMs and the challenges they pose, reigniting interest in LLM-generated text detection. Recent methods have evolved to address more complex scenarios, such as distinguishing LLM-generated text from human-authored content, evaluating the robustness of detection systems, and developing adversarial attacks. While previous surveys provide valuable insights into machine-generated text detection (Crothers, Japkowicz, and Viktor 2023; Tang, Chuang, and Hu 2023), they often need more depth in exploring the diverse and rapidly advancing methodologies in the field. This work seeks to bridge that gap by comprehensively reviewing detection techniques and identifying key challenges for future research (see Section 3.1).
This article presents a detailed review of recent advancements in LLM-generated text detection. Our objective is to highlight the challenges in this domain while exploring potential directions for future research. We begin by introducing the task of detecting LLM-generated text, explaining the mechanisms behind LLM text generation, and outlining key technological advancements. We also discuss the relevance and importance of detecting LLM-generated text in various real-world contexts. This review examines widely used datasets and benchmarks, highlighting their limitations and the need for improved data resources. Additionally, we analyze various detection approaches, including neural-based methods, statistical methods, watermarking techniques, and human-assisted methods. We explore in-depth critical challenges such as out-of-distribution issues, adversarial attacks, real-world data complexities, and the lack of robust evaluation frameworks. Finally, we propose several directions for future research to advance the development of adequate LLM-generated text detectors.
2 Background
2.1 LLM-Generated Text Detection Task
Detecting LLM-generated text presents a significant challenge. Humans generally struggle to distinguish between LLM-generated text and human-written text (Uchendu et al. 2021; Dou et al. 2022; Clark et al. 2021; Soni and Wade 2023), and their capability to distinguish such texts exceeds random classification only slightly. Table 1 offers some examples where LLM-generated text often is extremely close to human-written text and can be difficult to distinguish. When LLMs generate fabricated details, discerning their origins and veracity remains equally challenging.
Examples of human-written text and LLM-generated text. Text generated by LLMs during normal operation and instances in which they fabricate facts often exhibit no intuitively discernible differences. When LLMs either abstain from providing an answer or craft neutral responses, certain indicators, such as the explicit statement “I am an AI language model,” may facilitate human adjudication, but such examples are less.
Type . | Question . | Human-written . | LLMs-generated . |
---|---|---|---|
Normal | Explain what is NLP? | Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence … | Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics that focuses on … |
Refusal | How is today special? | Today’s Special is a Canadian children’s television show produced by Clive VanderBurgh at TVOntario from 1981 to 1987. | I’m sorry, but I am an AI language model and do not have access to current dates or events. Is there anything else I can help you with … |
Fabricated | Explain what is NLP based on one publication in the recent literature. | In “Natural language processing: state of the art, current trends and challenges”, NLP is summarized as a discipline that uses various algorithms, tools and methods to … | NLP is a multidisciplinary field at the intersection of computer science, linguistics, and ai, as described in a recent peer-reviewed publication titled “Natural Language Processing: A Comprehensive Overview and Recent Advances” (2023) … |
Type . | Question . | Human-written . | LLMs-generated . |
---|---|---|---|
Normal | Explain what is NLP? | Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence … | Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics that focuses on … |
Refusal | How is today special? | Today’s Special is a Canadian children’s television show produced by Clive VanderBurgh at TVOntario from 1981 to 1987. | I’m sorry, but I am an AI language model and do not have access to current dates or events. Is there anything else I can help you with … |
Fabricated | Explain what is NLP based on one publication in the recent literature. | In “Natural language processing: state of the art, current trends and challenges”, NLP is summarized as a discipline that uses various algorithms, tools and methods to … | NLP is a multidisciplinary field at the intersection of computer science, linguistics, and ai, as described in a recent peer-reviewed publication titled “Natural Language Processing: A Comprehensive Overview and Recent Advances” (2023) … |
Recent studies (Guo et al. 2023; Ma, Liu, and Yi 2023; Muñoz-Ortiz, Gómez-Rodríguez, and Vilares 2023; Giorgi et al. 2023; Seals and Shalin 2023) have highlighted significant disparities between human-written and LLM-generated text, such as ChatGPT. The differences between LLM-generated text and human-written text are not merely within the scope of individual word choice (Seals and Shalin 2023), but also manifest in stylistic dimensions, such as syntactical simplicity, use of passive voice, and narrativity. Notably, LLM-generated text often exhibits qualities of enhanced organization, logical structure, formality, and objectivity in comparison with human-written text. Additionally, LLMs frequently produce extensive and comprehensive responses, characterized by a lower prevalence of bias and harmful content. Nevertheless, they occasionally introduce nonsensical or fabricated details. Linguistically, LLM-generated text tends to be about twice the length of human-written text but exhibits a more limited vocabulary. LLMs exhibit a higher frequency of noun, verb, determiner, adjective, auxiliary, coordinating conjunction, and particle word categories compared with humans, and less adverb and punctuation, incorporating more deterministic, conjunctive, and auxiliary structures in their syntax. Additionally, LLM-generated text often conveys less emotional intensity and exhibits clearer presentation than human writing. This phenomenon is possibly influenced by inherent positive bias in LLMs (Giorgi, Ungar, and Schwartz 2021; Markowitz, Hancock, and Bailenson 2023; Mitrović, Andreoletti, and Ayoub 2023). While statistical variations exist across datasets, the consistent divergence in language features and human visual perception strongly indicates detectable differences between LLM-generated and human-written text. Chakraborty et al. (2023b) have further substantiated the view by reporting on the detectability of text generated by LLMs, including the high-performance models such as GPT-3.5-Turbo and GPT-4 (Helm, Priebe, and Yang 2023). Additionally, Chakraborty et al. (2023a) introduced an AI Detectability Index to further rank models according to their detectability.
This survey begins by defining key concepts relevant to the field, including human-written text, LLM-generated text, and the LLM-generated text detection task.
Human-written Text
Is characterized as the text crafted by individuals to express thoughts, emotions, and viewpoints. This encompasses articles, poems, and reviews, among others, typically reflecting personal knowledge, cultural milieu, and emotional disposition, spanning the entirety of the human experience.
LLM-Generated Text
Is defined as cohesive, grammatically coherent, and pertinent content generated by LLMs. These models are trained extensively on NLP techniques, utilizing large datasets and machine learning methodologies. The quality and fidelity of the generated text typically depend on the scale of the model and the diversity of training data.
It is noteworthy that a standardized definition for computer-assisted writing is still absent. Gao et al. (2024) categorize this as a distinct type termed “AI-revised Human-Written Text,” which is further discussed in Section 8.3.
LLM-Generated Text Detection Task
Overview of LLM-generated text detection task. This task is a binary classification task that detects whether the provided text is generated by LLMs or written by humans.
Overview of LLM-generated text detection task. This task is a binary classification task that detects whether the provided text is generated by LLMs or written by humans.
2.2 LLMs Text Generation and Confusion Sources
2.2.1 Generation Mechanisms of LLMs
The quality of generated text is fundamentally linked to the decoding strategy used during text generation. As models generate text sequentially, the method of selecting the next token from the probability distribution over the vocabulary plays a pivotal role in shaping the output. This involves sampling yt from the probability distribution over the vocabulary. The predominant decoding techniques include greedy search, beam search (Lowerre and Reddy 1976), top-K sampling (Fan, Lewis, and Dauphin 2018), and top-P sampling (Holtzman et al. 2020).
Greedy search selects the token with the highest probability at each step (Sutskever, Vinyals, and Le 2014), offering simplicity and speed, but it often leads to local optima and lacks diversity, making it ineffective in managing uncertainty. Beam search attempts to mitigate these limitations by considering multiple candidates simultaneously, which improves text quality but can result in repetitive fragments and struggles with open-ended tasks due to its difficulty in handling uncertainty (He et al. 2023a). In contrast, top-K sampling enhances diversity by restricting choices to the K most probable tokens, thus introducing variability and managing uncertainty, although it can sometimes result in incoherence (Holtzman et al. 2020; Basu et al. 2021). Top-P sampling, or nucleus sampling, further refines this approach by selecting from the smallest set of tokens whose cumulative probability surpasses a threshold P, effectively balancing coherence and diversity. Its effectiveness, however, heavily depends on the model’s prediction quality, with diversity being influenced by the parameter P (Holtzman et al. 2020).
These strategies provide various trade-offs between speed, diversity, coherence, and flexibility in text generation, helping elucidate the specific characteristics of the text produced by LLMs.
2.2.2 Sources of LLMs’ Strong Generation Capabilities
Notably, beyond a certain scale, models exhibit abilities that defy prediction by conventional scaling laws. These phenomena, absent in smaller models but emergent in larger ones, are collectively referred to as the “Emergent Abilities” of LLMs.
In-Context Learning (ICL)
The origins of ICL capabilities remain a topic of ongoing debate (Dai et al. 2023). However, this capability introduces a paradigm where model parameters remain unchanged, and only the design of the prompt is modified to elicit desired outputs from LLMs. This concept was first introduced in GPT-3 (Brown et al. 2020). Brown et al. (2020) argued that the presence of ICL is fundamental for the swift adaptability of LLMs across a diverse set of tasks. With only a few examples, LLMs can adeptly handle downstream tasks, eliminating the need for the earlier BERT-based approach that depended on pretraining followed by task-specific fine-tuning (Raffel et al. 2020).
Alignment of Human Preference
Although LLMs can be guided to generate content using carefully designed prompts, the resulting text might lack control, potentially leading to the creation of misleading or biased content (Zhang et al. 2023b). The primary limitation of these models lies in predicting subsequent words to form coherent sentences based on vast corpora, rather than ensuring that the content generated is both beneficial and innocuous to humans. To address these concerns, OpenAI introduced the Reinforcement Learning from Human Feedback (RLHF) approach, as detailed in Ouyang et al. (2022) and Lambert et al. (2022). This approach begins by fine-tuning LLMs using data from user-directed quizzes and subsequently evaluating the model’s outputs with human assessors. Simultaneously, a reward function is established, and the LLM is further refined using the Proximal Policy Optimization (PPO) algorithm (Schulman et al. 2017). The end result is a model that aligns with human values, understands human instructions, and genuinely assists users.
Complex Reasoning Capabilities
Although the ICL and alignment capabilities of LLMs enable meaningful interactions and assistance, their effectiveness diminishes when tasked with logical reasoning and increased complexity. Wei et al. (2022) observed that encouraging LLMs to produce more intermediate steps through a Chain of Thought can enhance their effectiveness. Tree of Thoughts (Yao et al. 2023) and Graph of Thoughts (Besta et al. 2023) are extensions of this methodology. Both strategies augment LLM performance on intricate tasks by amplifying the computational effort required for the model to deduce an answer.
2.3 Why Do We Need to Detect Text Generated by LLMs?
As LLMs undergo iterative refinements and reinforcement learning through human feedback, their outputs become increasingly harmonized with human values and preferences. This alignment facilitates the broader acceptance and integration of LLM-generated text into everyday life. The emergence of various AI tools has played a significant role in fostering intuitive human–AI interactions and democratizing access to the advanced capabilities of previously esoteric models. From interactive web-based assistants like ChatGPT,2 to search engines enhanced with LLM technology like the contemporary version of Bing,3 to specialized tools like Copilot,4 and Scispace5 that assist professionals in code generation and scientific research, LLMs have subtly woven into the digital fabric of our lives, propagating their content across diverse platforms.
It is important to acknowledge that for the majority of users, LLMs and their applications are still considered black-box AI systems. For individual users, this often serves as a benign efficiency boost, sidestepping laborious retrieval, and summarization. However, within specific contexts and against the backdrop of the broader digital landscape, it becomes crucial to discern, filter, or even exclude LLM-generated text. It is crucial to note that not all scenarios necessitate the detection of LLM-generated content. Unnecessary detection can lead to inefficiencies and increased development costs. Generally, detecting LLM-generated text might be superfluous when:
The utilization of LLMs poses minimal risk, especially when they handle routine, replicable tasks.
The dissemination of LLM-generated text is confined to predictable, limited domains, like closed information circles with few participants.
Drawing upon the literature reviewed in this study, the rationale behind detecting LLM-generated text can be elucidated from multiple perspectives, as illustrated in Figure 2. The delineated perspectives are, in part, informed by the insights presented in Gade et al. (2020) and Saeed and Omlin (2023). Gade et al. (2020) suggested the need for explainable AI for social, legal, scientific, enterprise, end-user applications, particularly in high-stakes domains. Saeed and Omlin (2023) emphasized aspects like fairness, accountability, and transparency of trustworthy AI. While they focus on the general AI applications, we have adapted and extended these concepts to detecting AI-generated text.
The most critical reasons why LLM-generated text detection is needed urgently. We discuss it from five perspectives: Regulation, Users, Developments, Science, and Human Society.
The most critical reasons why LLM-generated text detection is needed urgently. We discuss it from five perspectives: Regulation, Users, Developments, Science, and Human Society.
While these perspectives provided in previous works may not be exhaustive and some facets may intersect or further delineate as LLMs and AI systems mature, we posit that these points underscore the paramount reasons for the necessity of detecting text generated by LLMs.
Regulation
As AI tools, often characterized as black boxes, the inclusion of LLM-generated text in creative endeavors raises significant legal issues. A pressing concern is the eligibility of LLM-generated texts for intellectual property rights protection, a subject still mired in debate (Epstein et al. 2023; Wikipedia 2023), although the EU AI Act6 has begun to continuously improve to regulate the use of AI. The main challenges arise from issues such as ownership of the training data used by the AI in generating output and how to determine how much human involvement is enough to make the work theirs. The prerequisite for copyright protection for AI supervision and AI-generated content is that human creativity in the materials used to train AI systems can be distinguished, so as to further promote the implementation of more complete legal supervision.
Users
LLM-generated text, refined through various alignment methods, is progressively aligning with human preferences. This content permeates numerous user-accessible platforms, including blogs and Questions & Answers (Q&A) forums. However, excessive reliance on such content can undermine user trust in AI systems and, by extension, digital content as a whole. In this context, the role of LLM-generated text detection becomes crucial as a gatekeeper to regulate the prevalence of LLM-generated text online.
Developments
With the evolving prowess of LLMs, Li et al. (2023b) suggested that LLMs can self-assess and even benchmark their own performances. Due to its excellent text generation performance, LLMs are also used to construct many training data sets through preset instructions (Taori et al. 2023). However, if LLMs heavily rely on web-sourced data for training, and a significant portion of this data originates from LLM outputs, it could hinder their long-term progress (Alemohammad et al. 2023; Tang, Chuang, and Hu 2024; Shumailov et al. 2024).
Science
The relentless march of human progress owes much to the spirit of scientific exploration and discovery. However, the increasing presence of LLM-generated text in academic writing (Májovskỳ et al. 2023) and the use of LLM-originated designs in research endeavors raise concerns about potentially diluting human ingenuity and exploratory drive. At the same time, it could also undermine the ability of higher education to validate student knowledge and comprehension, and diminish the academic reputation of specific higher education institutions (Ibrahim et al. 2023). Although current methodologies may have limitations, further enhancements in detection capabilities will strengthen academic integrity and preserve human independent thinking in scientific research.
Human Society
From a societal perspective, analyzing the implications of LLM-generated text reveals that these models essentially mimic specific textual patterns while predicting subsequent tokens. If used improperly, these models have the potential to diminish linguistic diversity and contribute to the formation of information silos within societal discourse. In the long run, detecting and filtering LLM-generated text is crucial for preserving the richness and diversity of human communication, both linguistically and informatively.
3 Related Work and Our Investigation
3.1 Related Work
The comprehensive review article by Beresneva (2016) represents the first extensive survey of methods for detecting computer-generated text. At that time, the detection process was relatively simple, mainly aimed at identifying machine-translated text through simple statistical methods. The advent of autoregressive models significantly heightened the complexity of text detection tasks. Jawahar, Abdul-Mageed, and Lakshmanan (2020) provide a detailed survey on the detection of machine-generated text. This work laid a solid foundation for the field, with a particular focus on detection methods tailored to the state-of-the-art (SOTA) generative models like GPT-2. The detection methods prevalent at the time were classified into four main categories: classifiers trained from scratch, zero-shot classifiers, fine-tuning of neural LMs, and human–machine collaboration. These methods have since been regarded as relatively traditional in current research contexts.
The subsequent release of ChatGPT sparked a surge of interest in LLMs, and signified a major shift in research directions. In response to the rapid challenges posed by LLM-generated text, the NLP community has intensified efforts to develop robust detection mechanisms and investigate the dynamics of evasive techniques used by such models. Recent surveys by Crothers, Japkowicz, and Viktor (2023) and Dhaini, Poelman, and Erdogan (2023) have offered fresh insights into the detection of LLM-generated text. Notably, Crothers, Japkowicz, and Viktor (2023) performed a comprehensive analysis of threat models proposed by contemporary NLG systems, which covered speech-to-text and end-to-end models. However, this review’s coverage of detection techniques was quite similar to that of Jawahar, Abdul-Mageed, and Lakshmanan (2020) and did not encompass many cutting-edge works, including current popular zero-shot techniques, for example, DetectGPT (Mitchell et al. 2023) and Fast-DetectGPT (Bao et al. 2023), as well as advanced neural-based methods like CoCo (Liu et al. 2022) and OUTFOX (Koike, Kaneko, and Okazaki 2023b). Similarly, Dhaini, Poelman, and Erdogan (2023) also lagged in capturing many innovative works, with the primary focus still on Encoder-based classification methods. Another survey by Tang, Chuang, and Hu (2023) categorized detection methods into black-box and white-box approaches and highlighted emerging technologies such as watermarking. However, it, too, was slightly outdated in terms of dataset and detector discussed, including only five datasets and not covering currently popular zero-shot techniques and some advanced neural-based methods, which could benefit from a more comprehensive analysis and critical evaluation. Ghosal et al. (2023) focused primarily on current attacks and defenses in LLM-generated text detection, discussing the state of LLM-generated text detection tasks, including watermarks and some SOTA zero-shot detectors. However, by concentrating on more specific aspects, this review missed some broader review angles, such as a detailed examination of the motivations for detecting LLM-generated text, data resources, and and the history of various evaluation methods that could enrich the discourse. Liu et al. (2023c) provided a comprehensive review of watermarking techniques, primarily focusing on watermarking itself rather than the detection of texts generated by LLMs. While not all watermarking methods covered in the review are applicable to LLM-generated text detection, the classification criteria for watermarking methods proposed in the paper can provide valuable insights and serve as an important reference for organizing and investigating watermarking techniques.
In this article, we strive to provide a more comprehensive and insightful review of the latest research on LLM-generated text detection, enriched with comprehensive discussion. We highlight the strengths of our review in comparison to others:
Systematic and Comprehensive Review: Our survey offers an extensive exploration of LLM-generated text detection, covering the task’s description and underlying motivation, various benchmarks and datasets, the latest detection and attack methods, evaluation frameworks, the most pressing challenges faced today, potential future directions, and a critical examination of each aspect.
In-depth Analysis of Detection Mechanisms: We provide a detailed overview of detection strategies, from traditional approaches to the latest research, and systematically evaluate their effectiveness, strengths, and weaknesses in the current environment of LLMs.
More Pragmatic Insights. Our discussion delves into research questions with practical implications, such as how model size affects detection capabilities, the challenges of identifying text that is not purely generated by LLMs, and the lack of effective evaluation frameworks.
In summary, we firmly believe that this review is more systematic and comprehensive than existing works. More importantly, our critical discussion not only provides guidance to new researchers but also imparts valuable insights into established works within the field.
3.2 Systematic Investigation and Implementation
Our survey utilized the System for Literature Review (SLR) as delineated by Kitchenham and Charters (2007), a methodological framework designed for evaluating the extent and quality of extant evidence pertaining to a specified research question or topic. Offering a more expansive and accurate insight compared with conventional literature reviews, this approach has been prominently utilized in numerous scholarly surveys, as evidenced by Murtaza et al. (2020) and Saeed and Omlin (2023). The research questions guiding our SLR were as follows:
What are the prevailing methods for detecting LLM-generated text, and what are the main challenges associated with these methods?
Upon delineating the research problems, our study utilized search terms directly related to the research issue, specifically: “LLM-generated text detection.” “machine-generated text detection,” “AI-written text detection,” “authorship attribution,” and “deepfake text detection.” These terms were strategically combined using the Boolean operator OR to formulate the following search string: (“LLM-generated text detection” OR “machine-generated text detection” OR “AI-written text detection” OR “authorship attribution” OR “deepfake text detection”). Subsequently, using this search string, we engaged in a preliminary search through pertinent and authoritative electronic dissertation databases and search engines. Our investigation mainly focused on scholarly articles that were publicly accessible prior to November 2023. Table 2 outlines the sources used and provides an overview of our results.
Overview of the diverse databases and search engines utilized in our research, along with the incorporated search schemes and the consequent results obtained. Google Scholar predominates as the search engine yielding the maximum number of retrievable documents. Upon meticulous examination, it is observed that a substantial portion of the documents originate from ArXiv, primarily shared by researchers.
Databases . | Search Engine . | Search Scheme . | Retrieved . |
---|---|---|---|
Google Scholar | https://scholar.google.com/ | Full Text | 210 |
ArXiv | https://arxiv.org/ | Full Text | N/Aa |
Scopus | https://www.scopus.com/ | TITLE-ABS-KEY: (Title, Abstract, Author Keywords, Indexed Keywords) | 133 |
Web of Science | https://www.webofscience.com/ | Topic: (Searches Title, Abstract, Author Keywords, Keywords Plus) | 92 |
IEEE Xplore | https://ieeexplore.ieee.org/ | Full Text | 49 |
Springer Link | https://link.springer.com/ | Full Text | N/Aa |
ACL Anthology | https://aclanthology.org/ | Full Text | N/Aa |
ACM Digital Library | https://dl.acm.org/ | Title | N/Ab |
Databases . | Search Engine . | Search Scheme . | Retrieved . |
---|---|---|---|
Google Scholar | https://scholar.google.com/ | Full Text | 210 |
ArXiv | https://arxiv.org/ | Full Text | N/Aa |
Scopus | https://www.scopus.com/ | TITLE-ABS-KEY: (Title, Abstract, Author Keywords, Indexed Keywords) | 133 |
Web of Science | https://www.webofscience.com/ | Topic: (Searches Title, Abstract, Author Keywords, Keywords Plus) | 92 |
IEEE Xplore | https://ieeexplore.ieee.org/ | Full Text | 49 |
Springer Link | https://link.springer.com/ | Full Text | N/Aa |
ACL Anthology | https://aclanthology.org/ | Full Text | N/Aa |
ACM Digital Library | https://dl.acm.org/ | Title | N/Ab |
Search engines cannot use all keywords in a single search string. Therefore the retrieved results are inaccurate and there may be duplicate results of thesis queries.
The search engine retrieved an inaccurate number of papers that were weakly related to our topic.
Subsequently, we established the ensuing criteria to scrutinize the amassed articles:
The article should be a review focusing on the methods and challenges pertinent to LLM-generated (machine-generated/AI-written) text detection.
The article should propose a methodology specifically designed for the detection of LLM-generated (machine-generated/AI-written) text.
The article should delineate challenges and prospective directions for future research in the domain of text generation for LLMs.
The article should articulate the necessity and applications of LLM-generated text detection.
If any one of the aforementioned four criteria was met, the respective work was considered valuable for our study. Following a process of de-duplication and manual screening, we identified 83 pertinent pieces of literature. The distribution by year is illustrated in Figure 3. Notably, the majority of relevant research on LLM-generated text detection was published in the year 2023 (as shown in Figure 3), underscoring the vibrant development within this field and highlighting the significance of our study.
The distribution by year of the last 5 years of literature obtained from the screening is plotted. The number of published articles obtain significant attention in 2023.
The distribution by year of the last 5 years of literature obtained from the screening is plotted. The number of published articles obtain significant attention in 2023.
In the subsequent sections, we provide a comprehensive analysis, starting with Section 4, where we delve into the datasets and benchmarks relevant to detection tasks. This section highlights datasets that can be extended to detection applications and discusses the inherent challenges they present. Building on this foundation, Section 5 examines various detection methods. We explore technologies ranging from watermarking and statistics-based approaches to neural-based detectors and human-assisted methods, providing insights into their effectiveness and limitations. Section 7 focuses on evaluation metrics include accuracy, precision, recall, false positive rate, true negative rate, false negative rate, F1score, and the area under the receiver operating characteristic curve (AUROC), linking these measures to the methods discussed in earlier sections. In Section 8, we discuss the challenges faced by detection methods. This includes out-of-distribution issues, potential attacks, real-world data problems, the impact of model size on detector performance, and the current lack of a robust evaluation framework, all of which underscore the need for further research. Finally, Section 9 outlines potential avenues for future research. We propose developing more robust detectors against attacks, enhancing zero-shot capabilities, optimizing performance in low-resource settings, detecting not purely LLM-generated text, constructing detectors amidst data ambiguity, creating effective real-world evaluation frameworks, and improving misinformation discrimination capabilities.
Through this structured analysis, we aim to provide a clear and cohesive understanding of the current landscape and future directions in detection technology.
4 Data
High-quality datasets are essential for advancing research in the LLM-generated text detection task. These datasets enable researchers to swiftly develop and calibrate efficient detectors and establish standardized metrics for evaluating the efficacy of their methodologies. However, procuring such high-quality labeled data often demands substantial financial, material, and human resources. Presently, the development of datasets focused on detecting LLM-generated text is in its nascent stages, hindered by issues such as limited data volume and sample complexity, both crucial for crafting robust detectors. This section introduces the most widely used datasets for training LLM-generated text detectors. Additionally, we highlight datasets from unrelated domains or tasks that, though not initially designed for detection tasks, can be repurposed for various detection scenarios, which is a prevailing strategy in many contemporary detection studies. We subsequently introduce benchmarks for verifying the effectiveness of LLM-generated text detectors, which are carefully designed to evaluate the performance of the detector from different perspectives. Lastly, we evaluate these training datasets and benchmarks, identifying current shortcomings and challenges in dataset construction for LLM-generated text detection, aiming to inform the design of future data resources.
4.1 Training
4.1.1 Detection Datasets
Massive and high-quality datasets can assist researchers in rapidly training their detectors. Table 3 provides a comprehensive organization and comparison. Given that different studies focus on various practical issues, our aim is to facilitate researchers in conveniently selecting high-quality datasets that meet their specific needs through our comprehensive review work.
Summary of detection datasets for LLM-generated text detection.
Corpus . | Use . | Human . | LLMs . | LLMs Type . | Language . | Attack . | Domain . |
---|---|---|---|---|---|---|---|
HC3 (Guo et al. 2023) | train | ˜80k | ˜43k | ChatGPT | English, Chinese | – | Web Text, QA, Social Media |
CHEAT (Yu et al. 2023a) | train | ˜15k | ˜35k | ChatGPT | English | Paraphrase | Scientific Writing |
HC3 Plus (Su et al. 2023b) | train | ˜95k | GPT-3.5-Turbo | English, Chinese | Paraphrase | News Writing, Social Media | |
valid | ˜10k | ||||||
test | ˜38k | ||||||
OpenLLMText (Chen et al. 2023a) | train, | ˜52k | ˜209k | ChatGPT, PaLM, LLaMA, GPT2-XL | English | – | Web Text |
valid, | ˜8k | ˜33k | |||||
test | ˜8k | ˜33k | |||||
GROVER Dataset (Zellers et al. 2019b) | train | ˜24k | Grover-Mega | English | – | News Writing | |
TweepFake (Fagni et al. 2021) | train | ˜12k | ˜12k | GPT-2, RNN, Markov, LSTM, CharRNN | English | – | Social Media |
GPT-2 Output Dataset7 | train | ˜250k | ˜2000k | GPT-2 (small, medium, large, xl) | English | – | Web Text |
test | ˜5k | ˜40k | |||||
ArguGPT (Liu et al. 2023d) | train | ˜6k | GPT2-Xl, Text-Babbage-001, Text-Curie-001, Text-Davinci-001, Text-Davinci-002, Text-Davinci-003, GPT-3.5-Turbo | English | – | Scientific writing | |
valid | 700 | ||||||
test | 700 | ||||||
DeepfakeTextDetect (Li et al. 2023c) | train | ˜236k | GPT (Text-Davinci-002, Text-Davinci-003, GPT-Turbo-3.5), LLaMA (6B, 13B, 30B, 65B), GLM-130B, FLAN-T5 (small, base, large, xl, xxl), OPT(125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, iml1.3B, iml-30B), T0 (3B, 11B), BLOOM-7B1, GPT-J-6B, GPT-NeoX-20B) | English | Paraphrase | Social Media, News Writing, QA, Story Generation, Comprehension and Reasoning, Scientific writing | |
valid | ˜56k | ||||||
test | ˜56k |
Corpus . | Use . | Human . | LLMs . | LLMs Type . | Language . | Attack . | Domain . |
---|---|---|---|---|---|---|---|
HC3 (Guo et al. 2023) | train | ˜80k | ˜43k | ChatGPT | English, Chinese | – | Web Text, QA, Social Media |
CHEAT (Yu et al. 2023a) | train | ˜15k | ˜35k | ChatGPT | English | Paraphrase | Scientific Writing |
HC3 Plus (Su et al. 2023b) | train | ˜95k | GPT-3.5-Turbo | English, Chinese | Paraphrase | News Writing, Social Media | |
valid | ˜10k | ||||||
test | ˜38k | ||||||
OpenLLMText (Chen et al. 2023a) | train, | ˜52k | ˜209k | ChatGPT, PaLM, LLaMA, GPT2-XL | English | – | Web Text |
valid, | ˜8k | ˜33k | |||||
test | ˜8k | ˜33k | |||||
GROVER Dataset (Zellers et al. 2019b) | train | ˜24k | Grover-Mega | English | – | News Writing | |
TweepFake (Fagni et al. 2021) | train | ˜12k | ˜12k | GPT-2, RNN, Markov, LSTM, CharRNN | English | – | Social Media |
GPT-2 Output Dataset7 | train | ˜250k | ˜2000k | GPT-2 (small, medium, large, xl) | English | – | Web Text |
test | ˜5k | ˜40k | |||||
ArguGPT (Liu et al. 2023d) | train | ˜6k | GPT2-Xl, Text-Babbage-001, Text-Curie-001, Text-Davinci-001, Text-Davinci-002, Text-Davinci-003, GPT-3.5-Turbo | English | – | Scientific writing | |
valid | 700 | ||||||
test | 700 | ||||||
DeepfakeTextDetect (Li et al. 2023c) | train | ˜236k | GPT (Text-Davinci-002, Text-Davinci-003, GPT-Turbo-3.5), LLaMA (6B, 13B, 30B, 65B), GLM-130B, FLAN-T5 (small, base, large, xl, xxl), OPT(125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, iml1.3B, iml-30B), T0 (3B, 11B), BLOOM-7B1, GPT-J-6B, GPT-NeoX-20B) | English | Paraphrase | Social Media, News Writing, QA, Story Generation, Comprehension and Reasoning, Scientific writing | |
valid | ˜56k | ||||||
test | ˜56k |
HC3
The Human ChatGPT Comparison Corpus (HC3) (Guo et al. 2023) stands as one of the initial open-source efforts to compare ChatGPT-generated text with human-written text. It involves collecting both human and ChatGPT responses to identical questions. Due to its pioneering contributions in this field, the HC3 corpus has been utilized in numerous subsequent studies as a valuable resource. The corpus offers datasets in both English and Chinese. Specifically, HC3-en comprises 58k human responses and 26k ChatGPT responses, derived from 24k questions, sourced from the ELI5, WikiQA, Crawled Wikipedia, Medical Dialog, and FiQA datasets. Meanwhile, HC3-zh encompasses a broader spectrum of domains, featuring 22k human answers and 17k ChatGPT responses. The data within HC3-zh spans seven sources: WebTextQA, BaikeQA, Crawled BaiduBaike, NLPCC-DBQA, Medical Dialog, Baidu AI Studio, and LegalQA datasets. However, it is pertinent to note some limitations of the HC3 dataset, such as the lack of diversity in prompts used for data creation.
CHEAT
The CHEAT dataset (Yu et al. 2023a) is specifically designed to detect spurious academic content generated by ChatGPT. It includes human-written academic abstracts sourced from IEEE Xplore, with an average abstract length of 163.9 words and a vocabulary size of 130k words. Following the ChatGPT generation process, the dataset contains 15k human-written abstracts and 35k ChatGPT-generated summaries. To better simulate real-world applications, the outputs were guided by ChatGPT for further refinement and amalgamation. The “polishing” process aims to simulate scenarios where users refine LLM-generated text to bypass plagiarism detection, while “blending” represents cases where users combine manually drafted content with ChatGPT-generated text to evade detection. However, a limitation of the CHEAT dataset is its focus on narrow academic disciplines, overlooking cross-domain challenges, which stems from constraints related to its primary data source.
HC3 Plus
HC3 Plus (Su et al. 2023b) built on the original HC3 dataset, introducing an augmented section named HC3-SI. This new section specifically targets tasks requiring semantic invariance, such as summarization, translation, and paraphrasing, thus extending the scope of HC3. To compile the human-written text corpus for HC3-SI, data was curated from several sources, including the CNN/DailyMail dataset, Xsum, LCSTS, the CLUE benchmark, and datasets from the Workshop on Machine Translation (WMT). Simultaneously, the LLM-generated texts were generated using GPT-3.5-Turbo. The expanded English dataset now includes a training set of 95k samples, a validation set of 10k samples, and a test set of 38k samples. The Chinese dataset, in comparison, contains 42k training samples, 4k for validation, and 22k for testing. Despite these expansions, HC3-SI still mirrors HC3’s approach to data construction, which is somewhat monolithic and lacks diversity, particularly in the variety of LLMs and the use of complex and varied prompts for generating data.
OpenLLMText
The OpenLLMText dataset (Chen et al. 2023a) incorporates four types of LLMs: GPT-3.5, PaLM, LLaMA-7B, and GPT2-1B (also known as GPT-2 Extra Large). The samples from GPT2-1B are sourced from the GPT-2 Output dataset, which OpenAI has made publicly available. Text generation from GPT-3.5 and PaLM was generated using the prompt “Rephrase the following paragraph by paragraph: [Human_Sample],” while LLaMA-7B generated text by completing the first 75 tokens of human samples. The dataset comprises a total of 344k samples, including 68k written by humans. It is divided into training, validation, and test sets at 76%, 12%, and 12%, respectively. Notably, this dataset features LLMs like PaLM, which are commonly used in everyday applications. However, it does not fully capture the nuances of cross-domain and multilingual text, which limits its usefulness for related research.
TweepFake Dataset
TweepFake (Fagni et al. 2021) is a foundational dataset designed for the analysis of fake tweets on Twitter, derived from both genuine and counterfeit accounts. It encompasses a total of 25k tweets, with an equal distribution between human-written and machine-generated samples. The machine-generated tweets were crafted using various techniques, including GPT-2, RNN, Markov, LSTM, and CharRNN. Although TweepFake remains a valuable dataset resource of choice for many scholars, those working with LLMs should critically assess its relevance and rigor in light of evolving technological capabilities.
GPT2-Output Dataset
The GPT2-Output Dataset,7 introduced by OpenAI, is based on 250k documents sourced from the WebText test set for its human-written text. Regarding the LLM-generated text, the dataset includes 250k randomly generated samples using a temperature setting of 1 without truncation and an additional 250k samples produced with Top-K 40 truncation. This dataset was conceived to further research into the detectability of the GPT-2 model. However, a notable limitation lies in the insufficient complexity of the dataset, marked by the uniformity of both the generative models and data distribution.
GROVER Dataset
The GROVER Dataset, presented by Zellers et al. (2019b), is styled after news articles. Its human-written text is sourced from RealNews, a comprehensive corpus of news articles derived from Common Crawl. The LLM-generated text is produced by Grover-Mega, a transformer-based news generator with 1.5 billion parameters. A limitation of this dataset, particularly in the current LLM landscape, is the uniformity and singularity of both its generative model and data distribution.
ArguGPT Dataset
The ArguGPT Dataset (Liu et al. 2023d) is specifically designed for detecting LLM-generated text in various academic contexts such as classroom exercises, TOEFL, and GRE writing tasks. It comprises 4k argumentative essays, generated by seven distinct GPT models. Its primary aim is to tackle the unique challenges associated with teaching English as a second language.
DeepfakeTextDetect Dataset
Attention is also drawn to the DeepfakeTextDetect Dataset (Li et al. 2023c), a robust platform tailored for deepfake text detection. The dataset combines human-written text from ten diverse datasets, encompassing genres like news articles, stories, scientific writing, and more. The dataset comprises texts generated by 27 prominent LLMs, sourced from entities such as OpenAI, LLaMA, and EleutherAI. Furthermore, the dataset introduces an augmented challenge with the inclusion of text produced by GPT-4 and paraphrased text.
4.1.2 Potential Datasets
Creating datasets from scratch that include both human-written and LLM-generated text can be highly resource-intensive. As a result, researchers often use existing datasets to represent human-written text and generate new text using LLMs for training detectors. We refer to such datasets as “potential datasets.” These datasets can be categorized into various writing domains such as question answering, scientific writing, creative writing, social media, and web text, aligning closely with real-world use cases and taking into account the potential harm and higher likelihood of misuse of LLM-generated text (Mitchell et al. 2023). Table 4 provides an organized classification of commonly used datasets in current LLM-generated text detection research. Moreover, with the advancement of sophisticated LLMs, it has become increasingly challenging to ensure that new human-written datasets are free from LLM-generated content. Consequently, older human-written datasets may play a crucial role in developing future defenses against LLM-generated text.
Summary of other potential datasets that can be easily extended to LLM-generated text detection tasks.
Corpus . | Size . | Source . | Language . | Domain . |
---|---|---|---|---|
XSum (Narayan, Cohen, and Lapata 2018) | 42k | BBC | English | News Writing |
SQuAD (Rajpurkar et al. 2016) | 98.2k | Wiki | English | Question Answering |
WritingPrompts (Fan, Lewis, and Dauphin 2018) | 302k | Reddit WRITINGPROMPTS | English | Creative Writing |
Wiki40B (Guo et al. 2020) | 17.7m | Wiki | 40+ Languages | Web Text |
PubMedQA (Jin et al. 2019) | 211k | PubMed | English | Question Answering |
Children’s Book Corpus (Hill et al. 2016) | 687k | Books | English | Question Answering |
Avax Tweets Dataset (Muric, Wu, and Ferrara 2021) | 137m | English | Social Media | |
Climate Change Dataset (Littman and Wrubel 2019) | 4m | English | Social Media | |
Yelp Dataset (Asghar 2016) | 700k | Yelp | English | Social Media |
ELI5 (Fan et al. 2019) | 556k | English | Question Answering | |
ROCStories (Mostafazadeh et al. 2016) | 50k | Crowdsourcing | English | Creative Writing |
HellaSwag (Zellers et al. 2019a) | 70k | ActivityNet Captions, Wikihow | English | Question Answering |
SciGen (Moosavi et al. 2021) | 52k | arXiv | English | Scientific Writing, Question Answering |
WebText (Radford et al. 2019) | 45m | Web | English | Web Text |
TruthfulQA (Lin, Hilton, and Evans 2022) | 817 | authors writtEnglish | English | Question Answering |
NarrativeQA (Kočiský et al. 2018) | 1.4k | Gutenberg3, web | English | Question Answering |
TOEFL11 (Blanchard et al. 2013) | 12k | TOEFL test | 11 Languages | Scientific Writing |
Peer Reviews (Kang et al. 2018) | 14.5k | NIPS 2013–2017, CoNLL 2016, ACL 2017 | English | Scientific Writing |
ICLR 2017, arXiv 2007–2017 |
Corpus . | Size . | Source . | Language . | Domain . |
---|---|---|---|---|
XSum (Narayan, Cohen, and Lapata 2018) | 42k | BBC | English | News Writing |
SQuAD (Rajpurkar et al. 2016) | 98.2k | Wiki | English | Question Answering |
WritingPrompts (Fan, Lewis, and Dauphin 2018) | 302k | Reddit WRITINGPROMPTS | English | Creative Writing |
Wiki40B (Guo et al. 2020) | 17.7m | Wiki | 40+ Languages | Web Text |
PubMedQA (Jin et al. 2019) | 211k | PubMed | English | Question Answering |
Children’s Book Corpus (Hill et al. 2016) | 687k | Books | English | Question Answering |
Avax Tweets Dataset (Muric, Wu, and Ferrara 2021) | 137m | English | Social Media | |
Climate Change Dataset (Littman and Wrubel 2019) | 4m | English | Social Media | |
Yelp Dataset (Asghar 2016) | 700k | Yelp | English | Social Media |
ELI5 (Fan et al. 2019) | 556k | English | Question Answering | |
ROCStories (Mostafazadeh et al. 2016) | 50k | Crowdsourcing | English | Creative Writing |
HellaSwag (Zellers et al. 2019a) | 70k | ActivityNet Captions, Wikihow | English | Question Answering |
SciGen (Moosavi et al. 2021) | 52k | arXiv | English | Scientific Writing, Question Answering |
WebText (Radford et al. 2019) | 45m | Web | English | Web Text |
TruthfulQA (Lin, Hilton, and Evans 2022) | 817 | authors writtEnglish | English | Question Answering |
NarrativeQA (Kočiský et al. 2018) | 1.4k | Gutenberg3, web | English | Question Answering |
TOEFL11 (Blanchard et al. 2013) | 12k | TOEFL test | 11 Languages | Scientific Writing |
Peer Reviews (Kang et al. 2018) | 14.5k | NIPS 2013–2017, CoNLL 2016, ACL 2017 | English | Scientific Writing |
ICLR 2017, arXiv 2007–2017 |
There are two main approaches to the construction of LLM generated text. One involves using prompts to directly instruct the model to write or answer questions. For example, in news writing, you might prompt the model with, “Please write an article for BBC News with the following headline: <headline >.” Examples of such prompts can be found in the work by Li et al. (2023c), which provides many specified prompts. The other method involves providing the LLM with an opening sentence and guiding it to continue the narrative. For instance, in news writing, you might instruct the model with, “Please write an article starting exactly with: <prefix >.” This approach helps align LLM-generated text more closely with human writing. For more details, please refer to the prompt settings and examples in Bao et al. (2023).
4.2 Evaluation Benchmarks
Benchmarks with higher quality can help researchers verify whether their detectors are rapidly feasible and effective. We sort out and compare the benchmarks that are currently popular or have potential, as shown in Table 5. On the one hand, we hope to help researchers better understand their differences to choose suitable benchmarks for their experiments. On the other hand, we hope to draw researchers’ attention to the latest benchmarks, which have been fully designed to verify the latest issues for the task, with great potential.
Summary of benchmarks for LLM-generated text detection.
Corpus . | Use . | Human . | LLMs . | LLMs Type . | Language . | Attack . | Domain . |
---|---|---|---|---|---|---|---|
TuringBench (Uchendu et al. 2021) | train | ˜8k | ˜159k | GPT-1, GPT-2, GPT-3, GROVER, CTRL, XLM, XLNET, FAIR, TRANSFORMER_XL, PPLM | English | – | News Writing |
MGTBench (He et al. 2023b) | train test | ˜2.4k ˜0.6k | ˜14.4k ˜3.6k | ChatGPT, ChatGPT-turbo, ChatGLM, Dolly, GPT4All, StableLM | English | Adversarial | Scientific Writing, Story Generation, News Writing |
GPABenchmark (Liu et al. 2023e) | test | ˜150k | ˜450k | GPT-3.5 | English | Paraphrase | Scientific Writing |
Scientific-articles Benchmark (Mosca et al. 2023) | test | ˜16k | ˜13k | SCIgen, GPT-2, GPT-3, ChatGPT, Galactica | English | – | Scientific Writing |
MULTITuDE (Macko et al. 2023) | train test | ˜4k ˜3k | ˜40k ˜26k | Alpaca-lora, GPT-3.5- Turbo, GPT-4, LLaMA, OPT, OPT-IML-Max, Text-Davinci-003, Vicuna | Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, Ukrainian | – | Scientific Writing, News Writing, Social Media |
HANSEN (Tripto et al. 2023) | test | – | ˜21k | ChatGPT, PaLM2, Vicuna13B | English | – | Spoken Text |
M4 (Wang et al. 2023b) | train valid test | ˜35k ˜3.5k ˜3.5k | ˜112k ˜3.5k ˜3.5k | GPT-4, ChatGPT, GPT-3.5, Cohere, Dolly-v2, BLOOMz 176B | English, Chinese, Russian, Urdu, Indonesian, Bulgarian, Arabic | – | Web Text, Scientific Writing, News Writing, Social Media, QA |
DetectRL (Wu et al. 2024b) | train test | ˜100k | ˜134k | GPT-3.5-turbo, Claude-instant, Palm-2-bison, Llama-2-70b | English | Prompt, Paraphrase, Adversarial | Scientific Writing, News Writing, Story Generation, Social Media |
Corpus . | Use . | Human . | LLMs . | LLMs Type . | Language . | Attack . | Domain . |
---|---|---|---|---|---|---|---|
TuringBench (Uchendu et al. 2021) | train | ˜8k | ˜159k | GPT-1, GPT-2, GPT-3, GROVER, CTRL, XLM, XLNET, FAIR, TRANSFORMER_XL, PPLM | English | – | News Writing |
MGTBench (He et al. 2023b) | train test | ˜2.4k ˜0.6k | ˜14.4k ˜3.6k | ChatGPT, ChatGPT-turbo, ChatGLM, Dolly, GPT4All, StableLM | English | Adversarial | Scientific Writing, Story Generation, News Writing |
GPABenchmark (Liu et al. 2023e) | test | ˜150k | ˜450k | GPT-3.5 | English | Paraphrase | Scientific Writing |
Scientific-articles Benchmark (Mosca et al. 2023) | test | ˜16k | ˜13k | SCIgen, GPT-2, GPT-3, ChatGPT, Galactica | English | – | Scientific Writing |
MULTITuDE (Macko et al. 2023) | train test | ˜4k ˜3k | ˜40k ˜26k | Alpaca-lora, GPT-3.5- Turbo, GPT-4, LLaMA, OPT, OPT-IML-Max, Text-Davinci-003, Vicuna | Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, Ukrainian | – | Scientific Writing, News Writing, Social Media |
HANSEN (Tripto et al. 2023) | test | – | ˜21k | ChatGPT, PaLM2, Vicuna13B | English | – | Spoken Text |
M4 (Wang et al. 2023b) | train valid test | ˜35k ˜3.5k ˜3.5k | ˜112k ˜3.5k ˜3.5k | GPT-4, ChatGPT, GPT-3.5, Cohere, Dolly-v2, BLOOMz 176B | English, Chinese, Russian, Urdu, Indonesian, Bulgarian, Arabic | – | Web Text, Scientific Writing, News Writing, Social Media, QA |
DetectRL (Wu et al. 2024b) | train test | ˜100k | ˜134k | GPT-3.5-turbo, Claude-instant, Palm-2-bison, Llama-2-70b | English | Prompt, Paraphrase, Adversarial | Scientific Writing, News Writing, Story Generation, Social Media |
TuringBench
The TuringBench dataset (Uchendu et al. 2021) is an initiative designed to explore the challenges of the “Turing test” in the context of neural text generation techniques. It comprises human-written content derived from 10k news articles, predominantly from reputable sources such as CNN. For the purpose of this dataset, only articles ranging between 200 to 400 words were selected. LLM-generated text within this dataset is produced by 19 distinct text generation models, including GPT-1, GPT-2 variants (small, medium, large, xl, and PyTorch), GPT-3, different versions of GROVER (base, large, and mega), CTRL, XLM, XLNET variants (base and large), FAIR for both WMT19 and WMT20, Transformer-XL, and both PLM variants (distil and GPT-2). Each model contributed 8k samples, categorized by label type. Notably, TuringBench emerged as one of the pioneering benchmark environments for the detection of LLM-generated text. However, given the rapid advancements in LLM technologies, the samples within TuringBench are now less suited for training and validating contemporary detector performances. As such, timely updates incorporating the latest generation models and their resultant texts are imperative.
MGTBench
Introduced by He et al. (2023b), MGTBench stands as the inaugural benchmark framework for machine-generated text (MGT) detection. It boasts a modular architecture, encompassing an input module, a detection module, and an evaluation module. The dataset draws upon several of the foremost LLMs, including ChatGPT, ChatGLM, Dolly, ChatGPT-turbo, GPT4All, and StableLM, for text generation. Additionally, it incorporates over ten widely recognized detection algorithms, demonstrating significant potential.
GPABenchmark
The GPABenchmark (Liu et al. 2023e) is a comprehensive dataset encompassing 600k samples. These samples encompass four categories: human-written, GPT-written, GPT-completed, and GPT-polished abstracts from a broad spectrum of academic disciplines, such as computer science, physics, and the humanities and social sciences. This dataset meticulously captures the critical scenarios reflecting both the utilization and potential misapplication of LLMs in academic composition. Consequently, it delineates three specific tasks: generation of text based on a provided title, completion of a partial draft, and refinement of an existing draft. Within the domain of academic writing detection, GPABenchmark stands as a robust benchmark, attributed to its voluminous data and its holistic approach to scenario representation.
Scientific-articles Benchmark
The Scientific-articles Benchmark (Mosca et al. 2023) comprises 16k human-written articles alongside 13k LLM-generated samples. The human-written articles are sourced from the ArXiv dataset available on Kaggle. In contrast, the machine-generated samples, which include abstracts, introductions, and conclusions, are produced by SCIgen, GPT-2, GPT-3, ChatGPT, and Galactica using the titles of the respective scientific articles as prompts. A notable limitation of this dataset is its omission of various adversarial attack types.
MULTITuDE
This is a benchmark for detecting machine-generated text in multiple languages. This dataset consists of 74k machine-generated texts and 7k human-written texts across 11 languages (Macko et al. 2023), including Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, and Ukrainian. The machine-generated texts are produced by eight generative models, including Alpaca-Lora, GPT-3.5-turbo, GPT-4, LLaMA, OPT, OPT-IML-Max, Text-Davinci-003, and Vicuna. In an era of rapidly increasing numbers of multilingual LLMs, MULTITuDE serves as an effective benchmark for assessing the detection capabilities of LLM-generated text detectors in various languages.
HANSEN
The Human and AI Spoken Text Benchmark (HANSEN) (Tripto et al. 2023) is the largest benchmark for spoken text, encompassing the organization of 17 speech datasets and records, as well as 23k novel AI-generated spoken texts. The AI-generated spoken texts in HANSEN were created by ChatGPT, PaLM2, and Vicuna-13B. Due to the stylistic differences between spoken and written language, detectors may require a more nuanced understanding of spoken text. HANSEN can effectively assess the progress in research aimed at developing such nuanced detectors.
M4
M4 (Wang et al. 2023b) represents a comprehensive benchmark corpus for the detection of text generated by LLMs. It spans a variety of generators, domains, and languages. Compiled from diverse sources, including wiki pages from various regions, news outlets, and academic portals, the dataset reflects common scenarios where LLMs are utilized in daily applications. The LLM-generated texts in M4 are created using cutting-edge generative models such as ChatGPT, LLaMa, BLOOMz, FlanT5, and Dolly. Notably, the dataset captures cross-lingual subtleties, featuring content in more than ten languages. While the dataset effectively addresses challenges across diverse domains, languages, and models, it could be further enhanced by incorporating a wider array of adversarial scenarios to broaden its applicability.
DetectRL
DetectRL (Wu et al. 2024b) is a benchmark explicitly designed to evaluate the effectiveness of LLM-generated text detectors in real-world application scenarios. It consists of four evaluation tasks: In-domain Robustness, Generalization, Varying Text Length, and Real-World Human Writing Assessment. This benchmark covers writing domains that are particularly susceptible to misuse, including academic writing, news writing, creative writing, and social media. It supports popular LLMs such as GPT-3.5-turbo, Claude-instant, Palm-2-bison, and Llama-2-70b. Unlike prior studies, DetectRL employs heuristic rules to generate adversarial LLM-generated texts, simulating realistic scenarios such as various prompt usages, human revisions (e.g., word substitutions), and typographical errors. The benchmark contains a total of 100k human-written samples and 134k LLM-generated samples, including both original and attacked examples.
5 Advances in Automated Detection Research
This section provides an overview of different detector designs and detection algorithms, including watermarking technology, statistics-based detectors, and neural-based detectors. The discussion is systematically structured and is organized by their underlying principles (see Figure 4), with a detailed analysis of each method’s contributions, comparisons to other approaches, examination of the datasets used, and examples to illustrate their effectiveness.
Classification of LLM-generated text detectors with corresponding diagrams and paper lists. We categorize the detectors into watermarking technology, statistics-based detectors, neural-based detectors, and human-assisted methods. In the diagrams, HWT represents Human-Written Text and LGT represents LLM-Generated Text. We use the brown lines to highlight the source of the detector’s detection capability, and the green lines to describe the detection process.
Classification of LLM-generated text detectors with corresponding diagrams and paper lists. We categorize the detectors into watermarking technology, statistics-based detectors, neural-based detectors, and human-assisted methods. In the diagrams, HWT represents Human-Written Text and LGT represents LLM-Generated Text. We use the brown lines to highlight the source of the detector’s detection capability, and the green lines to describe the detection process.
5.1 Watermarking Technology
Initially developed in the field of computer vision, watermarking techniques have played a critical role in detecting AI-generated images and safeguarding intellectual property in the visual arts. With the rise of LLMs, watermarking technology has been adapted to identify text generated by these models. These techniques not only protect models from unauthorized access, such as sequence distillation, but also mitigate risks related to the replication and misuse of LLM-generated text.
It is important to note that watermarking technology differs significantly from statistics-based or neural-based detection approaches. It is not solely designed for the broad identification of text generated by LLMs. Instead, it serves as a regulatory framework tailored to specific models, requiring access to the deployment of the model for implementation. Thus, while watermarking is a specialized detection method with effective capabilities, it also has the potential to complement statistics-based or neural-based detection techniques (Mitchell et al. 2023).
5.1.1 Data-Driven Watermarking
Data-driven methods enable the verification of data ownership or the tracking of illegal copying or misuse by embedding specific patterns or tags within the training datasets of LLMs. These methods typically rely on backdoor insertion, where a small number of watermarked samples are added to the dataset, enabling the model to implicitly learn a secret function set by the defender. When a specific trigger is activated, the backdoor watermark is triggered, which is usually implemented in a black-box setting (Gu et al. 2022). This mechanism protects the model from unauthorized fine-tuning or use beyond the terms of the license by embedding a backdoor during the foundational and multi-task learning framework phases of model training, specified by the owner’s input. Experimental results show that the watermark embedded using this method can be robustly extracted with a high success rate and will not be erased by subsequent fine-tuning.
However, subsequent studies identified vulnerabilities in this technology, showing that it can be relatively easily compromised. Lucas and Havens (2023) detailed an attack method on this watermarking strategy by analyzing the content generated by autoregressive models to precisely identify the trigger words or phrases of the backdoor watermark. Their research indicates that triggers composed of randomly combined common words are easier to detect than those made up of unique and rare markers. Additionally, the research mentions that access to the model’s weights is the only prerequisite for detecting the backdoor watermark. Recently, Tang et al. (2023) introduced a clean-label backdoor watermarking framework that uses subtle adversarial perturbations to mark and trigger samples. Unlike previous methods that require adding arbitrary and mislabeled data to the training set, this approach minimizes the impact on the original task performance. Results indicate that incorporating just 1% of watermarked samples can embed a traceable watermark feature while remaining visually imperceptible.
It is important to note that data-driven methods were initially designed to protect the copyright of datasets and hence generally lack substantial payload capacity and generalizability. Moreover, applying such techniques in the field of LLM-generated text detection requires significant resource investment, including the embedding of watermarks in large datasets and retraining LLMs.
5.1.2 Model-Driven Watermarking
Model-driven methods embed watermarks directly into the LLMs by manipulating the logits output distribution or token sampling during the inference process. As a result, the LLMs generate responses that carry the embedded watermark, enabling effective regulation.
Logits-Based Methods
Kirchenbauer et al. (2023a) were the first to design a logits-based watermarking framework for LLMs, known as WLLM. This approach minimizes impact on text quality and does not require access to the LLM’s API or parameters. It involves selecting a random set of “green” tokens before generating words and subtly promoting their use during sampling. The watermark is then identified through statistical analysis of “red” and “green” tokens in the text. Experiments demonstrate that WLLM has a low false-positive rate, and the watermark degrades gracefully when under attack, showcasing strong reliability. Building on WLLM, Lee et al. (2023) proposed a method called Selective Watermarking via Entropy Thresholding (SWEET) for code generation. Unlike WLLM, SWEET enhances “green” tokens only at positions with high token distribution entropy during generation, ensuring the watermark remains both stealthy and intact. Results show that SWEET outperforms watermark baselines WLLM (Kirchenbauer et al. 2023a) and EXP-EDIT (Kuditipudi et al. 2024) and zero-shot methods including LogRank (Solaiman et al. 2019) and DetectGPT (Mitchell et al. 2023). SWEET improves detection capability (by over 10% AUROC) and demonstrates robust performance against real-world challenges, including paraphrasing attacks.
Despite the impressive performance of WLLM (Kirchenbauer et al. 2023a) and SWEET (Lee et al. 2023), Krishna et al. (2023) have shown that the robustness of these methods in paraphrasing LLM-generated text still requires improvement. A key issue is that the token’s watermark logit is influenced by a specific number of preceding tokens—too few to reduce security, while too many to compromise robustness against attacks. To address this challenge, Liu et al. (2023b) proposed a Semantic Invariant Robust Watermark (SIR) for LLMs. This approach generates semantic embeddings for all previous tokens and uses them to determine the watermark logic, offering greater robustness against synonym substitution and text paraphrasing compared with WLLM (Kirchenbauer et al. 2023a) and EXP-EDIT (Kuditipudi et al. 2024). Recently, a multilingual version, X-SIR (He et al. 2024), was developed to tackle the inconsistency issues faced by logits-based text watermarking when translating text into different languages.
Moreover, current watermark detection algorithms require a secret key during generation, which can introduce security vulnerabilities and the risk of forgery in public detection processes. To address these issues, Liu et al. (2023a) introduced an Unforgeable Publicly Verifiable (UPV) watermarking algorithm. This method utilizes two different neural networks for watermark generation and detection to avoid using the same key in both stages. Compared to WLLM (Kirchenbauer et al. 2023a), it demonstrates a lower false positive rate and stronger resistance to paraphrasing attacks.
In evaluating the robustness of logits-based text watermarking, Kirchenbauer et al. (2023b) assessed the resilience of watermarked text against various attacks, including manual rewriting, rewriting using non-watermarked LLMs, and integration into large handwritten document corpora. The findings revealed that watermarks could be detected after an average of just 800 tokens, even though real-world attacks do weaken the watermark’s effectiveness. This suggests that watermarking might be the most reliable method compared with other LLM-generated text detectors.
Token Sampling-Based Methods
During the normal model inference process, token sampling is determined by the sampling strategy and is often random, which helps guide the LLMs to produce more unpredictable text. Token sampling-based methods achieve watermarking by influencing the token sampling process, either by setting random seeds or specific patterns for token sampling. Kuditipudi et al. (2023) introduced a sequence of random numbers as a secret watermark key to intervene in and guide token sampling, which is then embedded into the LLMs to generate watermarked text. This approach is the first distortion-free watermarking strategy. Experiments on OPT-1.3B, LLaMA-7B, and Alpaca-7B demonstrate robustness against various paraphrasing attacks (e.g., editing and cropping), even when about 40–50% of the tokens are modified.
SemStamp (Hou et al. 2023) addresses the vulnerability of existing watermarking algorithms to paraphrasing attacks due to their token-level design. It is a robust sentence-level semantic watermarking algorithm based on Locality-Sensitive Hashing (LSH). The algorithm starts by encoding candidate sentences generated by the LLM and uses LSH hashing to partition the semantic embedding space into watermarked and non-watermarked regions. It then continuously performs sentence-level rejection sampling until a sentence falls into the watermarked region of the semantic embedding space. Experimental results show that this approach is more resilient against common and effective paraphrasing attacks, such as bigram paraphrase attacks, compared with WLLM (Kirchenbauer et al. 2023a), while maintaining superior text generation quality.
In general, model-driven watermarking is a plug-and-play method that does not require any changes to the model’s parameters and has minimal impact on text quality, making it a reliable and practical watermarking approach. However, there is still significant opportunity for improvement in its robustness, and its specific usability needs to be further explored through additional experiments and practical applications.
5.1.3 Post-Processing Watermarking
Post-processing watermarking refers to a technique that involves embedding a watermark by processing the text after it has been generated by an LLM. This method typically functions as a separate module that works in a pipeline with the output of the generative model.
Character-Embedded Methods
Early post-processing watermarking techniques relied on the insertion or substitution of special Unicode characters in text, which was known at the time as the open space method (Bender et al. 1996) or Steganography (Provos and Honeyman 2003). These characters, which are difficult to detect with the naked eye, carried unique encoding information. Subsequent research utilized sentence spacing, word spacing, paragraph spacing, and end-of-line spacing to further reduce the visibility of embedded messages (Chotikakamthorn 1998; Por, Ang, and Delina 2008).
Por, Wong, and Chee (2012) presented a straightforward character-embedded method based on space character operations. They inserted selected Unicode space characters between sentences, words, lines, and paragraphs to embed watermarks in text, offering strong imperceptibility.
Rizzo, Bertini, and Montesi (2016) introduced Easymark, a technique that cleverly took advantage of the fact that Unicode has many visually identical or similar code points. Unlike earlier methods (Por, Wong, and Chee 2012), Easymark was not restricted by specific file formats and could be effectively applied to short texts. Specifically, Easymark embedded watermarks by replacing the regular space character (U+0020) with another blank code point (e.g., U+2004), using Unicode’s variant selectors, substituting substrings, or using spaces and homoglyphs of slightly different lengths, while ensuring the text’s appearance remained nearly unchanged. Results on a real dataset of 1.8 million New York Times articles showed that watermarks embedded with Easymark could be reliably detected.
Synonym Substitution-Based Methods
In response to the vulnerability of character-level methods to targeted attacks, some research has shifted towards embedding watermarks at the word level, primarily through synonym substitution.
Early watermark embedding strategies involved the continuous replacement of words with synonyms until the text carried the intended watermark content, as demonstrated in T-tex (Winstein 1998). To address the limitations of these initial methods, Topkara, Topkara, and Atallah (2006) developed a more quantifiable and resilient watermarking technique utilizing Wordnet (Fellbaum 1998). This approach was not entirely reliant on the token insertion process. Specifically, it involved further modifying the document to maximize ambiguity and deliberately operating close to the distortion threshold after the watermark content was embedded. Xiang et al. (2018) presented another approach that effectively combined arithmetic coding with synonym substitution for robust lossless recovery. In this technique, synonyms capable of carrying a meaningful payload were quantified into binary sequences and compressed using adaptive binary arithmetic coding before being embedded in the text. However, these techniques often did not adequately address the impact of synonym substitution on the overall meaning of sentences.
Building upon these foundations, Yang et al. (2022), Munyer and Zhong (2023), and Yoo et al. (2023) have utilized pre-trained or additionally fine-tuned neural models to execute word substitution and detection tasks more effectively, thereby better preserving the semantic relevance of the original sentences. A notable development in this area is presented by Yang et al. (2022), who introduced a context-aware lexical substitution (LS) scheme for natural language watermarking. This method utilizes BERT (Devlin et al. 2019) to assess the semantic relevance between candidate words and the original sentence, recommending LS candidates that ensure superior semantic relevance compared to earlier methods (Topkara, Topkara, and Atallah 2006; Xiang et al. 2018), with average relevance exceeding 98% across six datasets.
Given the prevalence of black-box models, Yang et al. (2023a) have developed a watermarking framework tailored for black-box LMs, enabling third parties to independently embed watermarks into generated texts. This framework involves defining a binary encoding function to randomly encode words into binaries, selectively replacing words denoting binary “0” with contextually relevant synonyms denoting binary “1” to embed the watermark. Experimental results on the Chinese and English datasets from HC3 (Guo et al. 2023) have demonstrated that this method maintains robustness against various attacks including retranslation, text polishing, word deletion, and synonym substitution, without sacrificing the original semantic integrity.
Sequence-to-Sequence Methods
Recent research has explored end-to-end watermark encryption techniques aimed at enhancing flexibility and minimizing artifacts introduced by watermarks. For instance, Abdelnabi and Fritz (2021) proposed Adversarial Watermark Transformer (AWT), the first end-to-end framework to automate the learning of word replacements and their contents for watermark embedding. This method combines end-to-end and adversarial training, enabling the injection of binary messages into specific input texts at the encoding level. The result is an output text that is virtually imperceptible, with minimal impact on the semantic accuracy and integrity of the original input. In comparison to Synonym Substitution-Based Methods (Topkara, Topkara, and Atallah 2006), AWT offers a superior balance between effectiveness, confidentiality, and robustness.
Zhang et al. (2023a) introduced a Robust and Efficient Watermarking Framework for Generative LLMs (REMARK-LLM), which includes three components: (i) a message encoding module that embeds binary signatures into texts generated by LLMs; (ii) a reparametrization module that converts the dense distribution of message encoding into a sparse distribution for generating watermarked text tokens; and (iii) a decoding module dedicated to extracting signatures. Experiments demonstrated that REMARK-LLM can embed significantly more signature bits (more than double) into the same text while maintaining its semantic integrity. Moreover, compared to AWT (Abdelnabi and Fritz 2021), it shows enhanced resilience against a variety of watermark removal and detection attacks.
Compared to model-driven watermarking, post-processing watermarking may depend more heavily on specific rules, making it more vulnerable to sophisticated attacks that exploit visible clues. For example, systematic patterns introduced through methods like character embedding or synonym substitution can be identified and targeted for removal or disruption, rendering the watermarks ineffective. Despite this risk, post-processing watermarking has significant potential for various applications. Many existing watermarking techniques typically necessitate training within white-box models, making them unsuitable for black-box LLMs settings. For instance, embedding watermarks in GPT-4 is nearly impossible given its proprietary and closed-source nature. Nevertheless, post-processing watermarking provides a solution for adding watermarks to text generated by black-box LLMs, enabling third parties to embed watermarks independently.
5.2 Statistics-Based Methods
Unlike watermarking methods, which embed identifiable patterns within text and require access to the LLM’s deployment, statistics-based methods focus on detecting LLM-generated text by analyzing inherent text features. These methods do not rely on additional training through supervised signals; instead, they leverage statistical data to uncover unique patterns and regularities in the generated text. By computing thresholds or analyzing distributional characteristics, statistics-based methods can proficiently identify LLM-generated text without requiring supervised training or specialized access to the model. This independence makes statistics-based approaches more broadly applicable and less dependent on the specific deployment or modification of LLMs. As a result, they provide a complementary or alternative detection strategy to watermarking. In this section, we classify these methods into three categories: linguistic feature statistics, white-box statistics, and black-box statistics, and discuss each in detail.
5.2.1 Linguistic Feature Statistics
The inception of statistics-based detection research can be traced back to the pioneering work of Corston-Oliver, Gamon, and Brockett (2001). In this foundational study, the authors utilized linguistic features, such as the branching properties observed in grammatical analyses of text, function word density, and constituent length, to determine whether a given text was generated by a machine translation model. These features served as key indicators in distinguishing machine-generated text from human-generated text.
Another early method, dedicated to achieving similar detection goals, uses frequency statistics. For instance, Kalinichenko et al. (2003) and Baayen (2001) utilized frequency statistics associated with the occurrence of word pairs and the distribution characteristics of words within texts. This mechanism was used to determine whether texts were autonomously generated by a generative system. Building on this foundation, Arase and Zhou (2013) developed a detection technique that could identify the “phrase salad” phenomenon in sentences—right and fluent phrases that are sequenced unnaturally. This technique, based on the statistics of fluency feature, grammaticality feature, and gappy-phrase feature, effectively detected low-quality machine-translated sentences from large-scale web texts, achieving an accuracy of 95.8% for sentences and 80.6% for noisy web texts.
Recent studies on LLM-generated text detection have proposed methodologies based on count-based language model features statistics. Gallé et al. (2021) proposed a method of using repeated high-order n-grams to detect LLM-generated documents. This approach is predicated on the observation that certain n-grams appear with unusual frequency within LLM-generated text. Similarly, Hamed and Wu (2023) developed a detection system based on the statistical similarity of bigram counts. Their research revealed that texts produced by ChatGPT accounted for only 23% of all academic bigram content, highlighting substantial variations in terminology between human authors and LLM-generated content. These findings suggest that ChatGPT may possess restricted academic aptitude from a human viewpoint. Their algorithm accurately detected 98 of 100 academic papers authored by LLMs, thus proving the efficacy of their feature engineering strategy in differentiating between texts created by humans and those produced by LLMs.
However, our empirical observations reveal a conspicuous limitation in the application of linguistic feature statistics: their effectiveness heavily depends on access to extensive corpus statistics and diverse types of LLMs.
5.2.2 White-box Statistics
White-box methods for detecting LLM-generated text require direct access to the source model for implementation. The existing white-box detection techniques primarily use zero-shot approaches, which involves obtaining the model’s logits output and calculating specific metrics. These metrics are then compared against predetermined thresholds obtained through statistical methods to identify LLM-generated text.
Logits-Based Methods
Logits are the raw outputs produced by LLMs during text generation, specifically from the model’s final linear layer before the softmax function. These outputs indicate the model’s confidence levels associated with generating each potential subsequent word. The Log-likelihood (Solaiman et al. 2019), a metric derived directly from the logits, measures the average token-wise log probability for each token within the given text. This metric helps determine the likelihood that the text was generated by an LLM and is widely recognized as one of the most popular baseline metrics for LLM-generated text detection.
Similarly, Rank (Solaiman et al. 2019) is another normal baseline computed from logits. The Rank metric calculates the ranking of each word in a sample within the model’s output probability distribution. This ranking is determined by comparing the logit score of the word against the logit scores of all other possible words. If the average rank of each word in the sample is high, it suggests that the sample is likely generated by LLMs. To refine this process, the Log-rank method applies a logarithmic function to each token’s rank. A notable application of Log-rank is the GLTR tool (Gehrmann, Strobelt, and Rush 2019), which is designed as a visual forensic tool to facilitate comparative analysis. This tool uses different colors to represent tokens based on their sampling frequency levels, highlighting the proportion of words that a model tends to use in the analyzed text. With the help of GLTR, the human detection rate of fake texts increased from 54% to 72%. Su et al. (2023a) introduced Log-likelihood Ratio Ranking (LRR), which combines Log-likelihood (Solaiman et al. 2019) and Log-rank (Gehrmann, Strobelt, and Rush 2019) by taking the ratio of these two metrics. This approach provides a more comprehensive evaluation by effectively integrating Log-likelihood assessments and Log-rank analysis. Experiments conducted across three datasets and seven language models demonstrated that this method outperforms the Log-likelihood (Solaiman et al. 2019), Rank (Solaiman et al. 2019), and Log-rank (Gehrmann, Strobelt, and Rush 2019) methods by approximately 3.9 and 1.75 AUROC points, respectively.
Entropy represents another early zero-shot method used for evaluating LLM-generated text. It measures the uncertainty or amount of information in a text or model output, typically calculated through the probability distribution of words. High entropy suggests that the content of the sample text is unclear or highly diversified, meaning that many words have a similar probability of being chosen. In such cases, the sample is likely to have been generated by an LLM. Lavergne, Urvoy, and Yvon (2008) used the Kullback-Leibler (KL) divergence to assign scores to n-grams, taking into account the semantic relationships between their initial and final words. This approach identifies n-grams with significant dependencies between the initial and terminal words, thus aiding in the detection of spurious content and enhancing the overall performance of the detection process. However, in recent works by Mitchell et al. (2023) and Bao et al. (2023), methods based on entropy have performed poorly, achieving only an average AUROC of nearly 50%.
The perplexity method based on traditional n-gram language models evaluates the predictive capability of LMs (Beresneva 2016). Recent studies, such as HowkGPT (Vasilatos et al. 2023), use this approach to differentiate between texts written by students and those produced by ChatGPT. This is achieved by calculating and comparing the perplexity scores of each text. Findings indicate that ChatGPT-generated texts exhibit lower perplexity, in contrast to the more evenly distributed scores observed in student responses. This comparative analysis establishes thresholds to accurately identify the origins of submitted assignments. Wu et al. (2023) introduced LLMDet, a tool designed to quantify and categorize the perplexity scores of various models by calculating the probability of the next token for selected n-grams. This approach leverages the text’s intrinsic self-watermarking properties (evidenced by surrogate perplexity). Compared to other perplexity-based methods like HowkGPT (Vasilatos et al. 2023), LLMDet allows for effective tracing of text origins and facilitates more fine-grained detection. With a classification accuracy of up to 98.54%, this tool also boasts greater computational efficiency than finely-tuned RoBERTa classifiers (Liu et al. 2019). However, Hans et al. (2024) found that perplexity alone is insufficient as a detection feature, as the text generated by LLMs may exhibit high perplexity scores depending on the specified prompt. This variability renders simple perplexity-based detectors ineffective. Therefore, it is recommended to use the ratio of perplexity measurement to cross-perplexity to address this issue. This approach can detect over 90% of samples generated by ChatGPT (and other LLMs) with a false positive rate of 0.01%. The method, referred to as “Binoculars,” describes the extent to which the next-token prediction of one model surprises another model.
In addition, some white-box based derived features are beneficial for LLM-generated text detection. Ghostbuster (Verma et al. 2023) feeds LLM-generated texts into a series of weaker LLMs (from unigram models to unadjusted GPT-3 davinci) to obtain token probabilities, and then conducts a structured search on the combinations of these model outputs to train a linear classifier for distinguishing LLM-generated texts. This detector achieves an average F1score of 99.0, which is an increase of 41.6 F1score over previous methods such as GPTZero8 and DetectGPT (Mitchell et al. 2023). Uniform Information Density (UID) (Jain et al. 2018; Wei, Meister, and Cotterell 2021; Venkatraman, He, and Reitter 2023) proves to be another influential feature. By analyzing the token probabilities within samples, Venkatraman, Uchendu, and Lee (2023) have been able to extract UID-based features and then trains a logistic regression classifier to fit the UID characteristics of texts generated by different LLMs. Experimental results on datasets such as TuringBench (Uchendu et al. 2021), GPABenchmark (Liu et al. 2023e), ArguGPT (Liu et al. 2023d), and MAGE (Li et al. 2023c) demonstrate that the method significantly outperforms, by over 20%, traditional supervised and statistical methods like the OpenAI detector (Radford et al. 2019) and DetectGPT (Mitchell et al. 2023) across multiple domains.
Perturbed-Based Methods
A notable study by Mitchell et al. (2023) presents a method to identify text produced by LLMs, based on structural patterns within LLM probability functions. Specifically, it has been observed that LLM-sampled texts tend to be located in regions exhibiting negative curvature in the model’s log probability function. This technique merely requires the use of a pre-trained mask-filling model (e.g., small T5) to generate semantically similar text perturbations. DetectGPT demonstrates greater discriminative power than traditional zero-shot approaches, including Log-likelihood (Solaiman et al. 2019), Rank (Solaiman et al. 2019), Log-rank (Solaiman et al. 2019), and Entropy Gehrmann, Strobelt, and Rush (2019). It significantly enhances the detection rate of fake news articles generated by the GPT-NeoX-20B, improving the AUROC from the strongest zero-shot baseline of 0.81 to 0.95 with DetectGPT. Another contemporary work, NPR (Su et al. 2023a), shares a conceptual framework with DetectGPT (Mitchell et al. 2023). NPR utilizes normalized perturbation log-rank to detect text generated by LLMs. Compared to DetectGPT, NPR is less sensitive to the type and quantity of perturbations, achieving improved performance with an approximate 2% increase in AUROC.
While innovative and sometimes more effective than supervised methods, DetectGPT has limitations, including potential performance drops if rewrites don’t adequately represent the space of meaningful alternatives, and high computational demands, as it requires perturbing and scoring numerous texts. To address these challenges, Deng et al. (2023) proposed a method using a Bayesian surrogate model to score a small set of representative samples. By extrapolating the scores from these samples to others, the method enhances query efficiency and reduces computational overhead while preserving performance. Extensive empirical studies on datasets such as GPT-2, LLaMA2, and Vicuna have demonstrated its efficiency over DetectGPT (Mitchell et al. 2023), especially in detecting texts generated by the LLaMA. This method achieved superior results using just 2–3 queries compared to DetectGPT’s 200 queries. Another significant advancement is reported by Bao et al. (2023), who introduced an approach based on the hypothesis that token-level conditional probability curvature is a more fundamental metric of LLM-generated text. This approach replaces the perturbation steps in DetectGPT (Mitchell et al. 2023) with more efficient sampling steps. Evaluations across a variety of datasets, source models, and testing conditions have shown that Fast-DetectGPT not only significantly enhances detection accuracy by approximately 75% over DetectGPT in both white-box and black-box settings under consistent experimental conditions but also boosts detection speed by 340 times. Another approach stems from the white-box configuration in DNA-GPT (Yang et al. 2023b), which also utilizes probability curves but diverges from DetectGPT (Mitchell et al. 2023). Rather than using a perturbation framework, this method leverages LLMs like ChatGPT to repeatedly extend truncated texts. By calculating the probability divergence, it analyzes the differences between the original and the extended texts. This technique has shown superior performance in distinguishing between human-written and GPT-generated texts on four English and one German dataset, achieving nearly 100% detection accuracy and outperforming the OpenAI classifier (Radford et al. 2019).
To enhance white-box LLM-generated text detection, Ma and Wang (2024) proposed a method called TOCSIN. The key innovation is Token Cohesiveness, a feature that quantifies the semantic tightness of a text by randomly deleting 1.5% of its tokens and calculating the average semantic difference between the original and altered versions. Due to the causal self-attention mechanism in LLMs, LLM-generated text tends to exhibit higher cohesiveness, while human-written text is more flexible and less cohesive. TOCSIN uses a dual-channel framework: One channel calculates token cohesiveness, and the other uses an existing zero-shot detector to generate predictions. By combining the two scores, TOCSIN achieves more accurate classification. Experiments show that TOCSIN significantly improves the performance of four mainstream detectors (Likelihood, LogRank, LRR, Fast-DetectGPT), with AUROC gains of 0.59% to 10.97% in white-box scenarios and 0.38% to 3.77% in black-box scenarios, while also enhancing the reliability of cross-model detection.
Intrinsic Features-Based Methods
Tulchinskii et al. (2023) proposed a method for constructing a detector using the intrinsic dimension of the manifold, based on the assumption that humans and LLMs exhibit consistent capabilities in their respective textual domains. Specifically, a contextual embedding is extracted for each token, converting the text into a high-dimensional point cloud. The intrinsic dimension of this point cloud is estimated using the persistent homology dimension (PHD) method, which involves constructing a minimum spanning tree, analyzing the “lifetime” of topological features, and fitting a regression model. This intrinsic dimension is then used as the sole feature in a logistic regression classifier to detect whether the text is LLM-generated. Observations show that the average intrinsic dimension of natural fluent text in various alphabetic languages is about 9 and about 7 for Chinese, while the average intrinsic dimension of LLM-generated text in each language is generally about 1.5 units lower. The proposed detector shows consistent accuracy across text domains, generator, and different levels of human author skills. It significantly outperforms detectors including DetectGPT (Mitchell et al. 2023) and OpenAI Detector (Radford et al. 2019) in scenarios involving multi-domain challenges, model shifts, and adversarial attacks. However, its reliability degrades when applied to suboptimal or high-temperature generators. Text Fluoroscopy (Yu et al. 2024) classifies texts by extracting intrinsic features from the intermediate layers of language models. Unlike methods that rely solely on features from the first or last layer of the model, this approach calculates the distributional differences (KL divergence) between each intermediate layer and the first and last layers, selecting the intermediate layer with the largest distributional difference as the intrinsic feature layer. This enables more effective identification of differences between human-generated and LLM-generated texts. Experimental results demonstrate that Text Fluoroscopy outperforms existing methods across various datasets and generative models (e.g., ChatGPT, GPT-4, Claude3) with an average improvement of 7.36%, and it exhibits greater robustness against paraphrasing and back-translation attacks.
5.2.3 Black-box Statistics
In contrast to white-box statistical methods, black-box statistical approaches utilize models that calculate scores for specific text features without accessing the logits of either the source or surrogate models. One method, utilized in DNA-GPT (Yang et al. 2023b), uses n-gram probability divergence similarity. It extends the writing of a truncated text under review using an LLM, and evaluates the similarity of n-gram probability divergence between the continuation and the original text. This process helps distinguish between human-written texts and those generated by LLMs. Experiments conducted on models such as GPT-3.5-turbo and GPT-4, as well as open-source models like GPT-NeoX-20B and LLaMa-13B, have shown superior performance compared to the OpenAI detector (Radford et al. 2019), with a minimal drop from 99.09 AUROC to 98.48 AUROC when facing modification attacks.
Another approach calculates similarity scores between original texts and their rewritten or revised versions to identify LLM-generated text, as demonstrated by Mao et al. (2024) and Zhu et al. (2023). This method is based on the observation that texts rewritten and edited by an LLM tend to undergo fewer modifications than human-written texts, as LLMs favor their own generative logic and statistical patterns, making them less likely to initiate changes. Thus, texts that show higher similarity between their original and modified versions are more likely to be LLM-generated. Experiments across various domains, including news writing, creative writing, academic writing, code generation, and social media, have significantly enhanced the efficacy of current AI detectors, showing better generalization capabilities than white-box statistical methods like DetectGPT (Mitchell et al. 2023) and Ghostbuster (Verma et al. 2023). GECScore (Wu et al. 2024a) is a simple yet effective black-box zero-shot method, designed based on the observation that, from the perspective of LLMs, human-written texts often contain more grammatical errors than texts generated by LLMs. The method achieves an AUROC of approximately 98.62% and demonstrates greater reliability in real-world scenarios compared with methods relying on simple LLM revision preferences (Zhu et al. 2023) and the powerful zero-shot method Fast-DetectGPT (Bao et al. 2023), showcasing outstanding generalization capabilities. AuthentiGPT (Guo and Yu 2023) utilizes a similar black-box approach with a zero-shot denoising technique to identify LLM-generated texts. It involves using a black-box LLM to remove noise artificially added to the input texts, then comparing the denoised text semantically to the original. This method has achieved an AUROC score of 91.8% in specific academic writing domain, surpassing commercial detectors such as GPTZero9 and Originality.AI,10 and baseline methods using GPT-3.5 and GPT-4 as detectors.
Yu et al. (2023b) introduced a novel detection mechanism that leverages the similarity between original texts and their regenerated versions. This method considers the generation process as a coupling of the generative model’s prompt features and intrinsic characteristics. It reconstructs prompts corresponding to the candidate text using an auxiliary LLM, then regenerates the text from these prompts, aligning the candidate and regenerated texts with the prompts accordingly. This alignment’s similarity is then used as the primary detection feature, enabling the detector to focus on the intrinsic properties of the generative model rather than the prompts. Compared to supervised methods like the RoBERTa classifier (Liu et al. 2019) and RADAR (Hu, Chen, and Ho 2023), and statistical methods such as DNA-GPT (Yang et al. 2023b), DetectGPT (Mitchell et al. 2023), and Fast-DetectGPT (Bao et al. 2023), DPIC (Yu et al. 2023b) achieves an average improvement of 6.76% and 2.91% over the best baseline in detecting texts generated by GPT-4 and Claude3 across various domains.
Additionally, acknowledging the absence of quantifiable metrics at the sentence level, Quidwai, Li, and Dube (2023) proposed a more comprehensive comparison technique to provide more accurate and interpretable evaluations. This method involves analyzing sets of sentences within LLM-generated texts and their paraphrases, differentiating them from human-written texts through cosine similarity measurements, achieving an accuracy rate as high as 94%, which outperforms supervised methods like the RoBERTa classifier (Liu et al. 2019).
However, the approaches of black-box statistics involves challenges, including the substantial overhead of accessing the LLM and long response times.
5.3 Neural-Based Methods
5.3.1 Feature-Based Classifiers
Linguistic Feature-Based Classifiers
When comparing texts generated by LLMs with those written by humans, the differences in numerous linguistic features provide a solid foundation for feature-based classifiers to effectively distinguish between them. The workflow of such classifiers typically starts with the extraction of key statistical language features, followed by the application of machine learning techniques to train a classification model. This approach has been widely used in the identification of fake news. For instance, in a recent study, Aich, Bhattacharya, and Parde (2022) achieved impressive accuracy of 97% on fake news detection by extracting 21 textual features and using a k-nearest neighbor classifier. Drawing inspiration from the tasks of detecting fake news and LLM-generated texts, the linguistic features of texts can be extensively categorized into stylistic features, complexity features, semantic features, psychological features, and knowledge-based features. These features are primarily obtained through statistical methods.
Stylistic features primarily focus on the frequency of words that specifically highlight the stylistic elements of the text, including the frequency of capitalized words, proper nouns, verbs, past tense words, stopwords, technical words, quotes, and punctuation (Horne and Adali 2017). Complexity Features are extracted to represent the complexity of the text, such as the type-token ratio (TTR) and textual lexical diversity (MTLD) (McCarthy 2005). Semantic Features include Advanced Semantic (AdSem), Lexico Semantic (LxSem), and statistics of semantic dependency tags, among other semantic-level features. These can be extracted using tools like LingFeat (Lee, Jang, and Lee 2021). Psychological Features are generally related to sentiment analysis and can be derived based on tools like SentiWordNet (Baccianella, Esuli, and Sebastiani 2010) to calculate sentiment scores, or extracted using sentiment classifiers. Information Features include named entities (NE), opinions (OP), and entity relation extraction (RE), and can be extracted using tools such as UIE (Lu et al. 2022) and CogIE (Jin et al. 2021).
Shah et al. (2023) constructed a classifier based on stylistic features such as syllable count, word length, sentence structure, frequency of function word usage, and punctuation ratio. This classifier achieved an accuracy of 93%, which effectively demonstrates the significance of stylistic features for LLM-generated text detection. Other work integrated text modeling with a variety of linguistic features through data fusion techniques (Corizzo and Leal-Arenas 2023), which included different types of punctuation marks, the use of the Oxford comma, paragraph structures, average sentence length, the repetitiveness of high-frequency words, and sentiment scores, subsequently training a deep neural network for detection tasks. On datasets in both English and Spanish, this methodology reached F1score of 98.36% and 98.29%, respectively, surpassing the performance of both single neural network-based methods (e.g., BERT + SVM) and feature-based methods (e.g., Emotional Semantics + SVM). Furthering this line of inquiry, Mindner, Schlippe, and Schaaff (2023) explored a range of both traditional and novel features for detecting LLM-generated text, adopting a multidimensional approach to bolster the discriminative power of their classifiers. This approach included perplexity-based, semantic, list lookup, document, error-based, readability, AI-feedback, and text vector features. By training the detector using XGBoost, random forest, and neural network-based methods, they achieved an F1score of 98% for basic LLM-generated text detection, and 78.9% for basic LLM-rewritten text. The optimized detector outperformed GPTZero11 by an impressive 183.8% in F1score, demonstrating its exceptional detection capabilities.
While these features provide valuable insights into various aspects of the text, research has shown that incorporating deeper structural features can further improve performance. For example, Kim et al. (2024) proposed an approach based on Rhetorical Structure Theory (RST) and recursive hypergraph analysis, focusing on extracting hierarchical discourse topics. Compared with methods directly fine-tuning encoder-based classifiers, models incorporating discourse topic features achieve significant performance improvements on multiple datasets. On the HC3 dataset, the classification model leveraging discourse topic features achieved an F1score of 92.4%, which is about 4% higher than the baseline without discourse features, showing higher robustness, especially in long texts and cross-domain tasks.
Although classifiers based on linguistic features have their advantages in distinguishing between human-written and LLM-generated texts, their shortcomings cannot be overlooked. The results from Schaaff, Schlippe, and Mindner (2023) indicate that such classifiers have poor robustness against ambiguous semantics and often underperform neural network features. Moreover, classifiers based on stylistic features may be capable of differentiating between texts written by humans and those generated by LLMs, but their ability to detect LLM-generated misinformation is limited. This limitation is highlighted in Schuster et al. (2020), which shows that language models tend to produce stylistically consistent texts. However, Crothers et al. (2022) suggest that statistical features can offer additional adversarial robustness and can be utilized in constructing integrated detection models.
White-box Features-Based Classifiers
In addition to linguistic features, classifiers based on model features have recently garnered considerable attention from researchers. These classifiers are not only capable of detecting texts generated by LLMs but can also be used for text origin tracing. Sniffer (Li et al. 2023a) was the first to focus on tracing the origins of LLM outputs, using contrastive features between models along with token-level perplexity that aligns with model contrasts. These features evaluate the percentage of words that show lower perplexity when comparing one model, θi, with another, θj. A classifier trained on these features reached an accuracy of 86.0%. SeqXGPT (Wang et al. 2023a) represents a further advancement in the field of text origin tracing, extending the granularity of detection to the sentence level. It utilizes a context network designed around the log probabilities in a white-box LLM, which combines a CNN and a two-layer transformer to encode text and detect LLM-generated content through a sequence tagging task. Experiments demonstrate that SeqXGPT achieves superior results over RoBERTa (Liu et al. 2019), DetectGPT (Mitchell et al. 2023), and Sniffer (Li et al. 2023a) in both sentence-level and document-level LLM-generated text detection.
However, a common limitation of these methods is their reliance on accessing the source models’ logits. This requirement may limit their effectiveness when applied to other powerful, closed-source models where logits are inaccessible.
5.3.2 Pre-training Classifiers
In-domain Fine-tuning is All You Need
This subsection explores methods involving the fine-tuning of encoder-based classifiers to distinguish between texts generated by LLMs and those written by humans. This approach requires paired samples to facilitate supervised training processes. According to Qiu et al. (2020), pre-trained LMs have demonstrated exceptional capabilities in natural language understanding, which is crucial for enhancing various NLP tasks, particularly text categorization. Prominent pre-trained models, such as BERT (Devlin et al. 2019), RoBERTa (Liu et al. 2019), and XLNet (Yang et al. 2019), have outperformed traditional statistical machine learning and deep learning counterparts when applied to the text classification tasks within GLUE (Wang et al. 2019). There is an extensive body of prior work (Bakhtin et al. 2019; Uchendu et al. 2020; Antoun et al. 2023; Li et al. 2023c) that has meticulously examined the capabilities of fine-tuned LMs in detecting LLM-generated text. Notably, studies conducted in 2019 have acknowledged fine-tuned LMs, with RoBERTa (Liu et al. 2019) being especially prominent, as being among the most formidable detectors of LLM-generated text. In the following discourse, we will introduce recent scholarly contributions in this vein, providing an updated review and summary of the methods deployed.
Fine-tuning RoBERTa provides a robust baseline for detecting text generated by LLMs. Fagni et al. (2021) observed that fine-tuning RoBERTa led to optimal classification outcomes in various encoding configurations (Gambini et al. 2022), with the subsequent OpenAI detector (Radford et al. 2019) also adopting a RoBERTa fine-tuning approach. Recent work (Guo et al. 2023; Liu et al. 2023d, 2023e; Chen et al. 2023b; Wang et al. 2023c) further corroborated the superior performance of fine-tuned members of the BERT family, such as RoBERTa, in identifying LLM-generated text. Moreover, these fine-tuned models yielded a 95% accuracy rate within their respective domains, outperforming zero-shot methods like DetectGPT (Mitchell et al. 2023) and watermarking methods like WLLM (Kirchenbauer et al. 2023a). Additionally, exhibiting a modicum of resilience to various attack techniques within in-domain settings. Nevertheless, like their counterparts, these encoder-based fine-tuning approaches lack robustness (Bakhtin et al. 2019; Uchendu et al. 2020; Antoun et al. 2023; Li et al. 2023c), as they tend to overfit to their training data or the source model’s training distribution, resulting in a decline in performance when faced with unseen domain or data. Additionally, fine-tuning LM classifiers is limited in facing data generated by different models (Sarvazyan et al. 2023). Despite this, detectors based on RoBERTa exhibit significant potential for robustness, requiring as few as a few hundred labels to fine-tune and deliver impressive results (Rodriguez et al. 2022). mBERT (Devlin et al. 2019) has demonstrated consistently robust performance in document-level LLM-generated text classification and various model attribution settings, maintaining optimal performance particularly in English and Spanish tasks. In contrast, encoder models like XLM-RoBERTa (Conneau et al. 2020) and TinyBERT (Jiao et al. 2020) have shown significant performance disparities in the same document-level tasks and model attribution setups, suggesting that these two tasks may require different capabilities from the models. Additionally, SimLLM (Nguyen-Son, Dao, and Zettsu 2024) improved on the zero-shot work of Mao et al. (2024) and Zhu et al. (2023) and combines it with a fine-tuning approach. The method used a candidate LLM to generate proofread versions of input sentences, compares them with the original input to assess similarity, and organizes the input and its proofread versions based on the similarity score. The RoBERTa model is then fine-tuned to determine the source of the connected sequence and distinguish between human-written content and model-generated content. Results show that this method outperforms DetectGPT (Mitchell et al. 2023) and Revise-Detect (Zhu et al. 2023) in both performance and generalization.
Contrastive Learning
Data scarcity has propelled the application of contrastive learning (Yan et al. 2021; Gao, Yao, and Chen 2021; Chen et al. 2022) to Encoder-based classifiers, with the core of this approach being self-supervised learning. This strategy minimizes the distance between the anchor and positive samples while maximizing the distance to negative samples through spatial transformations. Liu et al. (2022) proposed an enhanced contrastive loss that allocates increased weight to challenging negative samples, thus optimizing model utility and sensitivity, which improves performance in resource-scarce environments. This approach effectively integrates linguistic features and sentence structure, presenting text as coherence graphs to encapsulate inherent entity consistency. Research demonstrates that leveraging information fact structures can significantly enhance the effectiveness of Encoder-based detectors. Experiments on generators such as GROVER, GPT-2, and GPT-3.5 yielded results that surpassed those of the encoder-based classifiers including RoBERTa (Conneau et al. 2020) and XLNet (Yang et al. 2019) classifier, the mainstream contrastive learning method DualCL (Chen et al. 2022), and the zero-shot approach DetectGPT (Mitchell et al. 2023). Similar findings were also highlighted in the work by Zhong et al. (2020). Another contrastive learning application for LLM-generation detection, the Contrastive Domain Adaptation framework (ConDA), was introduced by Bhattacharjee et al. (2023). This framework merges standard domain adaptation techniques with the representational capabilities of contrastive learning, substantially boosting the model’s defenses against unknown models. Compared to DetectGPT, ConDA shows an average performance improvement of 31.7%, with only a 0.8% discrepancy from fully supervised RoBERTa detectors. Building on these advancements, PECOLA (Liu et al. 2024) introduced a multi-pairwise contrastive learning strategy combined with selective perturbation to improve noise robustness and token-level sensitivity. By leveraging token importance weights and bridging zero-shot and fine-tuned approaches, PECOLA achieves better generalization and performance. Experiments demonstrate that PECOLA outperforms DetectGPT by 3.84% and fine-tuned RoBERTa classifier by 1.62%.
Adversarial Learning Methods
In light of the vulnerability of detectors to different attacks and robustness issues, there has been significant academic interest in utilizing adversarial learning as a countermeasure. Adversarial learning approaches are mainly linked with the fine-tuning of LMs. A notable recent study by Koike, Kaneko, and Okazaki (2023b) demonstrates that adversarial training can be conducted without fine-tuning the model, using context as a guide to freeze the model parameters. To more clearly describe this method, we categorize such research into sample-enhancement-based adversarial training and two-player games.
A prominent approach within sample enhancement based adversarial training centers on deploying adversarial attacks predicated on sample augmentation, with the overarching aim of crafting deceptive inputs to thereby enhance the model’s competency in addressing a broader array of scenarios that bear deception potential. Specifically, this method emphasizes the importance of sample augmentation and achieves it by injecting predetermined adversarial attacks. This augmentation process is integral to fortifying the detector’s robustness by furnishing it with an expanded pool of adversarial samples. Subsection 8.2 of the article outlines various potential attack mechanisms, including paraphrase attacks, adversarial attacks, and prompt attacks. Yang, Jiang, and Li (2023), Shi et al. (2023), and He et al. (2023b) conducted the adversarial data augmentation process on LLM-generated text, the findings of which indicated that models trained on augmented data exhibited commendable robustness against potential attacks.
The methods of Two-Player Games, fundamentally aligned with the principles underpinning Generative Adversarial Networks (Goodfellow et al. 2020) and Break-It-Fix-It strategies (Yasunaga and Liang 2021), typically involve the configuration of an attack model alongside a detection model, with the iterative confrontation between the two culminating in enhanced detection capabilities. Hu, Chen, and Ho (2023) introduced a framework, RADAR, envisaged for the concurrent training of robust detectors through adversarial learning. This framework facilitates interaction between a paraphrasing model, responsible for generating realistic content that evades detection, and a detector whose goal is to enhance its capability to identify text produced by LLMs. The RADAR framework incrementally refines the paraphrase model, drawing on feedback garnered from the detector and employing PPO (Schulman et al. 2017), outperforming zero-shot detection methods including DetectGPT (Mitchell et al. 2023) and OpenAI detector across eight different LLMs and four datasets. Despite its commendable performance in countering paraphrase attacks, the study by Hu, Chen, and Ho (2023) did not provide a comprehensive analysis of RADAR’s defense mechanism against other attack modalities. In a parallel vein, Koike, Kaneko, and Okazaki (2023b) proposed a training methodology for detectors predicated on a continual interaction between an attacker and a detector. Distinct from RADAR, OUTFOX allocates greater emphasis on the likelihood of detectors using ICL (Dong et al. 2023) for attacker identification. Specifically, the attacker in the OUTFOX framework utilizes predicted labels from the detector as ICL exemplars to generate text that poses detection challenges. Conversely, the detector uses the content generated adversarially as ICL exemplars to enhance its detection capabilities against formidable attackers. This reciprocal consideration of each other’s outputs fosters improved robustness in detectors for text generated by LLMs. Empirical evidence shows that OUTFOX outperforms previous statistical methods (e.g., Log-likelihood [Solaiman et al. 2019] and DetectGPT [Mitchell et al. 2023]) and RoBERTa-based methods (Conneau et al. 2020), achieving up to 96.9 F1score and maintaining good performance against attacks utilizing TF-IDF and DIPPER (Krishna et al. 2023).
Features-Enhanced Approaches
In addition to enhancements in training methodology, Tu et al. (2023) demonstrated that the extraction of linguistic features can effectively improve the robustness of a RoBERTa-based detector, with benefits observed in various related models. Cowap, Graham, and Foster (2023) developed an emotion-aware detector by fine-tuning a Pre-trained Language Model (PLM) for sentiment analysis, thereby enhancing the potential of emotion as a signal for identifying synthetic text. They achieved this by further fine-tuning BERT specifically for sentiment classification, resulting in a detection performance F1score improvement of up to 9.03%. Uchendu, Le, and Lee (2023b) used RoBERTa to capture contextual representations, such as semantic and syntactic linguistic features, and integrated Topological Data Analysis to analyze the shape and structure of data, which includes linguistic structure. This approach surpassed the performance of RoBERTa alone on the SynSciPass and M4 datasets. The framework J-Guard (Kumarage et al. 2023a) guides existing supervised LLM-generated text detectors in detecting LLM-generated news by extracting Journalism Features, which help the detector recognize LLM-generated fake news text. The framework demonstrates strong performance and robustness, achieving over 96% AUROC in all tests within TuringBench and 93.4% AUROC on ChatGPT. Even when faced with adversarial attacks, the average performance degradation is as low as 7%.
5.3.3 LLMs as Detectors
Questionable Reliability of Using LLMs
Several studies have examined the feasibility of utilizing LLMs as detectors to distinguish between text generated by either themselves or other LLMs. This approach was first introduced by Zellers et al. (2019b), who noted that the text generation model Grover produced disinformation that was remarkably deceptive due to its inherently controllable nature. Subsequent analyses by Zellers et al. (2019b), involving various architectural models like GPT-2 (Radford et al. 2019) and BERT (Devlin et al. 2019), revealed that Grover’s most effective countermeasure was itself, achieving an accuracy rate of 92%. In contrast, other detector types experienced a decline in accuracy to approximately 70% as Grover’s size increased. A recent reevaluation conducted by Bhattacharjee and Liu (2023) on more recent LLMs like ChatGPT and GPT-4 yielded that neither could reliably identify text generated by various LLMs. During the observations, it was noted that ChatGPT and GPT-4 exhibited contrasting tendencies. ChatGPT tended to classify text generated by LLMs as if it were written by humans, with a misclassification probability of about 50%; whereas GPT-4 leaned towards labeling human-written text as if it were generated by LLMs, and about 95% of human-written texts are misclassified as LLM-generated texts. ArguGPT (Liu et al. 2023d) further attested to the lackluster performance of GPT-4-Turbo in detecting text generated by LLMs, with accuracy rates languishing below 50% across zero-shot, one-shot, and two-shot settings. These findings collectively demonstrate the diminishing reliability of utilizing LLMs for direct self-generated text detection, particularly when compared to statistical and neural network methods. This is particularly evident in light of the increasing complexity of LLMs.
ICL: A Powerful Technique for LLM-Based Detection
While using LLMs to detect texts generated by other LLMs often raises reliability concerns, recent empirical research underscores the effectiveness of Instructional Contextual Learning (ICL) in improving detection capabilities. ICL involves an advanced form of cue engineering, which integrates examples directly into the prompts provided to the model, thereby enabling LLMs to learn new tasks more effectively. This approach allows existing LLMs to adeptly manage a variety of tasks without requiring further fine-tuning. Koike, Kaneko, and Okazaki (2023b) introduced a framework that enhances the robustness of detectors for LLM-generated text by facilitating a reciprocal consideration of outputs between detectors and attackers. In this setup, attackers use the prediction labels from the detector as contextual learning examples to craft more elusive texts adversarially. Simultaneously, detectors use these adversarially generated texts as learning examples to improve their ability to identify content from sophisticated attackers. Experimental results demonstrate that the ICL strategy surpasses traditional zero-shot methods and detectors based on RoBERTa, achieving an F1score of up to 96.9 and enhancing the detection performance of texts generated by attackers by as much as 41.3 F1score.
6 Human Detection and Enhancements
This section explores human-assisted methods for detecting text generated by LLMs. These methods leverage human prior knowledge and analytical skills, offering notable interpretability and credibility in the detection process.
6.1 Intuitive Indicators
Numerous studies have delved into the disparities between human and machine classification capabilities. Human classification primarily depends on visual observation to discern features indicative of text generation by LLMs. Uchendu et al. (2023) noted that a lack of coherence and consistency in LLM-generated text serves as a strong indicator of falsified content. Texts produced by LLMs often exhibit semantic inconsistencies and logical errors. Additionally, Dugan et al. (2023) identified that the human discernment of LLM-generated text varies across different domains. For instance, LLMs tend to generate more “generic” text in the news domain, whereas, in story domains, the text might be more “irrelevant.” Ma et al. (2023) noted that evaluators of academic writing typically emphasize style. Summaries generated by LLMs frequently lack detail, particularly in describing the research motivation and methodology, which hampers the provision of fresh insights. In contrast, LLM-generated papers exhibit fewer grammatical and other types of errors and demonstrate a broader variety of expression (Yan et al. 2023; Liao et al. 2023). However, these papers often use general terms instead of effectively tailored information relevant to the specific problem context. In human-written texts, such as scientific papers, authors are prone to composing lengthy paragraphs and using ambiguous language (Desaire et al. 2023), often incorporating terms like “but,” “however,” and “although.” Dugan et al. (2023) also noted that relying solely on grammatical errors as a detection strategy is unreliable. In addition, LLMs frequently commit factual and common-sense reasoning errors, which, while often overlooked by neural network-based detectors, are easily noticed by humans (Jawahar, Abdul-Mageed, and Lakshmanan 2020).
6.2 Imperceptible Features
Ippolito et al. (2020) suggested that text perceived as high quality by humans tends to be more easily recognizable by detectors. This observation implies that some features, imperceptible to humans, can be efficiently captured by detection algorithms. While humans are adept at identifying errors in many LLM-generated texts, unseen features also significantly influence their decision-making. In contrast, statistical thresholds commonly employed in zero-shot detector research to distinguish LLM-generated text can be manipulated. However, humans typically possess the ability to detect such manipulations through various metrics; GLTR (Gehrmann, Strobelt, and Rush 2019) pioneered this approach, serving as a visual forensic tool to assist human vetting processes, while also providing rich interpretations easily understandable by non-experts (Clark et al. 2021).
6.3 Enhancing Human Detection Capabilities
Ippolito et al. (2020) indicated that human evaluators might not be as proficient as detection algorithms in recognizing LLM-generated text across various settings. However, exposing evaluators to examples before evaluation enhances their detection capabilities, especially with longer samples. The platform RoFT (Dugan et al. 2020) allows users to engage with LLM-generated text, shedding light on human perception of such text. While revealing true boundaries post-annotation did not lead to an immediate improvement in annotator accuracy, it is worth noting that with proper incentives and motivations, annotators can indeed improve their performance over time (Dugan et al. 2023). The SCARECROW framework (Dou et al. 2022) facilitates the annotation and review of LLM-generated text, outlining ten error types to guide users. The result from SCARECROW reports manual annotation outperformed detection models on half of the error types, highlighting potential in developing efficient annotation systems despite the associated human resource overhead.
6.4 Mixed Detection: Understanding and Explanation
Weng et al. (2023) introduced a prototype amalgamating human expertise and machine intelligence for visual analysis, based on the belief that human judgment is the benchmark. Initially, experts label text based on their prior knowledge, elucidating the distinctions between human and LLM-generated text. Subsequently, machine-learning models are trained and iteratively refined based on labeled data. Finally, the most intuitive detector is selected through visual statistical analysis to fulfill the detection purpose. This analytical approach not only bolsters experts’ trust in decision-making models but also fosters learning from the models’ behavior, improving the identification of LLM-generated samples.
7 Evaluation
7.1 Evaluation Metrics
Evaluation metrics are indispensable for the assessment of model performance within any NLP task, including LLM-generated text detection. This section discusses conventionally utilized metrics in these tasks. Common metrics include Accuracy, Precision, Recall, F1score, and AUROC (Dalianis 2018). Accuracy provides an overall measure of success in correctly identifying both human-written and LLM-generated texts, making it a straightforward metric for assessing the general effectiveness of a detection model. Precision and Recall in LLM-generated text detection are used to assess the precision of identifying LLM-generated text and the ability to capture all relevant instances of such content. The F1score constitutes a harmonic mean of Precision and Recall, integrating the considerations of false positives and false negatives (Sokolova, Japkowicz, and Szpakowicz 2006). AUROC, derived from receiver operating characteristic curves, evaluates model performance across varying thresholds and is particularly valuable for assessing the model’s robustness and behavior under different levels of detection sensitivity.
8 Important Issues of LLM-Generated Text Detection
This section discusses the primary issues and limitations of contemporary SOTA techniques for detecting text generated by LLMs. It is important to note that no technique has been deemed infallible. The issues highlighted here may pertain to one or multiple classes of detectors.
8.1 Out-of-Distribution Challenges
Out-of-distribution challenges significantly impede the efficacy of current techniques dedicated to the detection of LLM-generated text. This section elucidates the constraints of these detectors to variations in domains and languages.
Multi-domain
The dilemma of multi-domain application is a ubiquitous challenge inherent to numerous NLP tasks. Studies conducted by Antoun et al. (2023) and Li et al. (2023c) underscored considerable limitations in the performance of sophisticated detectors, including but not limited to DetectGPT (Mitchell et al. 2023), GLTR (Gehrmann, Strobelt, and Rush 2019), and fine-tuned RoBERTa models, when applied to a new domain. These detectors exhibit substantial performance degradation when confronted with out-of-distribution data prevalent in real-world scenarios, with the efficacy of some classifiers marginally surpassing that of random classification. This disparity between high reported performance and actual reliability underlines the need for critical evaluation and enhancement of existing methods.
Multilingual
The issue of multilingual application introduces a set of complex challenges that hinder the global applicability of existing detector research. Predominantly, contemporary detectors designed for LLM-generated text primarily target monolingual applications, often neglecting to evaluate and optimize performance across multiple languages. Wang et al. (2023b) and Chaka (2023) have noted significant gaps in control and consistency when detecting multilingual LLM-generated text, despite some language migration capabilities. These multilingual challenges are pivotal for improving the usability and fairness of LLM-generated text detectors. Additionally, the study by Liang et al. (2023a) showed a discernible decline in the performance of SOTA detectors when processing texts authored by non-native English speakers. While effective prompt strategies can mitigate this bias, they also inadvertently allow generated text to bypass detection. Consequently, there is a risk that detectors might inadvertently penalize writers who exhibit non-standard linguistic styles or use limited expressions, thereby introducing issues of discrimination within the detection process.
Cross-LLMs
Another significant out-of-distribution issue in the LLM-generated text detection task is the cross-LLMs challenge. Current white-box detection approaches primarily rely on accessing the source model and comparing features such as Log-likelihood. As a result, white-box methods may underperform when encountering text generated by unknown LLMs. For example, DetectGPT (Mitchell et al. 2023) highlights the vulnerability of white-box methods when faced with unknown models, especially powerful ones like GPT-3.5-Turbo. However, the recent findings from Fast-DetectGPT (Bao et al. 2023) show that statistical comparisons with surrogate models can significantly mitigate this issue. Additionally, identifying the type of the generative model before applying white-box methods could be beneficial. In this regard, the methodologies of Siniff (Li et al. 2023a), SeqXGPT (Wang et al. 2023a), and LLMDet (Wu et al. 2023) may offer useful insights. On the other hand, methods based on neural classifiers, especially those fine-tuned classifiers susceptible to overfitting training data, may struggle to recognize types of LLMs not seen during training. This limitation is evident for newly emerging LLMs, where detectors may fail to effectively identify generated texts (Pagnoni, Graciarena, and Tsvetkov 2022). For instance, the OpenAI detector12 (trained on texts generated by GPT-2) struggles to discern texts generated by GPT-3.5-Turbo and GPT-4, achieving an AUROC of only 74.74%, while it performs nearly perfectly on GPT-2 generated texts (Bao et al. 2023). Findings by Sarvazyan et al. (2023) demonstrate that supervised LLM-generated text detectors generalize well across model scales but face challenges in generalizing across different model families. Enhancing the cross-LLM robustness of neural classifiers is thus essential for the practical deployment of detectors. Nonetheless, classifiers fine-tuned on RoBERTa still possess strong transfer capabilities, and with additional fine-tuning on just a few hundred samples, detectors can effectively generalize to texts generated by other models. Therefore, incorporating LLM-generated text from various sources into the training data could substantially improve the cross-LLMs’ robustness of detectors in real-world applications, even with a small sample size.
8.2 Potential Attacks
Potential attacks, defined as deliberate manipulations of text or the generative models used to produce it, aim to evade or confuse detection systems. These attacks significantly contribute to the ongoing unreliability of current LLM-generated text detectors. In this section, we discuss these attacks to encourage researchers to focus on developing more comprehensive defensive measures.
Paraphrase Attacks
Paraphrasing attacks are one of the most effective strategies, capable of undermining detectors using watermarking technology, fine-tuned supervised detectors, and zero-shot detectors (Sadasivan et al. 2023; Orenstrakh et al. 2023). The underlying principle involves applying a lightweight paraphrase model on LLMs’ outputs and changing the distribution of lexical and syntactic features of the text by paraphrasing, thereby confusing the detector. Sadasivan et al. (2023) reported on Parrot (Damodaran 2021), a T5-based paraphrase model, and DIPPER (Krishna et al. 2023), an 11B paraphrasing model that allows for tuning paraphrase diversity and the degree of content reordering that attacks the overall superiority of existing detection methods. While retrieval-based defenses have shown promise against paraphrasing attacks (Krishna et al. 2023), these defenses require ongoing maintenance by language model API providers and remain vulnerable to recursive paraphrasing attacks (Sadasivan et al. 2023).
Adversarial Attacks
Normal LLM-generated texts are highly identifiable, yet adversarial perturbations, such as substitution, can effectively reduce the accuracy of detectors (Peng et al. 2024). We summarize attacks that process on textual features as adversarial attacks, including cutoff (cropping a portion of the feature or input) (Shen et al. 2020), shuffle (randomly disrupting the word order of the input) (Lee et al. 2020), mutation (character and word mutation) (Liang, Guerrero, and Alsmadi 2023), word swapping (substituting other suitable words given the context) (Shi and Huang 2020; Ren et al. 2019; Crothers et al. 2022), and misspelling (Gao et al. 2018). There are also adversarial attack frameworks such as TextAttack (Morris et al. 2020), which systematically construct attacks using four components: an objective function, a set of constraints, a transformation, and a search method. Shi et al. (2023) and He et al. (2023b) reported on the effectiveness of the permutation approach on attack detectors. Specifically, Shi et al. (2023) replaced words with synonyms based on context, which forms an effective attack on the fine-tuned classifier, watermarking (Kirchenbauer et al. 2023a), and DetectGPT (Mitchell et al. 2023), reducing detector performance by more than 18%, 10%, and 25%, respectively. He et al. (2023b) used probability-weighted word saliency (Ren et al. 2019) to generate adversarial examples, which further maintains semantic similarity.
Stiff and Johansson (2022) utilized the DeepWordBug (Gao et al. 2018) adversarial attack algorithm to introduce character-level perturbations to generated texts, including adjacent character swaps, character substitutions, deletions, and insertions, which resulted in more than a halving of the performance of the OpenAI large detector.13 Wolff (2020) presented two types of black-box attacks against these detectors: random substitutions of characters with visually similar homoglyphs and the intentional misspelling of words. These attacks drastically reduced the recall rate of popular neural text detectors from 97.44% to 0.26% and 22.68%, respectively. Moreover, Bhat and Parthasarathy (2020) showed that detectors are more sensitive to syntactic perturbations, including breaking longer sentences, removing definite articles, using semantic-preserving rule conversions (such as changing “that’s” to “that is”), and reformatting paragraphs of machine-generated text.
Although existing detection methods are highly sensitive to adversarial attacks, different types of detectors exhibit varying degrees of resilience to such attacks. Antoun et al. (2023) reported that supervised approaches are effective defensive measures against these attacks: Training on adversarial samples can significantly improve a detector’s ability to recognize texts that have been manipulated by such attacks. Additionally, Kulkarni et al. (2023) explored the impact of semantic perturbations on the Grover detector, finding that synonym substitution, fake-fake replacement, insertion instead of substitution, and changes in the position of substitution had no effect on Grover’s detection capabilities. However, adversarial embedding techniques can effectively deceive Grover into classifying false articles as genuine. The attack degrades the performance of the fine-tuning classifier significantly, even though the distributional features of the attack can be learned by the fine-tuning classifier to form a strong defense.
Prompt Attacks
Prompt attacks pose a significant challenge for current LLM-generated text detection techniques. The quality of LLM-generated text is associated with the complexity of the prompts used to instruct LLMs to generate text. As the model and corpus size increase, LLMs emerge with excellent ICL capabilities for more complex text generation capabilities. Numerous efficient prompting methods have been developed, including few-shot prompt (Brown et al. 2020), combining prompt (Zhao et al. 2021), Chain of Thought (CoT) (Wei et al. 2022), and zero-shot CoT (Kojima et al. 2022), etc., which significantly enhance the quality and capabilities of LLMs. Existing research on LLM-generated text detectors primarily utilize datasets created with simple direct prompts. For instance, the study by Guo et al. (2023) demonstrates that detectors might struggle to identify text generated with complex prompts. Liu et al. (2023e) reported a noticeable decrease in the detection ability of a detector using a fine-tuned language model when faced with varied prompts, which indicates that the use of different prompts results in large differences in the detection performance of existing detectors (Koike, Kaneko, and Okazaki 2023a).
The Substitution-based Contextual Example Optimization method, proposed by Lu et al. (2023), uses sophisticated prompts to bypass the defenses of current detection systems. This leads to a substantial reduction in the area under the curve, averaging a decrease of 0.54, and achieves a higher success rate with better text quality compared to paraphrase attacks. It is worth mentioning that both paraphrase attacks and adversarial attacks mentioned above could be executed through careful prompt design (Shi et al. 2023; Koike, Kaneko, and Okazaki 2023b). With ongoing research in prompt engineering, the risk posed by prompt attacks is expected to escalate further. This underscores the need for developing more robust detection methods that can effectively counteract such evolving threats.
Training Threat Models
Further training of language models has been preliminarily proven to effectively attack existing detectors. Nicks et al. (2023) used the “humanity” scores of various open source and commercial detectors as a reward function for reinforcement learning, enabling fine-tuning of language models to confound existing detectors. Without significantly altering the model, further fine-tuning of Llama-2-7B can reduce the AUROC of the OpenAI RoBERTa-Large detector from 0.84 AUROC to 0.62 AUROC in a short training period. A similar idea is demonstrated in Schneider et al. (2023): Using reinforcement learning to refine generative models can successfully circumvent BERT-based classifiers with detection accuracy as low as 0.15 AUROC, even when using linguistic features as a reward function. Kumarage et al. (2023b) propose a universal evasion framework named EScaPe to guide PLMs in generating “human-like text” that may mislead detectors. Through evasive soft prompt learning and transfer, the performance of DetectGPT and OpenAI Detector can be effectively reduced by up to 40% AUROC. Additionally, the results from Henrique, Kucharavy, and Guerraoui (2023) reveal another potential vulnerability of detectors. If a generative model can access the human-written text used to train the detector and use them for fine-tuning, it is impossible to use detector for text detection on this generative model. This indicates that LLMs trained on a more human-written corpus will be more robust against existing detectors, and training against a specific detector can provide the LLMs with a sharp spear to breach its defenses.
8.3 Real-World Data Issues
Detection for Not Purely LLM-Generated Text
In practice, there are many texts that are not purely generated by LLMs, and they may even contain a mix of human-written text. Specifically, this can be categorized as either data-mixed text or human-edited text. Data-mixed text refers to the sentence or paragraph level mixture of human-written text and LLM-generated text. For instance, in a document, some sentences may be generated by LLMs, while others are written by humans. In such cases, identifying the category of the document becomes challenging. Data-mixed text necessitates more fine-grained detection methods, such as sentence-level detection, to effectively address this challenge. However, current LLM-generated text detectors struggle to perform effectively with short texts. Recent research, such as that by Wang et al. (2023a), indicates that sentence-level detection appears to be feasible. Encouragingly, studies have started addressing this issue. Zeng et al. (2023) proposed a two-step method to effectively identify a mix of human-written and LLM-generated text. This method first uses contrastive learning to distinguish between content generated by LLMs and human-written content. It then calculates the similarity between adjacent prototypes, assuming that a boundary exists between the least similar adjacent prototypes.
Another issue that has not been fully discussed is the human-edited text. For example, after applying LLM to generate a text, humans often edit and modify certain words or passages. The detection of such text poses a significant challenge and is an issue we must confront, as it is prevalent in real-world applications. Therefore, there is an urgent need to organize relevant datasets and define tasks to address this issue. One potential approach for tackling this problem is informed by experimental results from paraphrasing and adversarial perturbation attacks. These methods effectively simulate how individuals might use LLMs to refine text or make word substitutions. Current mainstream detectors, however, tend to degrade in performance when dealing with paraphrased text (Wolff 2020), although certain black-box detectors display relatively good robustness. Another potential solution could involve breaking down the detection task to the word level, but as of now, there is no research directly addressing this.
Issues of LLM-Assisted Writing
When discussing the use of LLMs in writing assistance and their impact on human writing, it is essential to consider meticulously how such texts are annotated and managed. It is particularly important to distinguish between texts that are entirely generated by LLMs and those that are co-created by humans with LLM assistance. Texts that are fully generated by an LLM should be marked as “LLM-generated” to enhance transparency and meet regulatory requirements. Moreover, minor edits by humans, such as revising individual words or slightly modifying segments, should not change the designation of the text since the core content remains LLM-generated and subject to regulation due to possible quality inconsistencies that require strict oversight (Wang et al. 2024).
In scenarios where LLMs provide grammar checking, polishing, and editing suggestions during the creative process, the text should not be labeled as entirely LLM-generated, because the primary substantial contribution is from humans. Such texts should not be under strict regulation. It has been suggested that these texts could be described as “AI-revised Human-Written Text” by Gao et al. (2024), a label that accurately reflects the collaborative nature of human and computer in the creative process and respects the human contribution to creativity. Regulation of LLM-assisted human creations should be more lenient to avoid suppressing creative freedom, particularly in the realms of literature and academia. Despite the challenges of regulation and labeling, LLMs have positively influenced human writing by enhancing writing efficiency, improving language quality, and fostering creativity (Kasneci et al. 2023). Appropriate labeling and moderate regulation can harness the benefits of LLMs while mitigating potential risks, ensuring both the quality of content and creative freedom. This balanced approach will optimize the positive impacts of LLMs while safeguarding the interests of both users and creators.
Data Ambiguity
Data ambiguity remains a challenge in the field of LLM-generated text detection, which closely ties to the inherent mechanisms of the detection technology itself. The pervasive deployment of LLMs across various domains exacerbates this issue, rendering it increasingly challenging to discern whether training data comprises human-written or LLM-generated text. Utilizing LLM-generated text as training data under the misapprehension that it is human-written inadvertently instigates a detrimental cycle. Within this cycle, detectors, consequently trained, demonstrate diminished efficacy in distinguishing between human-written and LLM-generated text, thereby undermining the foundational premises of detector research. It is imperative to acknowledge that this quandary poses a significant, pervasive threat to all facets of detection research. However, to our knowledge, no existing studies formally address this concern. Alemohammad et al. (2023) further highlighted an additional potential risk, suggesting that data ambiguity might lead to the recycling of LLM-generated content in the training processes of subsequent models. This scenario could adversely impact the text generation quality of these emergent LLMs, thereby destabilizing the research landscape dedicated to the detection of LLM-generated text.
8.4 Impact of Model Size on Detectors
Many researchers are concerned about the impact of the model size on detectors, which can be analyzed from two perspectives: one is the size of the generative model, and the other is the size of the supervised classifiers. The size of the generative model is closely related to the quality of the generated text. Generally, texts generated by smaller-sized models are easier to recognize, while those generated by larger models pose a greater challenge for detection. Another concern is how the texts generated by models of different sizes affect the detectors when used as training samples. Pu et al. (2023b) report that detectors trained with data generated by medium-sized LLMs can generalize to larger versions without any samples, while training samples generated by overly large or small models may reduce the generalization ability of the detectors. Antoun, Sagot, and Seddah (2023) further explores the apparent negative correlation between classifier effectiveness and the size of the generative model. Their findings indicate that text generated by larger LLMs is more difficult to detect, especially when the classifier is trained on data generated by smaller LLMs. Aligning the distribution of the generative models for the training and test sets can improve the performance of the detectors. From the perspective of the size of the supervised classifiers, the detection capability of the detectors is directly proportional to the size of the fine-tuned LMs (Guo et al. 2023). However, recent findings suggest that while larger detectors perform better on test sets with the same distribution as the training set, their generalization ability is somewhat diminished.
8.5 Lack of Effective Evaluation Framework
Comprehensiveness of Evaluation Frameworks
To gain users’ trust, a reliable detector must undergo a multifaceted assessment. The current benchmarks are somewhat limited, providing only superficial challenges and thereby not facilitating a holistic evaluation of detectors. We highlight five crucial dimensions that are essential for the development of more robust benchmarks for LLM-generated text detection tasks. These dimensions include the incorporation of multiple types of attacks, diverse domains, varied tasks, a spectrum of models, and the inclusion of multiple languages.
Multiple types of attack are instrumental in ascertaining the efficacy of detection methodologies. In practical environments, LLM-generated text detectors often encounter texts that are generated using a wide range of attack mechanisms, which differ from texts generated through simple prompts. For instance, the prompt attack elucidated in Section 8.2 impels the generative model to produce superior-quality text, leveraging intricate and sophisticated prompts. Integrating such texts into existing datasets is essential, as echoed in the limitations outlined by Guo et al. (2023).
Multi-domain and multi-task configurations are crucial in assessing a detector’s performance across diverse domains and LLM applications. These dimensions bear significant implications for a detector’s robustness, usability, and credibility. In academic contexts, for instance, an effective detector should perform consistently across all disciplines. In everyday scenarios, it should adeptly identify LLM-generated text spanning academic compositions, news articles, and Q&A sessions. It is important to highlight that many existing studies have explicitly considered this in their experimental setups (Wang et al. 2023b). We encourage the construction and promotion of more high-quality, multi-domain, and task-specific LLM-generated datasets for future researchers to adopt. The ongoing research momentum in LLMs has ushered in formidable counterparts like LLaMa (Touvron et al. 2023), PaLM (Chowdhery et al. 2022), and Claude-2,14 rivaling ChatGPT’s prowess. As the spotlight remains on ChatGPT, it is essential to concurrently address potential risks emanating from other emerging LLMs.
Multilingual considerations demand increased attention. Specifically, existing benchmarks are mainly developed for English datasets, with only a few like MULTITuDE (Macko et al. 2023) and M4 (Wang et al. 2023b) covering multiple languages, including Arabic, Catalan, Chinese, Czech, Dutch, English, German, Portuguese, Russian, Spanish, Indonesian, Bulgarian, and Ukrainian. However, there is still a significant lack of datasets for other language resources, especially low-resource languages. Therefore, we strongly encourage researchers to spearhead the creation of multilingual datasets to facilitate the evaluation of text detectors generated by LLMs across different languages. We strongly encourage researchers to spearhead the creation of multilingual datasets to facilitate the evaluation of text detectors generated by LLMs across different languages. The utilization of pre-trained models may uncover instances where certain detectors struggle with underrepresented languages, while LLMs could exhibit more noticeable inconsistencies. This dimension presents a rich avenue for exploration and discourse.
Request for Objective and Fair Benchmark
A prevalent issue in LLM-generated text detection research is the discrepancy between claimed detector performance and practical results. While many studies report impressive and robust detector capabilities, these methods often underperform on test sets created by other researchers. This variance arises from using different strategies to construct their test sets including the parameters used to generate the test set, the computational environment, text distribution, and text processing strategies, including truncation, which can all influence the effectiveness of detectors. Due to these factors’ complex nature, the reproducibility of evaluation results is often compromised, even when researchers adhere to identical dataset production protocols. As discussed in Section 4, the limitations of existing benchmarks necessitate the creation of high-quality and comprehensive evaluation frameworks. We strongly encourage future research to adopt these frameworks to ensure consistency in testing standards. Additionally, we urge researchers focusing on specific challenges to openly share their test sets, emphasizing the adaptability of current evaluation frameworks to incorporate diverse datasets. Establishing objective and fair benchmarks is vital to advancing LLM-generated text detection research and moving beyond isolated, siloed efforts.
Temporal of Current Benchmark
It is evident that certain contemporary studies persistently rely on seminal but somewhat outdated benchmark datasets, which had significantly shaped prior GPT-generated text and fake news detection endeavors. However, these datasets predominantly originate from backward LLMs, implying that validated methodologies might not invariably align with current real-world dynamics. We emphasize the significance of utilizing datasets formulated with advanced and powerful LLMs, while also urging benchmark dataset developers to regularly update their contributions to reflect the rapid evolution of the field.
9 Future Research Directions
This section explores potential avenues for future research to develop more efficient and practically developed detectors for LLM-generated text.
9.1 Building Robust Detectors with Attacks
The attack methods discussed in Section 8.2 encompass Paraphrase Attacks (Sadasivan et al. 2023), Adversarial Attacks (He et al. 2023b), and Prompt Attacks (Lu et al. 2023). These methods underscore the primary challenges impeding the utility of current detectors. While recent research, such as that of Yang, Jiang, and Li (2023), addresses robustness against specific attacks, it often neglects potential threats posed by other attack forms. Consequently, it is imperative to develop and validate diverse attack types, thereby gaining insights into vulnerabilities inherent to LLM-generated text detectors. Additionally, we further advocate for the establishment of comprehensive benchmarks to assess existing detection strategies. Although some studies (He et al. 2023b; Wang et al. 2023b) purport to provide such benchmarks, the scope and diversity of the validated attacks remain limited.
9.2 Enhancing the Efficacy of Zero-Shot Detectors
Zero-shot methods are recognized for their notable stability as detectors (Deng et al. 2023). These approaches offer enhanced controllability and interpretability for users (Mitrović, Andreoletti, and Ayoub 2023). Recent research (Giorgi et al. 2023; Liao et al. 2023) has highlighted distinct disparities between LLM-generated text and human-written text, revealing a tangible and discernible gap between the two. This revelation has invigorated research in the domain of LLM-generated text detection. We advocate for a proliferation of these studies that delve into the nuanced distinctions between LLM-generated texts and human-written text, spanning from low-dimensional to high-dimensional features. Unearthing metrics that more accurately distinguish the two can bolster the evolution of automatic detectors and furnish more compelling justifications for decision-making processes. We have observed that the latest emerging black-box zero-shot methods (Yang et al. 2023b; Mao et al. 2024; Zhu et al. 2023; Quidwai, Li, and Dube 2023; Guo and Yu 2023) demonstrate enhanced stability and application potential compared to white-box based zero-shot methods by extracting discriminative metrics that are independent of white-box models. These methods do not rely on an understanding of the model’s internal workings, thereby offering broader applicability across various models and environments.
9.3 Optimizing Detectors for Low-Resource Environments
Many contemporary detection techniques tend to overlook the challenges faced by resource-constrained settings, often neglecting the need for resources in developing the detector. The relative efficacy of various detectors across different data volume settings remains inadequately explored. Concurrently, determining the minimal resource prerequisites for different detection methods to yield satisfactory results is imperative. Beyond examining the model’s adaptability across distinct domains (Rodriguez et al. 2022) and languages (Wang et al. 2023b), we advocate for investigating the defensive adaptability against varied attack strategies. Such exploration can guide users in selecting the most beneficial approach to establish a dependable detector under resource constraints.
9.4 Detection for Not Purely LLM-Generated Text
As highlighted in Section 8.3, a significant challenge encountered in real-world scenarios is the detection of text that is not purely produced by LLMs. We examined this issue by separately discussing texts that are a mixture of data sources and those that have been edited by humans, and review the latest related work and propose potential solutions, which are still pending verification. We emphasize that organizing relevant datasets and defining tasks to address this issue is an urgent need at present, because fundamentally, this type of text may be the most commonly encountered in detector applications.
9.5 Constructing Detectors Amidst Data Ambiguity
Verifying the authenticity of the training data poses a significant challenge. When aggregating textual data from sources such as blogs and web comments, there is a potential risk of inadvertently including a substantial amount of LLM-generated text. This incorporation can fundamentally compromise the integrity of detector research, perpetuating a detrimental feedback loop. We urge forthcoming detection studies to prioritize the authenticity assessment of real-world data, anticipating this as a pressing challenge in the future.
9.6 Developing an Effective Evaluation Framework Aligned with Real-World Settings
In Section 8.5, we discussed the objective differences between evaluation environments and real-world settings, which limit the effectiveness of existing detectors when applied in practice. On one hand, biases in the construction of test sets can be found in many works because they often favor the detectors built by their creators. On the contrary, current benchmarks frequently reflect idealized scenarios far removed from real-world applications. We call on researchers to develop a fair and effective evaluation framework closely linked to the practical needs of LLM-generated detection tasks; for instance, considering the necessity of the application domain, the black-box nature of LLM-generated texts, and the various attacks and post-editing strategies that texts may encounter. We believe such an evaluation framework will promote the research and development of detectors that are more practical and aligned with real-world scenarios.
9.7 Constructing Detectors with Misinformation Discrimination Capabilities
Contemporary detection methodologies have largely overlooked the capacity to identify misinformation. Existing detectors primarily emphasize the distribution of features within the text generated by LLMs but often overlooked their potential for factual verification. A proficient detector should possess the capability to discern the veracity or falsity of factual claims presented in text. In the initial stages of generative modeling’s emergence, when it had yet to pose significant societal challenges, the emphasis was on assessing the truth or falsity of the content in LLM-generated text, with less regard for its source (Schuster et al. 2020). Constructing detectors with misinformation discrimination capabilities can aid in more accurately attributing the source of text, rather than relying solely on distributional features, and subsequently contribute to mitigating the proliferation of misinformation. Recent studies (Gao et al. 2023; Chern et al. 2023) highlight the potential of LLMs to detect factual content in texts. We recommend bolstering such endeavors through integration with external knowledge bases (Asai et al. 2023) or search engines (Liang et al. 2023b), which could significantly enhance their ability to verify claims and improve reliability in practical applications.
10 Conclusion
With the rapid advancements and application of LLMs, the presence of LLM-generated text in our daily lives has transitioned from expectation to pervasive reality. LLM-generated text detectors play a pivotal role in distinguishing between human-written and LLM-generated text, serving as a crucial defense against the misuse of LLMs for generating deceptive news, engaging in scams, or exacerbating issues such as educational inequality. This survey provides a comprehensive overview of the task of LLM-generated text detection, examines the underlying mechanisms enhancing LLM capabilities, and highlights the increasing need for robust detection methodologies. We also list popular or promising datasets that point out the challenges and requirements associated with existing detectors. In addition, we shed light on the critical limitations of contemporary detectors, including issues related to out-of-distribution data, potential attacks, real-world data issues, and the lack of an effective evaluation framework, to direct researchers’ attention to the focal points of the field, thereby sparking innovative ideas and approaches. Finally, we propose potential future research directions that are poised to guide the development of more powerful and effective detection systems, ensuring their alignment with real-world applications.
Acknowledgments
This work was supported in part by the Major Program of the State Commission of Science Technology of China (grant no. 2020AAA0106701), the Science and Technology Development Fund of Macau SAR (grant no. 0007/2024/AKP), the Science and Technology Development Fund of Macau SAR (grant no. FDCT/0070/2022/AMJ, China Strategic Scientific and Technological Innovation Cooperation Project grant no. 2022YFE0204900), the Science and Technology Development Fund of Macau SAR (grant no. FDCT/060/2022/AFJ, National Natural Science Foundation of China grant no. 62261160648), the UM and UMDF (grant nos. MYRG-GRG2023- 00006-FST-UMDF, MYRG-GRG2024-00165- FST-UMDF), and the National Natural Science Foundation of China (grant no. 62266013). This work was performed in part at SICC which is supported by SKL-IOTSC, and HPCC supported by ICTO of the University of Macau. We would like to thank the anonymous reviewers for their insightful comments.
Notes
The useful resources are publicly available at: https://github.com/NLP2CT/LLM-generated-Text-Detection.
References
Author notes
Action Editor: Miguel Ballesteros